Conditionally separating a string in R or alternative Regex expressions

46 Views Asked by At

I have a string of 7 dimensions that I want to split into 7 different strings. The string is comma delimited, i.e. there is a comma between each of the dimensions. Ordinarily, to separate the dimensions into different strings, I would just use the separate function and specify that sep = ", ". However, one of my dimensions is a string that in some instances contains a comma, which makes the above method obsolete. Is there a way to potentially conditionally separate the string or a different regex pattern that I could use to still separate the dimensions while maintaining the proper values for each dimension?

Here is a sample of the issue that I am dealing with. Below is an example of the kind of string that I am working with. I have the dimension name, followed by a colon, followed by the value for that dimension. Note that I do not need assistance separating the dimension name from the value, I already have a solution for that:

my_str <- "dim1: 1, dim2: a, b, dim3: 3"

As you can see, dim2 has a value of a, b so if I use the separate() function with sep = ", ", I end up with the following three strings:

"dim1: 1"
"dim2: a"
"b "

when what I want is

"dim1: 1"
"dim2: a, b"
"dim3: 3"
3

There are 3 best solutions below

2
Onyambu On
unlist(strsplit(my_str, ",\\s*(?=\\w+:)", perl = TRUE))
[1] "dim1: 1"    "dim2: a, b" "dim3: 3"   
0
The fourth bird On

You could get all the matches using a regex positive lookahead and asserting the same starting pattern with an optional leading comma.

\w+:.*?(?=,?\s+\w+:|$)

The pattern matches:

  • \w+: Match 1+ word characters
  • .*? Match any character except a newline, as few as possible
  • (?= Positive lookahead, assert that to the right is
    • ,?\s+\w+: An optional comma, 1+ word characters followed by :
    • | Or
    • $ Assert the end of the string
  • ) Close the lookahead

See the matches in a regex demo.

For example, using perl = TRUE due to the lookahead assertion:

my_str <- "dim1: 1, dim2: a, b, dim3: 3"

pattern <- "\\w+:.*?(?=,?\\s+\\w+:|$)"

regmatches(my_str, gregexpr(pattern, my_str, perl = TRUE))

Output

[1] "dim1: 1"    "dim2: a, b" "dim3: 3"
0
Chris Ruehlemann On

Another solution is with str_extract_all:

library(stringr)
str_extract_all(my_str, "dim.*?(?=,?\\sdim|$)")

There are two key elements in the pattern:

  • the ? in dim.*? makes the matching 'lazy' in the sense that the matching is halted as soon as the first match is reached (if you leave the ? out the matching will be 'greedy' and match the whole string)
  • (?=,?\\sdim|$) is a positive look-ahead that asserts that dim.*? must only match if there is (i) an optional comma followed by whitespace (\\s) and the string dim OR (|) (ii) by the end of the string ($).