R - Regex to separate string based on first dot?

5.9k Views Asked by At

I have a column which is filled with strings containing multiple dots. I want to split this column into two containing the two substrings before and after the first dot.

I.e.

comb          num
UWEA.n.49.sp   3
KYFZ.n.89.kr   5
     ...

Into

 a         b       num
UWEA    n.49.sp     3
KYFZ    n.89.kr     5
     ...

I'm using the separate function from tidyr but cannot get the regexp correct. I'm trying to use the regex style from this answer:

foo %>%
    separate(comb, into=c('a', 'b'),
             sep="([^.]+)\\.(.*)")

So that column a should be determined by the first capture group ([^.]+) containing at least one non-dot characters, then the first dot, then the second capture group (.*) just matches whatever remains after.

However this doesn't seem to match anything:

a   b   num
         3
         5

Here's my dummy dataset:

library(dplyr)
library(tidyr)
foo <- data.frame(comb=replicate(10, 
                                 paste(paste(sample(LETTERS, 4), collapse=''),
                                       sample(c('p', 'n'), 1), 
                                       sample(1:100, 1), 
                                       paste(sample(letters, 2), collapse=''), 
                                       sep='.')
                                 ),
                  num = sample(1:10, 10, replace=T))
3

There are 3 best solutions below

0
On BEST ANSWER

I think @aosmith's answer is great and definitely less clunky than a regex solution involving lookarounds. But since you're intent on using regex, here it is:

foo %>% 
    separate(comb, 
             into = c("a","b"), 
             sep = "(?<=[A-Z])\\.(?=[a-z]+)")

The trick here is the regex itself. It uses what is known as lookaround. Basically, you are looking for a dot (.) that's placed between an uppercase letter and a lowercase letter (i.e. UWEA.n) for the sep parameter. It means: match a dot preceded by a capital letter and followed by a lowercase letter.

This allows the separate function to split the comb column on the dots that are between A and n or between Z and n, in your case.

I hope this helps.

0
On

Here is a base R option . Replace the first . with , in the 'comb' column, read with read.csv to create two columns based on the delimiter , and cbind with the other columns of 'foo'

cbind(read.csv(text=sub("\\.", ",", foo$comb), 
          col.names = c('a', 'b'), header=FALSE), foo[-1])
#      a       b num
#1  GJMU n.83.cu   3
#2  IVMD p.85.ny   9
#3  HLQB p.94.rd   8
#4  WIJY n.92.sz   4
#5  QXCM n.38.lf   8
#6  UBNC n.82.js   5
#7  EPLZ n.56.kl   3
#8  YRBA  n.6.ny   8
#9  HQMR p.54.pn  10
#10 LBPO p.98.tv   7

Or another option is with extract from tidyr where we match one or more character that are not a ., place it in a capture group (([^.]+)), followed by a dot (\\.) followed by other characters in the second capture group ((.*)). The captured group characters return as two columns replacing the original 'comb' column.

library(tidyr)
extract(foo, comb, into = c("a", "b"), "([^.]+)\\.(.*)")
#      a       b num
#1  GJMU n.83.cu   3
#2  IVMD p.85.ny   9
#3  HLQB p.94.rd   8
#4  WIJY n.92.sz   4
#5  QXCM n.38.lf   8
#6  UBNC n.82.js   5
#7  EPLZ n.56.kl   3
#8  YRBA  n.6.ny   8
#9  HQMR p.54.pn  10
#10 LBPO p.98.tv   7

NOTE: There was no set.seed in the OP's post

1
On

This is a case where you can take advantage of the extra = "merge" option in separate. Because separate separates on symbols by default, you don't have to define the separator. If you wanted to, you could use "\\."

foo %>%
    separate(comb, into=c('a', 'b'), extra = "merge")

      a       b num
1  NPTE p.10.ku   4
2  YAIU p.54.lw   4
3  CHUR n.51.kx   6
4  EPGX n.14.lg   3
5  POBJ n.11.ja   5
6  LEWI n.72.un   7
7  WLAP n.20.ve  10
8  XZUY p.75.cf   6
9  ZSNJ  p.4.aj   3
10 ABKR n.69.ua   3

extra = "merge" takes all the extra pieces beyond the columns you defined and merges them into the last column.