I have a column which is filled with strings containing multiple dots. I want to split this column into two containing the two substrings before and after the first dot.
I.e.
comb num
UWEA.n.49.sp 3
KYFZ.n.89.kr 5
...
Into
a b num
UWEA n.49.sp 3
KYFZ n.89.kr 5
...
I'm using the separate
function from tidyr
but cannot get the regexp correct. I'm trying to use the regex style from this answer:
foo %>%
separate(comb, into=c('a', 'b'),
sep="([^.]+)\\.(.*)")
So that column a
should be determined by the first capture group ([^.]+)
containing at least one non-dot characters, then the first dot, then the second capture group (.*)
just matches whatever remains after.
However this doesn't seem to match anything:
a b num
3
5
Here's my dummy dataset:
library(dplyr)
library(tidyr)
foo <- data.frame(comb=replicate(10,
paste(paste(sample(LETTERS, 4), collapse=''),
sample(c('p', 'n'), 1),
sample(1:100, 1),
paste(sample(letters, 2), collapse=''),
sep='.')
),
num = sample(1:10, 10, replace=T))
I think @aosmith's answer is great and definitely less clunky than a
regex
solution involving lookarounds. But since you're intent on usingregex
, here it is:The trick here is the regex itself. It uses what is known as
lookaround
. Basically, you are looking for a dot (.
) that's placed between an uppercase letter and a lowercase letter (i.e.UWEA.n
) for thesep
parameter. It means:match a dot preceded by a capital letter and followed by a lowercase letter
.This allows the
separate
function to split thecomb
column on the dots that are betweenA
andn
or betweenZ
andn
, in your case.I hope this helps.