How to separate a column based on words made up only of upper case in R

44 Views Asked by user18081990 At 23 June 2023 at 19:31

I want to separate a column of string characters. In one column I want all capitilized words. The strings can have one or two uppercase words. Here is an example of the dataframe:

mydataframe <- data.frame(species= c("ACTINIDIACEAE Actinidia arguta", 
           "ANACARDIACEAE Attilaea abalak E.Martínez & Ramos", 
           "LEGUMINOSAE CAESALPINIOIDEAE Biancaea decapetala (Roth) O.Deg."),
           trait= c(1,2,4))

I tried with separate and the following regular expression: "\\s+(?=[A-Z]+)". This is not working. For the strings with more than two capitilized words it separates the first and the second capitilized words, removing the rest of the string. Here is the code:

mydataframe <- mydataframe %>%
              separate(species, into = c("family", "sp"), sep ="\\s+(?=[A-Z]+)")

This is the result of the code:

family	sp	trait
ACTINIDIACEAE	Actinidia arguta	1
ANACARDIACEAE	Attilaea abalak	2
LEGUMINOSAE	CAESALPINOIDEAE	4

I want the following format:

family	sp	trait
ACTINIDIACEAE	Actinidia arguta	1
ANACARDIACEAE	Attilaea abalak	2
LEGUMINOSAE CAESALPINOIDEAE	Biancaea decapetala	4

Original Q&A

There are 1 best solutions below

r2evans On 23 June 2023 at 19:40 BEST ANSWER

I think we can use (base) strcapture for this to find the last occurrence of two upper-case in a row, then blank space, then a word with at least one lower-case letter.

mydataframe %>%
  mutate(strcapture("(.*[A-Z]{2,})\\s+(\\S*[a-z].*)", species, list(family="", sp="")))
#                                                          species trait                       family                                 sp
# 1                                 ACTINIDIACEAE Actinidia arguta     1                ACTINIDIACEAE                   Actinidia arguta
# 2               ANACARDIACEAE Attilaea abalak E.Martínez & Ramos     2                ANACARDIACEAE Attilaea abalak E.Martínez & Ramos
# 3 LEGUMINOSAE CAESALPINIOIDEAE Biancaea decapetala (Roth) O.Deg.     4 LEGUMINOSAE CAESALPINIOIDEAE  Biancaea decapetala (Roth) O.Deg.

How to separate a column based on words made up only of upper case in R

There are 1 best solutions below

Related Questions in R

Related Questions in REGEX

Related Questions in DATAFRAME

Related Questions in TIDYR

Related Questions in CAPITALIZE

Trending Questions

Popular # Hahtags

Popular Questions