R strsplit ignore some text

65 Views Asked by At

I'm working on a survey, and many of the written categories on an answer are separated by commas. I have used gsub successfully in order to separate them, like this.

sss6 <- str_trim(unlist(strsplit(aiprm$step_do_you_anticipate, split=",")))

I have successfully separated strings like these, so I can count them each correctly in order to make visualizations.

Grammar, None of the above, Grammar, Subject matter expertise, Grammar, Subject matter expertise, Bias, Grammar, Subject matter expertise, Bias, Fact-checking

The problem now is that I have text with parenthesis and commas inside, and I would like that the commas inside the parenthesis "()" are ignored. Here are some examples of that.

Ad copy, JavaScript code, headlines, compelling copy, commercial ideas, Ad copy, Title & meta description, Idea generation (topics, headlines), Code, Idea generation (topics, headlines), Ad copy, Idea generation (topics, headlines)

Is there any way to tell the strsplit() function to not separate or ignore the commas that are inside the parenthesis? The main problem is (topics, headlines)

Thanks!

3

There are 3 best solutions below

1
weakCoder On

Horrible (and really slow) solution:

chrs        <- strsplit(s, "")[[1]]
commas      <- as.integer(chrs == ",")
parenthesis <- cumsum(chrs == "(" | chrs == ")")
ind         <- which((commas == 1) & (parenthesis %% 2 == 0))

sapply(seq_along(ind), function(i) {
  start <- ifelse(i == 1, 1, ind[i - 1] + 2)
  end   <- ind[i] - 1
  paste(chrs[start:end], collapse = "")
})

Best way to go about it is probably to use a regex. See this thread.

0
jpsmith On

In this specific case, since you note that the problematic string within parentheses is always the same ("topics, headlines"), and if you're up for a slight modification, this could be easily done by subbing out the comma within the phrase with another non-comma punctuation, such as a hyphen, ie:

gsub("topics, headlines", "topics-headlines", aiprm$step_do_you_anticipate)

Which will just require you replacing the aiprm$step_do_you_anticipate in your original code with the above:

sss6 <- stringr::str_trim(unlist(strsplit(
  gsub("topics, headlines", "topics-headlines", aiprm$step_do_you_anticipate), 
  split=",")))

# [1] "Ad copy"                            "JavaScript code"                   
# [3] "headlines"                          "compelling copy"                   
# [5] "commercial ideas"                   "Ad copy"                           
# [7] "Title & meta description"           "Idea generation (topics-headlines)"
# [9] "Code"                               "Idea generation (topics-headlines)"
# [11] "Ad copy"                            "Idea generation (topics-# headlines)"

If you really wanted the commas, you could sub back out quickly:

gsub("topics-headlines", "topics, headlines", sss6) 

# [1] "Ad copy"                             "JavaScript code"                    
# [3] "headlines"                           "compelling copy"                    
# [5] "commercial ideas"                    "Ad copy"                            
# [7] "Title & meta description"            "Idea generation (topics, headlines)"
# [9] "Code"                                "Idea generation (topics, headlines)"
# [11] "Ad copy"                             "Idea generation (topics, headlines)"

As an aside, you may also want to look into tidyr::separate_longer_delim():

aiprm$comma_replaced <- gsub("topics, headlines", "topics-headlines", aiprm$step_do_you_anticipate)

tidyr::separate_longer_delim(aiprm, comma_replaced, ",")

#                        comma_replaced
#1                              Ad copy
#2                      JavaScript code
#3                            headlines
#4                      compelling copy
#5                     commercial ideas
#6                              Ad copy
#7             Title & meta description
#8   Idea generation (topics-headlines)
#9                                 Code
#10  Idea generation (topics-headlines)
#11                             Ad copy
#12  Idea generation (topics-headlines)
0
thelatemail On

Upgrading my previous comment to a full answer as it should be more direct an answer without modifying the original string:

Example data:

x <- "Ad copy, JavaScript code, headlines, compelling copy, commercial ideas, Ad copy, Title & meta description, Idea generation (topics, headlines), Code, Idea generation (topics, headlines), Ad copy, Idea generation (topics, headlines)"

Regex splitting adapted to R from this Java question linked in @weakCoder's answer:

trimws(strsplit(x, ",(?![^(]*\\))", perl=TRUE)[[1]])

## [1] "Ad copy"                             "JavaScript code"                    
## [3] "headlines"                           "compelling copy"                    
## [5] "commercial ideas"                    "Ad copy"                            
## [7] "Title & meta description"            "Idea generation (topics, headlines)"
## [9] "Code"                                "Idea generation (topics, headlines)"
##[11] "Ad copy"                             "Idea generation (topics, headlines)"