Separating a string based on known expressions (with no viable delimiter or regular expression)

29 Views Asked by At

I have survey data from google forms with a "select all that apply" style question. Google forms outputs the response to this question as a single string of all the selected answers separated by a ", ". I would like to use something like separate_longer_delim() to separate out the selected answer from each participant as a new row. Ultimately, I intend to pivot these data using pivot_wider() such that each unique response is a new column, and each observation is a 1 or a 0 depending on whether that participant selected that answer.

The major issue I am encountering is that the language of some of the survey responses contains a ", " within, preventing me from using a tidy separate-like function to separate a string along a delimiter or a regular expression.

Is there any way to accomplish this using an input of all possible responses for the question? or something along those lines? Is there some way to identify the "other" responses that are written-in by participants?

Here is a simplified mock up of some of the data I am working with:

known_answers <- 
c("I live alone",
  "I live in on campus",
  "I split housing costs with housemates, family, landlord, tenant, etc.",
  "I have dependents",
"other")

set.seed(123)
data.frame(
ID = 1:10, 
answer = replicate(n = 10, expr = (sample(x=known_answers, size = sample(1:3,1)))) %>% sapply(function(x) paste(x, collapse = ", "))
)

assume the "other" answer is a stand in for a write-in answer. In the real survey data, if a participant writes-in an other response, the output does not include the word "other" in the string.

I would like the data to resemble something like a separate_longer_delim() (but i cannot use that function directly because I dont have a reliable delimiter):

   ID                               answer
1   1 I split housing costs with housemates, family, landlord, tenant, etc.
1   1 I live in on campus
1   1 other
2   2 I live in on campus
2   2 other
3   3 other
3   3 I have dependents
3   3 I live in on campus
4   4 I live alone
4   4 I live in on campus
5   5 other
5   5 I split housing costs with housemates, family, landlord, tenant, etc.
5   5 I have dependents
6   6 I have dependents
7   7 I live alone
.
.
.
.
1

There are 1 best solutions below

0
MrFlick On

You could build a helper function to escape known terms. For example

escape_comma_terms <- function(x, terms) {
  cleaned <- x
  for (i in seq_along(terms)) {
    cleaned <- gsub(terms[i], paste0("[[", i, "]]"), cleaned, fixed = TRUE)
  }
  cleaned
}

That replaces a term like "I live alone" with "[[1]]" so there's no chance of commas. Then you can safely split and remerge back in the values.

dd %>% 
  mutate(answer=escape_comma_terms(answer, known_answers)) %>% 
  tidyr::separate_longer_delim(answer, ", ") %>% 
  left_join(tibble(real_answer=known_answers, answer=paste0("[[", seq_along(known_answers), "]]")))

returns something like

   ID answer                                                           real_answer
1   1  [[3]] I split housing costs with housemates, family, landlord, tenant, etc.
2   1  [[2]]                                                   I live in on campus
3   1  [[5]]                                                                 other
4   2  [[2]]                                                   I live in on campus
5   2  [[5]]                                                                 other
6   3  [[5]]                                                                 other