Returning the matched string from a grepl match of multiple strings, rather than the logical

4.5k Views Asked by At

Currently I'm using nested ifelse functions with grepl to check for matches to a vector of strings in a data frame, for example:

# vector of possible words to match
x <- c("Action", "Adventure", "Animation")

# data
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
my_text <- as.data.frame(my_text)

my_text$new_column <- ifelse (
  grepl("Action", my_text$my_text) == TRUE,
  "Action",
  ifelse (
    grepl("Adventure", my_text$my_text) == TRUE,
    "Adventure",
    ifelse (
      grepl("Animation", my_text$my_text) == TRUE,
      "Animation", NA)))

> my_text$new_column
[1] "Animation" NA          "Adventure"

This is fine for just a few elements (e.g., the three here), but how do I return when the possible matches are much larger (e.g., 150)? Nested ifelse seems crazy. I know I can grepl multiple things at once as in the code below, but this return a logical telling me only if the string was matched, not which one was matched. I'd like to know what was matched (in the case of multiple, any of the matches is fine.

x <- c("Action", "Adventure", "Animation")
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
grepl(paste(x, collapse = "|"), my_text)

returns: [1]  TRUE FALSE  TRUE
what i'd like it to return: "Animation" ""(or FALSE) "Adventure"
3

There are 3 best solutions below

1
On BEST ANSWER

Following the pattern here, a base solution.

x <- c("ActionABC", "AdventureDEF", "AnimationGHI")

regmatches(x, regexpr("(Action|Adventure|Animation)", x))

stringr has an easier way to do this

library(stringr)
str_extract(x, "(Action|Adventure|Animation)")
0
On

This will do it...

my_text$new_column <- unlist(              
                         apply(            
                             sapply(x, grepl, my_text$my_text),
                             1,
                             function(y) paste("",x[y])))

The sapply produces a logical matrix showing which of the x terms appears in each element of your column. The apply then runs through this row-by-row and pastes together all of the values of x corresponding to TRUE values. (It pastes a "" at the start to avoid NAs and keep the length of the output the same as the original data.) If there are two terms in x matched for a row, they will be pasted together in the output.

0
On

Building on Benjamin's base solution, use lapply so that you will have a character(0) value when there is no match.

Just using regmatches on your sample code directly, will you give the following error.

    my_text$new_column <-regmatches(x = my_text$my_text, m = regexpr(pattern = paste(x, collapse = "|"), text = my_text$my_text))

    Error in `$<-.data.frame`(`*tmp*`, new_column, value = c("Animation",  : 
  replacement has 2 rows, data has 3

This is because there are only 2 matches and it will try to fit the matches values in the data frame column which has 3 rows.

To fill non-matches with a special value so that this operation can be done directly we can use lapply.

my_text$new_column <-
lapply(X = my_text$my_text, FUN = function(X){
  regmatches(x = X, m = regexpr(pattern = paste(x, collapse = "|"), text = X))
})

This will put character(0) where there is no match.

Table screenshot

Hope this helps.