Finding verbs in a list of trigrams (with partial match?)

121 Views Asked by At

I know that there are many questions out there about partial matches and I've read as many as I've been able to, but I have still not managed to extract what I need using R.

In a nutshell, my problem is that I have a data set with over a million rows of Spanish trigrams and I want to find only those that have verbs. In an attempt to make this easier, I added a row with the 500 most common verbs in Spanish in order to try to match them to the trigrams.

I have a data set like this:

data <- data_frame(trigrams= c("no veo que", "no me gusta", "si habla de", "la mesa de", "el caso que"), fequency=c(112, 345, 578), verb=c("hablar", "gustar", "leer"))

The verbs in the third column ("verb") are infinitives and I would like to partially match them to the verbs in the first ("trigram"). I think it would be ideal, in this case, to be able to use a for loop in order to iterate through the 500 verbs that I want to partially match to my over one million trigrams.

so in this case: "gustar" should partially match "no me gusta" and nothing should match verbless trigrams like "el caso que".

I really do hope this makes sense, I have never worked with these amount of data before and I am too new to regular expressions to really figure this out on my own.

1

There are 1 best solutions below

6
On

I think this approach using stringr might help you. You might have to do some modifications in order to use it in a dataframe. Basically we have to convert each verb such as "hablar" into a pattern such as 'hablar*' and then do a str_extract() -

library(dplyr)
library(stringr)


trigrams <- c("no veo que", "no me gusta", "si habla de", "la mesa de", "el caso que")
verb <- c("hablar", "gustar", "leer")

# loop through verbs for each verb compare all possible matches in the trigrams vector
# convert the nested list into a vector
result <- lapply(paste(verb,"*", sep = ""),str_extract, string = trigrams) %>%
            unlist(.)
# filter out na values
result <- result[!is.na(result)]

result
#> [1] "habla" "gusta"

Created on 2018-09-16 by the reprex package (v0.2.0).