I know that there are many questions out there about partial matches and I've read as many as I've been able to, but I have still not managed to extract what I need using R.
In a nutshell, my problem is that I have a data set with over a million rows of Spanish trigrams and I want to find only those that have verbs. In an attempt to make this easier, I added a row with the 500 most common verbs in Spanish in order to try to match them to the trigrams.
I have a data set like this:
data <- data_frame(trigrams= c("no veo que", "no me gusta", "si habla de", "la mesa de", "el caso que"), fequency=c(112, 345, 578), verb=c("hablar", "gustar", "leer"))
The verbs in the third column ("verb") are infinitives and I would like to partially match them to the verbs in the first ("trigram"). I think it would be ideal, in this case, to be able to use a for loop in order to iterate through the 500 verbs that I want to partially match to my over one million trigrams.
so in this case: "gustar" should partially match "no me gusta" and nothing should match verbless trigrams like "el caso que".
I really do hope this makes sense, I have never worked with these amount of data before and I am too new to regular expressions to really figure this out on my own.
I think this approach using
stringr
might help you. You might have to do some modifications in order to use it in adataframe
. Basically we have to convert each verb such as "hablar" into a pattern such as'hablar*'
and then do astr_extract()
-Created on 2018-09-16 by the reprex package (v0.2.0).