I have a data frame that has a column containing some text.
I want to extract phrases from the text with the format NN + VB + NN or NN + NN + VB + NN or NN + ... + NN + VB + NN et cetera. Basically, I want to get the simple phrases with 1 to n nouns before the first encountered verb, followed by a noun.
I'm using nltk.pos_tag after tokenizing the texts to get the tag of each word, however I cannot find a way to get what I want.
I also thought about bigrams, trigrams, ngrams etc. but couldn't find a way to apply it.
Any help, please?
Here is a solution which utilises
nltk.RegexParserwith a custom grammar rule to match occurrences of any numbers of nouns, followed by a verb, followed by a noun, specifically:Example
Parsing "Prodikos Socrates recommended Plato, and Plato recommended Aristotle" produces the following labelled parse tree:
Output:
Note: The above rule does not handle symbols and punctuation interrupting the first sequence nouns (e.g. "Prodikos, Socrates recommended Plato" will only match "Socrates recommended Plato"). There is likely a way to handle this case using some
regexppattern and the NLTK PoS tags but it is not immediately obvious to me.Solution