import nltk
from nltk.parse import ViterbiParser
def pcfg_chartparser(grammarfile):
f=open(grammarfile)
grammar=f.read()
f.close()
return nltk.PCFG.fromstring(grammar)
grammarp = pcfg_chartparser("wsjp.cfg")
VP = ViterbiParser(grammarp)
print VP
for w in sent:
for tree in VP.parse(nltk.word_tokenize(w)):
print tree
When I run the above code, it produces the following output for the sentence, "turn off the lights"-
(S (VP (VB turn) (PRT (RP off)) (NP (DT the) (NNS lights)))) (p=2.53851e-14)
However, it raises the following error for the sentence, "please turn off the lights"-
ValueError: Grammar does not cover some of the input words: u"'please'"
I am building a ViterbiParser by supplying it a probabilistic context free grammar. It works well in parsing sentences that have words which are already in the rules of the grammar. It fails to parse sentences in which the Parser has not seen the word in the grammar rules. How to get around this limitation?
I am referring to this assignment.
Firstly, try to use (i) namespaces and (ii) unequivocal variable names, e.g.:
If we look at the grammar:
To resolve the unknown word issues, there're several options:
Use
wildcard
non-terminals nodes to replace the unknown words. Find some way to replace the words that the grammar don't cover fromcheck_coverage()
with thewildcard
, then parse the sentence with the wildcardGo back to your grammar production file that you have before creating the learning the PCFG with
learn_pcfg.py
and add all possible words in the terminal productions.Add the unknown words into your pcfg grammar and then renormalize the weights, given either very small weights to the unknown words (you can also try smarter smoothing/interpolation techniques)
Since this is a homework question I will not give the answer with the full code. But the hints above should be enough to resolve the problem.