Parsey McParseface incorrectly identifying root on questions

264 Views Asked by At

It seems to me that Parsey has severe issues with correctly tagging questions and any sentence with "is" in it.


Text: Is Barrack Obama from Hawaii?

GCloud Tokens (correct):

  • Is - [root] VERB
  • Barrack - [nn] NOUN
  • Obama - [nsubj] NOUN
  • from - [adp] PREP
  • Hawaii - [pobj] NOUN

Parsey Tokens (wrong):

  • Is - [cop] VERB
  • Barrack - [nsubj] NOUN
  • Obama - [root] NOUN
  • from - [adp] PREP
  • Hawaii - [pobj] NOUN

Parsey decides to make the noun (!) Obama the root, which messes up everything else.


Text: My name is Philipp

GCloud Tokens (correct):

  • My [poss] PRON
  • name [nsubj] NOUN
  • is [root] VERB
  • Philipp [attr] NOUN

ParseyTokens (incorrect):

  • My [poss] PRON
  • name [nsubj] NOUN
  • is [cop] VERB
  • Philipp [root] NOUN

Again parsey chooses the NOUN as root and struggles with COP.


Any ideas why this is happening and how I could fix it?

Thanks, Phil

3

There are 3 best solutions below

0
On

I have to qualify my answer: I have limited knowledge of Parsey McParseface. However, since nobody else has answered, I hope I can add some value.

I think a major problem with most machine learning models is a lack of interpretability. This relates to your first question: "why is this happening?" It's very difficult to tell because this tool is founded on a 'black box' model, namely, a neural network. I will say that it seems extremely surprising, given the strong claims made about Parsey, that a common word like 'is' fools it consistently. Is it possible you've made some mistake? It's hard to tell without a code sample.

I'll assume you haven't made a mistake, in which case, I think you could solve this (or mitigate it) by taking advantage of your observation that the word 'is' seems to throw the model off. You could simply check the sentence in question for the word 'is' and use GCloud (or another parser) in that case. Conveniently, once you are using both, you can use GCloud as a fallback for other cases where Parsey seems to fail, should you find them in the future.

As for improving the base model, if you care enough, you could recreate it using the original paper, and perhaps optimize the training to suit your situation.

0
On

Regarding the first example, it appears that Parsey's training data is quite old, and doesn't contain any mention of even the word "Barack". If you replace Barack Obama with Bill Clinton you get a correct parse.

Input: Is Bill Clinton from Hawaii ? Parse: Is VBZ ROOT +-- Clinton NNP nsubj | +-- Bill NNP nn +-- from IN prep | +-- Hawaii NNP pobj +-- ? . punct

The second example is instead correctly parsed according to Stanford Dependencies (see "The treatment of copula verbs" in http://nlp.stanford.edu/software/dependencies_manual.pdf).

Input: My name is Philip Parse: Philip NNP ROOT +-- name NN nsubj | +-- My PRP$ poss +-- is VBZ cop

0
On

Since it correctly tagged Barack Obama as 2 nouns, I don't think its unfamiliarity with the name is the problem. I think Parsey has a ban on using "is" as the root.

In theoretical dependency grammar, a noun is never the root of a complete sentence. Parsey, however, does not follow theory; it has a strong preference for making content words into heads. I am thinking that it has decided that when you say "X is Y" the head of the sentence should be "X" rather than "is" because "is" is not an informative word.

...Except for the Bill Clinton example, which may prove me wrong! I have not yet gotten Parsey working on my own computer, so I'm not sure.