Should I stem domain words for named entity recognition?

434 Views Asked by At

My question is perhaps not entirely programming, but I know many talented programmers are doing NLP and might be able to answer my question yet.

I have compiled a document with domain words that I perform fuzzy matching on to extract named entities in text. The format is as follows:

  "ferry names": [
    {
      "stena danica": [
        "stena danica",
        "danica"
      ]
    },

The outer object is the category, the inner is the entity. An innermost list is a list of synonyms that the entity may be called by. Now, my named entity recognition, simple as it is, works quite well. To make it easier on it though, I decided to stem all the words on the text passed in.

{
  "category": "ferry names",
  "distance": 1,
  "entity": "stena danica",
  "interpreted": "stena danica",
  "raw": "stena danica",
  "stemmed": "stena danic"
}

The stemmer (nltk snowball stemmer, SwedishStemmer) works brilliantly, but it also stems domain words, in this case, Stena Danica.

Question: I'm not sure how to approach this, should I simply not stem domain words, or put the stemmed version in with the synonyms? As it is, it will still be picked up by the fuzzy matcher, but it might introduce problems. Thank you.

2

There are 2 best solutions below

0
On

There's really only one answer to your question: Try it both ways, test it (on data that you didn't use for training), and choose whichever works best.

In general the best way will depend on the domain, on the amount of training data, blah blah blah, try it and find out. Nobody can predict it with any certainty.

3
On

I might not be the most qualified person to answer this, but the way I see it, it depends on your goal. I perform stemming on my texts using NLTK to decrease my total vocabulary (to create document vectors and compare documents based on their content). I also stem named entities so that for example "Al Bundy" and "Al Bundys" can be recognized as the same thing. But I see a risk with adding the stemmed versions to your NE to your synonyms. Consider the following example:

"ferry names": [
    {
      "stena line": [
        "stena line",
        "stena",
        "sten"     # Supposed to represent a stemmed version of Stena
      ]
    },

If you were to get in "sten", "stenar", "stenarna", or any other word that would likely be stemmed to "sten" - you'd have a problem. It would be recognized as "Stena Line". Hope that helps. :)