My question is perhaps not entirely programming, but I know many talented programmers are doing NLP and might be able to answer my question yet.
I have compiled a document with domain words that I perform fuzzy matching on to extract named entities in text. The format is as follows:
"ferry names": [
{
"stena danica": [
"stena danica",
"danica"
]
},
The outer object is the category, the inner is the entity. An innermost list is a list of synonyms that the entity may be called by. Now, my named entity recognition, simple as it is, works quite well. To make it easier on it though, I decided to stem all the words on the text passed in.
{
"category": "ferry names",
"distance": 1,
"entity": "stena danica",
"interpreted": "stena danica",
"raw": "stena danica",
"stemmed": "stena danic"
}
The stemmer (nltk snowball stemmer, SwedishStemmer) works brilliantly, but it also stems domain words, in this case, Stena Danica
.
Question: I'm not sure how to approach this, should I simply not stem domain words, or put the stemmed version in with the synonyms? As it is, it will still be picked up by the fuzzy matcher, but it might introduce problems. Thank you.
There's really only one answer to your question: Try it both ways, test it (on data that you didn't use for training), and choose whichever works best.
In general the best way will depend on the domain, on the amount of training data, blah blah blah, try it and find out. Nobody can predict it with any certainty.