Stemming vs Lemmatization for financial text in python [NLTK]

681 Views Asked by At

To extract more information from annual reports (10ks), I am trying to compare companies based on the cosine similarity. One of the steps in this research is the stemming or lemmatization of words. The reason for doing this is to get the root of the words, so that when you don't have different variation words that at their core mean the same thing. For stemmer and lemmatizer, I used SnowBall stemmer and WordNetLemmatizer from the NLTK package.

E.g. of stemming: ; E.g. of lemmatization walking -> walk walking-> walking walked -> walk walked -> walked or owing -> owe owing -> owing owed -> owe owed -> owed
The question is the following: should I use the stemmer or a lemmatizer for financial text?

The way I see it, a stemmer would be more appropiate for this kind of research.

Disclaimer: I know there is already a question discussing stemming vs lemmatization on stackoverflow. However, I am looking for some clarification regarding financial text in particular not as a general case.

0

There are 0 best solutions below