I have written an application that measures text importance. It takes a text article, splits it into words, drops stopwords, performs stemming, and counts word-frequency and document-frequency. Word-frequency is a measure that counts how many times the given word appeared in all documents, and document-frequency is a measure that counts how many documents the given word appeared.
Here's an example with two text articles:
- Article I) "A fox jumps over another fox."
- Article II) "A hunter saw a fox."
Article I gets split into words (afters stemming and dropping stopwords):
- ["fox", "jump", "another", "fox"].
Article II gets split into words:
- ["hunter", "see", "fox"].
These two articles produce the following word-frequency and document-frequency counters:
fox(word-frequency: 3, document-frequency: 2)jump(word-frequency: 1, document-frequency: 1)another(word-frequency: 1, document-frequency: 1)hunter(word-frequency: 1, document-frequency: 1)see(word-frequency: 1, document-frequency: 1)
Given a new text article, how do I measure how similar this article is to previous articles?
I've read about df-idf measure but it doesn't apply here as I'm dropping stopwords, so words like "a" and "the" don't appear in the counters.
For example, I have a new text article that says "hunters love foxes", how do I come up with a measure that says this article is pretty similar to ones previously seen?
Another example, I have a new text article that says "deer are funny", then this one is a totally new article and similarity should be 0.
I imagine I somehow need to sum word-frequency and document-frequency counter values but what's a good formula to use?
I would suggest tf-idf and cosine similarity.
You can still use tf-idf if you drop out stop-words. It is even probable that whether you include stop-words or not would not make such a difference: the Inverse Document Frequency measure automatically downweighs stop-words since they are very frequent and appear in most documents.
If your new document is entirely made of unknown terms, the cosine similarity will be 0 with every known document.