Finding related texts(correlation between two texts)

2k Views Asked by At

I'm trying to find similar articles in database via correlation.

So i split text in array of words, then delete frequently used words (articles,pronouns and so on), then compare two text with pearson coefficient function. For some text it's works but for other it's not so good(texts with large text have higher coefficient).

Can somebody advice a good method to find related texts?

2

There are 2 best solutions below

0
On

Some of the problems you mention boild down to normalizing over document length and overall word frequency. Try tf-idf.

0
On

First and foremost, you need to specify what you precisely mean by similarity and when two documents are (more/less) similar.

If the similarity you are looking for is literal, then I would vectorise the documents using term frequencies, and use the cosine similarity to liken them to each other given that texts are inherently directional data. tf-idf and log-entropy weighting schemes may be tested depending on your use-case. The edit distance is inefficient with long texts.

If you care more about the semantics, word embeddings are your ally.