Finding related texts(correlation between two texts)

2k Views Asked by x2. At 27 July 2025 at 12:08

I'm trying to find similar articles in database via correlation.

So i split text in array of words, then delete frequently used words (articles,pronouns and so on), then compare two text with pearson coefficient function. For some text it's works but for other it's not so good(texts with large text have higher coefficient).

Can somebody advice a good method to find related texts?

Original Q&A

There are 2 best solutions below

highBandWidth On 30 April 2011 at 14:45

Some of the problems you mention boild down to normalizing over document length and overall word frequency. Try tf-idf.

Rafs On 27 October 2020 at 15:20

First and foremost, you need to specify what you precisely mean by similarity and when two documents are (more/less) similar.

If the similarity you are looking for is literal, then I would vectorise the documents using term frequencies, and use the cosine similarity to liken them to each other given that texts are inherently directional data. tf-idf and log-entropy weighting schemes may be tested depending on your use-case. The edit distance is inefficient with long texts.

If you care more about the semantics, word embeddings are your ally.

Finding related texts(correlation between two texts)

There are 2 best solutions below

Related Questions in TEXT

Related Questions in SIMILARITY

Related Questions in CORRELATION

Related Questions in PEARSON

Related Questions in RELATED-CONTENT

Trending Questions

Popular # Hahtags

Popular Questions