Chunking documents to test for plagiarism

248 Views Asked by ahmad At 29 June 2025 at 16:54

I am building a plagiarism checker for text files. I did every thing of preprocessing (stop word removal, stemming, etc.) and build my index. and filtered results. the system almost done. I chunked corpus and user document by sentences (sentence separators are . ? !) when I tested results I noticed that the chunking method (by sentences) is not powerful, since the user may changed punctuation to cheat my service. I read many article about chunking, the best way was K-words overlapping, which means to split on number of words, with overlapping. my question is, how to calculate similarity in this case between user chunk and corpus chunk, because the overlapped words will maximize the similarity.

Example: (ignoring stemming and stop word removal) here the number of words = 4, overlapping = 1 word (it may be changed)

user sentence = How can I find similar sentences in your corpus.

chunks = how can I find, can I find similar, I find similar sentences, find similar sentences in, similar sentences in your, sentences in your corpus.

Now when I test those chunks against the corpus (let the corpus has a chunk says: How can I find) you notice that the user chunks (how can I find, can I find similar) have similarity with corpus chunk, but both user chunks are redundant. So how can I eliminate this redundancy, sorry for long explanation.

Original Q&A

Chunking documents to test for plagiarism

There are 0 best solutions below

Related Questions in SIMILARITY

Related Questions in PLAGIARISM-DETECTION

Related Questions in TEXT-CHUNKING

Trending Questions

Popular # Hahtags

Popular Questions