I am building a plagiarism checker for text files. I did every thing of preprocessing (stop word removal, stemming, etc.) and build my index. and filtered results. the system almost done. I chunked corpus and user document by sentences (sentence separators are . ? !) when I tested results I noticed that the chunking method (by sentences) is not powerful, since the user may changed punctuation to cheat my service. I read many article about chunking, the best way was K-words overlapping, which means to split on number of words, with overlapping. my question is, how to calculate similarity in this case between user chunk and corpus chunk, because the overlapped words will maximize the similarity.
Example: (ignoring stemming and stop word removal) here the number of words = 4, overlapping = 1 word (it may be changed)
user sentence = How can I find similar sentences in your corpus.
chunks = how can I find, can I find similar, I find similar sentences, find similar sentences in, similar sentences in your, sentences in your corpus.
Now when I test those chunks against the corpus (let the corpus has a chunk says: How can I find) you notice that the user chunks (how can I find, can I find similar) have similarity with corpus chunk, but both user chunks are redundant. So how can I eliminate this redundancy, sorry for long explanation.