I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.
The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity.
Is there any program that can do this? Or should I start writing this from scratch?
Check out NLTK package: http://www.nltk.org it has everything what you need
For the cosine_similarity:
For ngrams:
for tf-idf you will have to compute distribution first, I am using Lucene to do that, but you may very well do something similar with NLTK, use FreqDist:
http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term
if you like pylucene, this will tell you how to comute tf.idf