Does Mahout provide a way to determine similarity between content?
I would like to produce content-based recommendations as part of a web application. I know Mahout is good at taking user-ratings matrices and producing recommendations based off of them, but I am not interested in collaborative (ratings-based) recommendations. I want to score how well two pieces of text match and then recommend items that match most closely to text that I store for users in their user profile...
I've read Mahout's documentation, and it looks like it facilitates mainly the collaborative (ratings-based) recommendations, but not content-based recommendations... Is this true?
That is not entirely true. Mahout does not have content-based recommender, but it does have algorithms for computing similarities between items based on the content. One of the most popular one is TF-IDF and cosine similarity. However, the computation is not on the fly, but is done offline. You need hadoop to compute the pairwise similarities based on the content more faster. The steps I am going to write are for MAHOUT 0.8. I am not sure if they changed it in 0.9.
Step 1. You need to convert your text documents into seq files. I lost the command for this in MAHOUT-0.8, but in 0.9 is something like this (Please check it for your version of MAHOUT):
Step 2. You need to convert your sequence files into sparse vectors like this:
where:
Step 3. Create a matrix from the vectors:
Step 4. Create a collection of similar docs for each row of the matrix above. This will generate the 50 most similar docs to each doc in the collection.
This will produce a file with similarities between each item with the top 50 files based on the content.
Now, to use it in your recommendation process you need to read the file or load it into database, depending of how much resources you have. I loaded into main memory using
Collection<GenericItemSimilarity.ItemItemSimilarity>
. Here are two simple functions that did the job for me:At the end, in your recommendation class you call this:
Where
filename
is your docIndex filename, andfolder
is the folder of the item-similarity files. At the end, this is nothing more than item-item based recommendation.Hope this can help you