I am trying to build an item-item similarity matching recommendation engine with mahout. The data set is as in the following format ( attributes are in text not in numerals format )
name : category : cost : ingredients
x : xx1 : 15 : xxx1, xxx2, xxx3
y : yy1 : 14 : yyy1, yyy2, yyy3
z : xx1 : 12 : xxx1, xxy1
So in-order to use this data set for mahout to train, what is the right way to convert this in to numeric (as CSV Boolean data set) format accepted by mahout.
Using Mahout v1 the encoding can be in a text delimited/CSV type file.
All IDs are strings so you may want to give IDs to cost-ranges instead of using the actual cost as a numeric value. Also make sure that none of the columns can contain the same id so cost-range-IDs are distinct from ingredient-IDs and category-IDs.
Run
mahout spark-rowsimilarity
on this data and you'll get back files of the form:This is a list of similar items for each item. The list is sorted and the strength is the LLR (log-likelihood ratio) score for how similar the items are.
Docs here: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html