Creating data model for mahout

537 Views Asked by At

I am trying to build an item-item similarity matching recommendation engine with mahout. The data set is as in the following format ( attributes are in text not in numerals format )

name : category : cost : ingredients

x : xx1 : 15 : xxx1, xxx2, xxx3

y : yy1 : 14 : yyy1, yyy2, yyy3

z : xx1 : 12 : xxx1, xxy1

So in-order to use this data set for mahout to train, what is the right way to convert this in to numeric (as CSV Boolean data set) format accepted by mahout.

1

There are 1 best solutions below

0
On

Using Mahout v1 the encoding can be in a text delimited/CSV type file.

name<tab>category-ID<space>cost-range-ID<space>ingredient-ID1<space>ingredient-ID2<space>etc...

All IDs are strings so you may want to give IDs to cost-ranges instead of using the actual cost as a numeric value. Also make sure that none of the columns can contain the same id so cost-range-IDs are distinct from ingredient-IDs and category-IDs.

Run mahout spark-rowsimilarity on this data and you'll get back files of the form:

name<tab>name1:strength<space>name2:strength<space>etc...

This is a list of similar items for each item. The list is sorted and the strength is the LLR (log-likelihood ratio) score for how similar the items are.

Docs here: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html