Collaborative Filtering Program: What to do for a Pearson Score When There Isn't Enough Data

3.3k Views Asked by At

I'm building a recommendation engine using collaborative filtering. For similarity scores, I use a Pearson correlation. This is great most of the time, but sometimes I have users that only share a 1 or 2 fields. For example:

User 1{
a: 4
b: 2
}

User 2{
a: 4
b: 3
}

Since this is only 2 data points, a Pearson correlation would always be 1 (a straight line or perfect correlation). This obviously isn't what I want, so what value should I use instead? I could just throw away all instances like this (give a correlation of 0), but my data is really sparse right now and I don't want to lose anything. Is there any similarity score I could use that would fit in with the rest of my similarity scores (all Pearson)?

4

There are 4 best solutions below

0
On

You might want to consider using cosine similarity rather than Pearson correlation. It does not suffer from this problem, and is widely used in the recommender systems literature.

The canonical solution to this, described by Herlocker et al. in "Empirical Analysis of Design Choices in Neighborhood-based Collaborative Filtering Algorithms", is to "damp" the Pearson correlation to correct for excessively high correlation between users with small co-rating sets. Basically, you multiply the Pearson correlation by the lesser of 1 and cc/50 where cc is the number of items both users have rated. The effect is that, if they have at least 50 items in common, the similarity is raw Pearson; otherwise, it is scaled linearly with the number of rated items they have in common. It turns that spurious correlation of 1 into a similarity of 0.02.

50 may need to be adapted based on your domain and system.

You can also use cosine similarity, which does not suffer from this limitation in the same way. For user-user CF, however, Pearson correlation is generally preferred.

Update: In more recent work, we found that cosine similarity was prematurely dismissed for user-based CF. Cosine similarity, when performed on normalized data (subtract the user's mean from each rating prior to computing cosine similarity --- the result is very similar to Parson correlation, except that it has a built-in self-damping term), outperforms Pearson in a "standard" environment. Of course, if possible, you should do some testing on your own data and environment to see what works best. Paper here: http://grouplens.org/node/479

Disclaimer: I'm a student in the lab that produced the above-mentioned Herlocker paper.

0
On

Thanks for the Tips as always Sean! I agree that LogLikelihood is the best 'default' metric to start off with because it can work with binary and non-binary rating sets and it returns similarity scores between (0,1).

In my experience, using a metric that maps similarity scores to the range (0,1) is an important property because it avoids capping estimated preferences during estimated preference calculation. This is essential if you don't want your best items to be lost in the hundreds of other lower-scoring items that actually have the same score as the best ones because of capping.

0
On

Yes, Pearson is commonly mentioned in recommender engine writeups, and it works reasonably, but has some quirks like this. (By the way the correlation is 1 in your example, not 0.)

The cosine measure similarity is indeed a good alternative. However if you "center" the data (shift so mean is 0) before computing, and there are reasons you should, then it reduces to be identical to the Pearson correlation. So you end up with similar issues, or else, have a different set of issues from not centering.

Consider a Euclidean distance-based similarity metric — similarity is inversely related to distance, where user ratings are viewed as points in space. It doesn't have this sparseness problem, though it needs to be normalized for dimension in order to not favor users who co-rate many items and are thus far since their distance is increased along many dimensions.

But really, I'd suggest you look at a log-likelihood-based similarity metric. It also doesn't have these issues, and doesn't even need rating values. This is a great default.

There are more to consider that wouldn't have this issue: Spearman correlation, Tanimoto distance (Jaccard coefficient)-based.

Where can you learn more and get an implementation? Voila, Apache Mahout

0
On

I think you should calculate item similarity instead of user similarity, so you could recommend new items to the users that have few rated items.