I'm building a recommendation engine using collaborative filtering. For similarity scores, I use a Pearson correlation. This is great most of the time, but sometimes I have users that only share a 1 or 2 fields. For example:
User 1{
a: 4
b: 2
}
User 2{
a: 4
b: 3
}
Since this is only 2 data points, a Pearson correlation would always be 1 (a straight line or perfect correlation). This obviously isn't what I want, so what value should I use instead? I could just throw away all instances like this (give a correlation of 0), but my data is really sparse right now and I don't want to lose anything. Is there any similarity score I could use that would fit in with the rest of my similarity scores (all Pearson)?
You might want to consider using cosine similarity rather than Pearson correlation. It does not suffer from this problem, and is widely used in the recommender systems literature.The canonical solution to this, described by Herlocker et al. in "Empirical Analysis of Design Choices in Neighborhood-based Collaborative Filtering Algorithms", is to "damp" the Pearson correlation to correct for excessively high correlation between users with small co-rating sets. Basically, you multiply the Pearson correlation by the lesser of 1 and cc/50 where cc is the number of items both users have rated. The effect is that, if they have at least 50 items in common, the similarity is raw Pearson; otherwise, it is scaled linearly with the number of rated items they have in common. It turns that spurious correlation of 1 into a similarity of 0.02.
50 may need to be adapted based on your domain and system.
You can also use cosine similarity, which does not suffer from this limitation in the same way.
For user-user CF, however, Pearson correlation is generally preferred.Update: In more recent work, we found that cosine similarity was prematurely dismissed for user-based CF. Cosine similarity, when performed on normalized data (subtract the user's mean from each rating prior to computing cosine similarity --- the result is very similar to Parson correlation, except that it has a built-in self-damping term), outperforms Pearson in a "standard" environment. Of course, if possible, you should do some testing on your own data and environment to see what works best. Paper here: http://grouplens.org/node/479
Disclaimer: I'm a student in the lab that produced the above-mentioned Herlocker paper.