Interpretation of cosine similarity and jaccard similarity (similarity of histograms)

224 Views Asked by At

Introduction

I would like to assess the similarity between two "bin counts" arrays (related to two histograms), by using the Matlab "pdist2" function:

% Input
bin_counts_a = [689   430   311   135    66    67    99    23    37    19     8     4     3     4     1     3     1     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     1     0     0     0     0     1];
bin_counts_b = [569   402   200   166   262    90    50    16    33    12     6    35    49     4    12     8     8     2     1     0     0     0     0     1     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     1];

% Visualize the two "bin counts" vectors as bars:
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])

enter image description here

% Calculation of similarities
cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')

% Output
cosine_similarity =

          0.95473215802008


jaccard_similarity =

        0.0769230769230769

Question

If the cosine similarity is close to 1, which means the two vectors are similar, shouldn't the jaccard similarity be closer to 1 as well?

1

There are 1 best solutions below

2
Luis Mendo On BEST ANSWER

The 'jaccard' measure, according to the documentation, only considers the "percentage of nonzero coordinates that differ", but not by how much they differ.

For instance, assume bin_counts_a as in your example and

bin_counts_b = bin_counts_a + 1;

Then

>> cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
cosine_similarity =
   0.999971577948095

is almost 1 as expected, because the bin counts are very similar. However,

>> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
jaccard_similarity =
     0

gives 0 because each entry in bin_counts_b is (slightly) different from that in bin_counts_a.

For assessing the similarity between the histograms, 'cosine' is probably a more meaningful option than 'jaccard'. You may also want to consider the Kullback-Leibler divergence, although it is not symmetric in the two distributions, and is not computed by pdist2.