Represent the similarity between many items in a nice manner MATLAB

98 Views Asked by At

I want to brainstorm an idea in MATLAB with you guys. Given a matrix with many columns (14K) and few rows (7) where columns are items and rows features of the items, I would like to compute the similarity with all items and keep it in matrix which is:

  1. Easy to compute
  2. Easy to access

for 1., I came up with a brilliant idea of using pdist() which is very fast:

 A % my matrix
 S = pdist(A')  % computes the similarity btw all columns very fast

However accessing s is not convenient. I prefer to access similarity between item i and j , e.g. using S(i,j):

 S(4,5)  % is the similarity between item 4 and 5

In its original definition, S is an array not a matrix. Is making it as an 2D matrix a bad idea storage-wise? Could we think about a cool idea that can help me find which similaity corresponds to which items quickly?

Thank you.

2

There are 2 best solutions below

0
On BEST ANSWER

You can use pdist2(A',A'). What is returned is essentially the distance matrix in its standard form where element (i,j) is the dissimilarity (or similarity) between i-th and j-th pattern.
Also, if you want to use pdist(), which is ok, you can convert the resulting array into the well-known distance matrix by using the function squareform().

So, in conclusion, if A is your dataset and S the distance matrix, you can use either

S=pdist(A');
S=squareform(S);

or

S=pdist2(A',A');

Now, regarding the storage point-of-view, you will certainly notice that such matrix is symmetric. What Matlab essentially proposes with the array S in pdist() is to save space: due to the fact that such matrix is symmetric you can as well save half of it in a vector. Indeed the array S has m(m-1)/2 elements whereas the matrix form has m^2 elements (if m is the number of patterns in your training set). On the other hand, most certainly is trickier to access such vector whereas the matrix is absolutely straightforward.

1
On

I'm not completely sure to understand what your question is, but if you want to access S(i, j) easily then the function squareform is made for this:

S = squareform(pdist(A'));

Best,