I want to brainstorm an idea in MATLAB with you guys. Given a matrix with many columns (14K) and few rows (7) where columns are items and rows features of the items, I would like to compute the similarity with all items and keep it in matrix which is:
- Easy to compute
- Easy to access
for 1., I came up with a brilliant idea of using pdist()
which is very fast:
A % my matrix
S = pdist(A') % computes the similarity btw all columns very fast
However accessing s
is not convenient. I prefer to access similarity between item i
and j
, e.g. using S(i,j)
:
S(4,5) % is the similarity between item 4 and 5
In its original definition, S
is an array not a matrix. Is making it as an 2D matrix a bad idea storage-wise? Could we think about a cool idea that can help me find which similaity corresponds to which items quickly?
Thank you.
You can use
pdist2(A',A')
. What is returned is essentially the distance matrix in its standard form where element(i,j)
is the dissimilarity (or similarity) between i-th and j-th pattern.Also, if you want to use
pdist()
, which is ok, you can convert the resulting array into the well-known distance matrix by using the functionsquareform()
.So, in conclusion, if
A
is your dataset andS
the distance matrix, you can use eitheror
Now, regarding the storage point-of-view, you will certainly notice that such matrix is symmetric. What Matlab essentially proposes with the array
S
inpdist()
is to save space: due to the fact that such matrix is symmetric you can as well save half of it in a vector. Indeed the arrayS
hasm(m-1)/2
elements whereas the matrix form hasm^2
elements (ifm
is the number of patterns in your training set). On the other hand, most certainly is trickier to access such vector whereas the matrix is absolutely straightforward.