I have a numpy array of size (96341,1000). I want to find the cosine similarity of this array. The machine I'm working on is 8 vCPU 32 GB. This is my initial code. And I want this function to run faster , can control/limit the amount of memory used and keep the same result.
vec.shape
(96341,1000)
def csm(A):
num = np.dot(A, A.T)
p1 = np.sqrt(np.sum(A**2, axis=1))[:, cp.newaxis]
p2 = np.sqrt(np.sum(A**2, axis=1))[cp.newaxis, :]
result = num / (p1 * p2)
return result
cos = csm(vec)
As @mpw2 said, your numerator is 96341×96341. But so is your denominator, as is. Which means that at some point, you'll have 3× 74GB arrays in your memory (the one holding the numerator
num, the one holding temporarily the denominatorp1*p2, and the one in making holding the result).For the denominator, that is so easily solvable, that I can't see any reason no to do it
Note also that
p1andp2are the same arrays, with just a view change. So no need to compute it twiceAt last, if we want to avoid those 74 GB (still 2x74 GB with that code: we still have both num and result at one moment both in memory), we can compute by blocks
Then you can compute by block
Not only we avoid most of the redundant computing because of the symmetry already mentioned. But we never have arrays bigger than NxN in memory, but for
resitself (but that is unavoidable if that is the result you want).And, sure, we use for loops, and almost all my answer on [so] mention somewhere that
forloops are the ultimate crime in numpy. But those are ok, because if N is big enough, then the computation taken by the iterations will be negligible before the computation time in numpy functions.The last remaining problem being, as I said, the 74 GB of the result (at least, now, there is only one, not 3 as in your solution, not 2 as in my first solution). That, you can avoid in the application, depending on what you do with the result. Try to use only part of the result at a time.
For example, if the idea is find minimum value, you can compute blocks of that matrix, and minimum for each blocks, and compute the min of the min then. That is a classical
map/reducesolution somehowSo, instead of
do
Of course, I doubt what you want to do with that
csmis as simple as a min. But since you haven't said, I can only speculate that there are ways to do it, while computing csm on one block at a time.