I'm new to Python, and i'm trying to calculate Page Rank vector according to this equation in Python:
Where Pi(k) is Page-rank vector after k-Th iteration, G is the Google matrix, H is Hyperlink matrix, a is a dangling node vector, alpha = 0.85 and e is vector of ones.
The calculation with G takes a lot of time, while using the Hyperlink matrix H, which is sparse matrix, should take significantly less time.
Here's my code:
for i in range(1, k_steps+1):
for j in range(0, len(dictionary_urls)):
for k in range(0, len(dictionary_urls)):
if matrix_H[k][j] != 0:
matrix_pi_k[i][j] += matrix_pi_k[i-1][k] * float(matrix_H[k][j])
alpha_pi_k_a += matrix_pi_k[i-1][k]*float(vector_a[k])
alpha_pi_k_a = alpha_pi_k_a * float(alpha)
alpha_pi_k_a = alpha_pi_k_a + float((1- alpha))
alpha_pi_k_a = alpha_pi_k_a / float(len(dictionary_urls))
matrix_pi_k[i][j] = matrix_pi_k[i][j] * float(alpha)
matrix_pi_k[i][j] = matrix_pi_k[i][j] + float(alpha_pi_k_a)
alpha_pi_k_a = 0
k_steps is the number of iteration needed.
dictionary_links contains all the URLs.
After code execution, matrix_pi_k should have all the Pi vector's
I calculated all the variables that needed. I got a run time using H matrix is almost equal to run time using G matrix, although, in theory it should be different.
Why? And what should I change to reduce the run time?
Thank you.
The problem is that you're multiplying a sparse matrix by a dense vector using the same-old dense matrix-vector multiplication algorithm. You won't see any speedups that way.
Suppose you have an
nxn
matrixA
(dense or sparse) and ann
-vectorx
. To computey = Ax
, we can write:This works whether the matrix
A
is dense or sparse. Suppose, though, thatA
is sparse. We still loop over all columns ofA
to compute a single entry ofy
, even though most of the entries will be zero. So the outer loop goes throughn
iterations, and the inner loop also goes throughn
iterations.If we know which entries of
A
are nonzero, we can do much better. Suppose we have a list of all of the nonzero entries of rowi
, call itnonzero[i]
. Then we can replace the inner loop with iteration over this list:So while our outer loop does
n
iterations, the inner loop only does as many iterations as there are nonzero entries.This is where the speedup comes with sparse matrix-vector multiplication.
Use
numpy
!But you have another problem: you're trying to do matrix multiplication with pure Python, which (due to type-checking, non-contiguous data structures, etc.) is slow. The solution is to use
numpy
, which provides fast algorithms and data structures. Then you can usescipy
's sparse matrices, as they implement fast sparse matrix-vector multiplication for you.Experiment
We can show all of this with a quick experiment. First we'll generate a
10,000 x 10,000
dense matrixA
:Then we'll make a sparse matrix
B
by thresholdingA
.B
is the same size asA
, but only 10% of its entries are nonzero:Now we'll make a dense vector to multiply
A
andB
with:It takes the same amount of time to compute
Ax
as it does to computeBx
, even thoughB
is "sparse". Of course, it isn't really sparse: it's stored as a dense matrix with a lot of zero entries. Let's make it sparse:There's our speedup! Now, just for fun, what if we store
A
as a sparse matrix, even though it's really dense?Ouch! But this is to be expected, as storing
A
as a sparse matrix will incur some overhead during the multiplication.