Page Rank in Python

Question

Page Rank in Python

8.3k Views Asked by Roman Yanovitski At 04 April 2025 at 19:16

I'm new to Python, and i'm trying to calculate Page Rank vector according to this equation in Python: enter image description here

Where Pi(k) is Page-rank vector after k-Th iteration, G is the Google matrix, H is Hyperlink matrix, a is a dangling node vector, alpha = 0.85 and e is vector of ones.

The calculation with G takes a lot of time, while using the Hyperlink matrix H, which is sparse matrix, should take significantly less time.

Here's my code:

for i in range(1, k_steps+1):
  for j in range(0, len(dictionary_urls)):
    for k in range(0, len(dictionary_urls)):
        if matrix_H[k][j] != 0:
            matrix_pi_k[i][j] += matrix_pi_k[i-1][k] * float(matrix_H[k][j])
        alpha_pi_k_a += matrix_pi_k[i-1][k]*float(vector_a[k])

    alpha_pi_k_a = alpha_pi_k_a * float(alpha)
    alpha_pi_k_a = alpha_pi_k_a + float((1- alpha))
    alpha_pi_k_a = alpha_pi_k_a / float(len(dictionary_urls))
    matrix_pi_k[i][j] = matrix_pi_k[i][j] * float(alpha)

    matrix_pi_k[i][j] = matrix_pi_k[i][j] + float(alpha_pi_k_a)
    alpha_pi_k_a = 0

k_steps is the number of iteration needed.

dictionary_links contains all the URLs.

After code execution, matrix_pi_k should have all the Pi vector's

I calculated all the variables that needed. I got a run time using H matrix is almost equal to run time using G matrix, although, in theory it should be different.

Why? And what should I change to reduce the run time?

Thank you.

Original Q&A

There are 2 best solutions below

Jivan On 22 December 2014 at 18:06

Based on your formula, calculation of matrix H doesn't look faster than for matrix G.

Explanation:

You might want to take a look at an introduction to Big O notation.

The right-most part (after the +) in the formula only consists in a simple calculation without loops, and its Big O notation is just O(1). Which means, it does not depend on the number of urls you are taking into account.

Whereas calculations both for H and G seem to be at least O(n^2) (n being the number of urls).

Edit:

In the deep nested part of your code, you have two instructions, one of them being conditioned upon whether matrix_H[k][j] is 0 or not. Still, if it is 0, which will be the case most of the time if H is a sparse matrix, the second instruction will be executed however. Plus, you enter the loop anyway.

That still gives you a complexity of O(n^2), thus parsing H is not (much) faster than parsing G.

**jme** · Accepted Answer

The problem is that you're multiplying a sparse matrix by a dense vector using the same-old dense matrix-vector multiplication algorithm. You won't see any speedups that way.

Suppose you have an nxn matrix A (dense or sparse) and an n-vector x. To compute y = Ax, we can write:

y = [0]*n
for i in range(n):
    for j in range(n):
        y[i] += A[i,j]*x[j]

This works whether the matrix A is dense or sparse. Suppose, though, that A is sparse. We still loop over all columns of A to compute a single entry of y, even though most of the entries will be zero. So the outer loop goes through n iterations, and the inner loop also goes through n iterations.

If we know which entries of A are nonzero, we can do much better. Suppose we have a list of all of the nonzero entries of row i, call it nonzero[i]. Then we can replace the inner loop with iteration over this list:

y = [0]*n
for i in range(n):
    for j in nonzero[i]:
        y[i] += A[i,j]*x[j]

So while our outer loop does n iterations, the inner loop only does as many iterations as there are nonzero entries.

This is where the speedup comes with sparse matrix-vector multiplication.

Use `numpy`!

But you have another problem: you're trying to do matrix multiplication with pure Python, which (due to type-checking, non-contiguous data structures, etc.) is slow. The solution is to use numpy, which provides fast algorithms and data structures. Then you can use scipy's sparse matrices, as they implement fast sparse matrix-vector multiplication for you.

Experiment

We can show all of this with a quick experiment. First we'll generate a 10,000 x 10,000 dense matrix A:

>>> import numpy as np
>>> n = 10000
>>> A = np.random.sample((n,n))

Then we'll make a sparse matrix B by thresholding A. B is the same size as A, but only 10% of its entries are nonzero:

>>> B = np.where(A < .1, A, 0).astype(float)

Now we'll make a dense vector to multiply A and B with:

>>> x = np.random.sample(n)
>>> %timeit A.dot(x)
10 loops, best of 3: 46.7 ms per loop
>>> %timeit B.dot(x)
10 loops, best of 3: 43.7 ms per loop

It takes the same amount of time to compute Ax as it does to compute Bx, even though B is "sparse". Of course, it isn't really sparse: it's stored as a dense matrix with a lot of zero entries. Let's make it sparse:

>>> sparse_B = scipy.sparse.csr_matrix(B)
>>> 100 loops, best of 3: 12 ms per loop

There's our speedup! Now, just for fun, what if we store A as a sparse matrix, even though it's really dense?

>>> sparse_A = scipy.sparse.csr_matrix(A)
>>> %timeit sparse_A.dot(x)
10 loops, best of 3: 112 ms per loop

Ouch! But this is to be expected, as storing A as a sparse matrix will incur some overhead during the multiplication.

Page Rank in Python

There are 2 best solutions below

Use `numpy`!

Experiment

Related Questions in PYTHON

Related Questions in ALGORITHM

Related Questions in PAGERANK

Trending Questions

Popular # Hahtags

Popular Questions

Page Rank in Python

There are 2 best solutions below

Use numpy!

Experiment

Related Questions in PYTHON

Related Questions in ALGORITHM

Related Questions in PAGERANK

Trending Questions

Popular # Hahtags

Popular Questions

Use `numpy`!