transition matrix from pyspark dataframe

223 Views Asked by At

I have two columns (such as):

from to
1 2
1 3
2 4
4 2
4 2
4 3
3 3

And I want to create a transition matrix (where sum of rows in a columns add up to 1):

         1.      2.     3.   4.

1.      0        0      0    0
2.      0.5*     0      0    2/3
3.      0.5      0.5    1    1/3
4.      0        0.5    0    0

where 1 -> 2 would be : (the number of times 1 (in 'from') is next to 2 (in 'to)) / (total times 1 points to any value).

1

There are 1 best solutions below

4
On BEST ANSWER

You can create this kind of transition matrix using a window and pivot.

First some dummy data:

import pandas as pd
import numpy as np

np.random.seed(42)
x = np.random.randint(1,5,100)
y = np.random.randint(1,5,100)

df = spark.createDataFrame(pd.DataFrame({'from': x, 'to': y}))
df.show()
+----+---+
|from| to|
+----+---+
|   3|  3|
|   4|  2|
|   1|  2|
...

To create a pct column, first group the data by unique combinations of from/to and get the counts. With that aggregated dataframe, create a new column, pct that uses the Window to find the total number of records for each from group which is used as the denominator.

Lastly, pivot the table to make the to values columns and the pct data the values of the matrix.

from pyspark.sql import functions as F, Window

w = Window().partitionBy('from')
grp = df.groupBy('from', 'to').count().withColumn('pct', F.col('count') / F.sum('count').over(w))

res = grp.groupBy('from').pivot('to').agg(F.round(F.first('pct'), 2))
res.show()
+----+----+----+----+----+
|from|   1|   2|   3|   4|
+----+----+----+----+----+
|   1| 0.2| 0.2|0.25|0.35|
|   2|0.27|0.31|0.19|0.23|
|   3|0.46|0.17|0.21|0.17|
|   4|0.13|0.13| 0.5|0.23|
+----+----+----+----+----+