I have a polars Dataframe like df
, I want to calculate the cross product of 'A' and 'B' in each row. The ground truth is like df2
. How can I do this efficiently with polars?
Step 1:
import polars as pl
import itertools
df = pl.DataFrame({
'A': [[1,1],[2,2]],
'B': [[3,4],[5,6]]}
)
print("df:", df)
out:
df: shape: (2, 2)
┌───────────┬───────────┐
│ A ┆ B │
│ --- ┆ --- │
│ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╡
│ [1, 1] ┆ [3, 4] │
│ [2, 2] ┆ [5, 6] │
└───────────┴───────────┘
Step2:
row1_prod = list(itertools.product([1, 1], [3, 4]))
row2_prod = list(itertools.product([2, 2], [5, 6]))
print("row1_prod: ", row1_prod)
print("row2_prod: ", row2_prod)
out:
row1_prod: [(1, 3), (1, 4), (1, 3), (1, 4)]
row2_prod: [(2, 5), (2, 6), (2, 5), (2, 6)]
Step3:
C_row1 = [hash(e) % 100 for e in row1_prod]
C_row2 = [hash(e) % 100 for e in row2_prod]
print("C_row1:", C_row1)
print("C_row2:", C_row2)
out:
C_row1: [80, 14, 80, 14]
C_row2: [75, 72, 75, 72]
Step4:
df2 = df.with_columns(
pl.Series("C", [C_row1, C_row2])
)
print("df2:", df2)
out:
df2: shape: (2, 3)
┌───────────┬───────────┬────────────────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪════════════════╡
│ [1, 1] ┆ [3, 4] ┆ [80, 14, … 14] │
│ [2, 2] ┆ [5, 6] ┆ [75, 72, … 72] │
└───────────┴───────────┴────────────────┘
I try to use the apply
method in polars
, but it's more than 20x slower. How can I speed it up? Thanks in advance.
As per the updated "jagged" example from the comments:
That looks equivalent to running
.explode()
on each column individually:We can then combine
A
andB
into a single "item" withpl.struct()
and.hash()
the result.Following that with
.group_by().agg()
to get back a single list per row.As we have used
maintain_order=True
in the.group_by()
the row order is guaranteed, meaning we can assign the result directly via.with_columns()
.groupby()
was renamed to.group_by()
in0.19.0