Looking for a polars-way to perform operations on vectors (List/Array) and matrices (List(List)/Array(Array)) .
polars-0.19.9
Small df
import polars as pl
df = pl.DataFrame({
"a": [[1,2], [3,4]],
"b": [[10, 20], [30, 40]]
})
- Seems like polars does not support operations on
lists/arrays
:
df.with_columns(pl.col("a") + pl.col("b"))
PanicException: `add` operation not supported for dtype `list[i64]`
- However it does support operations on
structs
(was very surprised):
df.with_columns((pl.col("a").list.to_struct() + pl.col("b").list.to_struct()).alias("sum"))
- For vectors we probably can use explode + group_by + join with the downside of executing the join:
df = df.with_row_count('i')
c = (
df
.select(["a", "b", "i"]) # required to not explode other cols in frame
.explode(['a','b'])
.groupby('i')
.agg(c=pl.col('a')+pl.col('b'))
.select(['i','c'])
)
df = df.join(c, on="i"). # but now we need to join resulting col back to frame
- Another way to do it for vector is to explode + group_by(maintain_order) + hstack - this drops the need to join:
df = df.with_row_count('i')
c = (
df
.select(["a", "b", "i"]) # required to not explode other cols in frame
.explode(['a','b'])
.groupby('i', maintain_order=True). # allows to use hstack
.agg(c=pl.col('a')+pl.col('b'))
.select(['c'])
)
df = df.hstack(c)
- Apply/map_elements with lists conversions to numpy array does not seem to be option at all, Polars use only one core when executing apply (however VAEX claim to be able to parallelise apply operations on multiple threads) + to my understanding there is no zero-copy happening:
import numpy as np
df.with_columns(
pl.struct(["a", "b"])
.apply(lambda x: (np.array(x["a"]) + np.array(x["b"])).tolist())
.alias("c")
)
What would be the recommendation here - which way is considered to be more polars? How should I deal with cases when I have nested lists/arrays in a row/col ? Doing multiple rounds of explode + group_by seems to be hardly manageable.
Thank you.
Your number 3 can be made a bit more concise like this:
Either this or the struct way are probably the best. You'd have to benchmark it to be sure.
You could try to make a ufunc with numba and the guvectorize decorator but I'm not sure if numba supports getting a list dtype so it might be a wild goose chase. Here's an example of numba with polars for a different application as a starting point. If you do this, please post it as an answer, I want to see it.Using numba here won't work. polars doesn't support converting a list to a C-type for the ufunc to ingest. Even taking a
Series.to_arrow()
into a ufunc will error out. I think this is a limitation on ufuncs not being 100% interoperable with all of polars types rather than it being a polars shortcoming that could be upgraded on the polars side but I could easily be mistaken here.Another idea that does work: is to round trip it through numpy
Another potential goose chase:
I forgot earlier that this exists. You can write your expression in rust, compile it, and then have custom vectorized expressions.