We are looking to switch to Polars as our main data processing framework from DataFlow + mostly numpy. However some of transformation we apply atm are row-wise and can't be vectorised in neither numpy nor polars expressions.
Seems like the way to go here is to call map_elements
, former apply
, and transfer polars list/array into numpy to pass it into python function. This seems to be extremely slow, probably for two reasons:
- data transfer is happening
- polars use only one core
I know that all data frameworks suggest to not use apply
until it is really required and we have exactly such case here. The question is - what can we do to go beyond expectations :)?
VAEX seems to be claiming to be using multiprocessing when apply is called. There seems to be no such thing in Polars. Multiprocessing page in Polars User guide has a very basic example where the same frame is passed into spawned process and processed output is not returned back to main. While it seems like serialisation/deserialisation (I tried to arrow) to be taking too much time.
The question is how Polars users can improve applying map_elements
? Maybe there are some recent developments here, not yet documented ? Is there some recommendations on how users can efficiently pass data frame slices to and from child processes?
Thank you.
UPDT: thanks to @jqurious hint
we managed to apply multiprocessing for row-wise operations:
def parallel_apply(func, iterable, chunksize=1):
python_iterable = list(iterable.to_numpy())
with multiprocessing.get_context("spawn").Pool(processes=processes) as pool:
result = pool.imap(func, track(python_iterable), chunksize)
return pl.Series(result)
few lessons learned:
to_numpy()
is an order of magnitude faster thanto_list()
which seems to be called when iterating over polars series. converting to python iterable saved us 40% of runtime- smaller chunk size worked better for us, however this might be related to huge numpy matrices we have in every row
- to_numpy() converts series of list to array of arrays (as those arrays might be of variable size). thus some post processing might be required - like stacking