How can I improve the runtime of row-wise operations in Polars?

322 Views Asked by At

We are looking to switch to Polars as our main data processing framework from DataFlow + mostly numpy. However some of transformation we apply atm are row-wise and can't be vectorised in neither numpy nor polars expressions.

Seems like the way to go here is to call map_elements, former apply, and transfer polars list/array into numpy to pass it into python function. This seems to be extremely slow, probably for two reasons:

  1. data transfer is happening
  2. polars use only one core

I know that all data frameworks suggest to not use apply until it is really required and we have exactly such case here. The question is - what can we do to go beyond expectations :)?

VAEX seems to be claiming to be using multiprocessing when apply is called. There seems to be no such thing in Polars. Multiprocessing page in Polars User guide has a very basic example where the same frame is passed into spawned process and processed output is not returned back to main. While it seems like serialisation/deserialisation (I tried to arrow) to be taking too much time.

The question is how Polars users can improve applying map_elements? Maybe there are some recent developments here, not yet documented ? Is there some recommendations on how users can efficiently pass data frame slices to and from child processes?

Thank you.

UPDT: thanks to @jqurious hint

we managed to apply multiprocessing for row-wise operations:

def parallel_apply(func, iterable, chunksize=1):
    python_iterable = list(iterable.to_numpy())
    with multiprocessing.get_context("spawn").Pool(processes=processes) as pool:
        result = pool.imap(func, track(python_iterable), chunksize)
        return pl.Series(result)

few lessons learned:

  • to_numpy() is an order of magnitude faster than to_list() which seems to be called when iterating over polars series. converting to python iterable saved us 40% of runtime
  • smaller chunk size worked better for us, however this might be related to huge numpy matrices we have in every row
  • to_numpy() converts series of list to array of arrays (as those arrays might be of variable size). thus some post processing might be required - like stacking
0

There are 0 best solutions below