Is it possible to use Series.str.extract with Dask?

591 Views Asked by At

I'm currently processing a large dataset with Pandas and I have to extract some data using pandas.Series.str.extract. It looks like this:

df['output_col'] = df['input_col'].str.extract(r'.*"mytag": "(.*?)"', expand=False).str.upper()

It works well, however, as it has to be done about ten times (using various source columns) the performance aren't very good. To improve the performance by using several cores, I wanted to try Dask but it doesn't seem to be supported (I cannot find any reference to an extract method in the dask's documentation).

Is there any way to performance such Pandas action in parallel? I have found this method where you basically split your dataframe into multiple ones, create a process per subframes and then concatenate them back.

2

There are 2 best solutions below

1
On

You should be able to do this like in pandas. It's mentioned in this segment of the documentation, but it might be valuable to expand it.

import pandas as pd
import dask.dataframe as dd
​
s = pd.Series(["example", "strings", "are useful"])
ds = dd.from_pandas(s, 2)
ds.str.extract("[a-z\s]{4}(.{2})", expand=False).str.upper().compute()
0    PL
1    NG
2    US
dtype: object
0
On

Your best bet is to use map_partitions, which enables you to perform general pandas operations to the parts of your series, and acts like a managed version of the multiprocessing method you linked.

def inner(df):
    df['output_col'] = df['input_col'].str.extract(
        r'.*"mytag": "(.*?)"', expand=False).str.upper()
    return df

out = df.map_partitions(inner)

Since this is a string operation, you probably want processes (e.g., the distributed scheduler) rather than threads. Note, that your performance will be far better if you load your data using dask (e.g., dd.read_csv) rather than create the dataframe in memory and then pass it to dask.