Is it possible to use Series.str.extract with Dask?

589 Views Asked by JackoBongo At 17 August 2025 at 22:30

I'm currently processing a large dataset with Pandas and I have to extract some data using pandas.Series.str.extract. It looks like this:

df['output_col'] = df['input_col'].str.extract(r'.*"mytag": "(.*?)"', expand=False).str.upper()

It works well, however, as it has to be done about ten times (using various source columns) the performance aren't very good. To improve the performance by using several cores, I wanted to try Dask but it doesn't seem to be supported (I cannot find any reference to an extract method in the dask's documentation).

Is there any way to performance such Pandas action in parallel? I have found this method where you basically split your dataframe into multiple ones, create a process per subframes and then concatenate them back.

Original Q&A

There are 2 best solutions below

Nick Becker On 07 December 2020 at 17:40

You should be able to do this like in pandas. It's mentioned in this segment of the documentation, but it might be valuable to expand it.

import pandas as pd
import dask.dataframe as dd

s = pd.Series(["example", "strings", "are useful"])
ds = dd.from_pandas(s, 2)
ds.str.extract("[a-z\s]{4}(.{2})", expand=False).str.upper().compute()
0    PL
1    NG
2    US
dtype: object

mdurant On 07 December 2020 at 17:05

Your best bet is to use map_partitions, which enables you to perform general pandas operations to the parts of your series, and acts like a managed version of the multiprocessing method you linked.

def inner(df):
    df['output_col'] = df['input_col'].str.extract(
        r'.*"mytag": "(.*?)"', expand=False).str.upper()
    return df

out = df.map_partitions(inner)

Since this is a string operation, you probably want processes (e.g., the distributed scheduler) rather than threads. Note, that your performance will be far better if you load your data using dask (e.g., dd.read_csv) rather than create the dataframe in memory and then pass it to dask.

Is it possible to use Series.str.extract with Dask?

There are 2 best solutions below

Related Questions in PANDAS

Related Questions in DASK

Related Questions in DASK-DATAFRAME

Trending Questions

Popular # Hahtags

Popular Questions