Parallelize Pandas's .size()

52 Views Asked by TiMauzi At 27 August 2023 at 15:51

I have the following code snippet using Python 3.10 and Pandas within a class method (not __init__ since I noticed this could lead to problems):

self.features = self.features.groupby(["token", "feature"], as_index=False).size() \
            .rename(columns={"size": "freq"})

My self.features DataFrame is very large, since I am processing a lot of textual data/documents. It also consists of elements from custom classes, which are not easily pickleable (I try to use dill whenever I can, e.g. for other parallelized tasks, I used pathos instead of standard multiprocessing).

Are there any ways of parallelizing the processing of .groupby(...).size()? I know there a few parallelization methods for Pandas, but they often use .apply() which I know is very slow.

Original Q&A

There are 1 best solutions below

mozway On 27 August 2023 at 16:43 BEST ANSWER

groupby.size can be replaced by value_counts that is quite faster.

features[['token', 'feature']].value_counts(sort=False).reset_index(name='freq')

Parallelizing won't be of much help as the limiting step (building the groups) cannot be parallelized.

Parallelize Pandas's .size()

There are 1 best solutions below

Related Questions in PYTHON-3.X

Related Questions in PANDAS

Related Questions in GROUP-BY

Related Questions in MULTIPROCESSING

Related Questions in DILL

Trending Questions

Popular # Hahtags

Popular Questions