I have the following code snippet using Python 3.10 and Pandas within a class method (not __init__ since I noticed this could lead to problems):
self.features = self.features.groupby(["token", "feature"], as_index=False).size() \
.rename(columns={"size": "freq"})
My self.features DataFrame is very large, since I am processing a lot of textual data/documents. It also consists of elements from custom classes, which are not easily pickleable (I try to use dill whenever I can, e.g. for other parallelized tasks, I used pathos instead of standard multiprocessing).
Are there any ways of parallelizing the processing of .groupby(...).size()? I know there a few parallelization methods for Pandas, but they often use .apply() which I know is very slow.
groupby.sizecan be replaced byvalue_countsthat is quite faster.Parallelizing won't be of much help as the limiting step (building the groups) cannot be parallelized.