I have a Dask dataframe describing the favourite snacks of three pets: Olive, George, and Maggie. Three of the four columns contain duplicate rows, and the fourth column, snack, has unique entries. age is a column of ints, and the rest are strings.
Input:
pet_name species age snack
0 Olive cat 7 yogurt
1 Olive cat 7 chicken
2 George hamster 1 strawberry
3 George hamster 1 sunflower seed
4 George hamster 1 cucumber
5 Maggie dog 12 peanut butter
I want to groupby the first three columns, aggregate the snack column into lists, sort by age, and reset the index, to get one row per pet with a list of favourite snacks, like so:
Expected output:
pet_name species age snack
0 George hamster 1 strawberry,sunflower seed,cucumber
1 Olive cat 7 yogurt,chicken
2 Maggie dog 12 peanut butter
I'm using groupby.apply(), which mostly works except that I'm getting stuck on writing Dask's meta argument.
ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join, meta=??).reset_index()
I'm using Dask 2024.2.1 and Pandas 2.2.1.
Input:
# import packages
import dask
# silence recommending of dask-exp install
dask.config.set({'dataframe.query-planning-warning': False})
import dask.dataframe as dd
import pandas as pd
# create toy Pandas df
d = {'pet_name': ['Olive', 'Olive', 'George', 'George','George','Maggie'], 'species': ['cat', 'cat', 'hamster', 'hamster', 'hamster', 'dog'], 'age': [7,7,1,1,1,12], 'snack': ['yogurt', 'chicken', 'strawberry', 'sunflower seed', 'cucumber', 'peanut butter']}
df = pd.DataFrame(data=d)
# import to Dask df
ddf = dd.from_pandas(df, npartitions=3)
# groupby all columns except 'snack', make 'snack' into list, and reset index
group_cols = [x for x in ddf.columns if x!='snack']
ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join).reset_index()
# sort by 'age'
ddf = ddf.sort_values("age")
# print result
print(ddf.compute())
Expected output:
pet_name species age snack
0 George hamster 1 strawberry,sunflower seed,cucumber
1 Olive cat 7 yogurt,chicken
2 Maggie dog 12 peanut butter
Actual output:
runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
/home/madeline/.config/spyder-py3/temp.py:24: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join).reset_index()
pet_name species age snack
0 George hamster 1 cucumber,strawberry,sunflower seed
0 Olive cat 7 yogurt,chicken
0 Maggie dog 12 peanut butter
In this toy example, the output works, except for the index being all zeroes--but I get a warning for not specifying the meta argument.
How would I specify meta in this case?
Things I've tried:
The ways I have tried specifying meta are:
Attempts One and Two:
Meta 1:
meta=pd.DataFrame({'pet_name': str, 'species': str, 'age': int, 'snack': str}, index=[0])
Meta 2:
meta={'pet_name': 'f8', 'species': 'f8', 'age': 'f8', 'snack': 'f8'}
Error from 1 and 2:
runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
Traceback (most recent call last):
File "/home/madeline/.config/spyder-py3/temp.py", line 26, in <module>
ddf = ddf.sort_values("age")
ValueError: cannot insert name, already exists
Attempt Three:
Meta 3:
meta = pd.DataFrame(columns=['pet_name', 'species', 'age', 'snack'], dtype=object)
Error from 3:
runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
Traceback (most recent call last):
File "/home/madeline/.config/spyder-py3/temp.py", line 30, in <module>
ddf = ddf.sort_values("age")
AttributeError: 'DataFrame' object has no attribute 'name'
Since you are using apply on a Series, you should use a Series or Tuple object for meta. The trick is also that using reset_index(), you go back to a Dataframe, but I didn't find a way to tell Dask what would be in this Dataframe, hence the
sort_valuespart is not working on the Dask Dataframe in the following code: