How do I specify the 'meta' argument for .apply() on a Dask dataframe?

36 Views Asked by At

I have a Dask dataframe describing the favourite snacks of three pets: Olive, George, and Maggie. Three of the four columns contain duplicate rows, and the fourth column, snack, has unique entries. age is a column of ints, and the rest are strings.

Input:

     pet_name  species  age           snack
0   Olive      cat    7          yogurt
1   Olive      cat    7         chicken
2  George  hamster    1      strawberry
3  George  hamster    1  sunflower seed
4  George  hamster    1        cucumber
5  Maggie      dog   12   peanut butter

I want to groupby the first three columns, aggregate the snack column into lists, sort by age, and reset the index, to get one row per pet with a list of favourite snacks, like so:

Expected output:

     pet_name  species  age                               snack
0  George  hamster    1  strawberry,sunflower seed,cucumber
1   Olive      cat    7                      yogurt,chicken
2  Maggie      dog   12                       peanut butter

I'm using groupby.apply(), which mostly works except that I'm getting stuck on writing Dask's meta argument.

ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join, meta=??).reset_index()

I'm using Dask 2024.2.1 and Pandas 2.2.1.

Input:

# import packages
import dask
# silence recommending of dask-exp install
dask.config.set({'dataframe.query-planning-warning': False}) 
import dask.dataframe as dd
import pandas as pd

# create toy Pandas df
d = {'pet_name': ['Olive', 'Olive', 'George', 'George','George','Maggie'], 'species': ['cat', 'cat', 'hamster', 'hamster', 'hamster', 'dog'], 'age': [7,7,1,1,1,12], 'snack': ['yogurt', 'chicken', 'strawberry', 'sunflower seed', 'cucumber', 'peanut butter']}
df = pd.DataFrame(data=d)

# import to Dask df
ddf = dd.from_pandas(df, npartitions=3)

# groupby all columns except 'snack', make 'snack' into list, and reset index
group_cols = [x for x in ddf.columns if x!='snack']
ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join).reset_index()

# sort by 'age'
ddf = ddf.sort_values("age")

# print result
print(ddf.compute())

Expected output:

     pet_name  species  age                               snack
0  George  hamster    1  strawberry,sunflower seed,cucumber
1   Olive      cat    7                      yogurt,chicken
2  Maggie      dog   12                       peanut butter

Actual output:

runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
/home/madeline/.config/spyder-py3/temp.py:24: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join).reset_index()
     pet_name  species  age                               snack
0  George  hamster    1  cucumber,strawberry,sunflower seed
0   Olive      cat    7                      yogurt,chicken
0  Maggie      dog   12                       peanut butter

In this toy example, the output works, except for the index being all zeroes--but I get a warning for not specifying the meta argument. How would I specify meta in this case?

Things I've tried:

The ways I have tried specifying meta are:

Attempts One and Two:

Meta 1:

meta=pd.DataFrame({'pet_name': str, 'species': str, 'age': int, 'snack': str}, index=[0])

Meta 2:

meta={'pet_name': 'f8', 'species': 'f8', 'age': 'f8', 'snack': 'f8'}

Error from 1 and 2:

runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
Traceback (most recent call last):

  File "/home/madeline/.config/spyder-py3/temp.py", line 26, in <module>
    ddf = ddf.sort_values("age")

ValueError: cannot insert name, already exists

Attempt Three:

Meta 3:

meta = pd.DataFrame(columns=['pet_name', 'species', 'age', 'snack'], dtype=object)

Error from 3:

runfile('/home/madeline/.config/spyder-py3/temp.py', wdir='/home/madeline/.config/spyder-py3')
Traceback (most recent call last):

  File "/home/madeline/.config/spyder-py3/temp.py", line 30, in <module>
    ddf = ddf.sort_values("age")

AttributeError: 'DataFrame' object has no attribute 'name'
1

There are 1 best solutions below

0
Guillaume EB On

Since you are using apply on a Series, you should use a Series or Tuple object for meta. The trick is also that using reset_index(), you go back to a Dataframe, but I didn't find a way to tell Dask what would be in this Dataframe, hence the sort_values part is not working on the Dask Dataframe in the following code:

# import packages
import dask
# silence recommending of dask-exp install
dask.config.set({'dataframe.query-planning-warning': False}) 
import dask.dataframe as dd
import pandas as pd

# create toy Pandas df
d = {'pet_name': ['Olive', 'Olive', 'George', 'George','George','Maggie'], 'species': ['cat', 'cat', 'hamster', 'hamster', 'hamster', 'dog'], 'age': [7,7,1,1,1,12], 'snack': ['yogurt', 'chicken', 'strawberry', 'sunflower seed', 'cucumber', 'peanut butter']}
df = pd.DataFrame(data=d)

# import to Dask df
ddf = dd.from_pandas(df, npartitions=3)

# groupby all columns except 'snack', make 'snack' into list, and reset index
group_cols = [x for x in ddf.columns if x!='snack']
ddf = ddf.groupby(by=group_cols)['snack'].apply(','.join, meta=('snack', 'object')).reset_index()

# print result
print(ddf.compute().sort_values("age"))