Performing equivalent to pd.Grouper() in Pandas API on Spark

28 Views Asked by At

I've been trying to transition some of our codebases from using pure pandas to using Pandas API on Spark (in Databricks), and one function that I've been having trouble replicating so far has been pd.Grouper().

Specifically, existing code has many situations where we would have a table like the following (simplified for example):

ds segment value
11-12-2023 A 1
11-13-2023 B 2
12-11-2023 A 3
12-12-2023 B 5

And, we use the following code to aggregate:

import pandas as pd

df = pd.Groupby(['segment',pd.Grouper(key = 'ds', freq = 'M')]).sum()

How could we accomplish the same functionality, allowing us to group on different frequencies without creating new helper columns for each frequency we want to group on? pd.Grouper has full support for a list of offset aliases that we actively use.

I've tried using pd.Grouper functionality by replacing pandas with pyspark.pandas, but this function is not available:

import pyspark.pandas as ps

df = ps.Groupby(['segment',ps.Grouper(key = 'ds', freq = 'M')]).sum()
0

There are 0 best solutions below