I have a dataframe like below:
df = pd.DataFrame({ 'region': [1,1,1,1,1,1,2,2,2,3],
'store': ['A', 'A', 'C', 'C', 'D', 'B', 'F', 'F', 'E', 'G'],
'call_date': ['2022-03-10', '2022-03-09', '2022-03-08', '2022-03-07', '2022-03-06', '2022-03-06',
'2022-03-10', '2022-03-09', '2022-03-08', '2022-03-09'] })
df['call_date']=pd.to_datetime(df['call_date'], format='%Y-%m-%d')
sdf = spark.createDataFrame(df)
+------+-----+-------------------+
|region|store| call_date|
+------+-----+-------------------+
| 1| A|2022-03-10 00:00:00|
| 1| A|2022-03-09 00:00:00|
| 1| C|2022-03-08 00:00:00|
| 1| C|2022-03-07 00:00:00|
| 1| D|2022-03-06 00:00:00|
| 1| B|2022-03-06 00:00:00|
| 2| F|2022-03-10 00:00:00|
| 2| F|2022-03-09 00:00:00|
| 2| E|2022-03-08 00:00:00|
| 3| G|2022-03-09 00:00:00|
+------+-----+-------------------+
I need to find the top 3 unique stores that have been called most recently.
+------+-----+
|region|store|
+------+-----+
| 1| A|
| 1| C|
| 1| D|
| 2| F|
| 2| E|
| 3| G|
+------+-----+
I have tried to generate rank base on call_date
within each region.
sdf.withColumn('RANK',F.dense_rank().over(Window.partitionBy("region").orderBy(F.col('call_date').desc()))).show()
+------+-----+-------------------+----+
|region|store| call_date|RANK|
+------+-----+-------------------+----+
| 1| A|2022-03-10 00:00:00| 1|
| 1| A|2022-03-09 00:00:00| 2|
| 1| C|2022-03-08 00:00:00| 3|
| 1| C|2022-03-07 00:00:00| 4|
| 1| D|2022-03-06 00:00:00| 5|
| 1| B|2022-03-06 00:00:00| 5|
| 2| F|2022-03-10 00:00:00| 1|
| 2| F|2022-03-09 00:00:00| 2|
| 2| E|2022-03-08 00:00:00| 3|
| 3| G|2022-03-09 00:00:00| 1|
+------+-----+-------------------+----+
But the issue is how I can get the top 3 unique stores based on rank? obviously I cannot just take the top 3 rank from each region. I need the unique stores. Appreciate if anyone can help out. I have thought about it for long time but still got no clue...
How about drop duplicates before you rank?