How to groupBy on two columns and work out avg total value for each grouped column using pyspark

Question

How to groupBy on two columns and work out avg total value for each grouped column using pyspark

42 Views Asked by Saffik At 26 March 2024 at 23:11

I have the following DataFrame and using Pyspark, I'm trying to get the following answers:

Total Fare by Pick
Total Tip by Pick
Avg Drag by Pick
Avg Drag by Drop

Pick	Drop	Fare	Tip	Drag
1	1	4.00	4.00	1.00
1	2	5.00	10.00	8.00
1	2	5.00	15.00	12.00
3	2	11.00	12.00	17.00
3	5	41.00	25.00	13.00
4	6	50.00	70.00	2.00

My Query is so far like this:

from pyspark.sql import functions as func
from pyspark.sql.functions import desc

data = [
    (1, 1, 4.00, 4.00, 1.00),
    (1, 2, 5.00, 10.00, 8.00),
    (1, 2, 5.00, 15.00, 12.00),
    (3, 2, 11.00, 12.00, 17.00),
    (3, 5, 41.00, 25.00, 13.00),
    (4, 6, 50.00, 70.00, 2.00)
]

columns = ["Pick", "Drop", "Fare", "Tip", "Drag"]
df = spark.createDataFrame(data, columns)


df.groupBy('Pick', 'Drop') \
    .agg(
        func.sum('Fare').alias('FarePick'),
        func.sum('Tip').alias('TipPick'),
        func.avg('Drag').alias('AvgDragPick'),
        func.avg('Drag').alias('AvgDragDrop')) \
    .orderBy('Pick').show()

However, I don't think it seems to be correct. I'm abit stuck on (4) because the groupby does not seem correct. Can anyone suggest correction here. The output needs to be in One (1) table together such as:

Pick	Drop	FarePick	TipPick	AvgDragPick	AvgDragDrop

Original Q&A

There are 2 best solutions below

meshkati On 27 March 2024 at 00:20

I added your table data into a data variable, and separated the 4 step.

from pyspark.sql import SparkSession
from pyspark.sql import functions as func

spark = SparkSession.builder \
    .appName("testSession") \
    .getOrCreate()

data = [
    (1, 1, 4.00, 4.00, 1.00),
    (1, 2, 5.00, 10.00, 8.00),
    (1, 2, 5.00, 15.00, 12.00),
    (3, 2, 11.00, 12.00, 17.00),
    (3, 5, 41.00, 25.00, 13.00),
    (4, 6, 50.00, 70.00, 2.00)
]

columns = ["Pick", "Drop", "Fare", "Tip", "Drag"]
df = spark.createDataFrame(data, columns)

# 1 and 2 and 3
df.groupBy('Pick').agg(
    func.sum('Fare').alias('TotalFarePick'),
    func.sum('Tip').alias('TotalTipPick'),
    func.avg('Drag').alias('AvgDragPick')
).orderBy('Pick').show()

# 4
df.groupBy('Drop').agg(
    func.avg('Drag').alias('AvgDragDrop')
).orderBy('Drop').show()

spark.stop()

The output of the two tables:

+----+-------------+------------+-----------+
|Pick|TotalFarePick|TotalTipPick|AvgDragPick|
+----+-------------+------------+-----------+
|   1|         14.0|        29.0|        7.0|
|   3|         52.0|        37.0|       15.0|
|   4|         50.0|        70.0|        2.0|
+----+-------------+------------+-----------+

+----+------------------+
|Drop|       AvgDragDrop|
+----+------------------+
|   1|               1.0|
|   2|12.333333333333334|
|   5|              13.0|
|   6|               2.0|
+----+------------------+

**Samuel Demir** · Accepted Answer · 2024-03-27T07:55:04.513000

In order you want to have all columns in it just use Window functions.

from pyspark.sql import functions as f
from pyspark.sql import Window

data = [
    (1, 1, 4.00, 4.00, 1.00),
    (1, 2, 5.00, 10.00, 8.00),
    (1, 2, 5.00, 15.00, 12.00),
    (3, 2, 11.00, 12.00, 17.00),
    (3, 5, 41.00, 25.00, 13.00),
    (4, 6, 50.00, 70.00, 2.00)
]

columns = ["Pick", "Drop", "Fare", "Tip", "Drag"]
df = spark.createDataFrame(data, columns)

df_new = (
    df
    .withColumn("TotalFarePick", f.sum("Fare").over(Window.partitionBy("Pick")))
    .withColumn("TotalTipPick", f.sum("Tip").over(Window.partitionBy("Pick")))
    .withColumn("AvgDragPick", f.avg("Drag").over(Window.partitionBy("Pick")))
    .withColumn("AvgDragDrop", f.avg("Drag").over(Window.partitionBy("Drop")))
    .drop("Fare", "Tip", "Drag")
)

df_new.show()

Also please do not use \ as they should not be used in newer python versions.

See for that https://peps.python.org/pep-0008/ :

The preferred way of wrapping long lines is by using Python’s implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation.

How to groupBy on two columns and work out avg total value for each grouped column using pyspark

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Trending Questions

Popular # Hahtags

Popular Questions