PySpark using both aggregate and group by

Question

PySpark using both aggregate and group by

392 Views Asked by Wispers At 05 April 2021 at 00:44

Can someone help me with pyspark using both aggregate and groupby functions? I have made my data frames, and applied filters and selects to get the data I want. However, I am now stuck trying to aggregate things correctly.

Currently, my code outputs the content below:

+----------+-----------+--------------+---------------+----------+---------+
|l_orderkey|o_orderdate|o_shippriority|l_extendedprice|l_discount|      rev|
+----------+-----------+--------------+---------------+----------+---------+
|     53634| 1995-02-22|             0|       20517.44|      0.08|18876.045|
|    265539| 1995-01-25|             0|       70423.08|      0.01| 69718.85|
|    331590| 1994-12-10|             0|       46692.75|      0.03| 45291.97|
|    331590| 1994-12-10|             0|        37235.1|       0.1| 33511.59|
|    420545| 1995-03-05|             0|        75542.1|      0.04|72520.414|
|    420545| 1995-03-05|             0|         1062.0|      0.07|987.66003|
|    420545| 1995-03-05|             0|        9729.45|       0.1| 8756.505|
|    420545| 1995-03-05|             0|        15655.6|      0.04|15029.375|
|    420545| 1995-03-05|             0|         3121.3|      0.03|3027.6611|
|    420545| 1995-03-05|             0|        71723.0|      0.03| 69571.31|
|    488928| 1995-02-15|             0|        1692.77|      0.01|1675.8423|
|    488928| 1995-02-15|             0|       22017.84|      0.01|21797.662|
|    488928| 1995-02-15|             0|       57100.42|      0.04|54816.402|
|    488928| 1995-02-15|             0|        3807.76|      0.05| 3617.372|
|    488928| 1995-02-15|             0|       73332.52|      0.01|72599.195|
|    510754| 1994-12-21|             0|       41171.78|      0.09| 37466.32|
|    512422| 1994-12-26|             0|       87251.56|      0.07| 81143.95|
|    677761| 1994-12-26|             0|       60123.34|       0.0| 60123.34|
|    956646| 1995-03-07|             0|       61853.68|      0.05|58760.996|
|   1218886| 1995-02-13|             0|        24844.0|      0.01| 24595.56|
+----------+-----------+--------------+---------------+----------+---------+

I wish to apply a group by: l_orderkey and aggregate the Rev as a sum.

Here is my most recent attempt with 't' being the dataframe and F being functions from pyspark.sql "from pyspark.sql import functions as F"

(t .groupby(t.l_orderkey,t.o_orderdate, t.o_shippriority) 
 .agg(F.collect_set(sum(t.rev)), F.collect_set(t.l_orderkey)) .show())

Can someone help me know if I'm on the right track? I keep getting "Column is not iterable"

Original Q&A

There are 1 best solutions below

**ggordon** · Accepted Answer · 2021-04-05T03:41:36.003000

total_rev = t.groupby(t.l_orderkey).agg(F.sum(t.rev).alias('total_rev'))

# print /show the top results
total_rev.show()

would give you a new df with columns l_orderkey, total_rev where total_rev would store the aggregated sum of rev

You use collect_set when attempting to remove duplicates.

You are also getting Column is not iterable because you are using the built in python function sum and not the spark function F.sum

PySpark using both aggregate and group by

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in TPC

Trending Questions

Popular # Hahtags

Popular Questions