Dataframe in pypark - How to apply aggregate functions to into two columns?

3k Views Asked by At

I'm using Dataframe in pyspark. I have one table like Table 1 bellow. I need to obtain Table 2. Where:

  • num_category - it is how many differents categories for each id
  • sum(count) - it is the sum of the third column in Table 1 for each id.

Example:

Table 1

id   |category | count 

1    |    4    |   1 
1    |    3    |   2
1    |    1    |   2
2    |    2    |   1
2    |    1    |   1

Table 2

id   |num_category| sum(count) 

1    |    3       |   5 
2    |    2       |   2

I try:

table1 = data.groupBy("id","category").agg(count("*"))
cat = table1.groupBy("id").agg(count("*"))
count = table1.groupBy("id").agg(func.sum("count"))
table2 = cat.join(count, cat.id == count.id)

Error:

     1 table1 = data.groupBy("id","category").agg(count("*"))
---> 2 cat = table1.groupBy("id").agg(count("*"))
       count = table1.groupBy("id").agg(func.sum("count"))
       table2 = cat.join(count, cat.id == count.id)
 TypeError: 'DataFrame' object is not callable
1

There are 1 best solutions below

0
On BEST ANSWER

You can do multiple column aggregation on single grouped data,

data.groupby('id').agg({'category':'count','count':'sum'}).withColumnRenamed('count(category)',"num_category").show()
+---+-------+--------+
| id|num_cat|sum(cnt)|
+---+-------+--------+
|  1|      3|       5|
|  2|      2|       2|
+---+-------+--------+