So suppose I have a big spark dataframe .I dont know how many columns.
(the solution has to be in pyspark using pandas udf. Not a different approach)
I want to perform an action on all columns. So it's ok to loop inside on all columns But I dont want to loop through rows. I want it to act on the column at once.
I didnt find on the internet how this could be done.
Suppose I have this datafrme
A B C
5 3 2
1 7 0
Now I want to send to pandas udf to get sum of each row.
Sum
10
8
Number of columns not known.
I can do it inside the udf by looping row at a time. But I dont want. I want it to act on all rows without looping. And I allow looping through columns if needed.
One option I tried is combining all colmns to array column
ARR
[5,3,2]
[1,7,0]
But even here it doesnt work for me without looping. I send this column to the udf and then inside I need to loop through its rows and sum each value of the list-row.
It would be nice if I could seperate each column as a one and act on the whole column at once
How do I act on the column at once? Without looping through the rows?
If I loop through the rows I guess it's no better than a regular python udf
I wouldnt go to pandas udfs, resort to udfs it cant be done in pyspark. Anyway code for both below