Dividing a set of columns by its average in Pyspark

555 Views Asked by At

I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.

Input Data

columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
    ("2", "45", "55", "tracey smith"),
    ("3", "33", "23", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

Col1    Col2    Col3    Name
 1      40      56  john jones
 2      45      55  tracey smith
 3      33      23  amy sanders

Expected Output

Col1    Col2    Col3    Name
 0.5    1.02    1.25    john jones
 1      1.14    1.23    tracey smith
 1.5    0.84    0.51    amy sanders

Function as of now. Not working:

#function to divide few columns by the column average and overwrite the column

def avg_scaling(df):

  #List of columns which have to be scaled by their average
  col_list = ['col1', 'col2', 'col3']

  for i in col_list:
   df = df.withcolumn(i, col(i)/df.select(f.avg(df[i])))

   return df

new_df = avg_scaling(df)
1

There are 1 best solutions below

0
On BEST ANSWER

You can make use of a Window here partitioned on a pusedo column and run average on that window.

The code goes like this,

columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
    ("2", "45", "55", "tracey smith"),
    ("3", "33", "23", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

df.show()

+----+----+----+------------+
|Col1|Col2|Col3|        Name|
+----+----+----+------------+
|   1|  40|  56|  john jones|
|   2|  45|  55|tracey smith|
|   3|  33|  23| amy sanders|
+----+----+----+------------+


from pyspark.sql import Window

def avg_scaling(df, cols_to_scale):
    
    w = Window.partitionBy(F.lit(1))
    for col in cols_to_scale:
        df = df.withColumn(f"{col}", F.round(F.col(col) / F.avg(col).over(w), 2))
    return df

new_df = avg_scaling(df, ["Col1", 'Col2', 'Col3'])
new_df.show()


+----+----+----+------------+
|Col1|Col2|Col3|        Name|
+----+----+----+------------+
| 0.5|1.02|1.25|  john jones|
| 1.5|0.84|0.51| amy sanders|
| 1.0|1.14|1.23|tracey smith|
+----+----+----+------------+