I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.
Input Data
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
Col1 Col2 Col3 Name
1 40 56 john jones
2 45 55 tracey smith
3 33 23 amy sanders
Expected Output
Col1 Col2 Col3 Name
0.5 1.02 1.25 john jones
1 1.14 1.23 tracey smith
1.5 0.84 0.51 amy sanders
Function as of now. Not working:
#function to divide few columns by the column average and overwrite the column
def avg_scaling(df):
#List of columns which have to be scaled by their average
col_list = ['col1', 'col2', 'col3']
for i in col_list:
df = df.withcolumn(i, col(i)/df.select(f.avg(df[i])))
return df
new_df = avg_scaling(df)
You can make use of a Window here partitioned on a pusedo column and run average on that window.
The code goes like this,