I have a Python function that receives a pyspark dataframe and checks if it has all the columns expected by other functions used in a script. In particular, if the column 'weight' is missing, I want to update the dataframe passed by the user by assigning a new column to it.
For example:
from pyspark.sql import functions as F
def verify_cols(df):
if 'weight' not in df.columns:
df = df.withColumn('weight', F.lit(1)) # Can I update `df` inside this function?
As you can see, I want the function to update df. How can I achieve this? If possible, I would like to avoid using a return statement.
This post is very similar but uses pandas' inplace argument.
To avoid the return statement you can declare a class and keep the df as a member field.
After calling verify_cols methods, the field df will be updated.