I have a Spark dataframe that needs to be ffilled. The size of the dataframe is large (>100 million rows). I'm able to achieve what I want using pandas as shown below.
new_df = df_pd.set_index('someDateColumn') \
.groupby(['Column1', 'Column2', 'Column3']) \
.resample('D') \
.ffill() \
.reset_index(['Column1', 'Column2', 'Column3'], drop=True) \
.reset_index()
I got stuck when trying .resample('D')
using Koalas. Is there any better alternative to do ffill replication logic in spark native functions? The reason being, I want to avoid pandas as it is not distributed and executes only on Driver Node.
How can I achieve the same as above using Spark/Koalas packages?
In case you are looking for forward fill in Spark, follow this tutorial in order to cater that - here