I'm trying create a PySpark function that can take input as a Dataframe and returns a data-profile report. I already used describe and summary function which gives out result like min, max, count etc. but I need a detailed report like unique_values and have some visuals too.
If anyone knows anything that can help, feel free to comment below.
A dynamic function that can give the desired output as mentioned above will be helpful.
If the spark dataframe is not to big you can try using a pandas profiling library like
sweetviz, e.g.:It looks like:
You can check more features about sweetviz here like how to compare populations.
Option 2:
Use a profiler that admits
pyspark.sql.DataFrame, e.g.ydata-profiling.