Data Profiling using Pyspark

1.9k Views Asked by Chirag Kaushik At 08 June 2023 at 08:38

I'm trying create a PySpark function that can take input as a Dataframe and returns a data-profile report. I already used describe and summary function which gives out result like min, max, count etc. but I need a detailed report like unique_values and have some visuals too.

If anyone knows anything that can help, feel free to comment below.

A dynamic function that can give the desired output as mentioned above will be helpful.

Original Q&A

There are 2 best solutions below

Jose On 08 June 2023 at 10:06

Option 1:

If the spark dataframe is not to big you can try using a pandas profiling library like sweetviz, e.g.:

import sweetviz as sv

my_report = sv.analyze(source=(data.toPandas(), "EDA Report"))
my_report.show_notebook() # to show in a notebook cell
my_report.show_html(filepath="report.html") # Will generate the report into a html file

It looks like:

You can check more features about sweetviz here like how to compare populations.

Option 2:

Use a profiler that admits pyspark.sql.DataFrame, e.g. ydata-profiling.

SeaEngineering On 12 June 2023 at 10:25

ydata-profiling currently support Spark dataframes, so it should be the most adequate choice:

from pyspark.sql import SparkSession
from ydata_profiling import ProfileReport

spark = SparkSession \
    .builder \
    .appName("Python Spark profiling example") \
    .getOrCreate()

df = spark.read.csv("{insert-csv-file-path}")
df.printSchema()

report = ProfileReport(df, title=”Profiling pyspark DataFrame”)
report.to_file('profile.html')

An example report looks like this: https://ydata-profiling.ydata.ai/examples/master/census/census_report.html

Data Profiling using Pyspark

There are 2 best solutions below

Related Questions in DATAFRAME

Related Questions in PYSPARK

Related Questions in DATA-ANALYSIS

Related Questions in AWS-GLUE

Related Questions in DATA-PROFILING

Trending Questions

Popular # Hahtags

Popular Questions