Spark report with pandas profiling

343 Views Asked by At

I'm trying to generate ydata-profiling report in a AWS glue environment, with the following version:

  • glue_version 3.0
  • ydata_profiling 4.5.1
  • pyspark 3.1.1+amzn.0

I have used also glue_version 2.0 and other versions of ydata_profiling (e.g. 4.3.2), but have the same issue.

After getting data (just 3397 lines) correctly with

dataset = glueContext.create_data_frame_from_catalog(database=config['schema'], table_name=table)

I used the following line to generate ydata-profiling report:

prof = ydata_profiling.ProfileReport(dataset, config_file=config['profiler_config'])
report = prof.get_description()

and got this error:

DispatchError: Function <code object spark_get_series_descriptions at 0x7f8c28632a50, file "/home/spark/.local/lib/python3.7/site-packages/ydata_profiling/model/spark/summary_spark.py", line 67>

The config file shouldn't be the problem since i tried with the suggested config from ydata-profiling page

prof = ydata_profiling.ProfileReport(dataset,infer_dtypes=False,
                interactions=None,
                missing_diagrams=None,
                correlations={"auto": {"calculate": False},
                              "pearson": {"calculate": True},
                              "spearman": {"calculate": True}})
report = prof.get_description()

but have the same issue. The issue is the same if i do

prof.to_file('prova.json')

or

prof.to_html('prova.html')

I have no idea how to fix the problem. Does someone have a suggestion or had the same issue?

0

There are 0 best solutions below