I am doing exploratory data analysis on my numeric data and i tried to run pandas profiling but i got error while generating report structure.
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv('mydatadata.csv')
print(df)
profile = ProfileReport(df)
profile.to_file(output_file="mydata.html")
and the error log looks like this
Summarize dataset: 99%|███████████████████████████████████████████████████████████████████████▌| 1144/1150 [46:07<24:03, 240.60s/it, Calculate cramers correlation]C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas_profiling\model\correlations.py:101: UserWarning: There was an attempt to calculate the cramers correlation, but this failed. To hide this warning, disable the calculation (using
df.profile_report(correlations={"cramers": {"calculate": False}})
If this is problematic for your use case, please report this as an issue: https://github.com/pandas-profiling/pandas-profiling/issues (include the error message: 'No data;observed
has size 0.') warnings.warn( Summarize dataset: 100%|██████████████████████████████████████████████████████████████████████████████████▋| 1145/1150 [46:19<17:32, 210.49s/it, Get scatter matrix]Fail to allocate bitmap
Reason your code may have failed
If your code failed for the same reason as mine, either:
Possible fix in your code
There is a workaround that is documented on the Github page for pandas-profiling under large datasets. In it, there is this example:
Possible fix in the pandas-profiling source?
I got the exact same error. I tried looking it up and it seemed to be coming from Matplotlib with a memory leak. Meaning, the plots were not properly being erased when they were formed. I tried adding the following to the utils.py file within the visualization folder of pandas_profiling:
The code there originally had
plt.close()
which I have found in the past to not be enough when making multiple plots back to back. However, I still got this error which makes me think it may not be Matplotlib (or I missed it somewhere.The
minimal=True
fix above may work sometimes. It still fails for me occasionally when my datasets are too big.