Fail to allocate bitmap error on numeric data using pandas profiling

810 Views Asked by At

I am doing exploratory data analysis on my numeric data and i tried to run pandas profiling but i got error while generating report structure.

import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv('mydatadata.csv')
print(df)
profile = ProfileReport(df) 
profile.to_file(output_file="mydata.html")

and the error log looks like this

Summarize dataset: 99%|███████████████████████████████████████████████████████████████████████▌| 1144/1150 [46:07<24:03, 240.60s/it, Calculate cramers correlation]C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas_profiling\model\correlations.py:101: UserWarning: There was an attempt to calculate the cramers correlation, but this failed. To hide this warning, disable the calculation (using df.profile_report(correlations={"cramers": {"calculate": False}}) If this is problematic for your use case, please report this as an issue: https://github.com/pandas-profiling/pandas-profiling/issues (include the error message: 'No data; observed has size 0.') warnings.warn( Summarize dataset: 100%|██████████████████████████████████████████████████████████████████████████████████▋| 1145/1150 [46:19<17:32, 210.49s/it, Get scatter matrix]Fail to allocate bitmap

2

There are 2 best solutions below

0
On

Reason your code may have failed

If your code failed for the same reason as mine, either:

  1. Tried making multiple profiles at the same time
  2. Tried making a profile with a large dataset in terms of variables

Possible fix in your code

There is a workaround that is documented on the Github page for pandas-profiling under large datasets. In it, there is this example:

    from pandas_profiling import ProfileReport
    profile = ProfileReport(large_dataset, minimal=True)
    profile.to_file("output.html")

Possible fix in the pandas-profiling source?

I got the exact same error. I tried looking it up and it seemed to be coming from Matplotlib with a memory leak. Meaning, the plots were not properly being erased when they were formed. I tried adding the following to the utils.py file within the visualization folder of pandas_profiling:

    plt.clf()
    plt.cla()
    plt.close('all')

The code there originally had plt.close() which I have found in the past to not be enough when making multiple plots back to back. However, I still got this error which makes me think it may not be Matplotlib (or I missed it somewhere.

The minimal=True fix above may work sometimes. It still fails for me occasionally when my datasets are too big.

0
On

I, too, have ran into that particular error case. Personal experimentation with configuration file settings narrowed things down to the quantity of plots generated, or else requested for generation up until that point in runtime execution. What varied in my experimentation was the serial number of PlotReport() profile generations that required interaction graphs: the less interaction graphs plotted overall, the less frequently this error is encountered.

Isolation of processes via multiprocessing Pool.map() calls for this stuff might help if you're only trying to generate a single profile and you absolutely need all of your interaction graphs, but that's a time and/or RAM-greedy option that would require getting creative with instanced joins for pairing column values for interaction graphs from smaller DataFrames, e.g. more reports overall.

Regardless, since documentation for ProfileReport configuration settings is terrible, here's the ones I've staggered into figuring out which you should probably investigate:

  • vars.cat.words -- Boolean, triggers word cloud graph generation for string/categorical variables, probably don't need it.
  • missing_diagrams.bar -- Boolean, turn off if graph is unnecessary
  • missing_diagrams.matrix -- Boolean, turn off if graph is unnecessary
  • missing_diagrams.heatmap -- Boolean, turn off if graph is unnecessary
  • missing_diagrams.dendrogram -- Boolean, turn off if graph is unnecessary
  • correlations.everythingthatisn'tpearsonifyoudon'tneedit -- Boolean, turn off if graph is unnecessary.
  • interactions.targets -- Python list of strings. Specifying one or more column names here will limit interaction graphs to just those involving these variables. This is probably what you're looking for if you just can't bear to drop columns prior to report generation.
  • interactions.continuous -- Boolean, turn off if you just don't want interactions anyway.
  • plot.image_format -- string, 'svg' or 'png', and I guess there's a data structure size difference which may or may not be related to the apparent memory leak the other guy was getting at, but go figure.
  • plot.dpi -- integer, specifies dpi of plotted images, and likewise may be related to an apparent memory leak.
  • plot.pie.max_unique -- integer, set to 0 to disable occasional pie charts being graphed, likewise may be related to apparent memory leak from graph plotting.

Good luck, and don't be afraid to try out other options like DataPrep. I wish I did before going down this rabbit hole.