pandas profiling with dask-dataframe. IndexError

471 Views Asked by At

I get an IndexError (IndexError: only integers, slices (:), ellipsis, nmpy.newaxis and integer or bolean arays are valid indices) while pandas profiling with dask. data: 290170 x 55

import dask.dataframe as dd
from pandas_profiling import ProfileReport
df = dd.read_csv("covtype.data").compute()

df.columns = ["Elevation", "Aspect", "Slope", "Horizontal_d_to_hydrology", "vertical_d_to_hydrology", "Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", "Horizontal_Distance_To_Fire_Points", "Rawah Wilderness Area","Neota Wilderness Area", "Comanche Peak Wilderness Area", "Cache la Poudre Wilderness Area", "2702", "2703", "2704", "2705", "2706", "2717", "3501", "3502", "4201", "4703", "4704", "4744", "4758", "5101", "5151", "6101", "6102", "6731", "7101", "7102", "7103", "7201", "7202", "7700", "7701", "7702", "7709", "7710", "7745", "7746", "7755", "7756", "7757", "7790", "8703", "8707", "8708", "8771", "8772", "8776", "Cover_Type"]

ProfileReport(df)
1

There are 1 best solutions below

0
On

Quick fix: following Issue #991, you can change line #13 in utils_pandas.py, just like ieaves suggested.

From: w_median = (data[weights == np.max(weights)])[0]

To: w_median = (data[np.where(weights == np.max(weights))])[0]

Also: take a look at Paul H comment, maybe the dask dataframe is incompatible.