Faltering report with pandas profiling after converting numeric values to categorical values

54 Views Asked by At

I'm currently working with pandas profiling and I have a problem with creating a proper report. Because when I just read the csv file, the columns are in the wrong data type. Instead of categorical, the values were labeled as numeric values. When I now try to define the specific datatype within the read_csv method, the creation of the report stucks at a certain point and takes forever (I canceled it after 30 mins). When I dont change the datatype of the values, the report is done in less then a minute.

Here are also the output of df_data.isnull().sum():

A                   0
B                   0
C                   3
D                   0
E                   0
F                   0
G               86317
H                  39
I                6871
J                   0

I tried to cast the datatypes within the read_csv:

df_data = pd.read_csv('example.csv', parse_dates=['A', 'B'], dtype={
    'C' : 'string',
    'D' : 'string',
    'E' : 'string'
}
)

And I've also tried to cast the datatypes with dtypes() after a normal read_csv:

df_data = pd.read_csv('example.csv')
df_data['C'] = df_data['A'].astype(str)
df_data['D'] = df_data['A'].astype(str)
df_data['E'] = df_data['A'].astype(str)

Both ways had the same result: a report that stucks halfway through

1

There are 1 best solutions below

1
hexxetexxeh On

I converted the data in the type_schema, like this:

df_data = pd.read_csv('example.csv') type_schema = { 'A' : 'datetime', 'B' : 'categorical', 'C' : 'categorical', 'D' : 'categorical', 'E' : 'categorical', 'F' : 'categorical', 'G' : 'categorical', 'H' : 'categorical', 'I' : 'categorical' } profile = ProfileReport(df_data, type_schema=type_schema)