LabelEncoder TypeError?

624 Views Asked by At

I'm trying to encode some text values with LabelEncoder. For this I'm writing:

onehot = pd.DataFrame()
encoders = []
for column in df_resolved.loc[:, ((df_resolved.dtypes != np.int64)&(df_resolved.dtypes != np.int32))]:
    enc = preprocessing.LabelEncoder()
    encoders.append(enc)
    onehot[column] = enc.fit_transform(df_resolved[column])

I need the encoding to be reproducible with new data, do I need to store the encoders, that's why I'm doing it this way. However, I get an error:

TypeError: '>' not supported between instances of 'str' and 'int'

I don't understand why this is happening. The encoder should be able to encode strings according to the documentation. What am I missing?

Full stack trace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-330-f9a564c7c9ab> in <module>()
      8     enc = preprocessing.LabelEncoder()
      9     encoders.append(enc)
---> 10     onehot[column] = enc.fit_transform(df_resolved[column])

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
    129         y = column_or_1d(y, warn=True)
    130         _check_numpy_unicode_bug(y)
--> 131         self.classes_, y = np.unique(y, return_inverse=True)
    132         return y
    133 

/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts)
    209 
    210     if optional_indices:
--> 211         perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
    212         aux = ar[perm]
    213     else:

TypeError: '>' not supported between instances of 'str' and 'int'

UPDATE:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1706
Data columns (total 26 columns):
u_category                       1436 non-null object
caller_id.country                1436 non-null object
number                           1436 non-null object
priority                         1436 non-null object
urgency                          1436 non-null object
incident_state                   1436 non-null object
u_subcategory                    1436 non-null object
assigned_to                      1436 non-null object
short_description                1436 non-null object
sys_created_on                   1436 non-null datetime64[ns]
business_duration                1436 non-null int64
u_resolved_time                  1436 non-null datetime64[ns]
u_reopen_count                   1436 non-null int64
sys_created_by                   1436 non-null int64
caller_id.u_display_name         1436 non-null object
u_on_behalf_of.u_display_name    1436 non-null object
u_on_behalf_of.email             1436 non-null object
u_actual_time_to_resolve         1436 non-null int64
comments                         1436 non-null object
u_comments_and_work_notes        1436 non-null object
description                      1436 non-null object
impact                           1436 non-null object
u_problem_classification         1436 non-null object
resolution_time                  1436 non-null float64
rawtext                          1436 non-null object
cluster                          1436 non-null int32
dtypes: datetime64[ns](2), float64(1), int32(1), int64(4), object(18)
memory usage: 337.3+ KB

This is the df info. My SKLearn is version 0.18.1.

0

There are 0 best solutions below