I'm trying to encode some text values with LabelEncoder. For this I'm writing:
onehot = pd.DataFrame()
encoders = []
for column in df_resolved.loc[:, ((df_resolved.dtypes != np.int64)&(df_resolved.dtypes != np.int32))]:
enc = preprocessing.LabelEncoder()
encoders.append(enc)
onehot[column] = enc.fit_transform(df_resolved[column])
I need the encoding to be reproducible with new data, do I need to store the encoders, that's why I'm doing it this way. However, I get an error:
TypeError: '>' not supported between instances of 'str' and 'int'
I don't understand why this is happening. The encoder should be able to encode strings according to the documentation. What am I missing?
Full stack trace:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-330-f9a564c7c9ab> in <module>()
8 enc = preprocessing.LabelEncoder()
9 encoders.append(enc)
---> 10 onehot[column] = enc.fit_transform(df_resolved[column])
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
129 y = column_or_1d(y, warn=True)
130 _check_numpy_unicode_bug(y)
--> 131 self.classes_, y = np.unique(y, return_inverse=True)
132 return y
133
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts)
209
210 if optional_indices:
--> 211 perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
212 aux = ar[perm]
213 else:
TypeError: '>' not supported between instances of 'str' and 'int'
UPDATE:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1436 entries, 0 to 1706
Data columns (total 26 columns):
u_category 1436 non-null object
caller_id.country 1436 non-null object
number 1436 non-null object
priority 1436 non-null object
urgency 1436 non-null object
incident_state 1436 non-null object
u_subcategory 1436 non-null object
assigned_to 1436 non-null object
short_description 1436 non-null object
sys_created_on 1436 non-null datetime64[ns]
business_duration 1436 non-null int64
u_resolved_time 1436 non-null datetime64[ns]
u_reopen_count 1436 non-null int64
sys_created_by 1436 non-null int64
caller_id.u_display_name 1436 non-null object
u_on_behalf_of.u_display_name 1436 non-null object
u_on_behalf_of.email 1436 non-null object
u_actual_time_to_resolve 1436 non-null int64
comments 1436 non-null object
u_comments_and_work_notes 1436 non-null object
description 1436 non-null object
impact 1436 non-null object
u_problem_classification 1436 non-null object
resolution_time 1436 non-null float64
rawtext 1436 non-null object
cluster 1436 non-null int32
dtypes: datetime64[ns](2), float64(1), int32(1), int64(4), object(18)
memory usage: 337.3+ KB
This is the df info. My SKLearn is version 0.18.1.