How to order values when label-encoding?

332 Views Asked by At

I want to label-encode a column called article_id which has unique identifiers for an article.

Integer values kind of implicitly have an order to them, because 3 > 2 > 1.

I wonder what is the most reasonable way to sort the values before factorizing them to have a benefit to this natural order. I though about sorting them by their occurence, so that the most common article_id has the highest label representation and the one which occurs the least has the lowest label representation.

Does this make sense and are there more reasonable ways of doing this?

This is what I am doing right now. Sorting by occurence and then factorizing.

df = df.iloc[df.groupby('article_id').article_id.transform('size').argsort(kind='mergesort')]

df['article_id'], article_labels = df['article_id'].factorize()

1

There are 1 best solutions below

1
On

If you just want to assign a custom ordering to the items in some column, you can consider using a Categorical (i.e. a Series whose dtype is a CategoricalDtype). When defining the dtype, use ordered=True.

# Example data
article_ids = np.random.choice(['abc', 'def', 'ghi'], size=100, p=[0.2, 0.3, 0.5])
df = pd.DataFrame({'article_id': article_ids})

# Obtain counts to determine the order
vc = df['article_id'].value_counts()
dtype = pd.CategoricalDtype(vc.index, ordered=True)

# Convert your column to use the new dtype
df['article_id'] = df['article_id'].astype(dtype)

# IF you sort that column, it will sort according
# to your custom ordering, not by string value.
print(df.sort_values('article_id'))
   article_id
29        ghi
30        ghi
31        ghi
32        ghi
33        ghi
..        ...
76        abc
75        abc
74        abc
71        abc
58        abc

You can assign ordered integer values for the categories by inspecting dtype.categories:

categories = df['article_id'].dtype.categories
code = dict(enumerate(categories))
code = {c: i for i, c in code.items()}
df['article_code'] = df['article_id'].map(code)

print(df)
   article_id article_code
0         def            1
1         ghi            0
2         ghi            0
3         ghi            0
4         def            1
..        ...          ...
95        ghi            0
96        def            1
97        def            1
98        abc            2
99        def            1

[100 rows x 2 columns]