How to order values when label-encoding?

Question

How to order values when label-encoding?

340 Views Asked by christallclear At 04 July 2025 at 01:00

I want to label-encode a column called article_id which has unique identifiers for an article.

Integer values kind of implicitly have an order to them, because 3 > 2 > 1.

I wonder what is the most reasonable way to sort the values before factorizing them to have a benefit to this natural order. I though about sorting them by their occurence, so that the most common article_id has the highest label representation and the one which occurs the least has the lowest label representation.

Does this make sense and are there more reasonable ways of doing this?

This is what I am doing right now. Sorting by occurence and then factorizing.

df = df.iloc[df.groupby('article_id').article_id.transform('size').argsort(kind='mergesort')]

df['article_id'], article_labels = df['article_id'].factorize()

Original Q&A

There are 1 best solutions below

**Stuart Berg** · Answer 1

If you just want to assign a custom ordering to the items in some column, you can consider using a Categorical (i.e. a Series whose dtype is a CategoricalDtype). When defining the dtype, use ordered=True.

# Example data
article_ids = np.random.choice(['abc', 'def', 'ghi'], size=100, p=[0.2, 0.3, 0.5])
df = pd.DataFrame({'article_id': article_ids})

# Obtain counts to determine the order
vc = df['article_id'].value_counts()
dtype = pd.CategoricalDtype(vc.index, ordered=True)

# Convert your column to use the new dtype
df['article_id'] = df['article_id'].astype(dtype)

# IF you sort that column, it will sort according
# to your custom ordering, not by string value.
print(df.sort_values('article_id'))

   article_id
29        ghi
30        ghi
31        ghi
32        ghi
33        ghi
..        ...
76        abc
75        abc
74        abc
71        abc
58        abc

You can assign ordered integer values for the categories by inspecting dtype.categories:

categories = df['article_id'].dtype.categories
code = dict(enumerate(categories))
code = {c: i for i, c in code.items()}
df['article_code'] = df['article_id'].map(code)

print(df)

   article_id article_code
0         def            1
1         ghi            0
2         ghi            0
3         ghi            0
4         def            1
..        ...          ...
95        ghi            0
96        def            1
97        def            1
98        abc            2
99        def            1

[100 rows x 2 columns]

How to order values when label-encoding?

There are 1 best solutions below

Related Questions in PANDAS

Related Questions in MACHINE-LEARNING

Related Questions in ENCODING

Related Questions in DATA-PREPROCESSING

Related Questions in LABEL-ENCODING

Trending Questions

Popular # Hahtags

Popular Questions