I want to label-encode a column called article_id which has unique identifiers for an article.
Integer values kind of implicitly have an order to them, because 3 > 2 > 1.
I wonder what is the most reasonable way to sort the values before factorizing them to have a benefit to this natural order. I though about sorting them by their occurence, so that the most common article_id has the highest label representation and the one which occurs the least has the lowest label representation.
Does this make sense and are there more reasonable ways of doing this?
This is what I am doing right now. Sorting by occurence and then factorizing.
df = df.iloc[df.groupby('article_id').article_id.transform('size').argsort(kind='mergesort')]
df['article_id'], article_labels = df['article_id'].factorize()
If you just want to assign a custom ordering to the items in some column, you can consider using a
Categorical
(i.e. a Series whosedtype
is aCategoricalDtype
). When defining the dtype, useordered=True
.You can assign ordered integer values for the categories by inspecting
dtype.categories
: