Why the index of Label Encoding is not seriated?

50 Views Asked by At

This is my label value:

df['Label'].value_counts()
------------------------------------
Benign                    4401366
DDoS attacks-LOIC-HTTP     576191
FTP-BruteForce             193360
SSH-Bruteforce             187589
DoS attacks-GoldenEye       41508
DoS attacks-Slowloris       10990
Name: Label, dtype: int64

I use label encoding to endcode:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
label_encoder.fit(df.Label)
df['Label']= label_encoder.transform(df.Label)

And this is the resuslt:

df['Label'].value_counts()
------------------------------
0    4380628
1     576191
4     193354
5     187589
2      41508
3      10990
Name: Label, dtype: int64

I want the result like this:

df['Label'].value_counts()
------------------------------
0    4380628
1     576191
2     193354
3     187589
4      41508
5      10990
Name: Label, dtype: int64

Does anyone know what problem and how to solve it?

1

There are 1 best solutions below

2
On

Example

we need reproducible and minimal example for answer. lets make

df = pd.DataFrame(list('BACCCCAAAA'), columns=['col1'])

df

    col1
0   B
1   A
2   C
3   C
4   C
5   C
6   A
7   A
8   A
9   A

Code

df['col1'].value_counts()

A    5
C    4
B    1
Name: col1, dtype: int64

your problem is because it is coded in the order in which it appears.

B-0, A-1, C-2 in df becuz appear order.

if want make A-0, C-1, B-2 (by frequency), this can be solved with pandas alone(dont need other library). using following code:

s = df['col1'].map(lambda x: df['col1'].value_counts().index.get_loc(x))

s

0    2
1    0
2    1
3    1
4    1
5    1
6    0
7    0
8    0
9    0
Name: col1, dtype: int64

make s to col1 column

out = df.assign(col1=s)

out

    col1
0   2
1   0
2   1
3   1
4   1
5   1
6   0
7   0
8   0
9   0

chk value_counts

out['col1'].value_counts()

0    5
1    4
2    1
Name: col1, dtype: int64

Update

more efficient way:

m = pd.Series(range(df['col1'].nunique()), index=df['col1'].value_counts().index)
s = df['col1'].map(m)

s

0    2
1    0
2    1
3    1
4    1
5    1
6    0
7    0
8    0
9    0
Name: col1, dtype: int64