How to add an unique id of each value in a new column of dask dataframe

27 Views Asked by At

I have the following dask dataframe

column1  column2
a        1
a        2
b        3
c        4
c        5

I need to add a new column with the unique consecutive number of the values in the column1. My output will be:

column1 column2 column 3
a        1      1
a        2      1
b        3      2
c        4      3
c        5      3

How do I achieve it?. Thanks in advance for your help.

1

There are 1 best solutions below

0
ljdyer On

You are talking about a label encoding, which you can find implemented in scikit-learn's LabelEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

Here it is applied to your Dask Dataframe:

import dask.dataframe as dd
import pandas as pd
from sklearn import preprocessing

df = pd.DataFrame([('a', 1), ('a', 2), ('b', 3), ('c', 4), ('c', 5)])
ddf = dd.from_pandas(df)
ddf.columns = ['column1', 'column2']

le = preprocessing.LabelEncoder()
ddf['column3'] = pd.Series(le.fit_transform(ddf.column1.values) + 1)
print(ddf.head())

*the + 1 is because your labels start from 1. By default they start from 0.

Output:

    column1  column2  column3
0       a        1        1
1       a        2        1
2       b        3        2
3       c        4        3
4       c        5        3