Pandas categorical columns to factorize tables

69 Views Asked by At

I am working on a huge denormalized table on a SQL server (10 columns x 130m rows). Take this as data example :

import pandas as pd
import numpy as np
data = pd.DataFrame({
    'status' :  ['pending', 'pending','pending', 'canceled','canceled','canceled', 'confirmed', 'confirmed','confirmed'],
    'clientId' : ['A', 'B', 'C', 'A', 'D', 'C', 'A', 'B','C'],
    'partner' :  ['A', np.nan,'C', 'A',np.nan,'C', 'A', np.nan,'C'],
    'product' : ['afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard'],
    'brand' : ['brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3','brand_1', 'brand_3', 'brand_3'],
    'gmv' : [100,100,100,100,100,100,100,100,100]})

data = data.astype({'partner':'category','status':'category','product':'category', 'brand':'category'})

As you can see, many of it columns are categories/strings that could be factorize (replaced by a small int identification to another x.1 join).

My question is if there is a easy way to extract another "dataframe" from each category columns and factory factorize the main table, so the bytes transmitted over a single query could be faster! Is there any easy library for it?

I would expect to get this output:

    data = pd.DataFrame({
        'status' :  ['1', '1','1', '2','2','2', '3', '3','3'],
        'clientId' : ['1', '2', '3', '1', '4', '3', '1', '2','3'],
        'partner' :  ['A', np.nan,'C', 'A',np.nan,'C', 'A', np.nan,'C'],
        'product' : ['afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard'],
        'brand' : ['brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3','brand_1', 'brand_3', 'brand_3'],
        'gmv' : [100,100,100,100,100,100,100,100,100]})
    
status_df = {1 : 'pending', 2:'canceled', 3:'confirmed'} 
clientid = {1 : 'A', 2:'B', 3:'C', 4:'D'}

and so on!

Bonus question! My table is big, so I probably would need to apply something using DASK.

2

There are 2 best solutions below

3
Nick On

You can use factorize to do this. For example:

codes, uniques = pd.factorize(data['status'])
data['status'] = codes
status_df = pd.DataFrame(uniques)

Output (data):

   status clientId partner    product    brand  gmv
0       0        A       A  afiliates  brand_1  100
1       0        B     NaN   pre-paid  brand_2  100
2       0        C       C   giftcard  brand_3  100
3       1        A       A  afiliates  brand_1  100
4       1        D     NaN   pre-paid  brand_2  100
5       1        C       C   giftcard  brand_3  100
6       2        A       A  afiliates  brand_1  100
7       2        B     NaN   pre-paid  brand_3  100
8       2        C       C   giftcard  brand_3  100

Output (status_df):

           0
0    pending
1   canceled
2  confirmed

For columns like partner, where there are NaN values, you can choose to have them replaced with -1 (the default behaviour), or to have NaN included in partner_df (along with its own index) by specifying use_na_sentinel=False.

4
Bill On

Did you try using the built-in categorical data type in Pandas?

To demonstrate this I will make the example dataframe you provided 100 times bigger:

data = pd.concat([data]*100).reset_index()
assert data.shape == (900, 6)

Then compare memory usage before:

print(data.memory_usage(index=False))

status      7200
clientId    7200
partner     7200
product     7200
brand       7200
gmv         7200
dtype: int64

Now convert some of the columns to categorical type:

cols_to_convert = ['status', 'clientId', 'partner', 'product', 'brand']
for col in cols_to_convert:
    data[col] = data[col].astype('category')

Then see the memory usage after:

status      1032
clientId    1104
partner     1024
product     1032
brand       1032
gmv         7200
dtype: int64

Would this achieve what you want? I'm sure Dask has the same features.

If you want to get the categories and category lookup values you can access them as follows:

clientId_categories = data['clientId'].cat.categories
clientId_codes = data['clientId'].cat.codes
print(clientId_categories)
print(clientId_codes[:5])
Index(['A', 'B', 'C', 'D'], dtype='object')
0    0
1    1
2    2
3    0
4    3
dtype: int8