One Hot Encoding of large dataset

1.1k Views Asked by At

I want to build recommendation system using association rules with implemented in mlxtend library apriori algorithm. In my sales data there is information about 36 millions of transactions and 50k unique products. I tried to use sklearn OneHotEncoder and pandas get_dummies() but both are giving OOM error as they are not able to create frame in shape of (36 mil, 50k)

MemoryError: Unable to allocate 398. GiB for an array with shape (36113798, 50087) and data type uint8

Is there any other solution?

2

There are 2 best solutions below

0
On

I think a good solution would be to use embeddings instead of one-hot encoding for your problem. In addition, I recommend that you split your dataset into smaller subsets to further avoid the memory consumption problems.

You should also consult this thread : https://datascience.stackexchange.com/questions/29851/one-hot-encoding-vs-word-embeding-when-to-choose-one-or-another

0
On

Like you, I too had out of memory error with mlxtend at first, but the following small changes fixed the problem completely.
`

from mlxtend.preprocessing import TransactionEncoder   

import pandas as pd

te = TransactionEncoder() 

#te_ary = te.fit(itemSetList).transform(itemSetList)

#df = pd.DataFrame(te_ary, columns=te.columns_)

fitted = te.fit(itemSetList)

te_ary = fitted.transform(itemSetList, sparse=True) # seemed to work good

df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_) # seemed to work good

# now you can call mlxtend's fpgrowth() followed by association_rules()   

`

You should also use fpgrowth instead of apriori on the big transaction datasets because apriori is too primitive. fpgrowth is more intelligent and modern than apriori but gives equivalent results. The mlxtend lib supports both apriori and fpgrowth.