One Hot Encoding of large dataset

1.1k Views Asked by psowa001 At 17 August 2025 at 09:46

I want to build recommendation system using association rules with implemented in mlxtend library apriori algorithm. In my sales data there is information about 36 millions of transactions and 50k unique products. I tried to use sklearn OneHotEncoder and pandas get_dummies() but both are giving OOM error as they are not able to create frame in shape of (36 mil, 50k)

MemoryError: Unable to allocate 398. GiB for an array with shape (36113798, 50087) and data type uint8

Is there any other solution?

Original Q&A

There are 2 best solutions below

Timbus Calin On 30 September 2020 at 11:33

I think a good solution would be to use embeddings instead of one-hot encoding for your problem. In addition, I recommend that you split your dataset into smaller subsets to further avoid the memory consumption problems.

You should also consult this thread : https://datascience.stackexchange.com/questions/29851/one-hot-encoding-vs-word-embeding-when-to-choose-one-or-another

Geoffrey Anderson On 08 January 2022 at 17:21

Like you, I too had out of memory error with mlxtend at first, but the following small changes fixed the problem completely.
`

from mlxtend.preprocessing import TransactionEncoder   

import pandas as pd

te = TransactionEncoder() 

#te_ary = te.fit(itemSetList).transform(itemSetList)

#df = pd.DataFrame(te_ary, columns=te.columns_)

fitted = te.fit(itemSetList)

te_ary = fitted.transform(itemSetList, sparse=True) # seemed to work good

df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_) # seemed to work good

# now you can call mlxtend's fpgrowth() followed by association_rules()

You should also use fpgrowth instead of apriori on the big transaction datasets because apriori is too primitive. fpgrowth is more intelligent and modern than apriori but gives equivalent results. The mlxtend lib supports both apriori and fpgrowth.

One Hot Encoding of large dataset

There are 2 best solutions below

Related Questions in PANDAS

Related Questions in SCIKIT-LEARN

Related Questions in ONE-HOT-ENCODING

Related Questions in APRIORI

Related Questions in MLXTEND

Trending Questions

Popular # Hahtags

Popular Questions