CountVectorizer() in scikit-learn Python gives Memory error when feeding big Dataset. Same code with Smaller dataset works fine, what am I missing?

Question

CountVectorizer() in scikit-learn Python gives Memory error when feeding big Dataset. Same code with Smaller dataset works fine, what am I missing?

3.8k Views Asked by Nasir At 05 June 2025 at 10:03

I am Working on Two Class Machine Learning Problem. Training Set contains 2-Millions Rows of URL(Strings) and Label 0 and 1. Classifier LogisticRegression() should predict any of two labels when testing datasets are passed. I am getting 95% accuracy results when i use smaller dataset i.e 78,000 URL and 0 and 1 as labels.

The Problem I am having is When I feed in big dataset (2 million row of URL strings) I get this error:

Traceback (most recent call last):

File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 540, in runfile

execfile(filename, namespace)
File "C:/Users/Slim/.xy/startups/start/chi2-94.85 - Copy.py", line 48, in <module>

bi_counts = bi.fit_transform(url_list)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform

vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 717, in _count_vocab

j_indices.append(vocabulary[feature])
MemoryError

My code which is working for small datasets with fair enough accuracy is

bi = CountVectorizer(ngram_range=(3, 3),binary = True, max_features=9000, analyzer='char_wb')
bi_counts = bi.fit_transform(url_list)
tf = TfidfTransformer(norm='l2')
X_train_tf =tf.fit_transform(use_idf=True, bi_counts)
clf = LogisticRegression(penalty='l1',intercept_scaling=0.5,random_state=True)
clf.fit(train_x2,y)

I tried to keep 'max_features' as minimum as possible say max_features=100, but still same result.

Please Note:

I am Using core i5 with 4GB ram
I tried the same code on 8GB ram but no luck
I am using Pyhon 2.7.6 with sklearn, NumPy 1.8.1, SciPy 0.14.0, Matplotlib 1.3.1

UPDATE:

@Andreas Mueller suggested to used HashingVectorizer(), i used it with small and large datasets, 78,000 dataset compiled successfully but 2-million dataset gave me same memory error as shown above. I tried it on 8GB ram and in-use memory space = 30% when compiling big dataset.

Original Q&A

There are 1 best solutions below

**Andreas Mueller** · Accepted Answer

IIRC the max_features is only applied after the whole dictionary is computed. The easiest way out is to use the HashingVectorizer that does not compute a dictionary. You will lose the ability to get the corresponding token for a feature, but you shouldn't run into memory issues any more.

CountVectorizer() in scikit-learn Python gives Memory error when feeding big Dataset. Same code with Smaller dataset works fine, what am I missing?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in NUMPY

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in FEATURE-EXTRACTION

Trending Questions

Popular # Hahtags

Popular Questions