python sklearn read very big svmlight file

1.6k Views Asked by thebeancounter At 17 July 2017 at 10:20

I am using python 2.7 with svmlight to store and read a very big svmlight format file.

I am reading the file using

import sklearn
rows, labels = sklearn.datasets.load_svmlight_file(matrixPath, zero_based=True)

The file is too big to be stored in memory. I am looking for a way to iterate over the file in batches without the need to split the file in advance.

For now the best way i found is to split the svmlight file using terminal command split. and then reading the partial files i created.

I found that a good way to read big files is reading in batches of line by line in order not to overflow the memory.

How can i do this with svmlight formated files?

Thanks!

Original Q&A

There are 1 best solutions below

Sven van der Burg On 17 July 2018 at 09:08

I came across the same problem, here is my solution:

Using the load_svmlight_file function from scikitlearn, you can specify the offset and length parameters. From the documentation:

offset : integer, optional, default 0

Ignore the offset first bytes by seeking forward, then discarding the following bytes up until the next new line character.

length : integer, optional, default -1

If strictly positive, stop reading any new line of data once the position in the file has reached the (offset + length) bytes threshold.

Here is an example for how to iterate over your svmlight file in batches:

from sklearn.datasets import load_svmlight_file

def load_svmlight_batched(filepath, n_features, batch_size):
    offset = 0
    with open(filepath, 'rb') as f:
        X, y = load_svmlight_file(f, n_features=n_features,
                                  offset=offset, length=batch_size)
        while X.shape[0]:
            yield X, y
            offset += batch_size
            X, y = load_svmlight_file(f, n_features=n_features,
                                      offset=offset, length=batch_size)

def main(filepath):
    iterator = load_svmlight_batched(filepath, 
                                     n_features=2**14, 
                                     batch_size=10000)
    for X_batch, y_batch in iterator:
        # Do something

python sklearn read very big svmlight file

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in SPARSE-MATRIX

Related Questions in LIBSVM

Related Questions in SVMLIGHT

Trending Questions

Popular # Hahtags

Popular Questions