I am using python 2.7 with svmlight to store and read a very big svmlight format file.
I am reading the file using
import sklearn
rows, labels = sklearn.datasets.load_svmlight_file(matrixPath, zero_based=True)
The file is too big to be stored in memory. I am looking for a way to iterate over the file in batches without the need to split the file in advance.
For now the best way i found is to split the svmlight file using terminal command split. and then reading the partial files i created.
I found that a good way to read big files is reading in batches of line by line in order not to overflow the memory.
How can i do this with svmlight formated files?
Thanks!
I came across the same problem, here is my solution:
Using the
load_svmlight_filefunction from scikitlearn, you can specify theoffsetandlengthparameters. From the documentation:offset : integer, optional, default 0
length : integer, optional, default -1
Here is an example for how to iterate over your svmlight file in batches: