I have to make a text classfication engine with large training datasets (~1 million of texts, each of them has >= 1 categories).
The problem is to make a temporary storage for my statistics (text features, extracted from each text). For each text I have two arrays: int[numOfFeatures] (the indexes of features presented) and double[numOfFeatures] (the weight of each feature). In case of numOfFeatures == 3000
, it's about 36 Kb size, or 36 Gb for all the base.
I tried to do it with SQLite database, packing the arrays as BLOBs:
CREATE TABLE 'docBlobs' ('docId' INTEGER, 'featureIndexes' BLOB, 'featureWeights' BLOB );
CREATE INDEX 'docIdIndex_docBlobs' ON 'docBlobs' ('docId' ASC);
The main use case of such statistics is sequential iterative loading of BLOBs of given category. The information of what document IDs are of each category, is stored in memory, and I tried some ways of loading BLOBs by a set of known IDs.
First, I tried performing
SELECT * FROM 'docBlobs' WHERE 'docBlobs'.'docId' = '?'
in a cycle. It gave about 20 rows per second.
Then, I tried to make a temp table in memory, put there the needed IDs and use it as IN list:
PRAGMA temp_storage = MEMORY
CREATE TEMP TABLE 'tempTable' ('docId' INTEGER)
INSERT INTO 'tempTable' VALUES (?)
SELECT * FROM 'docBlobs' WHERE 'docBlobs'.'docId' IN (SELECT 'tempTable'.'docId' FROM 'tempTable')
It gave me about 20 rows per second.
Then I tried to make a JOIN with temp table. Is also gave me about 20 rows per second.
Don't you know a way to get a normal reading speed for my task?
P.S. In case of SELECT * FROM 'docBlobs'
the reading speed was about 1000 rows per second, just like the speed of my HDD.
If the records to be loaded are not adjacent on disk, the disk must seek. These delays cannot be avoided.
If you're loading documents by category, you could try storing them ordered by category, but this does not work if there are multiple categories per document.