Well, I am making a c++ program, that goes through long streams of symbols and I need to store information for further analysis where in the stream appears symbol sequences of certain length. For instance in binary stream
100110010101
I have a sequences for example of length 6 like this:
- 100110 starting on position 0
- 001100 starting on position 1
- 011001 starting on position 2
- etc.
What I need to store are vectors of all positions where I can find the one certain sequence. So the result should be something like a table, maybe resembling a hash table that look like this:
sequence/ positions
10010101 | 1 13 147 515
01011011 | 67 212 314 571
00101010 | 2 32 148 322 384 419 455
etc.
Now, I figured mapping strings to integers is slow, so because I have information about symbols in the stream upfront, I can use it to map this fixed length sequences to an integer.
The next step was to create a map, that maps these "representing integers" to a corresponding index in the table, where I add next occurence of this sequence. However this is slow, much slower than I can afford. I tried both ordered and unordered map of both std and boost libraries, none having enough efficiency. And I tested it, the map is the real bottleneck here
And here is the loop in pseudocode:
for (int i=seqleng-1;i<stream.size();i++) {
//compute characteristic value for the sequence by adding one symbol
charval*=symb_count;
charval+=sdata[j][i]-'0';
//sampspacesize is number off all possible sequence with this symbol count and this length
charval%=sampspacesize;
map<uint64,uint64>::iterator &it=map.find(charval);
//if index exists, add starting position of the sequence to the table
if (it!=map.end()) {
(table[it->second].add(i-seqleng+1);
}
//if current sequence is found for the first time, extend the table and add the index
else {
table.add_row();
map[charval]=table.last_index;
table[table.last_index].add(i-seqleng+1)
}
}
So the question is, can I use something better than a map to keep the record of corresponding indeces in the table, or is this the best way possible?
NOTE: I know there is a fast way here, and thats creating a storage large enough for every possible symbol sequence (meaning if I have sequence of length 10 and 4 symbols, I reserve 4^10 slots and can omitt the mapping), but I am going to need to work with lengths and number of symbols that results in reserving amount of memory way beyond the computer's capacity. But the the actual number of used slots will not exceed 100 million (which is guaranteed by the maximal stream length) and that can be stored in a computer just fine.
Please ask anything if there is something unclear, this is my first large question here, so I lack experience to express myself the way others would understand.
An unordered map with pre-allocated space is usually the fastest way to store any kind of sparse data.
Given that
std::string
has SSO I can't see why something like this won't be about as fast as it gets:(I have used an unordered_multimap but I may have misunderstood the requirements)
output: