Is there better option than map?

316 Views Asked by At

Well, I am making a c++ program, that goes through long streams of symbols and I need to store information for further analysis where in the stream appears symbol sequences of certain length. For instance in binary stream

100110010101

I have a sequences for example of length 6 like this:

  • 100110 starting on position 0
  • 001100 starting on position 1
  • 011001 starting on position 2
  • etc.

What I need to store are vectors of all positions where I can find the one certain sequence. So the result should be something like a table, maybe resembling a hash table that look like this:

sequence/ positions

10010101 | 1 13 147 515

01011011 | 67 212 314 571

00101010 | 2 32 148 322 384 419 455

etc.

Now, I figured mapping strings to integers is slow, so because I have information about symbols in the stream upfront, I can use it to map this fixed length sequences to an integer.

The next step was to create a map, that maps these "representing integers" to a corresponding index in the table, where I add next occurence of this sequence. However this is slow, much slower than I can afford. I tried both ordered and unordered map of both std and boost libraries, none having enough efficiency. And I tested it, the map is the real bottleneck here

And here is the loop in pseudocode:

for (int i=seqleng-1;i<stream.size();i++) {
    //compute characteristic value for the sequence by adding one symbol
    charval*=symb_count;
    charval+=sdata[j][i]-'0';
    //sampspacesize is number off all possible sequence with this symbol count and this length
    charval%=sampspacesize;
    map<uint64,uint64>::iterator &it=map.find(charval);
    //if index exists, add starting position of the sequence to the table
    if (it!=map.end()) {
        (table[it->second].add(i-seqleng+1);
    }
    //if current sequence is found for the first time, extend the table and add the index
    else {
        table.add_row();
        map[charval]=table.last_index;
        table[table.last_index].add(i-seqleng+1)
    }
}

So the question is, can I use something better than a map to keep the record of corresponding indeces in the table, or is this the best way possible?

NOTE: I know there is a fast way here, and thats creating a storage large enough for every possible symbol sequence (meaning if I have sequence of length 10 and 4 symbols, I reserve 4^10 slots and can omitt the mapping), but I am going to need to work with lengths and number of symbols that results in reserving amount of memory way beyond the computer's capacity. But the the actual number of used slots will not exceed 100 million (which is guaranteed by the maximal stream length) and that can be stored in a computer just fine.

Please ask anything if there is something unclear, this is my first large question here, so I lack experience to express myself the way others would understand.

2

There are 2 best solutions below

5
On

An unordered map with pre-allocated space is usually the fastest way to store any kind of sparse data.

Given that std::string has SSO I can't see why something like this won't be about as fast as it gets:

(I have used an unordered_multimap but I may have misunderstood the requirements)

#include <unordered_map>
#include <string>
#include <iostream>

using sequence = std::string; /// @todo - perhaps replace with something faster if necessary

using sequence_position_map = std::unordered_multimap<sequence, std::size_t>;


int main()
{
    auto constexpr sequence_size = std::size_t(6);
    sequence_position_map sequences;
    std::string input = "11000111010110100011110110111000001111010101010101111010";

    if (sequence_size <= input.size()) {
        sequences.reserve(input.size() - sequence_size);

        auto first = std::size_t(0);
        auto last = input.size();

        while (first + sequence_size < last) {
            sequences.emplace(input.substr(first, sequence_size), first);
            ++first;
        }
    }

    std::cout << "results:\n";
    auto first = sequences.begin();
    auto last = sequences.end();
    while(first != last) {
        auto range = sequences.equal_range(first->first);

        std::cout << "sequence: " << first->first;
        std::cout << " at positions: ";
        const char* sep = "";
        while (first != range.second) {
            std::cout << sep << first->second;
            sep = ", ";
            ++first;
        }
        std::cout << "\n";
    }
}

output:

results:
sequence: 010101 at positions: 38, 40, 42, 44
sequence: 000011 at positions: 30
sequence: 000001 at positions: 29
sequence: 110000 at positions: 27
sequence: 011100 at positions: 25
sequence: 101110 at positions: 24
sequence: 010111 at positions: 46
sequence: 110111 at positions: 23
sequence: 011011 at positions: 22
sequence: 111011 at positions: 19
sequence: 111000 at positions: 26
sequence: 111101 at positions: 18, 34, 49
sequence: 011110 at positions: 17, 33, 48
sequence: 001111 at positions: 16, 32
sequence: 110110 at positions: 20
sequence: 101010 at positions: 37, 39, 41, 43
sequence: 010001 at positions: 13
sequence: 101000 at positions: 12
sequence: 101111 at positions: 47
sequence: 110100 at positions: 11
sequence: 011010 at positions: 10
sequence: 101101 at positions: 9, 21
sequence: 010110 at positions: 8
sequence: 101011 at positions: 7, 45
sequence: 111010 at positions: 5, 35
sequence: 011101 at positions: 4
sequence: 001110 at positions: 3
sequence: 100000 at positions: 28
sequence: 000111 at positions: 2, 15, 31
sequence: 100011 at positions: 1, 14
sequence: 110001 at positions: 0
sequence: 110101 at positions: 6, 36
0
On

After many suggestions in comments and answer, I tested most of them and picked the fastest possibility, reducing the bottleneck caused by the mapping to almost the same time it ran without the "map"(but producing incorrect data, however I needed to find the minimum speed this can be reduced to)

This was achieved by replacing the unordered_map<uint64,uint> and vector<vector<uint>> with just unordered_map<uint64, vector<uint> >, more precisely boost::unordered_map. I tested it also with unord_map<string,vector<uint>> and it surprised me that it was not that much slower as I expected. However it was slower.

Also, probably due to the fact ordered_map moves nodes to remain a balanced tree in its internal structure, ord_map<uint64, vector<uint>> was a bit slower than ord_map<uint64,uint> together with vector<vector<uint>>. But since unord_map does not move its internal data during computation, seems that it is the fastest possible configuration one can use.