Create, access, store and load boost::bimap in an efficient way

356 Views Asked by At

In continuation with my earlier question to serialize bitsets to avoid creating bimap repeatedly on the same data, so save the bimap and load when needed.

I have chosen boost::bimap to store data (in bitsets) in <key,value> pair due to the reason that it uses hashing technique and need O(1) operation to search. The bimap may have 40 million entries of bitsets and want to perform the following.

  1. Insert bitsets in the bimap in the minimum possible time. Answer to my earlier question takes more time (nearly 5 seconds for .25 million bitset entries which are 5 fold when compared to the hash function given under 2 ). For the same reason unordered_set_of and unordered_multiset_of is used.

  2. I want bimap to consume minimum memory as possible and avoid copying, unlike the following hash function.

    namespace std {
        template <typename Block, typename Alloc>
        struct hash<boost::dynamic_bitset<Block, Alloc> > {
    
            using bitset_type = boost::dynamic_bitset<Block, Alloc>;
            using block_type = typename bitset_type::block_type ;
    
            size_t operator()(boost::dynamic_bitset<Block, Alloc> const& bs) const
            {
                thread_local static std::vector<block_type> block_data;
                auto blocks = bs.num_blocks();
                block_data.assign(blocks, 0);
                to_block_range(bs, block_data.begin());
                return boost::hash<std::vector<block_type>>()(block_data);
            }
        };
    }
    
  3. O(1) search for a key/value.

  4. Load bimap in a short time. Again, loading bimap takes much time (nearly 20 seconds for a bimap of .25 million entries, size 12 MB).

So I want to achieve 1,2,3, and 4 to my already asked question for which the answer code @sehe is shown below.

#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/bimap.hpp>
#include <boost/bimap/unordered_multiset_of.hpp>
#include <boost/bimap/unordered_set_of.hpp>
#include <boost/dynamic_bitset/serialization.hpp>
#include <fstream>
#include <iostream>
#include <string>

#include <boost/iostreams/device/back_inserter.hpp>
#include <boost/iostreams/stream_buffer.hpp>
#include <boost/iostreams/stream.hpp>

#include <boost/functional/hash.hpp>

namespace serial_hashing { // see https://stackoverflow.com/questions/30097385/hash-an-arbitrary-precision-value-boostmultiprecisioncpp-int
    namespace io = boost::iostreams;

    struct hash_sink {
        hash_sink(size_t& seed_ref) : _ptr(&seed_ref) {}

        typedef char         char_type;
        typedef io::sink_tag category;

        std::streamsize write(const char* s, std::streamsize n) {
            boost::hash_combine(*_ptr, boost::hash_range(s, s+n));
            return n;
        }
      private:
        size_t* _ptr;
    };

    template <typename T> struct hash_impl {
        size_t operator()(T const& v) const {
            using namespace boost;
            size_t seed = 0;
            {
                iostreams::stream<hash_sink> os(seed);
                archive::binary_oarchive oa(os, archive::no_header | archive::no_codecvt);
                oa << v;
            }
            return seed;
        }
    };
}

namespace std {
    template <typename Block, typename Alloc> struct hash<boost::dynamic_bitset<Block, Alloc> >
        : serial_hashing::hash_impl<boost::dynamic_bitset<Block, Alloc> > 
    {};
} // namespace std

namespace bimaps = boost::bimaps;
using Bitset = boost::dynamic_bitset<>;

typedef boost::bimap<
    bimaps::unordered_set_of<Bitset, std::hash<Bitset> >,
     bimaps::unordered_multiset_of<Bitset, std::hash<Bitset> > > Index;

int main() {
    using namespace std::string_literals;

    {
        std::cout << "# Writing binary file ... " << std::endl;
        Index index;
        index.insert({Bitset("10010"s), Bitset("1010110110101010101"s)});

        std::ofstream ofs("binaryfile", std::ios::binary);
        boost::archive::binary_oarchive oa(ofs);
        oa << index;
    }

    {
        std::cout << "# Loading binary file ... " << std::endl;
        std::ifstream ifs("binaryfile", std::ios::binary); // name of loading file

        boost::archive::binary_iarchive ia(ifs);

        Index index;
        ia >> index;
    }
}

EDIT

AIM I have a real life example where I have a large string say for example 2000 or more million characters and say for example 40-100 million short strings of length 200 or more characters. My aim is to search these short strings in the large string. I thought to create bimap of bitsets for the large string and then search the short string there in the bimap. I also thought to use unordered to get the insertion and search very fast unlike ordered.

Key bitset length around 3-40 (all combination at a time).

Value bitset length around 100-2000 (only one at a time for example if it is 100, all the value entries will be around 90-110 or so).

1

There are 1 best solutions below

2
On

You framed the whole question in terms of unordered maps with bitsets. What is the goal? What real life problems are you modeling with this design?

Just how big are your bitsets? What is the variance in size? What is the distribution in certain subsets of the bits? You might be far better of with a quick-and-dirty ad-hoc hash assuming some things than this general-purpose approach (see ¹ and ² below)

You will want to reduce allocations, load "packed" data into a non-node-based container, control when the elements are sorted (instead of carrying that invariant all the time).

I've had excellent results putting such containers in a Boost Interprocess shared memory segment / memory mapped file.

Benchmarks

I've used the following code to benchmark generating/saving/loading the data.

Note this implements none of the above suggestions, except it opts out of the hash-table choice. Not having to instantiate an archive each time you insert or lookup a key is going to help a lot. Also, bear in mind hash-tables do rehash when the load-factor is reached. Tuning is of the essence for them to really work smoothly.

Live On Wandbox

#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/bimap.hpp>
#include <boost/bimap/multiset_of.hpp>
#include <boost/dynamic_bitset/serialization.hpp>
#include <fstream>
#include <vector>
#include <random>
#include <chrono>
#include <iostream>

namespace bimaps = boost::bimaps;
using Block = uint32_t;
using Bitset = boost::dynamic_bitset<Block>;

typedef boost::bimap<bimaps::set_of<Bitset>, bimaps::multiset_of<Bitset>> Index;

template <typename Caption, typename F>
auto timed(Caption const& task, F&& f) {
    using namespace std::chrono;
    using namespace std::chrono_literals;
    struct _ {
        high_resolution_clock::time_point s;
        Caption const& task;
        ~_() { std::cout << " -- (" << task << " completed in " << (high_resolution_clock::now() - s) / 1.0s << "s)\n"; }
    } timing { high_resolution_clock::now(), task };

    return f();
}

int main(int argc, char**) {
    using namespace std::string_literals;

    auto gen_bitset = [
        data=std::vector<Block>(64), // max 2048 bits
        prng=std::mt19937{42} // { std::random_device{}() }
    ]() mutable {
        auto length_gen = std::uniform_int_distribution<size_t>(data.size()/2, data.size());
        auto block_gen = std::uniform_int_distribution<Block>{};

        size_t n = length_gen(prng);
        std::generate_n(data.begin(), n, [&]{ return block_gen(prng); });

        return Bitset(data.begin(), data.begin()+n);
    };

    if (argc>1) {
        std::cout << "# Creating ... " << std::endl;
        Index index;

        timed("Generating data set", [&] {
            for (size_t i = 0; i < 52<<19; ++i) {
                index.insert({gen_bitset(), gen_bitset()});
            }
        });

        timed("Writing binary file", [&] {
            std::ofstream ofs("binaryfile", std::ios::binary);
            boost::archive::binary_oarchive oa(ofs);
            oa << index;
        });
        std::cout << "Written " << index.size() << " key/value pairs\n";
    } else {
        std::cout << "# Loading ... " << std::endl;
        Index index;

        timed("Loading binary file", [&] {
            std::ifstream ifs("binaryfile", std::ios::binary); // name of loading file
            boost::archive::binary_iarchive ia(ifs);
            ia >> index;
        });

        std::cout << "Roundtripped " << index.size() << " key/value pairs\n";
    }
}

This creates a 11G file of 27262976 key/value pairs. All keys and values are uniformly random bitsets with lengths distributed uniformly between 1024..2048 bits.

 rm binaryfile                                                                    
 time ./sotest 1                                                                  
     -- (Generating data set completed in 228.499s)
     -- (Writing binary file completed in 106.083s)
    Written 27262976 key/value pairs

    real    5m48.362s
    user    5m32.876s
    sys     0m14.704s
 ls -ltrah binaryfile 
    -rw-rw-r-- 1 sehe sehe 11G dec 14 01:16 binaryfile
 time ./sotest
    # Loading binary file ... 
     -- (Loading binary file completed in 135.167s)
    Roundtripped 27262976 key/value pairs

    real    2m19.397s
    user    2m11.624s
    sys     0m7.764s

When reducing the dataset to your .25 million entries, I get a file of 106MiB¹, and the following timings:

 rm binaryfile 
 time ./sotest 1
    # Creating ... 
     -- (Generating data set completed in 1.13267s)
     -- (Writing binary file completed in 0.586325s)
    Written 262144 key/value pairs

    real    0m1.825s
    user    0m1.676s
    sys     0m0.140s
 ls -ltrah binaryfile 
    -rw-rw-r-- 1 sehe sehe 106M dec 14 01:44 binaryfile
 time ./sotest
    # Loading ... 
     -- (Loading binary file completed in 0.776928s)
    Roundtripped 262144 key/value pairs

    real    0m0.823s
    user    0m0.764s
    sys     0m0.056s

¹ this basically tells me your bitsets are much smaller - I think this could be strongly in favour of other datastructure choices

² I noticed the older non-scaling hash implementation had been written by Richard Hodges in an older answer. You see what's happening? You ask for X, giving too little information for people to really know your problem, so you get the answer to X. But it's not optimal. Your real problem is something else.

StackOverflow may be populated with the very best programmers, but they will not magically see through your X/Y problems, even though they might smell it and try to draw it out. In the end, we're not psychics. There's no substitute to working together with senior mentors/coworkers that can guide you every step of the way.