In continuation with my earlier question to serialize bitsets to avoid creating bimap repeatedly on the same data, so save the bimap and load when needed.
I have chosen boost::bimap
to store data (in bitsets) in <key,value>
pair due to the reason that it uses hashing technique and need O(1) operation to search. The bimap
may have 40 million entries of bitsets and want to perform the following.
Insert bitsets in the
bimap
in the minimum possible time. Answer to my earlier question takes more time (nearly 5 seconds for .25 million bitset entries which are 5 fold when compared to the hash function given under 2 ). For the same reasonunordered_set_of
andunordered_multiset_of
is used.I want
bimap
to consume minimum memory as possible and avoid copying, unlike the following hash function.namespace std { template <typename Block, typename Alloc> struct hash<boost::dynamic_bitset<Block, Alloc> > { using bitset_type = boost::dynamic_bitset<Block, Alloc>; using block_type = typename bitset_type::block_type ; size_t operator()(boost::dynamic_bitset<Block, Alloc> const& bs) const { thread_local static std::vector<block_type> block_data; auto blocks = bs.num_blocks(); block_data.assign(blocks, 0); to_block_range(bs, block_data.begin()); return boost::hash<std::vector<block_type>>()(block_data); } }; }
O(1) search for a key/value.
Load bimap in a short time. Again, loading bimap takes much time (nearly 20 seconds for a bimap of .25 million entries, size 12 MB).
So I want to achieve 1,2,3, and 4 to my already asked question for which the answer code @sehe is shown below.
#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/bimap.hpp>
#include <boost/bimap/unordered_multiset_of.hpp>
#include <boost/bimap/unordered_set_of.hpp>
#include <boost/dynamic_bitset/serialization.hpp>
#include <fstream>
#include <iostream>
#include <string>
#include <boost/iostreams/device/back_inserter.hpp>
#include <boost/iostreams/stream_buffer.hpp>
#include <boost/iostreams/stream.hpp>
#include <boost/functional/hash.hpp>
namespace serial_hashing { // see https://stackoverflow.com/questions/30097385/hash-an-arbitrary-precision-value-boostmultiprecisioncpp-int
namespace io = boost::iostreams;
struct hash_sink {
hash_sink(size_t& seed_ref) : _ptr(&seed_ref) {}
typedef char char_type;
typedef io::sink_tag category;
std::streamsize write(const char* s, std::streamsize n) {
boost::hash_combine(*_ptr, boost::hash_range(s, s+n));
return n;
}
private:
size_t* _ptr;
};
template <typename T> struct hash_impl {
size_t operator()(T const& v) const {
using namespace boost;
size_t seed = 0;
{
iostreams::stream<hash_sink> os(seed);
archive::binary_oarchive oa(os, archive::no_header | archive::no_codecvt);
oa << v;
}
return seed;
}
};
}
namespace std {
template <typename Block, typename Alloc> struct hash<boost::dynamic_bitset<Block, Alloc> >
: serial_hashing::hash_impl<boost::dynamic_bitset<Block, Alloc> >
{};
} // namespace std
namespace bimaps = boost::bimaps;
using Bitset = boost::dynamic_bitset<>;
typedef boost::bimap<
bimaps::unordered_set_of<Bitset, std::hash<Bitset> >,
bimaps::unordered_multiset_of<Bitset, std::hash<Bitset> > > Index;
int main() {
using namespace std::string_literals;
{
std::cout << "# Writing binary file ... " << std::endl;
Index index;
index.insert({Bitset("10010"s), Bitset("1010110110101010101"s)});
std::ofstream ofs("binaryfile", std::ios::binary);
boost::archive::binary_oarchive oa(ofs);
oa << index;
}
{
std::cout << "# Loading binary file ... " << std::endl;
std::ifstream ifs("binaryfile", std::ios::binary); // name of loading file
boost::archive::binary_iarchive ia(ifs);
Index index;
ia >> index;
}
}
EDIT
AIM
I have a real life example where I have a large string say for example 2000 or more million characters and say for example 40-100 million short strings of length 200 or more characters. My aim is to search these short strings in the large string. I thought to create bimap
of bitsets for the large string and then search the short string there in the bimap. I also thought to use unordered
to get the insertion and search very fast unlike ordered
.
Key bitset length around 3-40 (all combination at a time).
Value bitset length around 100-2000 (only one at a time for example if it is 100, all the value entries will be around 90-110 or so).
You framed the whole question in terms of unordered maps with bitsets. What is the goal? What real life problems are you modeling with this design?
Just how big are your bitsets? What is the variance in size? What is the distribution in certain subsets of the bits? You might be far better of with a quick-and-dirty ad-hoc hash assuming some things than this general-purpose approach (see ¹ and ² below)
You will want to reduce allocations, load "packed" data into a non-node-based container, control when the elements are sorted (instead of carrying that invariant all the time).
I've had excellent results putting such containers in a Boost Interprocess shared memory segment / memory mapped file.
Benchmarks
I've used the following code to benchmark generating/saving/loading the data.
Live On Wandbox
This creates a 11G file of 27262976 key/value pairs. All keys and values are uniformly random bitsets with lengths distributed uniformly between 1024..2048 bits.
When reducing the dataset to your .25 million entries, I get a file of 106MiB¹, and the following timings:
¹ this basically tells me your bitsets are much smaller - I think this could be strongly in favour of other datastructure choices
² I noticed the older non-scaling hash implementation had been written by Richard Hodges in an older answer. You see what's happening? You ask for X, giving too little information for people to really know your problem, so you get the answer to X. But it's not optimal. Your real problem is something else.
StackOverflow may be populated with the very best programmers, but they will not magically see through your X/Y problems, even though they might smell it and try to draw it out. In the end, we're not psychics. There's no substitute to working together with senior mentors/coworkers that can guide you every step of the way.