How to get the actual bytes of a bitmap in C++ with either vector<bool> or Boost::dynamic_bitset?

153 Views Asked by At

I am developing a compressor that needs to handle bitmaps, and I wonder how to best handle it in C++. Here, a bitmap means an array of boolean values but each value is stored in a bit instead of a byte. For instance, "10010010" has 8 bits and can be stored in a std::vector<bool>. Without optimization, it can be 8 bytes, each storing a boolean value. And I know that vector<bool> will optimize it to take only 1 byte. I hope to keep this behavior when sending it to a compressor or storing it to a file. Note: other containers are also acceptable. I know there is std::bitset but it requires a compiler-time-known size of the bit array, which is not acceptable in my case. There is Boost::dynamic_bitset that can dynamically increase the capacity. I think it is a good option, but I don't know how to use it correctly for the following scenario.

Let's say we have a compress and decompress function declared below. The compress function takes the address of a variable and requires the number of bytes to be sent in by dataLength. It will return a byte array containing the compressed data, and the compressed size is stored in outSize.

char *compress(const char *data, size_t dataLength, size_t &outSize);
char *decompress(const char *data, size_t &compressedSize);

If I hope to compress a vector<bool> bits, I can do

// vector<bool> bits is defined and initialized earlier
int compressed_size;
char* compressed_data = compress(reinterpre_cast<const char*>(&bits),
                                 bits.size(), compressed_size);

However, I am treating this vector<bool> as an actual array of boolean values, taking more storage space. If it has 100 bits, the size() function would return 100, and the compressed size would be larger than 50 bytes if using the ZSTD algorithm. The boolean array can be represented by 13 bytes (if 8 bits consist of 1 byte) and would be much smaller in this way. How should I address this problem?

I tried vector<bool> and expect the size to be automatically optimized. It might have been optimized in memory, but I want it to keep this behavior when writing to files or sending to compressors.

---- Updates (A method I tried and it seemed to work but hacky)-----

I read through the boost::dynamic_bitset<> source code, and it seems the private member m_bits is a vector that actually stores the bit array. Each block is an unsigned integer. I can use the boost::to_block_range() to copy that out. So there is a way to do the compression

// The definition in dynamic_bitset.hpp
template <typename Block, typename Allocator, typename BlockOutputIterator>
inline void
to_block_range(const dynamic_bitset<Block, Allocator>& b,
               BlockOutputIterator result)
{
    // note how this copies *all* bits, including the
    // unused ones in the last block (which are zero)
    std::copy(b.m_bits.begin(), b.m_bits.end(), result);
}

// I can use this function to get the vector
// boost::dynamic_bitset<> bitset is defined earlier
std::vector<boost::dynamic_bitset<>::block_type> filterBlocks(bitset.num_blocks());
// This function copies the internal array out
boost::to_block_range(bitset, filterBlocks.begin());
char *compressed_bitset = zstdCompressor.compress(reinterpret_cast<const char *>(&filterBlocks), bitset.num_blocks() * sizeof(boost::dynamic_bitset<>::block_type), compressed_size);

The above method only uses the public method, and should work properly. The issue is that it copies the whole bit array. The bit array can be extremely huge in my scenario. And for compression, it does not modify the bit array, so I don't want it to be copied.

I think if I can get the m_bits out, I should be fine. I found a way to hack this out. Because to_block_range() is a friend function of dynamic_bitset, I can access m_bits inside this function. So I can do the following.

#include <tuple>
namespace boost {
    template <>
    inline void
    to_block_range(const dynamic_bitset<Block, Allocator>& b,
                   std::tuple<std::vector<boost::dynamic_bitset<>::block_type>&, size_t&>param)
    {
        std::get<0>(param) = b.m_bits;
        std::get<1>(param) = b.num_blocks() * sizeof(boost::dynamic_bitset<>::block_type);
    }
}

std::vector<boost::dynamic_bitset<>::block_type> filterBlocks;
size_t dataLength;
// This line will call my customized function instead of the original one. Since filterBlocks is a reference, there should be no copy.
boost::to_block_range(bitset, std::make_tuple(std::ref(filterBlocks), std::ref(dataLength)));
char *compressed_bitset = zstdCompressor.compress(reinterpret_cast<const char *>(&filterBlocks), dataLength, compressed_size);

This method accessed the private member m_bits in a hacky way. I don't know if it is safe. It should be more efficient if the blocks are continuous inside boost::dynamic_bitset.

1

There are 1 best solutions below

1
On

A boost::dynamic_bitset is made of of "blocks", where each block is some unsigned integral type (which you can specify, if you want). The dynamic_bitset has a method called to_block_range() that will write these blocks to an output iterator.

Each block is some contiguous integer, but the blocks themselves are not necessarily contiguous, so you'll have to call compress once for each block.

If you write a function that takes a reference to a block, you can use boost::make_function_output_iterator() to wrap that function in an output iterator that you can pass to to_block_range().

I'm not including example code because how exactly your "write-this-block" function looks depends on a few things. For instance, if you care about a particular bit ordering, you'll have to deal with the endianness of your integral type. You'll also have to handle the case of what you want to output in a partial block (i.e., your bitset size isn't a multiple of the block size).

Response to your edits:

I think the copy you're trying to avoid is only an issue if you can only call compress() once. In that case, you would need to put the blocks in some contiguous container. It's stored that way in the bitset, but that's not part of the public interface, and I don't think it's a good idea to hack your way in to using private data members.

If your output iterator takes a block by const reference and then calls compress() on it, you should be able to avoid any copies.