Fastest way to populate a boost::dynamic_bitset<> from a vector of strings

494 Views Asked by At

I am implementing a program that uses huffman encoding to compress a file. I am having trouble writing the bits of the compressed string to another bitset. I have a vector of bytes (8 digit integers), and a vector of strings huffCodes, which is of size 256 that stores bit strings for each index. (For example, 0 is represented by 11, 1 is represented by 11011, etc..)

Here is my current method:

string compressed = "";
boost::dynamic_bitset<unsigned char> output;

for(byte b : bytes) 
{
    compressed += huffCodes [ ByteToInt(std::to_string(b)) ];
}

output = boost::dynamic_bitset<unsigned char> (compressed);

This goes through each byte and grabs its corresponding string of bits from the huffCodes vector, then appends that string to the compressed string. Once the compressed string is done, it converts it to a bitset. The problem with this method is that it fills the bitset EXTREMELY slowly, because I have 72 million bytes in my vector. I don't like this method though, because it seems unnecessary to populate this enormous string just to convert it into a bitset. What I would prefer is something like this:

boost::dynamic_bitset<unsigned char> output;
string temp = "";
    for(byte b : bytes) 
    {
        temp = huffCodes [ ByteToInt(std::to_string(b)) ];
        output.append(temp);
    }

Obviously this isn't real code, but ideally I would be populating the output bitset while I collect all of the strings from the huffCodes vector. Is it possible to do this through some sort of concatenation or appending of strings to the bitset?

Note: the contents of the huffCodes vector is strings of up to size 8 consisting of only 1's and 0's

1

There are 1 best solutions below

3
On

Your bottleneck is almost certainly this line:

compressed += huffCodes [ ByteToInt(std::to_string(b)) ];

because the output string (compressed) will be reallocated and copied many times over as you iterate round the loop.

Instead of doing that, try the following. Note that this pre-allocates a vector of the appropriate size to avoid the need to make expensive reallocations and copies. I also don't see the need to convert b to a string and then back to an int so I took that bit out:

std::string s;
int nbytes = 0;
for (b : bytes)
    nbytes += huffcodes [b].size ();

{
    std::vector <char> v (nbytes + 1);
    for (b : bytes)
    {
        auto hc = huffcodes [b];
        for (auto c : hc)
            v.push_back (c);
    }

    v.push_back (0);    // NUL terminator
    s = v.data ();
}

auto output = boost::dynamic_bitset<unsigned char> (s);

As you can see, the conversion to a string is done in a single operation. It's a shame one has to copy so large a data structure but there seems to be no other way.