I was going through a great article on LZW compression algorithm by Mark Nelson, and found something in the code I haven't yet encountered.
In the code, he used unordered_map to store strings and their corresponding frequency. The declaration of the map was:
std::unordered_map<std::string, unsigned int> codes( (max_code * 11)/10 );
max_code stores the maximum number of entries in the map, i.e. 32767. The code:
void compress( INPUT &input, OUTPUT &output, const unsigned int max_code = 32767 )
{
//code
}
I am unaware as to what parameter does an unsigned int value associated with codes hold. Also, could someone enlighten me as to why the max_code value is multiplied by 11 and then divided by 10?
Here is the compress function for reference:
template<class INPUT, class OUTPUT>
void compress( INPUT &input, OUTPUT &output, const unsigned int max_code = 32767 )
{
input_symbol_stream<INPUT> in( input );
output_code_stream<OUTPUT> out( output, max_code );
std::unordered_map<std::string, unsigned int> codes( (max_code * 11)/10 );
for ( unsigned int i = 0 ; i < 256 ; i++ )
codes[std::string(1,i)] = i;
unsigned int next_code = 257;
std::string current_string;
char c;
while ( in >> c ) {
current_string = current_string + c;
if ( codes.find(current_string) == codes.end() ) {
if ( next_code <= max_code )
codes[ current_string ] = next_code++;
current_string.erase(current_string.size()-1);
out << codes[current_string];
current_string = c;
}
}
if ( current_string.size() )
out << codes[current_string];
}
The
std::unordered_maptype template has a number of constructors that accept asize_type; typicallystd::size_t, which is an unsigned integer. This value is related to the semantics, or rather the typical implementation, of an unordered map as a hash-set. From the link above:It serves as an indicator of how much data you expect the container to hold. This of course will not be perfect because of hash collisions. But it serves as a hint for the container for how much memory to allocate, similar to what
std::vector::reserve()would do.My guess is that by choosing slightly more than you actually need, here 1.1 as much, you are more likely to avoid reallocations on the bucket-level. 1.1 sounds a bit too conservative to me, I might have gone for 1.5 or higher. But possibly experiments were made to determine what safety-factor to use.