How to handle the remaining bits in file decompression

125 Views Asked by At

I am creating a file compression and decompression, and I don't know how to handle the remaining bits when I decompressed.

For example, I have 63 bits.length() and since byte = 8 bits, bits.length() % 8 = 7 there would still 7 bits. Now whenever I decompress the file it has a missing character.

Here is my compression code:

    void compressFile(string inputFile) {
    huffmanTree();
    system("cls");
    cout << "\n\n\t\t\t\tProcessing...";
    Sleep(5000);

    ifstream inputedFile(inputFile); // Open the file in binary mode
    ofstream compressedFile("compressed.huff");

    if (!inputedFile.is_open() || !compressedFile.is_open()) {
        cout << "\t\t\t\tError: Unable to open file for compression." << endl;
        return;
    }

    string bits; // Use a string to accumulate bits for each character
    char ch;
    while (inputedFile.get(ch)) {
        bits += treeCode[(int)ch];
        if(bits.length() >= 8){
            // Process complete groups of 8 bits
            for (int i = 0; i + 8 <= bits.length(); i += 8) {
                compressedFile.put((char)stoi(bits.substr(i, 8), NULL, 2));
            }
            bits = bits.substr(bits.length() - bits.length() % 8);
        }
    }
    if (!bits.empty()) {
        compressedFile.put((char)stoi(bits, NULL, 2));
    }

    system("cls");
    cout << "\t\t\t\t---------------------------------------------" << endl;
    cout << "\n\n\t\t\t\tSuccessful: File has been compressed." << endl;
    cout << "\n\n\t\t\t\tThe file name is compressed.huff." << endl;
    cout << "\t\t\t\t---------------------------------------------" << endl;
    cout << "\t\t\t\t";
    system("pause");
    inputedFile.close();
    compressedFile.close();
}

Here is my decompression code:

    void decompressFile(string compressedFile) {

    system("cls");
    cout << "\n\n\t\t\t\tProcessing...";
    Sleep(5000);

    ifstream compressedFileStream(compressedFile, ios::binary); // Open the file in binary mode
    ofstream decompressedFile("decompressed.txt");

    if (!compressedFileStream.is_open() || !decompressedFile.is_open()) {
        cout << "\n\t\t\t\tError: Unable to open file for decompression." << endl;
        return;
    }

    huffmanTree();

    Node* root = head->node; // Save the root of the Huffman tree
    Node* current = root;    // Initialize the current node

    char byte; // Read bytes for decompression
    while (compressedFileStream.get(byte)) {
        for (int i = 7; i >= 0; i--) {

            // Traverse the tree based on each bit in the byte
            char bit = (byte & (1 << i)) ? '1' : '0';

            if (bit == '0') {
                current = current->left;
            }
            else if (bit == '1') {
                current = current->right;
            }

            if (current->left == NULL && current->right == NULL) {
                decompressedFile << current->character;
                cout << "decompressed" <<current->character;
                current = root; // Reset current to the root for the next character
            }
        }
    }
    system("pause");
    system("cls");
    cout << "\t\t\t\t---------------------------------------------" << endl;
    cout << "\n\n\t\t\t\tSuccessful: File has been decompressed." << endl;
    cout << "\n\n\t\t\t\tThe file name is decompressed.txt." << endl;
    cout << "\t\t\t\t---------------------------------------------" << endl;
    cout << "\t\t\t\t";
    system("pause");
    compressedFileStream.close();
    decompressedFile.close();
} 

What should I do to decompress my compressed file without having a missing character.

2

There are 2 best solutions below

2
Atmo On

Your code currently does:

  1. Read a byte (it gets you a value to uncompress + advances the cursor in the file).
  2. Process the bits 1 by 1, which starts with checking you did not reach the end of the file.

I see some issues with it:

  • Putting the if (compressedFileStream.eof()) in the for loop looks like you expect a file to end in the middle of a byte. That does not happen.
    Moreover, once you have done the test, it only makes sense to do it again after reading from the file: if you do not read anything, the value returned by it will not change.
  • Like I said above, the first thing you do after reading a valid byte is test whether you have reached the end of the file. On the last byte, you exit your loop before processing anything.
    If you managed to read a byte, you know it is valid; hence your test would make much more sense to do right before reading the next byte.

With that in mind, it seems all you have to do is move the eof right before reading from the file:

Node* root = head->node; // Save the root of the Huffman tree
Node* current = root;    // Initialize the current node

char byte; // Read bytes for decompression
while (!compressedFileStream.eof()) {
    compressedFileStream.get(byte);
    for (int i = 7; i >= 0; i--) {
        if (!(byte >> i) & 1) {
            current = current->left;
        }
        else {
            current = current->right;
        }

        if (!current->left && !current->right) {
            decompressedFile << current->character;
            current = root; // Reset current to the root for the next character
        }
    }
} 
// Close the file / free resources here.

PS:

  • Depending on what type compressedFileStream was (you did not mention it in your question and I did not want to make any assumption there), it may be possible to just test the value returned by get to know when you are trying to read after the file end.
  • While I was at it, I corrected the part where you do char bit = (byte & (1 << i)) ? '1' : '0'
  • NULL is not C++. Use nullptr in the future (or test the pointers the way I did).

Note: during compression, you must ensure the last bits of the last byte cannot be interpreted as a character.

  • If you choose to fill the last byte with zeroes, then you must make sure no character is represented by a sequence of 7 or less consecutive zeroes.
  • The opposite approach is that you calculate your compression normally and as you are going, detect a sequence that can for sure fill the last byte without being interpreted as a character. For instance:
    • If a slightly rare character is represented by 8 bits or more, then the first 7 bits to decode it won't represent a character.
    • If you cannot find such a case, then you can create it: pick the rarest character and add 0 to its sequence during compression. Then the sequence that has the same start as it, but with 1 instead of 0, then a series of 0 to be enough to fill a byte will do.
  • You could also save how many characters your initial file has, making it trivial to know when you reach the end (that is what I would probably do, at least for v1 of the procedure).
4
anatolyg On

The compressor has to "flush" its output bitstream somehow. That is, if the number of bits in the last compressed byte is less than 8, it still has to output the byte. Fill the remaining bits with any value — zeros, ones or garbage.

This leads to the next problem — how will the decompressor understand where to stop? If it decompresses extra bits after the intended end, the decompressed output will contain garbage at the end.

  1. The simplest way is to pass the "real" length of the bitstream (number of bits) from the compressor to the decompressor, separately from the bitstream. This is inconvenient.

  2. The compressor can have a dedicated token "end of message"; it should encode it at the end, and fill the unused bits in last byte with garbage (doesn't matter with what).

  3. The compressor can put a dedicated "flush" bit-sequence at the end of bitstream. The best flush is "100..." — that is, put a bit equal to 1, and then, put bits equal to 0 until the byte is full.

The latter way is the cleanest, but most difficult to implement in decompressor. It should read 1 byte ahead and use the end-of-file indication for that byte to find the end of stream in current byte.

    Node* root = head->node; // Save the root of the Huffman tree
    Node* current = root;    // Initialize the current node

    char byte; // Read bytes for decompression
    char next_byte;
    compressedFileStream.get(byte);
    while (true) {
        compressedFileStream.get(next_byte);
        int num_bits = 8;
        if (compressedFileStream.eof()) {
            // current byte is the last one!
            // find the last bit equal to 1 in it.
            // All bits preceding it are data bits.
            if (byte == 0)
            {
                fprintf(stderr, "Error!\n");
            }
            for (int i = 0; i < 8; i++)
            {
                if (byte & (1 << i))
                {
                    num_bits = 7 - i;
                    break;
                }
            }
        }
        for (int i = 7; i >= 8 - num_bits; i--) {
            // Traverse the tree based on each bit in the byte
            char bit = (byte & (1 << i)) ? '1' : '0';
            ...
        }
        if (num_bits < 8)
            break;
        byte = next_byte;
    } 

The code above is untested; I hope it's not buggy. Bit-level alignment using this idea requires precise coding, but once you do it right, the problem is solved.