Java BitSet.size() is not returning the size I gave at the constructor

110 Views Asked by At

I have the following code:

 private BitSet arrayListToBitSet(ArrayList<Boolean> code) {
                // The problem is here, the bitset automatically is created using 64 bits even tho we are clearly 
                    // giving it a max size in constructor, and in the test we are running,
                    // we are only using 9 bits, leading to the rest of the bits being automatically 0, so we need to
                    // search for a way that the size of the bitset is exactly the size of the array list "code"
                int x = code.size(); // With the test I'm running, x = 9;
                BitSet bitset = new BitSet(x);
                System.out.println(bitset.size());
        
                for (int i = 0; i < x; i++)
                {
                    bitset.set(i, code.get(i));
                }
        
                return bitset;
        }

The System.out.println() is showing 64 on console, even tho it should show only 9.

And even when I hard coded 9 into the constructor of BitSet, it still returned that the size of the bitset is 64. As I am making a program that compress .txt files, is is crucial for me to use BitSets intead of booleans, or if anyone knows another way to manipulate single bits in java, it can also be helpful.

In the documentation of BitSet.size() just says the following

public int size()
Returns the number of bits of space actually in use by this BitSet to represent bit values. The maximum element in the set is the size - 1st element.
Returns:
the number of bits currently in this bit set 
  • Docs.Oracle

And this really does not help much to my problem. Anyone knows the reason this happens? and how to fix it? Thanks a lot.

More context: What I am doing is an implementation of the huffman algorithm to compress .txt files given, to a new class called "CompressedFile" that is going to be stored in the pc memory.

What happens before the above method is that we count the chars in the text and save them to a hashmap, with this hashmap we then create the Huffman tree, once we have the Huffman tree, we use it for getting the huffman enconding for each one of the chars in the text, each of the code for a given char are saved in another HashMap with this value pairs:

(char, ArrayList), the code is an ArrayList of booleans for now because of the problem with the BitSets.

Once we have this, we give this HashMap to another function that creates the "encodedText", this is an ArrayList that stores the encoded version of the tex, we do this by going by every char in the .txt, and adding the encoded value of that char to the "encodedText" variable (recall that the encoded value of a char is saved also as an ArrayList in a HashMap).

Once we have the complete Huffman code, is where this arrayListToBitSet method comes into play, it converts the encodedText in ArrayList version, to a BitSet version, that is the one that is going to be stored in the CompressedText class.

The .txt file we are using for test purposes is just this: "aaabbc"

After the getting the Huffman tree and getting each character encoded version, the value pairs look like this (im gonna use 0 and 1, as to not write True and False):

(a, 0) (b, 11) (c, 10)

Therefore the encodedVersion of the text will look like this: 000111110

when this ArrayList is converted to BitSet, the output (I print the Bitset to console) looks like this: 0001111100000000000... (another 42 ceros) ... 00

So the first 9 bits are correct, but the following 53 bits shouldn't be there, and are cero. If I were to leave it like this this is what the decoded text would look like:

aaabbcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...

very different from the original. One way to fix this would be to include the correct size of the encoded message in the CompressFile class, so when I decompress it, I only check the exact number of bits, but I want to know if there is a way to not have to save this integer in the class (this would save me 8 bytes, not much in the grand scheme of things, coming to be 8 chars, but the purpose is to use as less memory as possible). Still if there is no other way, I will put that original size into the CompressedFile class.

Thanks in advance for the help :)

4

There are 4 best solutions below

5
Sweeper On BEST ANSWER

Notice what the constructor's documentation says:

Creates a bit set whose initial size is large enough to explicitly represent bits with indices in the range 0 through nbits-1.

The phrase "large enough" is key here. This constructor is not supposed to create a bit set that is exactly the size you specified - just large enough to store the number of bits you specified.

Also note this paragraph in the class documentation:

Every bit set has a current size, which is the number of bits of space currently in use by the bit set. Note that the size is related to the implementation of a bit set, so it may change with implementation.

BitSet, at least in OpenJDK, is implemented by using a long[]. This explains why you see 64. The least it can do to create a bit set that is "large enough" to store 9 bits, is to allocate a long[] with length 1. A single long is 64 bits.

I don't think it is even possible in the JVM to implement a bit set that can store exactly 9 bits. Even if you use a boolean[] to back it, each boolean is still one byte (as you may know already).

so having extra ceros might lead to adding further characters in the decompressed version of the file

If that is the case, you can check the length of the BitSet. It gives you how many bits there are in the bit set, minus the leading zero bits.

0
Laughcheeta1 On

As I hinted at in the context of the question, I decided to have an extra int variable that keeps track of the amount of bits that I am using, since there is no other way that does not affect the Huffman algorithm to solve this.

Note that this is a solution that only applies for this very specific case, It is not a way to solve the problem of the BitSet.size() method not returning the actual size of the array. Since there is no work aroung this, please take your time to read the comments on the post, specially since they give awesome explanations of how BitSets work and may help you in your case.

0
harold On

One way to fix this would be to include the correct size of the encoded message in the CompressFile class, so when I decompress it, I only check the exact number of bits

You will need at least one of:

  • The number of characters in the uncompressed file, which is nice to have anyway.
  • An "end" symbol, that does not correspond to any character but signals to stop decoding.
  • The exact number of encoded bits, but it's more annoying to work with.

The number of encoded bits may not exactly fill a number of bytes (nor anything bigger like longs), so there can be some extra bits at the end that do not code for anything - but they may accidentally look like valid Huffman codes. So you need something to tell you where to stop.

A BitSet doesn't remember its "exact length", and its length method ignores leading zeroes even if they are part of the data, but a BitSet plus an extra integer (long if you want) can work. Or as mentioned by Johannes Kuhn, you may prepend an extra set bit that isn't part of the actual data, that way length would work (remember to ignore the extra bit during decompression, of course).


Once we have this, we give this HashMap to another function that creates the "encodedText", this is an ArrayList that stores the encoded version of the tex, we do this by going by every char in the .txt, and adding the encoded value of that char to the "encodedText" variable (recall that the encoded value of a char is saved also as an ArrayList in a HashMap).

You should avoid creating at least the large ArrayList that represents the entire compressed data, since it will be much bigger than the actual data that it represents (it uses one Boolean per bit after all).

If you stuff the bits into bytes/integers/a BitSet while converting the characters to their encoded values, there is no "temporary expansion" anymore.

Putting ArrayList<Boolean> and BitSet aside for the moment, if you represented your codes as pairs of integers, using one integer for the length of the code and another integer to store the bits of the code, you can efficiently encode your data without doing any bit-by-bit iteration. This limits the lengths of the codes to at most 32 (or 64 if you use long), but efficient table-based decoding requires a limit too and very long codes are not required for good compression anyway, a limit of 32 is already way into the diminishing returns, much smaller limits are common in practice (eg 16) and you'd get more out of switching to ANS as several modern compression formats have.

0
Reilas On

To quickly answer your first inquiry, in terms of how the BitSet is creating a set larger than your requested amount, this is because there is no definitive way to represent 1 bit of memory.
At this rate, even a boolean is occupying 1 byte of space, despite only needing 1 bit, arguably.
So, it appears BitSet is utilizing a long to store the values.

If this is too deviated from the required logic you're needing, you can attempt to create your own class, and implement the procedures as needed.

It depends on the how you intend to utilize the bits.

You mention the following.

I am making a program that compress .txt files, [it] is crucial for me to use BitSets intead of booleans, if anyone knows another way to manipulate single bits in java, it can also be helpful.

BitSet is essentially the class you're looking for.