I'm taking an algorithms class where we have to implement LZW compression in Java. I decided to use a Trie data structure for this, and I've already implemented the Trie and got it working, but is very slow, and it barely compresses.
We are supposed to use 8-bit symbols and be able to compress any binary file.
Given a ~4MB file (bible.txt), I get around 549,012 elements in my codes array. When I write these elements to a file (one integer code per line), I end up with a "compressed" file of 3.5MB, so I get .5MB of compression.
How can I make this program more efficient? I feel like I misunderstood something fundamental here, or I'm missing something obvious, but I'm out of ideas on why this doesn't compress.
(I got my test file bible.txt from this website: https://corpus.canterbury.ac.nz/descriptions/)
I read bytes from a binary file like this (reading as int and converting to char is necessary so that values above 0x80 are not negative):
public String readFile(String path) throws IOException, FileNotFoundException {
File file = new File(path);
StringBuilder string = new StringBuilder();
try (FileInputStream fileInputStream = new FileInputStream(file)) {
int singleCharInt;
char singleChar;
while((singleCharInt = fileInputStream.read()) != -1) {
singleChar = (char) singleCharInt;
string.append(singleChar);
}
}
return string.toString();
}
My main method looks like this:
public static void main(String args[]) throws FileNotFoundException, IOException {
String bytes = new FileReader().readFile("/home/user/Code/Trie/bible.txt");
ArrayList<Integer> codes = new Compress().compress(bytes);
}
My Compress class looks like this:
public class Compress {
private int code = 0;
public ArrayList<Integer> compress(String data) {
Trie trie = new Trie();
// Initialize Trie Data Structure with alphabet (256 possible values with 8-bit
// symbols)
for (code = 0; code <= 255; code++) {
trie.insert(Character.toString((char) code), code);
}
code++;
String s = Character.toString(data.charAt(0));
ArrayList<Integer> codes = new ArrayList<Integer>();
for (int i = 1; i < data.length(); i++) {
String c = Character.toString(data.charAt(i));
if (trie.find(s + c) > 0) {
s += c;
} else {
codes.add(trie.find(s));
trie.insert(s + c, code);
code++;
s = c;
}
}
codes.add(trie.find(s));
return codes;
}
}
My Trie class looks like this:
public class Trie {
private TrieNode root;
public Trie() {
this.root = new TrieNode(false);
}
public void insert (String word, int code) {
TrieNode current = root;
for (char l: word.toCharArray()) {
current = current.getChildren().computeIfAbsent(Character.toString(l), c -> new TrieNode(false));
}
current.setCode(code);
current.setWordEnd(true);
}
public int find(String word) {
TrieNode current = root;
for (int i = 0 ; i < word.length(); i++) {
char ch = word.charAt(i);
TrieNode node = current.getChildren().get(Character.toString(ch));
if (node == null) {
return -1;
}
current = node;
}
return current.getCode();
}
}
My TrieNode class looks like this:
public class TrieNode {
private HashMap<String, TrieNode> children;
private int code;
private boolean wordEnd;
public TrieNode(boolean wordEnd) {
this.children = new HashMap<String, TrieNode>();
this.wordEnd = wordEnd;
}
public HashMap<String, TrieNode> getChildren() {
return this.children;
}
public void setWordEnd(boolean wordEnd) {
this.wordEnd = wordEnd;
}
public boolean isWordEnd() {
return this.wordEnd;
}
public int getCode() {
return this.code;
}
public void setCode(int code) {
this.code = code;
}
}
Thank you for your time!
What does this mean: "When I write these elements to a file (one integer code per line)"? You're writing four bytes to the file for each code? You're writing four bytes and new line? You're writing a number in decimal and a new line?
In any case, all of those are wrong. You need to store the codes as bits. In the usual LZW implementation, the number of bits in a code starts at 9, and then increments as more codes are created. Further into a file, a code might be, for example, 12 or 13 bits. The decoder knows from the data when the encoder incremented, so it always knows how many bits to get for the next code. Every once in a while it makes sense to reset back to 9 bits, which is signaled by the encoder to the decoder.
So how do you read and write bits to a file? You will quickly find that there are no functions for that. You need to write them yourself.
In short, you keep a buffer of bits in an integer, using the bit shift and or operations to add bits to the buffer, keeping track of how many bits are in the buffer in another integer. For encoding, after you add bits to the buffer, you see if there are at least 8 bits in there. If so, write out a byte to the file, and remove the 8 bits from the buffer. Repeat until there are less than 8 bits in the buffer.
Some care must be taken at the end to write out the last few bits into one more byte, making sure you have thought through how the decoder will know when to stop decoding bits.
Same thing on the decoder side, reading bytes from the input file and adding 8 bits at a time to your buffer until you have enough bits to pull for the next code.