I'm thinking of using a Huffman code to compress text, but with symbols of variable length (strings). For example (using an underscore as a space):
huffman-code | symbol
------------------------------------
00 | _
01 | E
100 | THE
101 | A
1100 | UP
1101 | DOWN
11100 | .
11101 |
1111...
(etc...)
How can I construct the frequency table? Obviously there are some overlapping issues, the sequence _TH
would appear neary as often as THE
, but would be useless in the table (both _
and THE
have short huffman code).
Does such an algorithm exists? Does it have a special name? What would be the tricks to generate the frequency table? Do I need to tokenize the input? I did not found anything in the litterature / web. (All this make me think also of radix trees).
I was thinking of using an iterative process:
- Generate an huffman tree for all symbols of length 1 to N
- Remove from the tree all symbols with N>1 and below a certain count threshold
- Regenerate a second huffman tree, but this time tokenizing the input with the previous one (probably using a radix tree for lookup)
- Repeat to 1 until we converge (or for a few times)
But I can't figure out how can I prevent the problem of overlaps (_TH
vs THE
) with this.
As long as you tokenize the text properly you don't have to worry about the overlap problem. You can define each token to be a word (longest continuous stream of characters), punctuation symbol or a whitespace character (' ', '\t', \n'). Thus by definition the tokens/symbols do not overlap.
But using Huffman coding directly isn't ideal for compressing text since it cannot make use of the dependencies between the symbols. For e.g. 'q' is likely followed by 'u', 'qu' is likely followed by a vowel, 'thank' is likely followed by 'you' and so on. You may want to look into a high order encoder like 'LZ' which can exploit this redundancy, by converting the data into a sequence of lookup addresses, copy lengths, and deviating symbols. Here's an example of how LZ works. You can then apply Huffman coding on each of the three streams to further compress the data. DEFLATE algorithm works exactly this way.