Huffman encoding with variable length symbols

1.7k Views Asked by At

I'm thinking of using a Huffman code to compress text, but with symbols of variable length (strings). For example (using an underscore as a space):

huffman-code | symbol
------------------------------------
00           | _
01           | E
100          | THE
101          | A
1100         | UP
1101         | DOWN
11100        | .
11101        |
1111...
(etc...)

How can I construct the frequency table? Obviously there are some overlapping issues, the sequence _TH would appear neary as often as THE, but would be useless in the table (both _ and THE have short huffman code).

Does such an algorithm exists? Does it have a special name? What would be the tricks to generate the frequency table? Do I need to tokenize the input? I did not found anything in the litterature / web. (All this make me think also of radix trees).

I was thinking of using an iterative process:

  1. Generate an huffman tree for all symbols of length 1 to N
  2. Remove from the tree all symbols with N>1 and below a certain count threshold
  3. Regenerate a second huffman tree, but this time tokenizing the input with the previous one (probably using a radix tree for lookup)
  4. Repeat to 1 until we converge (or for a few times)

But I can't figure out how can I prevent the problem of overlaps (_TH vs THE) with this.

2

There are 2 best solutions below

1
On

As long as you tokenize the text properly you don't have to worry about the overlap problem. You can define each token to be a word (longest continuous stream of characters), punctuation symbol or a whitespace character (' ', '\t', \n'). Thus by definition the tokens/symbols do not overlap.

But using Huffman coding directly isn't ideal for compressing text since it cannot make use of the dependencies between the symbols. For e.g. 'q' is likely followed by 'u', 'qu' is likely followed by a vowel, 'thank' is likely followed by 'you' and so on. You may want to look into a high order encoder like 'LZ' which can exploit this redundancy, by converting the data into a sequence of lookup addresses, copy lengths, and deviating symbols. Here's an example of how LZ works. You can then apply Huffman coding on each of the three streams to further compress the data. DEFLATE algorithm works exactly this way.

0
On

This is not a complete solution.

Since you have to store both the sequence and the lookup table, maybe you can greedily pick symbols that minimize the storage cost.

Step 1: Store all the symbols of length at most k in a try and keep track of their counts

Step 2: For each probable symbol, calculate the space saved (or compression ratio).

Encode_length(symbol) = log(N) - log(count(symbol))

Space_saved(symbol) = length(symbol)*count(symbol) - Encode_length(symbol)*count(symbol) - (length(symbol)+Encode_length(symbol))

N is the total frequency of all symbols (which we don't know yet, maybe approximate?).

Step 3: Select the optimal symbol and subtract frequency of other symbols that overlap with it.

Step 4: If the whole sequence is not encoded yet pick the next optimal symbol (i.e. go to step 2)

NOTE: This is just a outline and it is neither complete nor computationally efficient. If you are looking for a practical quick solution you should use krjampani's solution. This answer is purely academical.