Let's assume that we have sequences consisting of symbols from alphabet containing 50 symbols. So, every symbol can be encoded by 7 bits (2^7 = 64 > 50
). It means that every given sequence of symbols can be represented as a sequence of 0s and 1s.
Now, let's assume that symbols in our sequences are not totally random, so they are to some extend predictable. In more details, given the first N symbols in a sequence we can estimate how probable is each symbols as a next one in the sequence. For example, we can say that A
is expected to come with probability 0.01, B
is expected to come with probability 0.3 and so on.
I believe that such a predictive model can be used to compress data. And my question is how exactly it should be done. Or, in more details, what is the best way to use the predictive model to compress the data.
I thought in the following direction. At a given stage, for all the symbols we have the estimated probabilities, so all the symbols can be ordered according to their probabilities (from the most probable symbol to the least probable). Then the first symbols is encoded by 0, second by 1, third by 00... So, the encodings are:
[0, 1, 00, 01, 10, 11, 000, 001, ..., 111110, 111111]
In such a way usually symbols will get an encoding with a small number of bits. However, these encodings contains commas. For example, the original sequence of symbols can be represented as:
[0, 00, 1, 10, 0, 0, 1, 0110, ...]
And commas are out of the alphabet.
I also though about the following encoding of the symbols in the list ordered by probabilities:
[0, 10, 110, 1110, 11110, 111110, ....]
Then 0 is used as a separator (instead of comas) and number of 1s represent the position of the symbol in the list. But again, I am not sure that this is the most efficient way to use bits and the best way to utilise the predictive model.
Yes, such a predictive model can be used to compress data, so long as the model has a good shot at predicting the next symbol. This is exactly the sort of probabilistic model that arithmetic coding was conceived for.