Analyzing a string before Burrows-Wheeler transform?

214 Views Asked by At

If we consider this aaabccba as our input string, baaacacb would be the output string after applying Burrows-Wheeler transform on the input. observing the output you'll see that the two clumped c are separated. It's clear, the input string will result in a better compression than the output.

How to decide whether or not to apply Burrows-Wheeler transform on an input string ? Can we do some sort of fast analysis to make a decision ?

2

There are 2 best solutions below

0
On

Just try to compress it with something much faster than BWT, e.g. lz4, and see how much it compresses. You can then through experiment set a threshold on that ratio above which to apply BWT, based on whatever criteria you derive for your application.

0
On

The simplest solution would be to actually compress each string, and see which results in the smallest compression.

If you don't want to do that, you could count the lengths of each group:

aaabccba -> aaa b cc b a

    aaa has length 3
    b has length 1
    cc has length 2
    b has length 1
    a has length 1

    there where 3 groups of length 1
    there where 1 group of length 2
    there where 1 group of length 3
                ^

    -> [3, 1, 1]
baaacacb -> b aaa c a c b

    b has length 1
    aaa has length 3
    c has length 1
    a has length 1
    c has length 1
    b has length 1

    there where 5 groups of length 1
    there where 0 groups of length 2
    there where 1 group of length 3
                ^

    -> [5, 0, 1]
  • Compare the lists lexicographically: 3 < 5 so [3, 1, 1] < [5, 0, 1] — Pick the smallest one.

  • Alternatively, you could reverse the lists: [1, 1, 3] > [1, 0, 5] — Pick the largest one.

  • Another way to compare them, would be the total count: 3+1+1=5 < 5+0+1=6. — Pick the one with smaller sum.