If we consider this aaabccba as our input string, baaacacb would be the output string after applying Burrows-Wheeler transform on the input. observing the output you'll see that the two clumped c are separated. It's clear, the input string will result in a better compression than the output.
How to decide whether or not to apply Burrows-Wheeler transform on an input string ? Can we do some sort of fast analysis to make a decision ?
The simplest solution would be to actually compress each string, and see which results in the smallest compression.
If you don't want to do that, you could count the lengths of each group:
Compare the lists lexicographically:
3 < 5so[3, 1, 1] < [5, 0, 1]— Pick the smallest one.Alternatively, you could reverse the lists:
[1, 1, 3] > [1, 0, 5]— Pick the largest one.Another way to compare them, would be the total count:
3+1+1=5 < 5+0+1=6. — Pick the one with smaller sum.