If we consider this aaabccba
as our input string, baaacacb
would be the output string after applying Burrows-Wheeler transform on the input. observing the output you'll see that the two clumped c
are separated. It's clear, the input string will result in a better compression than the output.
How to decide whether or not to apply Burrows-Wheeler transform on an input string ? Can we do some sort of fast analysis to make a decision ?
Just try to compress it with something much faster than BWT, e.g. lz4, and see how much it compresses. You can then through experiment set a threshold on that ratio above which to apply BWT, based on whatever criteria you derive for your application.