Burrows-Wheeler Transform without EOF character

2k Views Asked by At

I need to perform a well-known Burrows-Wheeler Transform in linear time. I found a solution with suffix sorting and EOF character, but appending EOF changes the transformation. For example: consider the string bcababa and two rotations

  • s1 = abababc
  • s2 = ababcab

it's clear that s1 < s2. Now with an EOF character:

  • s1 = ababa#bc
  • s2 = aba#bcab

and now s2 < s1. And the resulting transformation will be different. How can I perform BWT without EOF?

3

There are 3 best solutions below

5
On

You need to have EOF character in the string for BWT to work, because otherwise you can't perform the inverse transform to get the original string back. Without EOF, both strings "ba" and "ab" have the same transformed version ("ba"). With EOF, the transforms are different

ab        ba

a b |     a | b
b | a     b a |
| a b     | b a

i.e. ab transforms to "|ab" and ba to "b|a".

EOF is needed for BWT because it marks the point where the character cycle starts.

Re: doing it without the EOF character, according to Wikipedia,

Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted without adding an 'EOF' marker to the input or, augmenting the output with information, such as an index, that makes it possible to identify the input string from the class of all of its rotations.

There is a bijective version of the transform, by which the transformed string uniquely identifies the original. In this version, every string has a unique inverse of the same length.

The bijective transform is computed by first factoring the input into a non-increasing sequence of Lyndon words; such a factorization exists by the Chen–Fox–Lyndon theorem, and can be found in linear time. Then, the algorithm sorts together all the rotations of all of these words; as in the usual Burrows–Wheeler transform, this produces a sorted sequence of n strings. The transformed string is then obtained by picking the final character of each of these strings in this sorted list.

0
On

I know this thread is quite old but I had the same problem and came up with the following solution:

  • Find the lexicographical minimal string rotation and save the offset (needed to reverse) (I use the lydon factorization)
  • Use the normal bwt algorithms on the rotated string (this produces the right output because all algos asume that the string is followed by the lexicographically minimal char)
  • To reverse: unbwt using e.g. backward search starting at index 0 and write the corrosponding char to the saved offset
6
On

You can perform the transform in linear time and space without the EOF character by computing the suffix array of the string concatenated with itself. Then iterate over the suffix array. If the current suffix array value is less than n, add to your output array the last character of the rotation starting at the position denoted by the current value in the suffix array. This approach will produce a slightly different BWT transform result, however, since the string rotations aren't sorted as if the EOF character were present.

A more thorough description can be found here: http://www.quora.com/Algorithms/How-I-can-optimize-burrows-wheeler-transform-and-inverse-transform-to-work-in-O-n-time-O-n-space