Naive suffix array optimisation c++

266 Views Asked by At

Do you have an idea how to optimize the following function while still using std::sort(). It sorts the suffixes of a text to create a suffix array. I think the problem is in the compare function as not much more can be done for the rest. compare is basically the lexicographical < operator for a string.

I added the function I use to test the time in the main() and the segment is now reproducible. I hope the comments make it easier to understand the logic. The function works fine but is too slow because of the multiple ifs. Construction time is now around 101 ms (on my CPU), but we aim at around 70.

// compile with: g++ -std=c++17 -Wall -pedantic -fsanitize=address  -O3 test.cpp -o test
#include <string>
#include <vector>
#include <iostream>
#include <chrono>
#include <algorithm>

/// Build suffix array from text.
/// Input: the text (might be empty)
/// Output: a suffix array (sorted). The variable 'sa' is cleared before it's filled and returned.
void construct(std::vector<uint32_t>& sa, const std::string& text) {

    sa.clear();
    unsigned tsize = text.size();

    for (unsigned i = 0; i < tsize; ++i) {
        sa.push_back(i);
    }

    // with this function we compare the suffix at position 'a' to the suffix at position 'b' in the text.
    // we do this letter by letter calling compare recursively
    std::function<bool(uint32_t, uint32_t)> compare = [&](uint32_t a, uint32_t b) {

        if (a>tsize) return true;                 //if we reach the end of the vector on a it means a<b
        else if (b>tsize) return false;            // if we reach the end on b it means b>a
        else if (text[a] < text[b]) return true;    // case a<b
        else if (text[a] == text[b]) return compare(a + 1, b + 1); //if a and b are the same call compare on the next letters in both suffixes
        else return false;                                  // only the case b>a is left

    };

    std::sort(sa.begin(), sa.end(), compare);
}

// main tests the construction speed
int main(){

    std::vector<uint32_t> sa;

    srand(0);
    std::string big(100000, ' '); // 'large' random text
    for (auto& c : big) c = rand() % 128;
    // time the construction
    auto begin = std::chrono::steady_clock::now();
    construct(sa, big);
    auto end = std::chrono::steady_clock::now();

    size_t time = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();

    std::cout << "Construction time: " << time << "\n";

}
2

There are 2 best solutions below

1
On

I haven't profiled it, but - depending on how clever your compiler is - I think a non-recursive formulation might be more efficient:

std::function<bool(uint32_t, uint32_t)> compare = [&](uint32_t a, uint32_t b) {
        do
        {
            if (a>=tsize) return true;
            else if (b>=tsize) return false;
            else if (text[a] < text[b]) return true;
            else if (text[a] > text[b]) return false;
            a += 1;
            b += 1;
        } while (true);
    };
0
On

The code in compare compares strings according to their characters, in lexicographical order. This is very similar to how numbers are compared - most significant bits/bytes, if different, force the result of the comparison, and if equal, defer the result to less significant bits/bytes. So, if your strings are usually long, you can compare 8 bytes at once, using uint64_t.

To do that, you must pad your strings (with zero-valued bytes), to prevent out-of-bounds reads (alternatively, do the 8-byte comparison only when far from the string's end, otherwise do the original 1-byte comparison). Also, if your system is little-endian (most likely), you have to reverse bytes in your uint64_t numbers.