Efficient Means of Implementing Collation & Sorting?

185 Views Asked by At

I'm writing lexicography software, which may theoretically need to sort tens of thousands of strings with arbitrary (dictionary-project-specific) collations. There are two means of specifying custom collations:

  1. a map of graphemes to unicode-style multi-level collation keys.
  2. an array of alphabetic graphemes (possibly including digraphs, etc.) in sorting order, which can be internally transformed into a map of collation keys.

The naive method of comparing strings is to check grapheme-by-grapheme until you find a mismatch, and then look up the collation keys for the mismatched graphemes to compare, but I'm hoping there's a more efficient way of doing it.

The best idea I've got so far depends on noticing that strings of equal length can be treated as little-endian base-n numbers, so I can pre-compute an integer key for each string which turns collation into cheap integer comparison. But, this breaks for strings of different length (a big deal when sorting a dictionary), and there's no bound on the size of integers that could be generated. To account for length differences, I thought I could compute a list of keys for all prefixes of each string, and then just compare the keys for prefixes of length equal to the shorter string being compared. That seems to do pretty well, but key sizes are still unbounded, and storing the keys could use a lot of memory.

Is there a way to improve that approach? Or am I just going about it entirely wrong, and there's a much better means of sorting strings with arbitrary collations?

2

There are 2 best solutions below

1
On

I'm no expert but I might suggest some kind of hybrid between the naive approach and your approach. Where you look at a fixed number of bytes in each string, treat it as a little-endian number and use a pre-calculated collation. Then if they are the same move to the next set of the same length and do the same. The tricky part is dealing with variable length graphemes (such as UTF-8 or digraphs). The simplest solution would be to use a fixed-width representation in the dictionary, but there might be another, more sophisticated solution, which I can't think of right now.

Once you get to the end of the shorter string you zero extend it to meet the next boundary and then do the comparison.

You could also look at open-source implementations of collations, and see if they do something more sophisticated (for instance the GNU implementation of the strcoll C function).

0
On

How about a grapheme-by-grapheme Radix sort? You get Big O n(number of words) * m(length of longest word) sorting. The idea should be fairly simple put all the words that start with A in the A bucket, Bs in the B bucket and so on down the characters in the word.