How to sort and search in text while ignoring diacritics of all kinds?

215 Views Asked by At

Background

Various languages have what's called "Diacritics" . Special signs that come with "normal" letters, one way or another. They might change how the letters sound, or just give a hint about how they are supposed to be sound.

The problem

When searching and sorting strings using the basic way, it uses the Unicode value of the characters, so things can seem to be in the wrong order for sorting, or not found for searching.

Searching should allow me to find the occurrences of a string within another, including not just that they exist, but also where.

If I take the string "Le Garçon" in French, for example, and search for "rc" it would find it on position of "r" and ends with the position of "ç". Finding the locations is important in case you wish to highlight where the text was found.

What I've found

Collator and CollationKey can help for sorting: https://stackoverflow.com/a/75334111/878126

Normalizer might help for searching as it replaces letters that have Diacritic: https://stackoverflow.com/a/10700023/878126

But, these don't seem to cover some languages. I know Hebrew for example, and in Hebrew, it has Niqqud (equivalent to Vowels in English but are optional) signs, which, as a Unicode characters, are added after the letter. That's even though the sign itself is shown inside/around the letter.

https://en.wikipedia.org/wiki/Diacritic#Hebrew

In this case, normalization of the word doesn't do anything, and so searching for the text and sorting it becomes a problem.

Example is:

val regex = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+").toRegex()
val string = "בְּרֵאשִׁית"
val length = string.length // this is 11 and not 6 as it seems for other languages
val normalized = Normalizer.normalize(string, Normalizer.Form.NFD)
val result = normalized.replace(regex, "") // this still becomes the same exact value as on the original, instead of "בראשית"

I was told (here) that perhaps ICU4J library could help with these 2 operations (search and sort), but I can't find this information.

The questions

Is there a better solution in Java/Kotlin API to have searching&sorting while ignoring Diacritics? One that includes as many languages as possible?

Is it possible ICU4J can help? If so, how? I couldn't find much information and samples about how to use it for this purpose in Java/Kotlin.

2

There are 2 best solutions below

1
Philippe Fery On

Try this. It will normalize your string for search:

    String s = "çéèïïÔé";
    s= Normalizer.normalize(s, Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
    System.out.println(s.toString());
8
erickson On

In the case of your example, בְּרֵאשִׁית, the "diacritics" don't actually appear to be classified as diacritics in Unicode. They are in the category "non-spacing marks," Mn.

This regex satisfies your test: [\\p{IsHebrew}&&\\p{IsMn}] I don't know Hebrew script, so whether it causes problems elsewhere, or misses some other elements of Hebrew script, I can't tell.


Here is a test demonstrating [\\p{IsHebrew}&&\\p{IsMn}]:

import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;

public class SO75476483 {

    @Test
    public void inTheBeginning() {
        var niqqud = "[\\p{IsHebrew}&&\\p{IsMn}]";
        var text = "בְּרֵאשִׁית";
        int length = text.length();
        Assertions.assertEquals(11, length);
        String actual = text.replaceAll(niqqud, "");
        Assertions.assertEquals("בראשית", actual);
    }

}

Equivalence and sorting rules for the same characters are different in different locales. It follows ineluctably that you must select a specific locale appropriate for each use. There are no universal rules that work for everyone.

For search applications, you'll segregate documents by language and build a separate index for each group, using the language-appropriate collator. When making a query, the user will provide a keyword and its language tag (though the language is likely to be implied, for example via the Accept-Language header in an HTTP request). The language is used to select an appropriate collator and an index to search with the resulting collation key.

Here is a test demonstrating the right way to approach this problem (in memory), with a Collator.

    @Test
    public void collateInTheBeginning() {
        var hebrewCollator = Collator.getInstance(Locale.forLanguageTag("he"));
        hebrewCollator.setStrength(Collator.PRIMARY);

        var hebrewIndex = new HashMap<CollationKey, String>();
        var document = "בְּרֵאשִׁית";
        var ref = "Gen. 1:1";
        hebrewIndex.put(hebrewCollator.getCollationKey(document), ref);

        var query = "בראשית";
        String actual = hebrewIndex.get(hebrewCollator.getCollationKey(query));

        Assertions.assertEquals(ref, actual);
    }

Of course, many applications have too much text to index to keep all this in memory using CollationKey instances. Most relational databases support collations internally, if the proper one is specified when a column is defined. Of course, a decent full-text search engine will provide equivalent capabilities.

In the worst case, a CollationKey can be converted to a byte array in the application, and used as a key for searching, range queries, and sorting in nearly any type of external database.


While Arabic and Hebrew are abjads, where vowels can be inferred, you should be aware that this not representative. Abugidas like Devanagari are commonly used, and stripping vowel marks from these scripts would make text illegible.

Decomposing characters with a Normalizer will allow you you remove non-spacing marks, but to be safe, you would need to limit this behavior to abjads (mostly scripts coming down from Samaritan or Aramaic).

On the other hand, a Collator set with with the proper language and a PRIMARY strength will handle this distinction for you, and ignore marks that don't matter.