How to detect the script system/alphabet from UTF-8 input?

507 Views Asked by At

I am currently building a transliteration web interface based on icu4j. What is the best way to automatically detect what is script system the user enters queries?

E.g. if the input is 身体里 or عالمتاب how can/should I recognize from which script system does this come?

1

There are 1 best solutions below

1
On BEST ANSWER

The simplest way would be to check the script of the first character:

static Character.UnicodeScript getScript(String s) {
    if (s.isEmpty()) {
        return null;
    }
    return Character.UnicodeScript.of(s.codePointAt(0));
}

A better way would be to find the most frequently occurring script:

static Character.UnicodeScript getScript(String s) {
    int[] counts = new int[Character.UnicodeScript.values().length];

    Character.UnicodeScript mostFrequentScript = null;
    int maxCount = 0;

    int n = s.codePointCount(0, s.length());
    for (int i = 0; i < n; i = s.offsetByCodePoints(i, 1)) {
        int codePoint = s.codePointAt(i);
        Character.UnicodeScript script = Character.UnicodeScript.of(codePoint);

        int count = ++counts[script.ordinal()];
        if (mostFrequentScript == null || count > maxCount) {
            maxCount = count;
            mostFrequentScript = script;
        }
    }

    return mostFrequentScript;
}