How to detect if Chinese text contains simplified or traditional characters?

601 Views Asked by At

What would be a reliable way in Java to detect if a Chinese Unicode string contains Chinese simplified characters or traditional characters? The assumption is that characters that are common for both simplified and traditional ranges would be treated as simplified by default.

Ideally would be checking for a regex match by specific Unicode character ranges. Are these ranges documented and defined, and would this approach be reliable?

Update

Related questions:

Summary
  • for detecting presence of Chinese characters (both simplified and traditional) a regex like ".*[\\u4E00-\\u9FA5]+.*" can be used
  • to further identify hanzi specifically as Traditional/Simplified the lists extracted from cedict can be used. The exclusive subsets with the common superset removed can be used to get the required differentiation as shown in the sample gist *
1

There are 1 best solutions below

0
On
public class ChineseCharacterDetector {
    public static boolean containsSimplifiedChinese(String input) {
        for (char c : input.toCharArray()) {
            if (isSimplifiedChinese(c)) {
                return true;
            }
        }
        return false;
    }

    public static boolean containsTraditionalChinese(String input) {
        for (char c : input.toCharArray()) {
            if (isTraditionalChinese(c)) {
                return true;
            }
        }
        return false;
    }

    private static boolean isSimplifiedChinese(char c) {
        // Common simplified Chinese character range
        return (c >= '\u4E00' && c <= '\u9FFF');
    }

    private static boolean isTraditionalChinese(char c) {
        // Common traditional Chinese character ranges
        return (c >= '\u4E00' && c <= '\u9FFF') || // Common characters
               (c >= '\u3400' && c <= '\u4DBF') || // Extended-A
               (c >= '\u20000' && c <= '\u2A6DF'); // Extended-B
    }

    public static void main(String[] args) {
        String input = "你好,世界!Hello, 世界!";
        
        if (containsSimplifiedChinese(input)) {
            System.out.println("Contains Simplified Chinese characters");
        } else if (containsTraditionalChinese(input)) {
            System.out.println("Contains Traditional Chinese characters");
        } else {
            System.out.println("Contains neither Simplified nor Traditional Chinese characters");
        }
    }
}

The isSimplifiedChinese function takes into account characters from the common Simplified Chinese range, whereas the isTraditionalChinese function takes into account characters from the typical Traditional Chinese ranges, as well as certain expanded ranges. The functions containsSimplifiedChinese and containsTraditionalChinese iterate through the input text, looking for characters in the specified ranges.