How to transliterate Chinese characters to Zhuyin (in Java)

749 Views Asked by At

How to convert Chinese traditional or simplified characters to Zhuyin phonetic notation?

Example

# simplified
没关系 --> ㄇㄟˊㄍㄨㄢㄒㄧ

# traditional
沒關係 --> ㄇㄟˊㄍㄨㄢㄒㄧ
1

There are 1 best solutions below

0
On

With Python

The dragonmapper module does hanzi to zhuyin conversion (internally it converts first to pinyin and then to zhuyin):

# install dependencies: pip install dragonmapper

from dragonmapper import hanzi

hanzi.to_zhuyin('太阳')
>>> 'ㄊㄞˋ ㄧㄤ˙'

With Java

A possible sequence:

  1. convert Chinese text (Simplified or Traditional) to pinyin using pinyin4j (java), pypinyin (python), etc.
  2. Tokenize the numbered pinyin using a regex created according to this logic (generated final regex).
  3. Substitute pinyin tokens with zhuyin using documented mappings such as http://www.pinyin.info/romanization/bopomofo/basic.html or https://terpconnect.umd.edu/~nsw/chinese/pinyin.htm.

Possible scenario for step #1:

Java code

HanyuPinyinOutputFormat outputFormat = new HanyuPinyinOutputFormat();
outputFormat.setToneType(HanyuPinyinToneType.WITH_TONE_NUMBER);
outputFormat.setVCharType(HanyuPinyinVCharType.WITH_U_AND_COLON);
outputFormat.setCaseType(HanyuPinyinCaseType.LOWERCASE);

String[] pinyin = PinyinHelper.toHanyuPinyinStringArray(chineseText, outputFormat);

Python code

from pypinyin import pinyin

hanzi_text = '當然可以'
pinyin_text = ' '.join([seg[0] for seg in pinyin(hanzi_text)])
print(pinyin_text)

Scenario for step #2:

Provided that you generated a list of pinyin segments on step #1 you can now break the pinyin into segments and replace them using a map such as this one or this one (in js format).

Alternative approach

Another solution would be mapping Chinese characters directly to zhuyin using any of the available mappings such as this one: https://github.com/osfans/rime-tool/blob/master/data/y/taiwan.dict.yaml. The downside is that (with this particular source) this will only process Simplified Chinese but won't process Traditional characters.

UPDATE: The mapping from the libchewing project covers both simplified and traditional characters (plus frequency data and special cases for multiple characters): see word.src (400K) and tsi.src (5.2MB). In order to be able to handle segments you'll probably also want to look for a decent Chinese segmentation library such as jieba (python), jieba-analysis (java) etc.