Pinyin Named Entity Recognition

161 Views Asked by At

I'm trying to conduct named entity recognition or pull out the persons, places, etc... from Pinyin, or the romanization of Chinese characters.

For example (from Wikipedia):

 "Jiang Zemin, Li Peng and Zhu Rongji led the nation in the 1990s. Under their administration, China's economic performance pulled an estimated 150 million peasants out of poverty and sustained an average annual gross domestic product growth rate of 11.2%.[125][better source needed][126][better source needed] The country joined the World Trade Organization in 2001, and maintained its high rate of economic growth under Hu Jintao and Wen Jiabao's leadership in the 2000s. However, the growth also severely impacted the country's resources and environment,[127][128] and caused major social displacement.[129][130]
Chinese Communist Party general secretary Xi Jinping has ruled since 2012 and has pursued large-scale efforts to reform China's economy [131][132] (which has suffered from structural instabilities and slowing growth),[133][134][135] and has also reformed the one-child policy and prison system,[136] as well as instituting a vast anti corruption crackdown.[137] In 2013, China initiated the Belt and Road Initiative, a global infrastructure investment project.[138] The COVID-19 pandemic broke out in Wuhan, Hubei in 2019.[139][140]"

I am hoping to extract entities from the above like:

Jiang Zemin
Li Peng
Zhu Rongji
Hu Jintao
Wuhan
Hubei
etc...

Chinese character NER is pretty sophisticated, but I don't know of a way to extract Pinyin.

My current plan was to try all permutations of the 1300+ chinese syllables as follows:

import pandas as pd
import numpy as np

#import data
data = pd.read_csv('chinese_tones.txt', sep=" ", header=None)
data.columns = ["pinyin", "character"]

#convert
data['pinyin'] = data['pinyin'].str.replace('\d+', '') #data doesn't have tones, which makes this harder
s = data['pinyin'].drop_duplicates().to_numpy()
combos = pd.Series(np.add.outer(s, s).ravel())

#combine to giant list
all_pinyin = pd.Series(s.tolist() + np.add.outer(s, s).ravel().tolist())

I was then going to do something along the lines of .isin() to compare the text data to the list of pinyin.

Does anyone know of a better way to extract entities Pinyin?

1

There are 1 best solutions below

0
On

You can train a character level sequence tagger (such as BiLSTM) to extract Chinese names from the sequence. And you need to make some difficult cases (such as some words look similar with names) for the model. You can easily find a lot of Chinese names from here and then use some Hanzi2Pinyin Tools (such as python-pinyin) to convert Chinese names into their pinyin form.