Pinyin packages: accuracy and efficiency

284 Views Asked by At

I am looking to get the pinyin of Simplified Mandarin characters, and have come across two packages:

Both offer similar features in terms of the ability to print character pinyin with and without the diacritics, but I am curious if one is more efficient than the other.

Right off the bat, I noticed that on the first import pinyin_jyutping_sentence that the package builds out a Prefix dict:

import pinyin_jyutping_sentence as pnyn
Building prefix dict from Path\to\python\lib\site-packages\pinyin_jyutping_sentence\dict.txt.big ...
Dumping model to file cache Path\to\AppData\Local\Temp\jieba.ue5a383df573783d4e379d21ab891d92a.cache
Loading model cost 0.793 seconds.
Prefix dict has been built successfully.

Whereas running import pinyin did not result in the creation of any kind of a dictionary.

Is there a difference between the two packages in speed and accuracy?

1

There are 1 best solutions below

0
On BEST ANSWER

NOTE: Due to StackOverflow's rules about the inclusion of Mandarin characters, I was unable to include both the 294 character long mandarin string and 8-index long list of mandarin names I used to test this.


Because this seems to be an obscure question for which there are no questions/answers here on StackOverflow, I did some quick efficiency/accuracy analysis for each package using timeit and datetime.

Here is the code:

'''
Test differences between pinyin packages
'''

import timeit
from datetime import datetime as dt
from tabulate import tabulate

# Timeit import statement for pinyin_jyutping_sentence
import_pinyin_jyutping_sentence = 'import pinyin_jyutping_sentence as pnyn1'

# Timeit import statement for pinyin
import_pinyin = 'import pinyin as pnyn2'

# Empty list to drop tabulate information into
table = []

# Test pinyin_jyutping_sentence package
pinyin_jyutping_sentence = '''
string1 = '[294 character string]'
list_of_names = [list, of, mandarin, names, len, is, 8, long]
pnyn1.pinyin(string1)
pinyin_list = [pnyn1.pinyin(e) for e in list_of_names]
'''
dt1_1 = dt.now()
pinyin_jyutping_sentence_timeit = timeit.timeit(setup=import_pinyin_jyutping_sentence, stmt=pinyin_jyutping_sentence, number=1000)
dt1_2 = dt.now()

# Calculate the pinyin_jyutping_sentence runtime with datetime
method1_time = dt1_2 - dt1_1

# Append runtime information to tabulate list
table.append(['pinyin_jyutping_sentence', pinyin_jyutping_sentence_timeit, method1_time])


## Test pinyin package
pinyin = '''
string1 = '[294 character string]'
list_of_names = [list, of, mandarin, names, len, is, 8, long]
pnyn2.get(string1)
pinyin_list = [pnyn2.get(e) for e in list_of_names]
'''

dt2_1 = dt.now()
pinyin_timeit = timeit.timeit(setup=import_pinyin, stmt=pinyin, number=1000)
dt2_2 = dt.now()
# Calculate the pinyin runtime with datetime
method2_time = dt2_2 - dt2_1

# Append runtime information to tabulate list
table.append(['pinyin', pinyin_timeit, method2_time])

print(tabulate(table, headers=['Package', 'Timeit', 'Datetime'], tablefmt='grid'))



import pinyin_jyutping_sentence as pyn1
import pinyin as pyn2

# String 1
print('String1')
string1 = '[294 character string]'
print(pyn1.pinyin(string1),'\n', pyn2.get(string1))

With the following output:

Building prefix dict from Path\to\python\Python\Python38\lib\site-packages\pinyin_jyutping_sentence\dict.txt.big ...
Loading model from cache Path\to\AppData\Local\Temp\jieba.ue5a383df573783d4e379d21ab891d92a.cache
Loading model cost 0.766 seconds.
Prefix dict has been built successfully.
+--------------------------+----------+----------------+
| Package                  |   Timeit | Datetime       |
+==========================+==========+================+
| pinyin_jyutping_sentence | 2.29024  | 0:00:15.148325 |
+--------------------------+----------+----------------+
| pinyin                   | 0.464655 | 0:00:00.588944 |
+--------------------------+----------+----------------+
String1
měiguó jīngjìxuéjiā màikè dùgéěr ( macdougail ) yú yījiǔliùlíng nián jiànlì le guójì zīběn yídòng de ⼀ bān móxíng , yòu chēngwéi guójì tóuzī lìyì fēnpèi móxíng , yòngyǐ fēnxī guójì zīběn yídòng kěnéng gěi dōngdàoguó dàilái de lìyì 。 qíhòu , yòu yóu kěnpǔ děng rén jiàng gāi móxíng fāzhǎn chéng fēnxī jièdàizīběn guójì yùndòng zhōng shuāngfāng dāngshìguó lìyì fēnpèi de ⼀ bān gōngjù 。 zhèzhǒng yǐ guójiā wéi dānwèi de lìyì fēnpèi móxíng rútú yī  yī suǒshì 。 jiǎshè yǒujiǎ  yǐliǎngguó , jiǎwéi zīběn chōngyù guó , yǐwéi zīběn duǎnquē guó 。 jiǎguó yōngyǒu zīběn liàng ma , yǐguó yōngyǒu zīběn liàng na 。 gēnjù biānjì chǎnchū dìjiǎn guīlǜ , zài qítā yàosù tóurùliàng bùbiàn de qíngkuàng xià , měi ⼀ zhuījiā zīběn de dānwèi chǎnchūlǜ dìjiǎn 。 jiǎdìng liǎng guónèi jūn cúnzài wánquán jìngzhēng , zīběn de shōuyìlǜ děngyú zīběn de biānjì chǎnchūlǜ , tú yī  yī zhōng eo biǎoshì jiǎguó de biānjì chǎnchū qūxiàn , fo biǎoshì yǐguó de biānjì chǎnchū qūxiàn 。

 měiguójīngjìxuéjiāmàikèdùgéěr(Macdougail)yú1960niánjiànlìleguójìzīběnyídòngde⼀bānmóxíng,yòuchēngwèiguójìtóuzīlìyìfēnpèimóxíng,yòngyǐfēnxīguójìzīběnyídòngkěnénggěidōngdàoguódàiláidelìyì。qíhòu,yòuyóukěnpǔděngrénjiānggāimóxíngfāzhǎnchéngfēnxījièdàizīběnguójìyùndòngzhōngshuāngfāngdāngshìguólìyìfēnpèide⼀bāngōngjù。zhèzhǒngyǐguójiāwèidānwèidelìyìfēnpèimóxíngrútú1-1suǒshì。jiǎshèyǒujiǎ、yǐliǎngguó,jiǎwèizīběnchōngyùguó,yǐwèizīběnduǎnquēguó。jiǎguóyǒngyǒuzīběnliàngMA,yǐguóyǒngyǒuzīběnliàngNA。gēnjùbiānjìchǎnchūdìjiǎngūilv̀,zàiqítāyàosùtóurùliàngbùbiàndeqíngkuàngxià,měi⼀zhūijiāzīběndedānwèichǎnchūlv̀dìjiǎn。jiǎdìngliǎngguónèijūncúnzàiwánquánjìngzhēng,zīběndeshōuyìlv̀děngyúzīběndebiānjìchǎnchūlv̀,tú1-1zhōngEObiǎoshìjiǎguódebiānjìchǎnchūqūxiàn,FObiǎoshìyǐguódebiānjìchǎnchūqūxiàn。

Based on the output of the timeit and datetime modules, pinyin_jyutping_sentence is much slower than pinyin. However, after examining the pinyin output of both pinyin_jyutping_sentence and pinyin in relation to one another and the original mandarin characters, pinyin_jyutping_sentence is far more accurate and readable.* pinyin contained several errors in it's output of the 294 character long string, and on closer examination of the pinyin output of the list of names, pinyin got the character tone wrong in several places, whereas pinyin_jyutping_sentence got it right in (as far as I was able to identify) every case. I will update this answer if I find/test other mandarin characters to pinyin packages in python.

*Interestingly, pinyin_jyutping_sentence converted numbers in the string into the number's corresponding pinyin.