Identifying languages of individual words within list

122 Views Asked by At

Suppose I have a list of words in different languages and I need to sort the words by language. What would be the most efficient way to do this? Currently I am using Python's langdetect which is very good at identifying individual words of unique character sets. For example '这是什么' has a > 0.99 probability of being Mandarin and 'תפוחים' a > 0.99 probability of being Hebrew, but struggles when given one word from different languages from the Latin alphabet. For example 'intro' is given a > 0.99 probability of being Italian and 'já' a > 0.99 probability of being Hungarian. I was thinking of somehow pooling the words together (as each language has multiple words) and 'sweaty intro' produces a > 0.99 probability of being English while 'ele já' produces a > 0.99 probability of being Portuguese.

testWords = ['这是什么', '这是什么', '王明是学生。', 'sweaty', 'intro', 'am', 'תפוח' ,'תפוחים', 'אני', 'criança', 'ele', 'já']
0

There are 0 best solutions below