How to compare non-English(Chinese) Characters in python program?

3.3k Views Asked by At

In one of my python program(python 2.7), I need to process some chinese characters:

  1. I have a file A.txt, it has two columns: "name" and "score", the "name" column can be valued some chinese strings, and score is an int number values between 1 and 10. A.txt is encoded in GBK, which is a chinese character encoding.

  2. I insert every row of A.txt into my mysql table tb_name_score, it has three columns: ID, NAME, SCORE, and its NAME column's encoding is latin1_swedish_ci

  3. now, I have another file names B.txt, which has two columns too, "name" and "score", and I need to update the tb_name_score's SCORE column according to B.txt. B.txt is also encoded in GBK

  4. so, I traverse B.txt, read a line and use it's "name" value to compare with the records in tb_name_score.NAME, if they are equal, then I update tb_name_score.SCORE. However, although the "name" column of the line in B.txt is the same chinese string with the value in tb_name_score.NAME, the "=" returns false, I just can't update the table. Anybody can help? thanks!

2

There are 2 best solutions below

0
On

hope it helps:

Python 2.7.3 (default, Apr 10 2013, 06:20:15) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a=u'后者'
>>> b='后者'
>>> type(a)
<type 'unicode'>
>>> type(b)
<type 'str'>
>>> a==b
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False
>>> b
'\xe5\x90\x8e\xe8\x80\x85'
>>> a
u'\u540e\u8005'
>>> b.decode('utf8')
u'\u540e\u8005'
>>> a.encode('utf8')
'\xe5\x90\x8e\xe8\x80\x85'
>>> 
0
On
df_raw=pd.read_excel('/Users/zh/workspace/CityRealEstate/CityDataset20180521-4.xlsx')

df_train = df_raw.iloc[:,3:59]
print df_raw.loc[df_raw['Year'] <> 2016]

city = '深圳'
print df_raw['City'].values
df_train=df_raw.loc[df_raw['City'] == city.decode('utf8')]

it works for me