Strange Characters after downloading Chinese table from html

156 Views Asked by Yijiao Liu At 27 June 2025 at 21:31

I am using MAC OS X 10.12 system. I downloaded a table from http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2011/51/06/81/510681114.html . The table was encoded with GB2312, however, I used GBK to save the table. The main part of the code is like the following:

req = urllib2.Request(url, headers={ 'User-Agent': 'Mozilla/5.0' })
page = urllib2.urlopen(req ).read()
tables=BeautifulSoup(page,'html.parser',
parse_only=SoupStrainer(),from_encoding='gbk')
f = open(path, 'w')
for row in tables.findAll("tr"):
    cells = row.findAll("td")
    write_to_file = cells[0].find(text=True) + "," + cells[1].find(text=True) 
+ "\n"
    write_to_unicode = write_to_file.encode('utf-8')
    f.write(write_to_unicode)
f.close()

I repeated this pattern of code for many other similar tables, however, for some links (like the one I posted here), the Chinese tables downloaded are with strange characters. Here is the example.

´úÂë,³ÇÏç·ÖÀà,Ãû³Æ
510681114001,121,½ÖµÀ¾ÓÃñÎ¯Ô±»á
510681114201,220,ðÀÃù´å´åÃñÎ¯Ô±»á
510681114202,220,°×º×´å´åÃñÎ¯Ô±»á
510681114203,122,Áâ½Ç´å´åÃñÎ¯Ô±»á
510681114204,122,»Æ¼Òµê´å´åÃñÎ¯Ô±»á
510681114205,122,»¨ÌÁ´å´åÃñÎ¯Ô±»á
510681114206,220,ÔÂÍå´å´åÃñÎ¯Ô±»á
510681114207,122,°×ÔÆ´å´åÃñÎ¯Ô±»á
510681114208,220,Á¹Ë®¾®´å´åÃñÎ¯Ô±»á
510681114209,122,Çàþh´å´åÃñÎ¯Ô±»á

What should I do to convert this table to real Chinese, or what should I do to download the Chinese table?

The problem is, if I choose to use GB2312, maybe for this table, it can be shown with Chinese correctly, but for other tables, it will still show these annoying strange characters.

Original Q&A

There are 1 best solutions below

Yijiao Liu On 05 July 2017 at 09:04

I just got the inspiration from here http://zzi.io/?p=275 for example

a=u"´úÂë"
print a.encode('iso-8859-1').decode('gbk')

Result is

代码

So this problem is partly solved.

Strange Characters after downloading Chinese table from html

There are 1 best solutions below

Related Questions in PYTHON-2.7

Related Questions in ENCODING

Related Questions in UTF-8

Related Questions in GB2312

Related Questions in GBK

Trending Questions

Popular # Hahtags

Popular Questions