Some problems of Python crawler

185 Views Asked by At

And I'm just suffering from the question about python crawler.

First, the websites have two different hexadecimal of Chinese chracters. I can convert one of them (which is E4BDA0E5A5BD), the other one is C4E3BAC3 which I have no method to convert, or maybe I am missing some methods. The two hexadecimal values are '你好' in Chinese.

Second, I have found a website which can convert the hexadecimal, and to my surprise the answer is exactly what I cannot covert by myself.

The url is http://www.uol123.com/hantohex.html

Then I made a question: how to get the result which is in the text box (well I don't know what it is called exactly). I used firefox + httpfox to observe the post's data, and I find that the result which is converted by the website is in the Content, here is the pic:

And then I print the post, it has POST Data, and some headers, but no info about Content.

Third, then I google how to use ajax, and I really found a code about how to use ajax.

Here is the url http://outofmemory.cn/code-snippet/1885/python-moni-ajax-request-get-ajax-request-response But when I run this, it has an error which says "ValueError: No JSON object could be decoded."

And pardon that I am a newbie, so I cannot post images!!!

I am looking forward to your help sincerely.

Any help will be appreciated.

1

There are 1 best solutions below

2
On

you're talking about different encodings for these chinese characters. there are at least three different widely used encodings guobiao (for mainland China), big5 (on Taiwan) and unicode (everywhere else).

here's how to convert your kanji into the different encodings:

>>> a = u'你好'             -- your original characters
>>> a
u'\u4f60\u597d'            -- in unicode
>>> a.encode('utf-8')
'\xe4\xbd\xa0\xe5\xa5\xbd' -- in UTF-8
>>> a.encode('big5')
'\xa7A\xa6n'               -- in Taiwanese Big5
>>> a.encode('gb2312-80')
'\xc4\xe3\xba\xc3'         -- in Guobiao
>>> 

You may check other available encodings here.

Ah, almost forgot. to convert from Unicode into the encoding you use encode() method. to convert back from the encoded contents of the web site you may use decode() method. just don't forget to specify the correct encoding.