I would like to read to source code of a webpage using urllib2; however, I'm seeing a strange output that I've not seen before. Here's the code (Python 2.7, Linux):
import urllib2
open_url = urllib2.urlopen("http://www.elegantthemes.com/gallery/")
site_html = open_url.read()
site_html[50:]
Which gives the output:
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xe5\\ms\xdb\xb6\xb2\xfel\xcf\xe4?\xc0<S[\x9a\x8a\xa4^\xe28u,\xa5\x8e\x93\xf4\xa4\x93&\x99:9\xbdw\x9a\x8e\x07"'
Does anyone know why it's showing this as the output and not the correct HTML?
The http response being sent by the site is actually gzipped content and hence the strange output. urllib does not automatically decode the gzip cntent. There are two ways to solve this -
1) Decode zipped content before printing -
2) Use Requests library -