Strange Output from Python urllib2

196 Views Asked by At

I would like to read to source code of a webpage using urllib2; however, I'm seeing a strange output that I've not seen before. Here's the code (Python 2.7, Linux):

import urllib2
open_url = urllib2.urlopen("http://www.elegantthemes.com/gallery/")
site_html = open_url.read()
site_html[50:]

Which gives the output:

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xe5\\ms\xdb\xb6\xb2\xfel\xcf\xe4?\xc0<S[\x9a\x8a\xa4^\xe28u,\xa5\x8e\x93\xf4\xa4\x93&\x99:9\xbdw\x9a\x8e\x07"'

Does anyone know why it's showing this as the output and not the correct HTML?

1

There are 1 best solutions below

1
On BEST ANSWER

The http response being sent by the site is actually gzipped content and hence the strange output. urllib does not automatically decode the gzip cntent. There are two ways to solve this -

1) Decode zipped content before printing -

import urllib2
import io
import gzip

open_url = urllib2.urlopen("http://www.elegantthemes.com/gallery/")
site_html = open_url.read()
bi = io.BytesIO(site_html)
gf = gzip.GzipFile(fileobj=bi, mode="rb")
s = gf.read()
print s[50:]

2) Use Requests library -

import requests
r = requests.get('http://www.elegantthemes.com/gallery/')
print r.content