How can I download only text/html/javascript from of a webpage in Python?
I'm trying to get some statistics about the text written by authors of blogs. Needing only the text, I want to increase my program speed by avoiding the download of images, etc.
I'm able to separate the text from the HTML markup language. So my intention is mainly avoiding downloading aditional content in a webpage (like images, .swf or the like)
So far I use:
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = {'User-Agent': user_agent}
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req, timeout=60)
content_type = response.info().getheader('Content-Type')
if 'text/html' in content_type:
return response.read()
But I'm not sure if I'm doing the right thing (i.e. downloading text only)
Python BeautifulSoup one of the best for parsing webpages