Download only the text from a webpage content in Python

8.3k Views Asked by At

How can I download only text/html/javascript from of a webpage in Python?

I'm trying to get some statistics about the text written by authors of blogs. Needing only the text, I want to increase my program speed by avoiding the download of images, etc.

I'm able to separate the text from the HTML markup language. So my intention is mainly avoiding downloading aditional content in a webpage (like images, .swf or the like)

So far I use:

user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
        headers = {'User-Agent': user_agent}
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req, timeout=60)
content_type = response.info().getheader('Content-Type')
if 'text/html' in content_type:
   return response.read()

But I'm not sure if I'm doing the right thing (i.e. downloading text only)

1

There are 1 best solutions below

1
On BEST ANSWER

Python BeautifulSoup one of the best for parsing webpages

import bs4
import urllib.request

webpage=str(urllib.request.urlopen(link).read())
soup = bs4.BeautifulSoup(webpage)

print(soup.get_text())