Getting text-only content from non-English website

1.3k Views Asked by At

I am trying to get the text-only contents of a non-English website. For example, I want to get the hindi contents of http://www.bbc.co.uk/hindi/

For text dump of an English website, I use wget to fetch the contents. Then use an HTML parser for removing the HTML tags and give me clean text.

What are the equivalent tools for working on a non-English website?

This is just some pet project that I'm exploring. Speed is not much of a concern. I would code in Linux environment and preferably use Python or Java or C/C++ (in that order).

1

There are 1 best solutions below

5
On

It sounds like the method you're using to parse HTML falls down when encountering unicode. There's a module called BeautifulSoup that's great for parsing all manner of websites, and it handles unicode just fine. Try interactively:

>>> import urllib, BeautifulSoup
>>> html = urllib.urlopen( 'http://www.bbc.co.uk/hindi/' ).read()
>>> soup = BeautifulSoup.BeautifulSoup( html )
>>> print soup.find( 'title' ).contents
[u'BBC Hindi - \u092a\u0939\u0932\u093e \u092a\u0928\u094d\u0928\u093e']

My terminal can't print these characters, but however you usually display Hindi text should work here as well.