Convert html entities file to Unicode (with BeautifulSoup and Python?)

1.7k Views Asked by At

I have installed Python 2.7.13, pip and beautifulsoup on Win10. I want to convert a big file with html entities into Unicode characters and I am not sure how to go about it (I don't know much about Python). The file contents look like this:

<b>&#947;&#941;&#961;&#969;&#957;</b>, <i>&#959;&#957;&#964;&#959;&#962;, &#8001;</i>, Wurzel <i>&#915;&#917;&#929;</i>, verwandt mit <i>&#947;&#941;&#961;&#945;&#962;, &#947;&#949;&#961;&#945;&#961;&#972;&#962;, &#947;&#949;&#961;&#945;&#953;&#972;&#962;</i>

I can do small parts with EmEditor (using Edit > Encode/Decode Selection -> HTML/XML character reference to Unicode) but it is too slow and cannot cope with a big file conversion).

I would be happy for any (offline) solution for this.

4

There are 4 best solutions below

0
On

Thank you for your help, I did manage to do it quite easily with the latest version of EmEditor which proved to be quite fast:

Select text > Edit > Encode/Decode Selection -> HTML/XML character reference to Unicode

0
On
import bs4

html = '''<b>&#947;&#941;&#961;&#969;&#957;</b>, <i>&#959;&#957;&#964;&#959;&#962;, &#8001;</i>, Wurzel <i>&#915;&#917;&#929;</i>, verwandt mit <i>&#947;&#941;&#961;&#945;&#962;, &#947;&#949;&#961;&#945;&#961;&#972;&#962;, &#947;&#949;&#961;&#945;&#953;&#972;&#962;</i>'''

soup = bs4.BeautifulSoup(html, 'lxml')

out:

<html><body><b>γέρων</b>, <i>οντος, ὁ</i>, Wurzel <i>ΓΕΡ</i>, verwandt mit <i>γέρας, γεραρός, γεραιός</i></body></html>

Document:

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

from bs4 import BeautifulSoup
> 
> soup = BeautifulSoup(open("index.html"))  # you can open you file in here
> 
> soup = BeautifulSoup("<html>data</html>")

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

1
On

this is html encoded, try with this:

from HTMLParser import HTMLParser

f = open("myfile.txt")
h = HTMLParser()
new_file_content = h.unescape(f.read())
new_file = open("newfile.txt", 'w')
new_file.write(new_file_content)
2
On

BeautifulSoup has a built in function for doing this called .decode(). Simply add this to the end of the line when you read in the file!

Example:

site_read = site_download.read().decode('utf-8')