Converting HTML to plain text that looks like it was copied from a browser using Python

189 Views Asked by At

I want to convert HTML to plain text in Python, I hope the results to look like they were copied from the browser. I tried many libraries like html2text, html-text and BeautifulSoup, But none of them get the results I want. For example, the following HTML:

<div>aaa</div> <div>AAA</div>
<div><br></div>
<div>bbb</div> <div>BBB</div>
<div><br></div>
<div>ccc</div> <div>CCC</div>

looks like this in the browser:

aaa
AAA

bbb
BBB

ccc
CCC

But when I use html2text, the result is

aaa

AAA



bbb

BBB



ccc

CCC



the result of html-text is

aaa
AAA
bbb
BBB
ccc
CCC

and BeautifulSoup just removes the tags:


aaa AAA

bbb BBB

ccc CCC

well I also tried soup.get_text('\n') and soup.get_text('\n', strip=True) but couldn't get correct results.

Does anyone have a good way to solve the problem? Thank you very much.

2

There are 2 best solutions below

0
On

As @dabingsou said

This code is the generic solution using function

from simplified_scrapy.simplified_doc import SimplifiedDoc 

def print_html(html): # this is the function code
    return SimplifiedDoc(html).replaceReg(SimplifiedDoc(html).html,"</div>","\n").replaceReg(html,"<.*>","")

# let's say the html is 
html = """
<div> Hello, World! </div>
<div> By Faran </div>
"""

print_html(html) 

The result will be

Hello, World!
By Faran
2
On

what about this.

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''<div>aaa</div> <div>AAA</div>
<div><br></div>
<div>bbb</div> <div>BBB</div>
<div><br></div>
<div>ccc</div> <div>CCC</div>'''
doc = SimplifiedDoc(html)
html = doc.replaceReg(doc.html,"</div>","\n")
html = doc.replaceReg(html,"<.*>","")
print(html)

result:

aaa
AAA

bbb
BBB

ccc
CCC