I want to convert HTML to plain text in Python, I hope the results to look like they were copied from the browser. I tried many libraries like html2text, html-text and BeautifulSoup, But none of them get the results I want. For example, the following HTML:
<div>aaa</div> <div>AAA</div>
<div><br></div>
<div>bbb</div> <div>BBB</div>
<div><br></div>
<div>ccc</div> <div>CCC</div>
looks like this in the browser:
aaa
AAA
bbb
BBB
ccc
CCC
But when I use html2text, the result is
aaa
AAA
bbb
BBB
ccc
CCC
the result of html-text is
aaa
AAA
bbb
BBB
ccc
CCC
and BeautifulSoup just removes the tags:
aaa AAA
bbb BBB
ccc CCC
well I also tried soup.get_text('\n')
and soup.get_text('\n', strip=True)
but couldn't get correct results.
Does anyone have a good way to solve the problem? Thank you very much.
As @dabingsou said
This code is the generic solution using function
The result will be