Sanitise HTML content with Python

227 Views Asked by At

I am working with an external API which is sending me text from HTML emails. The text comes through without the HTML structure (e.g. <html> ... </html> etc). I need to sanitise this text and output to Slack. I have tried using BeautifulSoup and Bleach, neither of which are working, presumably due to the partial nature of the HTML in the input.

A sample of the input text looks like so:

&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Bacon ipsum dolor amet cupim meatball ham hock pancetta ball tip ribeye cow brisket bresaola short ribs drumstick short loin. Turkey pastrami boudin andouille fatback tenderloin pork beef jowl rump hamburger buffalo capicola prosciutto. Meatball jerky pig filet mignon cow. Tenderloin flank tongue venison. Spare ribs fatback jerky pig boudin biltong filet mignon pancetta capicola.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Jerky salami brisket, landjaeger beef ribs meatball swine alcatra. Pork chop doner kielbasa jowl biltong tri-tip. Sausage sirloin prosciutto ribeye meatball capicola andouille picanha rump bacon turkey kevin pancetta landjaeger jowl. Spare ribs burgdoggen landjaeger buffalo capicola cow corned beef flank frankfurter boudin salami t-bone doner. Kevin filet mignon ribeye, pork belly andouille chuck pig drumstick. Short ribs tri-tip ball tip rump flank.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Pig biltong doner fatback. Tail hamburger kielbasa pastrami buffalo boudin cupim, pig jerky prosciutto venison pork chop chuck sirloin kevin. Bresaola bacon drumstick ball tip salami ribeye capicola beef ribs. Meatball tenderloin drumstick bresaola rump short ribs. Salami venison chuck burgdoggen.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Strip steak ham prosciutto, biltong meatball kielbasa boudin shankle ground round bacon. Alcatra short loin chuck shankle hamburger shank, buffalo sausage turkey prosciutto tongue kielbasa venison. Shank cow turducken beef ribs meatloaf pork belly. Pastrami leberkas ball tip pancetta short loin sirloin turducken rump hamburger cupim strip steak ground round brisket filet mignon pork. Beef shankle kevin tail picanha bacon beef ribs cow ground round pig ham rump. Bresaola spare ribs tenderloin pastrami, ham jowl short loin hamburger shankle tail venison pig meatloaf.&lt;/div&gt;

I would like the following output for the input above:

Bacon ipsum dolor amet cupim meatball ham hock pancetta ball tip ribeye cow brisket bresaola short ribs drumstick short loin. Turkey pastrami boudin andouille fatback tenderloin pork beef jowl rump hamburger buffalo capicola prosciutto. Meatball jerky pig filet mignon cow. Tenderloin flank tongue venison. Spare ribs fatback jerky pig boudin biltong filet mignon pancetta capicola.
Jerky salami brisket, landjaeger beef ribs meatball swine alcatra. Pork chop doner kielbasa jowl biltong tri-tip. Sausage sirloin prosciutto ribeye meatball capicola andouille picanha rump bacon turkey kevin pancetta landjaeger jowl. Spare ribs burgdoggen landjaeger buffalo capicola cow corned beef flank frankfurter boudin salami t-bone doner. Kevin filet mignon ribeye, pork belly andouille chuck pig drumstick. Short ribs tri-tip ball tip rump flank.
Pig biltong doner fatback. Tail hamburger kielbasa pastrami buffalo boudin cupim, pig jerky prosciutto venison pork chop chuck sirloin kevin. Bresaola bacon drumstick ball tip salami ribeye capicola beef ribs. Meatball tenderloin drumstick bresaola rump short ribs. Salami venison chuck burgdoggen.
Strip steak ham prosciutto, biltong meatball kielbasa boudin shankle ground round bacon. Alcatra short loin chuck shankle hamburger shank, buffalo sausage turkey prosciutto tongue kielbasa venison. Shank cow turducken beef ribs meatloaf pork belly. Pastrami leberkas ball tip pancetta short loin sirloin turducken rump hamburger cupim strip steak ground round brisket filet mignon pork. Beef shankle kevin tail picanha bacon beef ribs cow ground round pig ham rump. Bresaola spare ribs tenderloin pastrami, ham jowl short loin hamburger shankle tail venison pig meatloaf.

I used the following simple Bleach routine:

def textify(html):
 text = bleach.clean(html)
 return text

With BeautifulSoup I also used some regex to clean the output:

def textify(html):
  html = re.sub('<br>', '\n', html)
  soup = BeautifulSoup(html)
  text = soup.getText()
  text = re.sub(r'\&lt;', '<', text)
  text = re.sub(r'\&gt\;', '>', text)
  text = re.sub(r'\&\#39\;', "'", text)
  return text
1

There are 1 best solutions below

1
On BEST ANSWER

You first need to unescape the strings before passing them to bleach or beautifulsoup, using the standard library's html module:

from html import unescape

html = "&lt;div style=&#39;bo...div&gt;"
unescaped_html = unescape(html)

text = bleach.clean(unescaped_html)
soup = BeautifulSoup(unescaped_html)