python bleach: inconsistent cleaning behaviour

1.1k Views Asked by At

I would like to use bleach to format some potentially unclean HTML. In the following sample, ideally bleach should remove:

  • the extra spaces in the first opening <p >
  • the attribute in the closing link tag </a attr="test">
  • the extra spaces in the last closing </p >

My code looks like this:

import bleach
html = """<p   >This <a href="book"> book </a attr="test"> will help you</p  >"""
html_cleaned = bleach.clean(html)

# html_cleaned is:
#'&lt;p  &gt;This <a href="book"> book </a> will help you&lt;/p&gt;'

As you can see, bleach is very inconsistent:

  • the < and > of the opening and closing p tag are escaped to &lt; and &gt;. For the link tag, this doesn't happen
  • The spaces in </p > are removed, in the opening <p > they are not
  • additionally, if I add an attribute to the closing p tag, </p attr="test">, it is not removed, while for the closing </a attr="test"> the illegal attribute is removed.

What is happening here ?

1

There are 1 best solutions below

2
On BEST ANSWER

bleach.clean expects an optional tags parameter which specifies allowed tags. The p tag is not allowed by default and therefore doesn't get the sanitizing treatment.

My problem can be fixed by:

cleaned_doc = bleach.clean(input_doc, tags = bleach.sanitizer.ALLOWED_TAGS+["p"])