I would like to use bleach to format some potentially unclean HTML. In the following sample, ideally bleach should remove:
- the extra spaces in the first opening
<p > - the attribute in the closing link tag
</a attr="test"> - the extra spaces in the last closing
</p >
My code looks like this:
import bleach
html = """<p >This <a href="book"> book </a attr="test"> will help you</p >"""
html_cleaned = bleach.clean(html)
# html_cleaned is:
#'<p >This <a href="book"> book </a> will help you</p>'
As you can see, bleach is very inconsistent:
- the < and > of the opening and closing
ptag are escaped to<and>. For the link tag, this doesn't happen - The spaces in
</p >are removed, in the opening<p >they are not - additionally, if I add an attribute to the closing
ptag,</p attr="test">, it is not removed, while for the closing</a attr="test">the illegal attribute is removed.
What is happening here ?
bleach.cleanexpects an optionaltagsparameter which specifies allowed tags. Theptag is not allowed by default and therefore doesn't get the sanitizing treatment.My problem can be fixed by: