bleach clean adds "<pre><code>“ tag at the beginning rather than cleaning

133 Views Asked by At

I scraped some html contents from internet, below is only a beginning part of it,

<p style="max-width: 100%;min-height: 1em;letter-spacing: 0.544px;text-align: center;box-sizing: border-box !important;word-wrap: break-word !important;"><strong style="max-width: 100%;letter-spacing: 0.544px;font-size: 24px;box-sizing: border-box !important;word-wrap: break-word !important;"><strong style="max-width: 100%;letter-spacing: 0.544px;box-sizing: border-box !important;word-wrap: break-word !important;"><span style="max-width: 100%;color: rgb(255, 41, 65);box-sizing: border-box !important;word-wrap: break-word !important;"><strong style="max-width: 100%;letter-spacing: 0.544px;color: rgb(0, 0, 0);font-size: 18px;box-sizing: border-box !important;word-wrap: break-word !important;"><span style="max-width: 100%;font-size: 24px;letter-spacing: 0.544px;box-sizing: border-box !important;word-wrap: break-word !important;"><strong style="max-width: 100%;letter-spacing: 0.544px;box-sizing: border-box !important;word-wrap: break-word !important;"><span style="max-width: 100%;letter-spacing: 0.544px;box-sizing: border-box !important;word-wrap: break-word !important;"><strong style="max-width: 100%;box-sizing: border-box !important;word-wrap: break-word !important;"><strong style="max-width: 100%;letter-spacing: 0.544px;box-sizing: border-box !important;word-wrap: break-word !important;"><span style="max-width: 100%;letter-spacing: 0.544px;color: rgb(61, 167, 66);box-sizing: border-box !important;word-wrap: break-word !important;"><strong style="max-width: 100%;box-sizing: border-box !important;word-wrap: break-word !important;">...

I am using

body_html=bleach.clean(markdown(value, output_format='html'),tags=['SOME_ALLOWED_TAGS'] ,attributes=['SOME_ALLOWED_ATTRIBUTES'],styles=['SOME_ALLOWED_STYLES'],strip=True,strip_comments=True)

but the return is not what I expected,

<pre><code> &lt;p style="max-width: 100%;min-height: 1em;letter-spacing: 0.544px;text-align: center;box-sizing: border-box !important;word-wrap: break-word !important;"&gt;&lt;strong style="max-width: 100%;letter-spacing: 0.544px;font-size: 24px;box-sizing: border-box !important;word-wrap: break-word !important;"&gt;&lt;strong style="max-width: 100%;letter-spacing: 0.544px;box-sizing: border-box !important;word-wrap: break-word !important;"&gt;&lt;span style="max-width: 100%;color: rgb(255, 41, 65);box-sizing: border-box !important;word-wrap: break-word !important;"&gt;&lt;strong style="max-width: 100%;letter-spacing: 0.544px;color: rgb(0, 0, 0);font-size: 18px;box-sizing: border-box !important;word-wrap: break-word !important;"&gt;&lt;span style="max-width: 100%;font-size: 24px;letter-spacing: 0.544px;box-sizing: border-box !important;word-wrap: break-word !important;"&gt;&lt;strong style="max-width: 100%;letter-spacing: 0.544px;box-sizing: border-box !important;word-wrap: break-word !important;"&gt;&lt;span style="max-width: 100%;letter-spacing: 0.544px;box-sizing: border-box  

what is wrong with bleach clean? is it because I have too many tags and styles to be cleaned so it just added "<pre><code>" at the beginning and closed it at the end?

1

There are 1 best solutions below

0
On BEST ANSWER

Figured Out. It is because the content to be cleaned contains \n \n\n \n\n \n \n at the beginning. Should remove those first.