Extract both CSS and HTML comments from HTML doc (Python)

Question

Extract both CSS and HTML comments from HTML doc (Python)

1k Views Asked by Randy Banks At 09 June 2015 at 22:04

I am passing HMTL code into BeautifulSoup using Python, and my output is erred by HTML commenting. I have this python script to remove HTML comments, but it fails to remove HTML comments nested with CSS comments.

My code:

from bs4 import BeautifulSoup, Comment   

   input_text = ""

   for line in open('output.txt'):
           input_text+=line

  soup = BeautifulSoup(input_text)
  comments = soup.findAll(text=lambda text:isinstance(text, Comment))
  [comment.extract() for comment in comments]
  print soup

For example, it removes all HTML comments from my test input except for:

<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:inherit;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman",serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-family:"Calibri",sans-serif;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} -->

Here is a chunk of code from the input that includes 2 comments that are successfully removed after running my script on it, as well as the comment that is not removed:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="Generator" content="Microsoft Word 15 (filtered medium)"> <!--[if !mso]><style>v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} </style><![endif]--><style><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:inherit;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt; font-family:"Times New Roman",serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} span.EmailStyle17 {mso-style-type:personal-reply; font-family:"Calibri",sans-serif; color:#1F497D;} .MsoChpDefault {mso-style-type:export-only; font-family:"Calibri",sans-serif;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --></style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]-->

I'm not sure of the best way of going about removing the CSS comments first. I shouldn't need to bother removing the content of the CSS comments, just the /* */, as the rest should be stripped with the HTML comment it is nested within

Original Q&A

There are 1 best solutions below

**Randy Banks** · Accepted Answer · 2015-06-10T20:34:48.833000

I solved my issue. I removed them using regexes, and for anyone who is curious, here is my new code:

from bs4 import BeautifulSoup, Comment
import re

input_text = ""

for line in open('output.txt'):
    input_text+=line

#extract all CSS comments
text = re.sub('\/*', '', input_text)
text = re.sub('\*/', '', text)

soup = BeautifulSoup(text)

#extract all HTML comments
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]

print soup

Extract both CSS and HTML comments from HTML doc (Python)

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in HTML

Related Questions in CSS

Related Questions in BEAUTIFULSOUP

Trending Questions

Popular # Hahtags

Popular Questions