Reading HTML File in .NET with css/javascript comments

322 Views Asked by At

I'm trying to load an HTML document into an XDocument in C# and running into issues with comments in <style> tags and <script> tags. Specifically, there's comments that contain < characters, so the XDocument throws errors complaining about those containing illegal names.

Here's my C# code:

XDocument doc = XDocument.Load(fileName);

And (a portion) of my html:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
<meta name="generator" content="..."/>
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/>
<style type="text/css">
/*! 
 * Copyright 2012,2013 --- <[email protected]> 
...

The thing I can think of so far is opening it as a string and wrapping the css/javascript in CDATA tags (using regex), but I was hoping there was an easier way

1

There are 1 best solutions below

0
On

I agree with ryascl's comment above - a better fit for your problem domain would be an HTML parser, not an XML parser. I haven't used any myself in a while, but here are a couple that a search turned up:

Html Agility Pack

A managed wrapper for the HTML Tidy library

Chilkat .NET HTML Conversion - this converts to XML, but parses HTML

Relevant SE and Wikipedia links:

C# library for parsing HTML?

Comparison of HTML parsers