I'm trying to load an HTML document into an XDocument
in C# and running into issues with comments in <style>
tags and <script>
tags. Specifically, there's comments that contain <
characters, so the XDocument
throws errors complaining about those containing illegal names.
Here's my C# code:
XDocument doc = XDocument.Load(fileName);
And (a portion) of my html:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
<meta name="generator" content="..."/>
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/>
<style type="text/css">
/*!
* Copyright 2012,2013 --- <[email protected]>
...
The thing I can think of so far is opening it as a string and wrapping the css/javascript in CDATA tags (using regex), but I was hoping there was an easier way
I agree with ryascl's comment above - a better fit for your problem domain would be an HTML parser, not an XML parser. I haven't used any myself in a while, but here are a couple that a search turned up:
Html Agility Pack
A managed wrapper for the HTML Tidy library
Chilkat .NET HTML Conversion - this converts to XML, but parses HTML
Relevant SE and Wikipedia links:
C# library for parsing HTML?
Comparison of HTML parsers