I need to pull various bits of information out of this HTML.
In a perfect world, I'd have some helper attributes I can use, but for reasons I am stuck with this structure and working with a mess.
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p><strong>Heading 1 text I can extract</strong><br />Paragraph text - this is where the exctraction ends for this paragraph/strong, I need to inclulde the list (and any other content before the next paragraph/strong)</p>
<ul>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
</ul>
<p><strong>Heading 2 text I can extract</strong><br />Paragraph text - this extracts fine</p>
</body>
</html>
I am doing this:
public static void Parse(string html)
{
var document = new HtmlDocument();
document.LoadHtml(html);
var paragraphs = new List < string > ();
var heading = string.Empty;
var nodes = document.DocumentNode.SelectNodes("//p");
for (int i = 0; i < nodes.Count; i++)
{
var paragraphNode = nodes[i];
paragraphs.Add(paragraphNode.InnerText.Trim() + Environment.NewLine);
}
}
paragraphNode.NextSibling does not contain the UL - what's the best way to go about being able to parse this?
I need to be cautious, as the UL must form part of the proceeding paragraph, so this is a content block:
<p><strong>Heading 1 text I can extract</strong><br />Paragraph text - this is where the exctraction ends for this paragraph/strong, I need to inclulde the list (and any other content before the next paragraph/strong)</p>
<ul>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
</ul>
and this is the next content block:
<p><strong>Heading 2 text I can extract</strong><br />Paragraph text - this extracts fine</p>
I cannot change the structure of the HTML or rely on any other content. Is there a somewhat sane was of doing this?
You may want to try this extension: https://github.com/hcesar/HtmlAgilityPack.CssSelector