How to better parse sibling content with HtmlAgilityPack

40 Views Asked by At

I need to pull various bits of information out of this HTML.

In a perfect world, I'd have some helper attributes I can use, but for reasons I am stuck with this structure and working with a mess.

<!DOCTYPE html>
<html>
<head>
</head>
<body>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p><strong>Heading 1 text I can extract</strong><br />Paragraph text - this is where the exctraction ends for this paragraph/strong, I need to inclulde the list (and any other content before the next paragraph/strong)</p>
<ul>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
</ul>
<p><strong>Heading 2 text I can extract</strong><br />Paragraph text - this extracts fine</p>
</body>
</html>

I am doing this:

 public static void Parse(string html)
 {
     var document = new HtmlDocument();
     document.LoadHtml(html);
     var paragraphs = new List < string > ();
     var heading = string.Empty;
     var nodes = document.DocumentNode.SelectNodes("//p");
     for (int i = 0; i < nodes.Count; i++)
     {
         var paragraphNode = nodes[i];
         paragraphs.Add(paragraphNode.InnerText.Trim() + Environment.NewLine);
     }
 }

paragraphNode.NextSibling does not contain the UL - what's the best way to go about being able to parse this?

I need to be cautious, as the UL must form part of the proceeding paragraph, so this is a content block:

<p><strong>Heading 1 text I can extract</strong><br />Paragraph text - this is where the exctraction ends for this paragraph/strong, I need to inclulde the list (and any other content before the next paragraph/strong)</p>
<ul>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
</ul>

and this is the next content block:

<p><strong>Heading 2 text I can extract</strong><br />Paragraph text - this extracts fine</p>

I cannot change the structure of the HTML or rely on any other content. Is there a somewhat sane was of doing this?

1

There are 1 best solutions below

1
Marco Merola On

You may want to try this extension: https://github.com/hcesar/HtmlAgilityPack.CssSelector

 document.QuerySelectorAll("p, li");