C# HtmlDocument Extract Classes

172 Views Asked by At

I am writing some code to loop through every element in a HTML page and extract all ID and Classes.

My current code is able to extract the ID's but I can't see a way to get the classes, does anybody know where I can access these?

    private void ParseElements()
    {
        // GET: Document from Browser
        HtmlDocument ThisDocument = Browser.Document;

        // DECLARE: List of IDs
        List<string> ListIdentifiers = new List<string>();

        // LOOP: Through Each Element
        for (int LoopA = 0; LoopA < ThisDocument.All.Count; LoopA += 1)
        {
            // DETERMINE: Whether ID Exists in Element
            if (ThisDocument.All[LoopA].Id != null)
            {
                // ADD: Identifier to List
                ListIdentifiers.Add(ThisDocument.All[LoopA].Id);
            }
        }
    }
1

There are 1 best solutions below

0
On

You could get the inner HTML of each node and use a regular expression to get the class. Or you could try HTML Agility pack.

Something like...

HtmlAgilityPack.HtmlDocument AgilePack = new HtmlAgilityPack.HtmlDocument();

AgilePack.LoadHtml(ThisDocument.Body.OuterHtml);

HtmlNodeCollection Nodes = AgilePack.DocumentNode.SelectNodes(@"//*");

foreach (HtmlAgilityPack.HtmlNode Node in Nodes)
{
    if (Node.Attributes["class"] != null)
        MessageBox.Show(Node.Attributes["class"].Value);

}