Html Parser - C# Regex html tags (div, img, a, h5 etc) plus attributes

1.1k Views Asked by At

Html Parser - C# Regex html tags content >me im the content< (div, img, a, h5 etc) the html tags are Closed in a number of different ways.

Why am doing this you might ask. I have inherited prototype code to perform phrase replacement for example Home -> Casa. (Spanish). As you can imagine i have quite a lot of phrases (350 and rising) such as "Add New Contact" which vary in length and word count.

First requirement: A Regex is required to pull out the tag content. Output must be: here is the content to be matched by the regex This will allow me to further manipulate the string to allow me to perform phrase replacement.

Second Requirement: here is the content to be matched by the regex/> A Regex is required to pull out the attribute tag content such as: Output must be:

Please Please dont respond with use an HTML Agility Pack. I have bespoke requirements that does not allow me to look at a: A Well formed document. b. Client side XSL transforms c. Xml Data islands which determine content.

string file = @"<html>
        <body>
            <input class='moth'>Add New Organisation  </>
<input class='moth'>Org&#160;role
 </>
         </body>
           </html>";

string searchText = "Add New Organisation";

<([\d\w]*)\b[^>]*>([\d\w\s]*?{0}[\d\w\s]*)

So can anyone help. So far i have been using this regexp..

 var myContentMatches = new List<string>
            (Regex.Matches(file, regExpressionContent.ToString(),
            RegexOptions.IgnoreCase
                | RegexOptions.IgnorePatternWhitespace
                | RegexOptions.Multiline)
            .Cast<Match>().Select(pp => pp.ToString()));  

I am trying not to overload the question here. Any further information required please ask. I have been banging my head against the speed and correct matching on this for some time now.

2

There are 2 best solutions below

1
On

HTML is not a regular language, and cannot be parsed with Regular Expressions. I do not believe that there is a realistic solution to your problem that does not leverage an existing library for parsing HTML.

This is one of the most up-voted question/answer combos on StackOverflow, and I suggest you read it: RegEx match open tags except XHTML self-contained tags

0
On

I am closing this question, using the HAP has solved a proportion of my requirements. Thank you all for your suggestions.