Html Parser - C# Regex html tags content >me im the content< (div, img, a, h5 etc) the html tags are Closed in a number of different ways.
Why am doing this you might ask. I have inherited prototype code to perform phrase replacement for example Home -> Casa. (Spanish). As you can imagine i have quite a lot of phrases (350 and rising) such as "Add New Contact" which vary in length and word count.
First requirement: A Regex is required to pull out the tag content. Output must be: here is the content to be matched by the regex This will allow me to further manipulate the string to allow me to perform phrase replacement.
Second Requirement: here is the content to be matched by the regex/> A Regex is required to pull out the attribute tag content such as: Output must be:
Please Please dont respond with use an HTML Agility Pack. I have bespoke requirements that does not allow me to look at a: A Well formed document. b. Client side XSL transforms c. Xml Data islands which determine content.
string file = @"<html>
<body>
<input class='moth'>Add New Organisation </>
<input class='moth'>Org role
</>
</body>
</html>";
string searchText = "Add New Organisation";
<([\d\w]*)\b[^>]*>([\d\w\s]*?{0}[\d\w\s]*)
So can anyone help. So far i have been using this regexp..
var myContentMatches = new List<string>
(Regex.Matches(file, regExpressionContent.ToString(),
RegexOptions.IgnoreCase
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Multiline)
.Cast<Match>().Select(pp => pp.ToString()));
I am trying not to overload the question here. Any further information required please ask. I have been banging my head against the speed and correct matching on this for some time now.
HTML is not a regular language, and cannot be parsed with Regular Expressions. I do not believe that there is a realistic solution to your problem that does not leverage an existing library for parsing HTML.
This is one of the most up-voted question/answer combos on StackOverflow, and I suggest you read it: RegEx match open tags except XHTML self-contained tags