Regex in .NET to match newlines in multiline string yaml values beginning with a pipe symbol

52 Views Asked by At

Quick Summary

I have a small yaml file as an example, which I am parsing with Regex using the .NET flavor.

My goal is to match multiline strings beginning with a | symbol, as seen on line 6 below.

Input and Background

kind: ConfigMap
apiVersion: v1
metadata:
  name: {{ .Release.Name }}-nginx-index-html-configmap
data:
  startRegexMatch.html: |
    <!DOCTYPE html>
    <html>
    <head>
    <title>Up & Running!</title>
    <style>
        body {
            width: 60em;
            margin: 0 auto;
            font-family: Tahoma, Verdana, Arial, sans-serif;
        }
    </style>
    </head>
    <body>
    <h1>Up & Running!</h1>

    <pre style="float:left;">
    888        888           888          d88P  888 8888888P"  8888888P"  888     888      888d88888b888 888     888 8888888P"  8888888b        "Y88b. 888
    8888888888 88888888       "Y8888P" d88P     888 888   T88b 888   T88b  "Y88888P"       888P     Y888  "Y88888P"  888   T88b 888    Y88b  "Y8888P"  888 </pre>
    </body>
    </html>
  stopRegexMatch:

Originally I matched multiline strings using a different regex that searched for non-yaml syntax and assumed it was a string. The problem with multiline html strings are lines like:

            width: 60em;
            margin: 0 auto;
            font-family: Tahoma, Verdana, Arial, sans-serif;

as these could be legitimate yaml key-value pairs if they weren't preceded by a | several lines back. My existing regex identifies these as valid yaml and, incorrectly, doesn't regard them as part of the multiline string anymore.

Goals

So now I am attempting to reapproach this with a regex that matches everything after a | symbol, and uses the indentation anchored at the beginning of that line as a reference for when to stop matching. Going off the above simplified example, this means:

  1. There are 2 spaces before startRegexMatch.html.
  2. There are 2 spaces before stopRegexMatch
  3. I would like to:
    1. grab the line containing a '|'
    2. capture the number of indented spaces at the beginning of this line
    3. match every newline afterward which isn't followed by a line with the same number or fewer of indented spaces captured in the starting line.

I match the newlines only, because I will be replacing them in a subsequent step to create a single line. I am aiming for a precise solution that will properly match for yaml files potentially spanning hundreds of lines that may have multiple occurrences of such multiline strings.

Current Attempt

My current attempt looks like this:

(?m)(?<=^(\s*).*: \|\s*(\n+\1\s+.*)*)(\r\n|\n)+(?!\1\S)

My thought process here:

  1. ^(\s*).*: \|\s*: The lookbehind checks for a line containing the string : |, and it sets a capture group of its spaces anchored at the beginning of that line. Call this capture group \1

  2. (\n+\1\s+.*) After this, it searches for lines that follow the pattern of:

    1. newlines
    2. a number of spaces greater than the capture group \1

    Everything else in the line doesn't matter. This is how I check that I am still inside the multiline string: subsequent lines must have a greater indentation than \1.

  3. The above search in #2 can repeat itself an arbitrary number of times, hence the * after that group.

    In summary, the lookbehind should be matching all newlines that are preceded by lines with a greater indentation than a line containing : |.

  4. \n+ Following the lookbehind, I must match newlines.

  5. (?!\1\S) I set up a negative lookahead to rule out newlines that are followed by the end of my pattern: a line that starts with \1. This doesn't appear to work perfectly however; see problem #3 below.

Problems

  1. The main problem is my first backreference \1 is not defined in the lookbehind, because the lookbehind parses from right to left. If I replace my above regex with: (?m)(?<=^(\s*).*: \|\s*(\n+ \s+.*)*)(\r\n|\n)+(?!\1\S), i.e., the first backreference is now hardcoded as 2 whitespaces, then it matches almost perfectly with what I want. However, I need this to be dynamic to handle an arbitrary initial indentation.

    So my first and largest issue is establishing this initial backreference somehow.

  2. My negative lookahead currently only checks for lines beginning with \1. I don't actually check if there would be less whitespace than \1—in this example, a line that might begin with 0 or 1 spaces. In short, my lookahead should be checking for anything from 0-2 whitespaces followed by \S, but I need a way to dynamically define the maximum of 2 whitespaces, as the indentation can vary in other files.

    From the regex documentation, I was wondering if maybe a balance group could help here, but I am not sure as I've never used them before.

  3. I also have the minor issue that I am matching the final linebreak between </html> and stopRegexMatch. I thought my negative lookahead would rule the last linebreak out, but it doesn't seem to work. However, I can fiddle with it some more to maybe figure that out. My main issues are the 1st point, with the 2nd point close behind, and this 3rd point further down in priority.

Does anyone have any ideas on how to solve the above points, or maybe a better way to do write the regex altogether?

1

There are 1 best solutions below

0
The fourth bird On

If you must use a regex for the examples data, you might use:

(?m)(?<=^(\s*)(?:[^\s:][^:\r\n]*)?: \|\s*)(?:\r?\n(?!\1\S).*)+

Note that in C# if you want to match spaces without newlines you can use [\p{Zs}\t] instead of \s

The pattern matches:

  • (?m) Inline multiline flag
  • (?<= Positive lookbehind
    • ^(\s*) Start of string, followed by capturing 0+ whitespace characters in group 1
    • (?:[^\s:][^:\r\n]*)?: Optionally match a non whitespace character other than : followed by optional characters other than : or a newline, then match :
    • \|\s* Match | and optional whitespace chars
  • ) Close the lookbehind
  • (?: Non capture group to repeat as a whole part
    • \r?\n(?!\1\S).* Match a newline and assert that there is not the same amount of whitespace character followed by a non whitespace character
  • )+ Close the non capture group and repeat 1+ times

In C#

string pattern = @"(?m)(?<=^(\s*)(?:[^\s:][^:\r\n]*)?: \|\s*)(?:\r?\n(?!\1\S).*)+";

See a .NET regex demo and a C# demo