Regex for web extraction. Positive lookahead issues

78 Views Asked by At

Below is an example of some data I'm using. I've read a number of posts involving this topic, as well as tried for a while on regex101.

BotInfo[-]: Source IP:[10.1.1.100] Target Host:[CentOS70-1] Target OS:[CentOS
7.0] Description:[HTTP Connection Request] Details:[10.1.1.101 - - [28/May
/2013:12:24:08 +0000] "GET /math/html.mli HTTP/1.0" 404 3567 "-" "-" ] Phase:
[Access] Service:[WEB]

The goal is to have two capture groups. One for for tag (e.g. Source IP, Target Host, Description, etc) and another for the content contained in the outermost square brackets.

It's the "outermost" that's getting me, because the content for the Details tag has square brackets in it.

Here is my current progress on said regex. I am using the /g flag:

\s?([^:]+):\[(.*?(?=\]\s.*?:\[))\]

This handles everything except the edge case (it's more complex than needed because I've been fiddling with trying to get the edge case to work).

My current lookahead (\]\s.*?:\[), at a high level, is to match the end left bracket and then the next tag. Another issue is that this fails at the last match, because there is no following tag.


Edit: An example of successful output was requested. Using the data provided, the goal is to have two capture groups resulting in these pairs:

MATCH 1
1.  `Source IP`
2.  `10.1.1.100`
MATCH 2
1.  `Target Host`
2.  `CentOS70-1`
MATCH 3
1.  `Target OS`
2.  `CentOS 7.0`
MATCH 4
1.  `Description`
2.  `HTTP Connection Request`
MATCH 5
1.  `Details`
2.  `10.1.1.101 - - [28/May/2013:12:24:08 +0000] "GET /math/html.mli HTTP/1.0" 404 3567 "-" "-" `
MATCH 6
1.  `Phase`
2.  `Access` 
MATCH 7
1.  `Service`
2.  `WEB`
1

There are 1 best solutions below

1
On BEST ANSWER

Heavily inspired from this answer about nested patterns I end up with this regex with Demo here:

\s*([\w ]+):\s*(\[((?>[^[\]]+|(?2))*)\])

The main idea is to repeat the match of brackets as much as possible (if an opening or closing bracket is found, repeat with (?2). The data you're looking for are in fact in the first and third capture group, the second is capturing with the brackets for the recursion to happen properly.

Details on the regex:

  • \s* match (and discard) all spaces before the field
  • ([\w ]+): Capture the field name (all before the :)
  • \s* Again to discard any space before the field
  • (\[ Start of second capture group and match a litteral [
  • ((?>[^[\]]+ Start of the third capture group with an atomic match (blocking backtracking to avoid infinite loop) which should match anything but brackets
  • |(?2)) If we found a bracket, try rematching the whole Second group
  • *) repeat 0 or infinite times the atomic group with the alternation to get nested brackets and end the third capture group
  • \]) our last bracket to match and end the second capture group used in the alternation for the atomic match.