regex onigurama Negative lookbehind not working

497 Views Asked by At

I am trying to capture a line in a logfile using the onigurama regex library (in Logstash) using a negative look-behind but it still seems to match the line that it shouldn't. I am trying to match only the top level exception and not the one starting with Caused By:

Somebody helped me write this

Tested on Rubular http://rubular.com/r/N3AzySNHiS

Tested Regex

^(?<!Caused by: ).*?Exception

(?<!^Caused by: ).*?Exception

Message:

2016-11-15 05:19:28,801 ERROR [App-Initialisation-Thread] appengine.java:520 Failed to initialize external authenticator myapp Support Access || appuser@vm23-13:/mnt/data/install/assembly app-1.4.12@cad85b224cce11eb5defa126030f21fa867b0dad
java.lang.IllegalArgumentException: Could not check if provided root is a directory
    at com.myapp.jsp.KewServeInitContextListener$1.run(QServerInitContextListener.java:104)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: fh-ldap-config/
    at com.upplication.s3fs.util.S3Utils.getS3ObjectSummary(S3Utils.java:55)
    at com.upplication.s3fs.util.S3Utils.getS3FileAttributes(S3Utils.java:64)

Logstash result

"exception" => "Caused by: java.nio.file.NoSuchFileException"
2

There are 2 best solutions below

3
Wiktor Stribiżew On BEST ANSWER

It seems there are some additional options set in your Logstach environment. From my tests, I suspect the "verbose" or "ignore whitespace" option is enabled. Also, to exclude any other issues with . (that may be redefined to match line break symbols), you may use an unambiguous [^\r\n] (any char not \r and \n):

^(?!Caused\ by:)(?<exception>[^\r\n]*?Exception)
          ^^                 ^^^^^^^

The escaped space will always match a single regular space.

0
ddrake12 On

Note: I am assuming throughout this answer that the 2 individual log lines shown in the problem and repeated below do not contain newlines and have been processed through the multiline codec plugin in logstash or removed in some way.

TL;DR The Solution Using a Negative Lookbehind

A negative look behind will work if it is given an appropriate anchor afterwards. Looking at the two lines this would work well:

^(?<!Caused by: )java.*Exception

Note: it could just be ^(?<!Caused by: )j.*Exception but I think the java makes it more readable.

Explanation of Problem with Sample Code

The problem with the given regular expressions: ^(?<!Caused by: ).*?Exception and (?<!^Caused by: ).*?Exception is the reluctant *? quantifier that allows something to be matched 0 or more times. Now as explained in this answer the regex engine starts at the beginning of the string and moves left to write. The smallest possible number of characters (since it is reluctant) is nothing but the engine cannot match Exception and then it incrementally tries to match anything (.) before Exception ("backtracking") moving left to write.

So the regex engine keeps trying to match one more character at a time (from left to right) until Exception is found after what is has consumed. Therefore the string

Caused by: java.nio.file.NoSuchFileException: fh-ldap-config/ at com.upplication.s3fs.util.S3Utils.getS3ObjectSummary(S3Utils.java:55) at com.upplication.s3fs.util.S3Utils.getS3FileAttributes(S3Utils.java:64)

Does match because the engine has consumed everything up to Exception and Caused by: doesn't appear before this match. Essentially the .*? has consumed the Caused by: that the negative lookbehind is looking for.

Understanding Deeper

To understand what the regex engine is actually doing with lookarounds I recommend viewing this answer

I think it's easy to get caught up by quantifiers and lookarounds and as a general rule I think lookarounds need to be anchored by something concrete (not .). To understand what I mean let's look at slight variation on the given regex with the greedy * quantifier . The regex ^(?<!Caused by: ).*Exception also matches the quoted string.

The reason why is that the greedy * qualifier starts by consuming the entire string and then backtracks from right to left as explained in the first linked answer above. For the same reason (but from the other side) once the engine matches Exception it holds everything from the start of the string up to Exception. It then looks behind what it has consumed and does not find Caused by: and successfully matches the string.

In Summary, as a General Rule

Always anchor lookarounds when using greedy or reluctant quantifiers.