How would I rewrite PCRE regular expressions so that they're compatible with JavaScript?

200 Views Asked by At

As part of a security related project written in Node.js, I'm looking at some of the work done by the team behind PHPIDS, specifically their filter list, which is composed of a large amount of regular expressions that matches a variety of different attack payloads.

I want to make it clear that I am of course fully aware that this project hasn't been maintained for almost eight years now, but I still definitely see how these filters could play a valuable role in a larger detection system.

With that out of the way, I have been struggling to find a good way to "convert" some of these PCRE specific expressions to a format that is compatible with the standard JavaScript implementation.

So far I've tried using different tools, such as regex 101, pcre-to-regexp and babel-plugin-transform-modern-regexp, but they all choke on the same features: "negative lookbehinds" and "group conditionals".

From that I understand, many features that have been lacking in the JS implementation are on their way, which is great - but there's basically no word on these two specifically (as far as I can find).

My hope is that for someone who actually understands the inner workings of these features, rewriting these could be fairly straight forward, maybe using a combination of significantly less complex expressions and/or some extra processing before/after these are run, to sort of act like a "polyfill" more or less.

I'm attaching a link to one of these patterns on RegExr, because of their incredibly helpful autogenerated explanation of the pattern and all of the different parts, as well as the full expression here as well.

RegExr: Pattern with PCRE features

([^*:\\s\\w,.\\\/?+-]\\s*)?(?<![a-z]\\s)(?<![a-z\\\/_@\\-\\|])(\\s*return\\s*)?(?:create(?:element|attribute|textnode)|[a-z]+events?|setattribute|getelement\\w+|appendchild|createrange|createcontextualfragment|removenode|parentnode|decodeuricomponent|\\wettimeout|(?:ms)?setimmediate|option|useragent)(?(1)[^\\w%\"]|(?:\\s*[^@\\s\\w%\",.+\\-]))

It can't be impossible to achieve the same thing as this nearly decade old expression in JavaScript, could it?

1

There are 1 best solutions below

0
On

Most modern JavaScript engines support negative lookbehinds, so the only feature in your regex that is not supported is the conditional group (?(1)subpattern1|subpattern2), which chooses a subpattern to try to match based on whether anything was matched by the first capture group.

This can be emulated by applying the regex with the conditional group removed, and then if there is a match, looking to see if anything was matched by the first capture group

let rex = new RegExp(patternWithoutConditionalGroup, 'i');
let match = text.match(rex);
if (match !== null) {
  if (match[1] !== undefined) {

and then concatenating subpattern1 or subpattern2 to the regex accordingly and re-applying it.

rex = new RegExp(patternWithoutConditionalGroup + subpattern1, 'i');
match = text.match(rex);

Let me know how you get on.