Is this a bug in ruby Regexp? How to guard against "infinite loop" from regex match without using Timeout?

308 Views Asked by At

I have this regex:

regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/i

And when I use it on some, but not all, texts e.g. this one:

text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ"

like so: text.match(regex), then ruby just runs in what seems like an infinite loop - but why? And is there anyway to guard against this, e.g. by having ruby throw an exception instead - without using the Timeout as it is a known issue when using it with Sidekiq (https://github.com/mperham/sidekiq/wiki/Problems-and-Troubleshooting#add-timeouts-to-everything)

ruby version: 2.7.2

1

There are 1 best solutions below

0
sln On BEST ANSWER

Built-in character classes are more table-driven.
Given that, Negative built-in ones like \W, \S etc...
are difficult for engines to merge into a positive character class.

In this case, there are some obvious bugs because as you've said, it doesn't time out on
some target strings.

In fact, [a-xzA-XZ\W] works given the sample string. It times out when Y is included anywhere
but just for that particular string.

Let's see if we can determine if this is a bug or not.

First, some tests:

Test - Fail [a-zA-Z\W]

https://rextester.com/FHUQG84843

# Test - Fail  [a-zA-Z\W]
puts "Hello World!";
regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/ui;
text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ";
res = text.match(regex);
puts "Done";

Test - Pass [a-xzA-XZ\W]

https://rextester.com/RPV28606

Test - Pass [a-zA-Z\P{Word}]

https://rextester.com/DAMW9069


Conclusion: Report this as a BUG.
IMO this is a BUG with their built-in class \W which is engine defined,
since \P{Word} is a Unicode property defined function, not a range.
And we see that [a-zA-Z\P{Word}] works just fine.
Use \P{Word} inside classes as a temporary workaround.

In reality when modern-day engines were first designed, the logic of what
a negative class was [^] each item is AND NOT which when combined with a positive
class where each item is ORed results in errors in scope.
Perl had class errors still a short time ago.