I have this regex:
regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/i
And when I use it on some, but not all, texts e.g. this one:
text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ"
like so: text.match(regex), then ruby just runs in what seems like an infinite loop - but why? And is there anyway to guard against this, e.g. by having ruby throw an exception instead - without using the Timeout as it is a known issue when using it with Sidekiq (https://github.com/mperham/sidekiq/wiki/Problems-and-Troubleshooting#add-timeouts-to-everything)
ruby version: 2.7.2
Built-in character classes are more table-driven.
Given that, Negative built-in ones like
\W,\Setc...are difficult for engines to merge into a positive character class.
In this case, there are some obvious bugs because as you've said, it doesn't time out on
some target strings.
In fact,
[a-xzA-XZ\W]works given the sample string. It times out whenYis included anywherebut just for that particular string.
Let's see if we can determine if this is a bug or not.
First, some tests:
Test - Fail [a-zA-Z\W]
https://rextester.com/FHUQG84843
Test - Pass [a-xzA-XZ\W]
https://rextester.com/RPV28606
Test - Pass [a-zA-Z\P{Word}]
https://rextester.com/DAMW9069
Conclusion: Report this as a BUG.
IMO this is a BUG with their built-in class
\Wwhich is engine defined,since
\P{Word}is a Unicode property defined function, not a range.And we see that
[a-zA-Z\P{Word}]works just fine.Use
\P{Word}inside classes as a temporary workaround.In reality when modern-day engines were first designed, the logic of what
a negative class was
[^]each item is AND NOT which when combined with a positiveclass where each item is ORed results in errors in scope.
Perl had class errors still a short time ago.