I do have a list of several hundred strings and an array of 10k regular expressions.
I now have to iterate over all strings and check which of the 10k regular expressions match. What's the most performant way to do this?
Currently I'm doing this:
myRegularExpression.firstMatch(in: myString, options: myMatchingOption, range: NSMakeRange(0, myString.characters.count)) == nil
where myRegularExpression
is an NSRegularExpression
stored for reuse and myMatchingOption
is NSRegularExpression.MatchingOptions(rawValue: 0)
Is there a faster, more performant way to check if a string matches one of those 10k regular expressions?
EDIT:
I need to know not only IF one of my 10k regular expressions fit but also which one. So currently I do have a for loop inside a for-loop: the outer one iterates over my several hundred strings and for each of these strings I iterate over my 10k rules and see if one rule fits (of course if one fits I can stop for that string, so roughly:
for string in stringsToCheck {
for rule in myRules {
if string.matches(rule) {
// continue with next string of stringsToCheck
}
}
}
Depending on what platform you're running this on, separating the work in using multiple threads may provide some response time improvements but I believe that really dramatic optimization for this would require some insight on the nature of the regular expressions.
For example, if the expressions don't have a specific precedence order, you could rearrange (reorder) them to make the most likely "matches" come first in the list. This could be evaluated pre-emptively either by the supplier of the expressions or using some function to estimate their complexity (e.g. length of the expression, presence of optional or combinatory symbols). Or it could be evaluated statistically by collecting (and persisting) hit/miss counts for each expression. But of course such an optimization assumes that every string will match at least one expression and that the 80/20 rule applies (i.e 20% of the expressions match 80% of the strings).
If the expressions are very simple and only make use of letter patterns, then you would get better performance with more "manual" implementations of the matching functions (instead of regex). In the best case scenario, simple letter patterns can be converted into a character tree and yield orders of magnitude in performance improvements.
Note that these solutions are not mutually exclusive. For example, if a large proportion of the expressions are simple patterns and only a few have complex patterns then you don't have to throw away the baby with the bath water: you can apply the simple pattern optimization to a subset of the rules and use the "brute force" nested loop to the remaining complex ones.
I had a similar problem in the past where thousands of rules would need to be applied on hundreds of thousands of records for processing insurance claims. The traditional "expert system" approach was to create a list of rules and run every record through it. Obviously that would take a ridiculous amount of time (like 2 months execution time to process one month of claims). Looking at it with a less than "purist" mindset, I was able to convince my customer that rules should be defined hierarchically. So we split them into a set of eligibility rules and a set of decision rules. Then we further refined the structure by creating eligibility groups and decision groups. What we ended up with was a coarse tree structure where rules would allow the system to narrow down the number of rules that should be applied to a given record. With this, the 6 week processing time for 250,000 records was cut down to 7 hours (this was in 1988 mind you).
All this to say that taking a step back into the nature of the problem to solve may provide some optimization opportunities that are not visible when looking merely at the mechanics of one process option.