Check if a regex is ambiguous

2.5k Views Asked by At

I wonder if there is a way to check the ambiguity of a regular expression automatically. A regex is considered ambiguous if there is an string which can be matched by more that one ways from the regex. For example, given a regex R = (ab)*(a|b)*, we can detect that R is an ambiguous regex since there are two ways to match string ab from R.

UPDATE

The question is about how to check if the regex is ambiguous by definition. I know in practical implementation of regex mechanism, there is always one way to match a regex, but please read and think about this question in academic way.

4

There are 4 best solutions below

9
On

You are forgetting greed. Usually one section gets first dibs because it is a greedy match, and so there is no ambiguity.

If instead you are talking about a mythical pattern matching engine without the practical details like greed; then the answer is yes you can.

Take every element of the pattern. And try every possible subset against every possible string. If more than one subset matches the same pattern then there's an ambiguity. Optimizing this to take less than infinite time is left as an exercise for the reader.

3
On

A possible solution:

Construct an NFA for the regexp. Then analyse the NFA where you start with a set of states consisting solely of the initial state. Then do a depth, or width first traversal where you keep track of if you can be in multiple states. You also need to track the path taken in order to eliminate cycles.

For example your (ab)*(a|b)* can be modeled with three states.

 |   a   |   b
p| {q,r} |  {r}
q|  {}   |  {p}
r|  {r}  |  {r}

Where p is the starting state and p and r accepts.

You then need to consider both letters and proceed with the sets {q,r} and {r}. The set {r} only leads to {r} giving a cycle and we can close that path. The set {q,r}, from {q,r} a takes us to {r} which is an accepting state, but since this path can not accept if we start with going to q we only have a single path here, we can then close this when we identify the cycle. Getting a b from {q,r} takes us to {p,r}. Since both of these accepts we have identified an ambigous position and we can conclude that the regexp is ambigous.

0
On

I read a paper published around 1980 which showed that whether a regular expression is ambiguous can be determined in O(n^4) time. I wish I could give you a reference but I no longer know the reference or even the journal. A more expensive way to determine if a regular expression is ambiguous is to construct a finite state machine (exponential in time and space in worst case) from the regular expression using subset construction. Now consider any state X of the FSM constructed from nfa states N. If, for any two nfa states n1, n2 of X, follow(n1) intersect follow(n2) is not empty then the regular expression is ambiguous. If this is not true for any state of the FSM then the regular expression is not ambiguous.

0
On

A regular expression is one-ambiguous if and only if the corresponding Glushkov automaton is not deterministic. This can be done in linear time now. Here's a link. BTW, deterministic regular expressions have been investigated also under the name of one-unambiguity.