When you read such posts as Regex: NFA and Thompson's algorithm everything looks rather straightforward until you realize in real life you need not only direct characters like "7" or "b", but also:
[A-Z]
[^_]
.
namely character classes (or ranges). And thus my question -- how to build NFA using character ranges? Using meta-characters like "not A", "anything else" and then computing overlapping ranges? This would lead to using tree-like structure when using final automaton, instead of just a table.
Update: please assume non-trivial in size (>>256) alphabet.
I am asking about NFA, but later I would like to convert NFA to DFA as well.
The simplest approach would be:
Use segments as labels for transitions in both NFA and DFA. For example, range [a-z] would be reperesented as segment
[97, 122]
; single character 'a' would become[97,97]
; and any character '.' would become[minCode, maxCode]
.Each negated range [^a-z] would result in two transitions from starting state to next state. In this example two transitions
[minCode, 96]
and[123, maxCode]
should be created.When range is represented by enumerating all possible characters [abcz], either transition per character should be created, or the code migh first group characters into ranges to optimize the number of transitions. So the [abcz] would become
[a-c]|z
. Thus two transitions instead of four.This should be enough for NFA. However the classical power set construction to transform NFA to DFA will not work when there are transitions with intersecting character ranges. To solve this issue only one additional generalization step is required. Once a set of all input symbols created, in our case it will be a set of segments, it should be transformed into a set of non-intersecting segments. This can be done in time O(n*Log(n)), where n is a number of segments in a set using priority equeue (PQ) in which segments are ordered by the left component. Example:
Step 2. To create new transitions from a "set state" the algorithm should be modified as following:
To speed up matching when searching for a transition and input symbol C, transitions per state might be sorted by segments and binary search applied.