NFA recognizer in LISP

1.4k Views Asked by At

I have to define a function in lisp that, given a regular expression and an e-NFA as input, it returns true if the expression is accepted by the automata.

For start, I've defined a function that generates an e-NFA (as a cons-cell) from a regular expression with these operators {|, *, +, ...}.

For example: with the expression (or a b) the output will be:

((INITIAL 0) (DELTA 0 EPSILON 1) (DELTA 0 EPSILON 3) (DELTA 2 EPSILON 5) (DELTA 4 EPSILON 5) (DELTA 1 A 2) (DELTA 3 B 4) (FINAL 5))

I've come up with this idea: I've written the function recognize-or (which processes the or case):

(defun nfa-recognize-or (fa input init final)
 (let ((arc (cdr (car fa))))
  (cond ((and (equal (first arc) init) (equal (second arc) 'epsilon))
         (nfa-recognize-or (cdr fa) input (third arc) final))
        ((not (equal (first arc) init))
         (nfa-recognize-or (cdr fa) input init final))
        ((and (equal (first arc) init) (equal (second arc) (car input)))
         (nfa-recognize-or fa (cdr input) (third arc) final))
        ((equal (third arc) final)
         T)
  )
 )
)

If I call the function this way:

(nfa-recognize-or (cdr fa) '(a) 0 5)

It returns 'stack overflow'. The problem is that, after some calls, the value of fa =

(DELTA 1 A 2) (DELTA 3 B 4) (FINAL 5))

with init = 2 and final = 5 as before. At this point, the next state the program should consider should be

(DELTA 2 EPSILON 5)

in order to return TRUE, but this is not possible because at this point NFA is "consumed" and I can't backtrack it to verify the other states.

Do you have any suggestions?

2

There are 2 best solutions below

0
On

I think I've worked out what you're trying to do.

Edit: I thought you were trying to convert regular expressions to e-NFA's but it seems you want to do something else. I'll leave this for now anyway.

Let's try to make some headway in creating a function re->enfa which takes the following parameters:

  1. A regular expression in some Lisp format
  2. A symbol for a state which will have an epsilon transition to the sub-e-NFA corresponding to this regular expression
  3. A symbol for the corresponding leaving state

We will use symbols to identify states so we can easily come up with unique identifiers. We'll write a few small functions in case we change our minds

(defun new-state () (gensym))
(defun transition (from to on) `((delta ,from ,on ,to)))
(defun nfa-union (&rest args) (apply #'append args))

Now let's look at two cases which we will put in functions. One for concatenation of two regular expressions and one for alternation.

(defun concat-nfa (a b s0 sn)
  (let ((m (new-state)))
    (nfa-union (re->enfa a s0 m) (re->enfa a m sn))))
(defun or-nfa (a b s0 sn)
  (let ((s1 (new-state))
        (s2 (new-state))
        (sk (new-state))
        (sl (new-state)))
    (nfa-union (transition s0 s1 'epsilon)
               (transition s0 s2 'epsilon)
               (transition sk sn 'epsilon)
               (transition sl sn 'epsilon)
               (re->enfa a s1 sk)
               (re->enfa b s2 sl))))

Then you can define what re->enfa might look like:

(defun re->enfa (x s0 sn)
  (if (atom x) (transition s0 sn x)
    (case (car x)
      (or (if (cddr x) (or-nfa (cadr x) (cons 'or (cddr x) s0 sn)
              (re->enfa (cadr x) s0 sn)))
      (concat  (if (cddr x) (concat-nfa (cadr x) (cons 'concat (cddr x) s0 sn)
              (re->enfa (cadr x) s0 sn)))
      (* (kleene-nfa (cadr x) s0 sn))
      (+ (re->nfa `(concat ,(cadr x) (* ,(cadr x))))))))

That should give you a starting point with some things left to fill in. There are also some changes you could make. You might for example want to implement + in a way that only requires the e-NFA for the inner expression to be calculated once.

You also need to add a function which adds a marker of the initial and final state and wraps re->enfa.

2
On

I'll start at a high level and work down into details. I won't give a complete answer since it is not short but hopefully it will be enough to set you on the right path.

given a regular expression and an e-NFA as input, it returns true if the expression is accepted by the automata.

For simplicity, I will assume that the regular expression is given in a form that is compatible with a string of characters from the alphabet E plus the usual (, ), +, *. I also assume the e-NFA is given in a form which contains all the necessary information to constitute all the information you'd expect in the formal definition of an e-NFA. The difficulty of translating between the actual formats and my preferred formats will not be considered.

I always recommend figuring out how you'd solve a problem by hand before you write any code to do it. How would you solve this problem on a test? How can you tell if an e-NFA accepts a regular expression? More fundamentally, what is a reasonable interpretation of that requirement? My interpretation is as follows:

An e-NFA M accepts a regular expression r if L(r) is a subset of L(M).

In other words, if w is a string matched by r, it should be accepted by M. This first step is important because it casts the problem into one that we can begin to solve formally. We need to see if the language of one thing is a subset of the language of another. There is no straightforward formal machinery to answer this question directly for regular expressions of which I'm aware. However, there is a well-known construction to answer questions like this with finite automata: I refer to this construction as the Cartesian Product Machine construction, and it is used to produce a finite automaton which accepts a language based on the languages of two other finite automata. In particular: if L \ R = 0 (where 0 is the empty set and \ is set difference), then L is a subset of R. The Cartesian Product Machine construction allows us to construct a machine for L \ R given machines for L and R. We already have one for M; it is given. All we need is a machine for L(r) and we are ready to proceed with the deterministic algorithm for producting a machine for the difference of languages. Then, all that remains is checking whether the resulting language is empty. For details of the Cartesian Product Machine construction, see my other answer here. Don't worry that you will have NFAs; the construction works the same as for DFAs as noted here.

Given a regular expression r, there is an algorithm for producing an NFA which accepts L(r). I don't have a ready link for this so I'll gloss over it. Basically, we define some recursive rules for constructing e-NFAs based on each term of a regular expression. Here are the rules:

1. Alphabet symbol a: M(a)

--->()-a->(O)


2. Concatenation rs: M(rs)

--->[M(r)]-e->[M(s)]

* Note: one -e-> from each accepting state of M(r) to initial state of M(s)
        initial state is that of M(r)
        accepting states are those of M(s)


3. M(r+s):

-->()-e->[M(r)]
    \-e->[M(s)]

* Note: new initial state added
        accepting states are those of M(r) and M(s)

4. M(r*):

     /--e--\
    V       \
--->()-e->[M(r)]

* Note: one -e-> from each accepting state of M(r) to initial state
        new initial state
        accepting state is only the new initial state

Now we have an NFA for your regular expression and we know how to construct the Cartesian Product Machine for the difference. What we end up with is a big e-NFA representing the difference of L(r) and L(M). We have already stated that the difference of these languages is empty iff L(r) is a subset of L(M), and L(r) is a subset of L(M) iff M accepts r. The only question that's left is this: is the language of our Cartesian product machine empty or not?

There are lots of ways to answer this question, but the most direct would be simply to perform a graph search algorithm starting at the initial state to see if any of the accepting states are reachable. If a graph search shows that a state is reachable, then there are some strings in the language. If none of the states reachable from the initial state is accepting, the language is empty. Any search algorithm for directed graphs - depth-first, breadth-first, etc. - will do fine. Note that the graph is not acyclic so don't use any methods that require acyclic graphs.