Recognizing permutations of a finite set of strings in a formal grammar

1.1k Views Asked by At

Goal: find a way to formally define a grammar that recognizes elements from a set 0 or 1 times in any order. Subsequently, I want to parse it and generate an AST as well.

For example: Say the set of valid strings in my language is {A, B, C}. I want to define a grammar that recognizes all valid permutations of any number of those elements.

Syntactically valid strings would include:

  • (the empty string)
  • A,
  • B A, and
  • C A B

Syntactically invalid strings would include:

  • A A, and
  • B A C B

To be clear, defining all possible permutations explicitly in a CFG is unacceptable for my purposes, since larger sets would be impossible to maintain.

From what I understand, such a language fails the pumping lemma for context free languages, so the solution will not be context free or regular.


Update

What I'm after is called a "permutation language", which Benedek Nagy has done some theoretical work on as an extension to context free languages.

Regarding a parser generator, I've only found talk of implementing parsers with a permutation phase (link). Parsers evidently have an exponential lower bound on the size of resulting CFG, and I haven't found any parser generators that support it anyhow.

A sort-of solution to this problem was written in ANTLR. It uses semantic predicates to 'code around' the issue.

1

There are 1 best solutions below

5
On

Assuming that the set of alternative strings is fixed and known in advance, say of size n, one can come up with a (non context-free) grammar of size O(n!). This is not asymptotically smaller than enumerating all permutations, so I suppose it cannot be considered a good solution. I believe that this grammar can be reformulated as a context-sensitive grammar (although in the form I'm suggesting below it is not).

For the example {a, b, c} mentioned in the question, one such grammar is the following. I'm using lower case letters for terminal symbols and upper case letters for non-terminals, as is customary. S is the initial non-terminal symbol.

S ::= XabcY
XabcY ::= aXbcY | bXacY | cXabY
XabY  ::= ab | ba
XacY  ::= ac | ca
XbcY  ::= bc | cb

Non-terminals X and Y enclose the substring in the production which has not been finalized yet; this substring will eventually be replaced by a permutation of the terminals that are given between X and Y (in some arbitrary order).