Append text file to lexicon in Rascal

92 Views Asked by At

Is it possible to append terminals retrieved from a text file to a lexicon in Rascal? This would happen at run time, and I see no obvious way to achieve this. I would rather keep the data separate from the Rascal project. For example, if I had read in a list of countries from a text file, how would I add these to a lexicon (using the lexical keyword)?

2

There are 2 best solutions below

3
On BEST ANSWER

In the data-dependent version of the Rascal parser this is even easier and faster but we haven't released this yet. For now I'd write a generic rule with a post-parse filter, like so:

rascal>set[str] lexicon = {"aap", "noot", "mies"};
set[str]: {"noot","mies","aap"}
rascal>lexical Word = [a-z]+;
ok
rascal>syntax LexiconWord = word: Word w;
ok
rascal>LexiconWord word(Word w) { // called when the LexiconWord.word rule is use to build a tree
>>>>>>> if ("<w>" notin lexicon) 
>>>>>>>   filter;  // remove this parse tree
>>>>>>> else fail; // just build the tree
>>>>>>>}
rascal>[Sentence] "hello"
|prompt:///|(0,18,<1,0>,<1,18>): ParseError(|prompt:///|(0,18,<1,0>,<1,18>))
        at $root$(|prompt:///|(0,64,<1,0>,<1,64>))
rascal>[Sentence] "aap"
Sentence: (Sentence) `aap`
rascal>

Because the filter function removed all possible derivations for hello, the parser eventually returns a parse error on hello. It does not do so for aap which is in the lexicon, so hurray. Of course you can make interestingly complex derivations with this kind of filtering. People sometimes write ambiguous grammars and use filters like so to make it unambiguous.

Parsing and filtering in this way is in cubic worst-case time in terms of the length of the input, if the filtering function is in amortized constant time. If the grammar is linear, then of course the entire process is also linear.

0
On

A completely different answer would be to dynamically update the grammar and generate a parser from this. This involves working against the internal grammar representation of Rascal like so:

set[str] lexicon = {"aap", "noot", "mies"};
syntax Word = ; // empty definition
typ = #Word;
grammar = typ.definitions;
grammar[sort("Word")] = { prod(sort("Word"), lit(x), {}) | x <- lexicon };
newTyp = type(sort("Word"), grammar); 

This newType is a reified grammar + type for the definition of the lexicon, and which can now be used like so:

import ParseTree;
if (type[Word] staticGrammar := newType) {
  parse(staticGrammar, "aap");
}

Now having written al this, two things:

  • I think this may trigger unknown bugs since we did not test dynamic parser generation, and
  • For a lexicon with a reasonable size, this will generate an utterly slow parser since the parser is optimized for keywords in programming languages and not large lexicons.