Unbelievably bad parse time

372 Views Asked by At

I need to analyze Elisp (Emacs Lisp) code so I wrote a parser for it using Instaparse. I expected it to be slow but doing 1k lines per second is way too slow to be right even on a calculator (or my pretty old i7). Can it be that bad or do I do something extremely wrong?

It's unambiguous and I tried to keep look ahead/behinds at minimum, unfortunately Elisp is very liberal with what constitutes as a symbol so I had to add some ahead/behinds there to differentiate numbers and symbols. Also I tried to deffer this by parsing symbols, numbers and keywords as "ident" it only gave me back like 30% of time. From my tests, it looks like Instaparse struggles a lot with recursive rules and lisps have highly recursive nature so maybe I didn't mess it up - it's just that slow...

The parser:

(ns slowparse
  (:require [clojure.string :as str]
            [instaparse.combinators :as c]
            [instaparse.core :as insta]))

(def grammar
  "Elisp grammar."
  "<root> = any +

  <any> = sexp | keyword | number | symbol | prefix | string | vector |
          comment | whitespace | char | Epsilon

  comment = comment-tok #'(?:[^\\n]*|$)'

  string = <str-l-tok> #'(?:(?:\\\\\\\\)|(?:\\\\\")|[^\"])*' <str-r-tok>

  char = <char-tok> #'(?:(?:\\\\(?:C|M)-)|(?:\\\\))?(?:.|\\s)'

  <whitespace> = <#'\\s+'>

  sexp   = sexp-l-tok any + sexp-r-tok

  vector = vec-l-tok any + vec-r-tok

  <prefix>   = quote | template | spread | hole

  <prfxbl> = sexp | symbol | keyword | number | prefix | vector

  quote    = quote-tok prfxbl
  template = tmpl-tok prfxbl
  hole     = hole-tok ! spread-tok prfxbl
  spread   = hole-tok spread-tok prfxbl

  <sexp-l-tok>      = <'('>
  <sexp-r-tok>      = <')'>

  <vec-l-tok>       = <'['>
  <vec-r-tok>       = <']'>

  <str-l-tok>       = <'\"'>
  <str-r-tok>       = <'\"'>

  <quote-tok>       = '#' ? <\"'\">

  <tmpl-tok>        = <'`'>

  <num-b-x-tok>     = '#'

  <hole-tok>        = <','>

  <spread-tok>      = <'@'>

  <comment-tok>     = <';'>

  <char-tok>        = '?'

  <kv-tok>          = <':'>

  symbol    = ! ( number | kv-tok | comment-tok | num-b-x-tok | char-tok )
               ident

  keyword = kv-tok ident

  number    = num-b10 | num-bx
  <num-b10> = #'[-+]?(?:(?:[\\d]*\\.[\\d]+)|(?:[\\d]+\\.[\\d]*)|(?:[\\d]+))' &
              ( ! ident )
  <num-bx>  = #'(?i)#(?:b|o|x|(?:\\d+r))[-+]?[a-z0-9]+'")

(def ident
  {:ident
   (let [esc-ch (str/join ["\\[" "\\]" "\\(" "\\)" "\"" "\\s" "'" "," "`" ";"])
         tmpl "(?:(?:\\\\[{{ec}}])|[^{{ec}}])+"]
     (->> esc-ch (str/replace tmpl "{{ec}}") c/regexp c/hide-tag))})

(insta/defparser ^{:doc "Elisp parser."} elisp-parser
  (merge ident (c/ebnf grammar))
  :start :root)

(def test-text (slurp "/tmp/foo.el"))

(time (insta/parse elisp-parser test-text))
2

There are 2 best solutions below

0
JAre On BEST ANSWER

As @akond suggested I ported the grammar to ANTLR (using https://github.com/aphyr/clj-antlr). It parses 1k lines in 20ms or less... Yeah looks like Instaparse is really slow.

Didn't have to change much, but Instaparse definitely feels a lot nicer to write rules for. It has simple ordering and look ahead/behind, standard regex, easy way to hide junk.

ANTLR grammar:

(ns fastparse
  (:require [clj-antlr.core :as antlr]))

(def grammar
  "Elisp grammar."
  "grammar EmacsLisp ;

   source: any* EOF ;

   any: list | keyword | number | symbol | prefix | string | vector | char |
        whitespace | comment;

   vector: '[' any* ']' ;

   list: '(' any* ')' ;

   prefix: quote | template | spread | hole ;

   quote: '#' ? '\\'' any ;

   template: '`' any ;

   spread: ',@' any ;

   hole: ',' any ;

   number: NUMB10 | NUMBX ;

   char: CHAR ;

   string: STRING ;

   keyword: KEYWORD ;

   symbol: IDENT ;

   whitespace: WS ;

   comment: COMLINE ;

   CHAR: '?' ( ( '\\\\' ( 'C' | 'M' ) '-' ) | '\\\\' )? . ;

   STRING: '\"' ( '\\\\\\\\' | '\\\\\"' | . )*? '\"' ;

   NUMB10: [+-] ? ( ( D* '.' D+ ) | ( D+ '.' D* ) | D+ ) ;

   NUMBX: '#' ( 'b' | 'o' | 'x' | ( D+ 'r' ) ) [-+]? ( A | D )+ ;

   fragment
   D: '0'..'9' ;

   fragment
   A: 'a'..'z' ;

   KEYWORD: ':' IDENT ;

   IDENT: ( ( '\\\\' [\\\\[\\]() \\n\\t\\r\"',`;] )+? |
            ( ~[[\\]() \\n\\t\\r\"',`;] )+? )+ ;

   COMLINE: ';' ~[\\n\\r]* ;

   WS: [ \\n\\t\\r]+ ;")

(def elisp-str->edn (antlr/parser grammar))

(def text (slurp "/tmp/foo.el"))

(time (elisp-str->edn text))
0
Nikolay Handzhiyski On

If you are interested in speed, and you do not want to worry for stack overflow occurrences, you may try Tunnel Grammar Studio, a parser generator I work on. The generated parsers from it are iterative during lexing, parsing, the tree construction, tree iteration, tree to string conversion and tree release. The accepted grammars are in ABNF (RFC 5234) with case sensitivity per token (RFC 7405).

Its a good idea to have a deterministic grammar with any parser you use. TGS does check for LL(1) conflicts at compile time, and will help you to create a deterministic grammar by visualizing the conflict places.

There is a demo of the tool, you can test the speed yourself. There is an option in the tool to generate fully ready test case project that will log into the console at runtime the time taken to parse, iterate the tree and free it, by only supplying the input data. Meaning that no development (except compiling the generated code) from you is expected if you want to test the speed for a grammar.

In my tests with a JSON grammar (RFC 8259) with removed ambiguity, that only emits syntax tree build events (like SAX) the iterative parser runs with around 8 megabytes per second, that are many lines per second and takes memory only proportional to the depth of the parsing, because only one token is technically needed at runtime for LL(1) grammars, i.e. it is practically "streaming" the input.

You can have also statically typed or dynamically typed concrete syntax tree, or dynamically typed abstract syntax tree with different levels of abstraction (i.e. automatic node pruning). The syntax trees builders (if chosen) for this trees are using the build events to create the relevant trees. You will need an ABNF grammar and C++ as a language target however.

The tool supports token ranges inside the parser grammar (additionally to the character ranges inside the lexer grammar). This means that you may develop your grammar without the extra care of the lexical rules order.