I'm using lark, an excellent python parsing library.
It provides an Earley and LALR(1) parser and is defined through a custom EBNF format. (EBNF stands for Extended Backus–Naur form).
Lowercase definitions are rules, uppercase definitions are terminals. Lark also provides a weight for uppercase definitions to prioritize the matching.
I have defined a grammar but i am stuck a little and unsure why it works, and if this is implemented 'well'.
especially i dont understand why i need:
// column names
_NAME: /\_{0,2}[a-zA-Z][a-zA-Z_0-9\%]*/
NAME.-1: _NAME
to properly parse some of my formulas, such as:
SUMNAME := A+B
IF(1>0, TRUE, FALSE)
Here is the full grammar i came up with to provide users with a syntax to calculate KPIs:
?start: expr
| NAME ":=" expr -> create_statement
?expr: expr_or
?exprif: _IF "(" expr_or "," expr_or ["," expr_or] ")" -> if_clause
?expr_or: expr_and
| expr_and (_OR expr_and)+ -> or_
?expr_and: expr_cond
| expr_cond (_AND expr_cond)+ -> and_
?expr_cond: sum
| sum COMPARISON sum -> condition
| sum _IN "(" list_expr ")" -> is_in
| sum _BETWEEN sum _AND sum -> between
| _NOT expr_atom -> not_
?sum: product
| sum "+" product -> add
| sum "-" product -> sub
?product: division
| product "*" division -> mul
?division: power
| power "/" power -> div
| _DIVIDE "(" expr_or "," expr_or ")" -> div
?power: exprfactor
| power "**" exprfactor -> pow
| power "^" exprfactor -> pow
?exprfactor: expr_atom
| "-" expr_atom -> neg
?expr_atom: atom
| _SUM "(" list_expr ")" [_OVER"(" list_str ")"] -> sum
| _MEAN "(" list_expr ")" [_OVER"(" list_str ")"] -> mean
| _STD "(" list_expr ")" [_OVER"(" list_str ")"] -> std
| _MAX "(" list_expr ")" -> max
| _MIN "(" list_expr ")" -> min
| _COALESCE "(" list_expr ")" -> coalesce
| _ABS "(" expr_or ")" -> abs
| _SQRT "(" expr_or ")" -> sqrt
| _FLOAT "(" expr_or ")" -> float_
| _MOD "(" expr_or "," atom ")" -> mod
| _ROUND "(" expr_or "," atom ")" -> round
| _YEAR "(" atom ")" -> year
| _QUARTER "(" atom ")" -> quarter
| _MONTH "(" atom ")" -> month
| _DAY "(" atom ")" -> day
| _HOUR "(" atom ")" -> hour
| _MINUTE "(" atom ")" -> minute
| _IS_NULL "(" expr_or ")" -> is_null
| _SECOND "(" atom ")" -> second
| _SUBSTR "(" expr_or "," atom ["," atom] ")" -> substr
| "(" expr ")"
| exprif
?atom: NAME -> variable
| NUMBER -> variable
| BOOLEAN -> variable
| STRING -> variable
| NAN -> variable
| NULL -> variable
| DATETIME -> variable
| DATE -> variable
list_expr: expr ("," expr)*
list_str: STRING ("," STRING)*
// over
_OVER: /\_{2}OVER\_{2}/
//special values
NAN: "nan"i
NULL: "null"i
// comparison operators
COMPARISON: GREATER | GREATER_OR_EQUAL | SMALLER | SMALLER_OR_EQUAL | EQUAL | UN_EQUAL
GREATER: ">"
GREATER_OR_EQUAL: GREATER"="
SMALLER: "<"
SMALLER_OR_EQUAL: SMALLER"="
EQUAL: "=="
UN_EQUAL: "!="
_BETWEEN: "between"i
_IN: "in"i
// IF token
_IF: "if"i
// logical operators.
_NOT.1: "not"i
_AND.1: "and"i
_OR.1: "or"i
_XOR.1: "xor"i
// boolean operators
_TRUE: "TRUE"
_FALSE: "FALSE"
// syntax tokens
_SUM: "sum"i
_MEAN: "mean"i
_STD: "std"i
_IS_NULL: "isnull"i | "is_null"i
_MAX: "max"i
_MIN: "min"i
_DIVIDE: "divide"i
_COALESCE: "coalesce"i
_SQRT: "sqrt"i
_FLOAT: "float"i
_MOD: "mod"i
_ROUND: "round"i
_ABS: "abs"i
_YEAR: "year"i
_QUARTER: "quarter"i
_MONTH: "month"i
_DAY: "day"i
_HOUR: "hour"i
_MINUTE: "minute"i
_SECOND: "second"i
_SUBSTR: "substr"i
// tokens for timeshifts
FREQUENCY: "M" | "Q" | "Y"
SHIFT: PLUS | MINUS
PYE: "PYE"
TIME_SHIFT: "T" SHIFT INT FREQUENCY | PYE
// operators
PLUS: "+"
MINUS: "-"
// time and date formats
DATETIME.1: "'" /\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}/ "'"
DATE.1: "'" /\d{4}-\d{2}-\d{2}/ "'"
// strings
STRING : "'" _STRING_ESC_INNER "'" | "\"" _STRING_ESC_INNER "\""
// booleans
BOOLEAN: _TRUE | _FALSE
// column names
_NAME: /\_{0,2}[a-zA-Z][a-zA-Z_0-9\%]*/
NAME.-1: _NAME
%import common.INT
%import common.NUMBER
%import common.WS_INLINE
%import common._STRING_ESC_INNER
%ignore WS_INLINE