I have been trying to get my head around Scala's parser combinators. It seems that they are pretty powerful but the only tutorial examples I seem to find are with mathematical expressions and very little proper real-world parsing examples with DSLs that need to be parsed and mapped to different entities etc.
For the sake of this example, lets say I have this BNF where I have this entity named Model, which is made up of a string like this: [model [name <name> ]]
. This is a simplistic example of a much larger BNF I have and there are more entities in reality.
So I defined my own class Model
which takes the name
as the constructor and then defined my own ModelParser
object which extends JavaTokenParsers
. I then defined the following parsers, following the BNF (I know some may have a simpler regex matcher but I preferred to follow the BNF exactly for other reasons).
def model : Parser[Model] = "[model" ~> "[name" ~> name <~ "]]" ^^ ( Model(_) )
def name : Parser[String] = (letter ~ (anyChar*)) ^^ {case text => text.toString())
def anyChar = letter | digit | "_".r | "-".r
def letter = """[a-zA-Z]""".r
def digit = """\d""".r
The toString
of Model
looks like this:
override def toString : String = "[model " + name + "]"
When I try to run it with a string like [model [name helloWorld]]
I get this
[model [h~List(e, l, l, o, W, o, r, l, d)]]
instead of what I am expecting [model helloWorld]
How do I get those individual characters to join back in the string they were originally in?
I am also confused with the individual parsers and the use of .r
. Sometimes I saw examples where they had just the following as a parser (to parse "hello"):
def hello = "hello"
Isn't that just a String? How on Earth did it suddenly become a parser that can be combined with other parsers? And what is the .r
actually doing? I have read at least 3 tutorials but still totally lost what is actually happening.
The problem is that
anyChar*
parses aList[String]
(where in this case each string is a single character), and the result of callingtoString
on a list of strings is"List(...)"
, not the string you'd get by concatenating the contents. In addition, thecase text =>
pattern is matching on the entireletter ~ (anyChar*)
, not just theanyChar*
part.It's possible to address both of these issues pretty straightforwardly:
We just append the first character string to the list of the rest, and then call
mkString
on the entire list, which will concatenate the contents. This works as expected:As you note, it would be possible (and possibly clearer and more performant) to let the regular expressions do more of the work:
This example also illustrates the way that the parsing combinator library uses implicit conversions to cut down on some of the verbosity of writing parsers. As you say,
def hello = "hello"
defines a string, and"[a-zA-Z]+".r
defines aRegex
(via ther
method onStringOps
), but either can be used as a parser becauseRegexParsers
defines implicit conversions fromString
(this one's namedliteral
) andRegex
(regex
) toParser[String]
.