Using parboiled2 to parse multiple lines instead of a String

1k Views Asked by At

I would like to use parboiled2 to parse multiple CSV lines instead of a single CSV String. The result would be something like:

val parser = new CSVRecordParser(fieldSeparator)
io.Source.fromFile("my-file").getLines().map(line =>

where CSVRecordParser is my parboiled parser of CSV records. The problem that I have is that, for what I've tried, I cannot do this because parboiled parsers requires the input in the constructor, not in the run method. Thus, I can either create a new parser for each line, that is not good, or find a way to pass the input to the parser for every input that I have. I tried to hack a bit the parser, by setting the input as variable and wrapping the parser in another object

object CSVRecordParser {

  private object CSVRecordParserWrapper extends Parser with StringBuilding {

    val textBase = CharPredicate.Printable -- '"'
    val qTextData = textBase ++ "\r\n"

    var input: ParserInput = _
    var fieldDelimiter: Char = _

    def record = rule { zeroOrMore(field).separatedBy(fieldDelimiter) ~> (Seq[String] _) }
    def field = rule { quotedField | unquotedField }
    def quotedField = rule {
      '"' ~ clearSB() ~ zeroOrMore((qTextData | '"' ~ '"') ~ appendSB()) ~ '"' ~ ows ~ push(sb.toString)
    def unquotedField = rule { capture(zeroOrMore(textData)) }
    def textData = textBase -- fieldDelimiter

    def ows = rule { zeroOrMore(' ') }

  def parse(input: ParserInput, fieldDelimiter: Char): Result[Seq[String]] = {
    CSVRecordParserWrapper.input = input
    CSVRecordParserWrapper.fieldDelimiter = fieldDelimiter

and then just call CSVRecordParser.parse(input, separator) when I want to parse a line. Besides the fact that this is horrible, it doesn't work and I often have strange errors related to previous usages of the parser. I know this is not the way I should write a parser using parboiled2 and I was wondering what is the best way to achieve what I would like to do with this library.


There are 2 best solutions below


I've done this for CSV files of over 1 million records, in a project that requires high speed and low resources, and I find it works well to instantiate a new parser for each line.

I tried this approach after I noticed that the parboiled2 readme mentions that the parsers are extremely light weight.

I have not needed even to increase JVM memory or heap limits from their defaults. Parser instantiation for each line works very well.


Why not add an end of record rule to the parser.

def EOR = rule { "\r\n" | "\n" }

def record = rule { zeroOrMore(field).separatedBy(fieldDelimiter) ~ EOR ~> (Seq[String] _) }

Then you can pass in as many lines as you want.