The Monday Morning Haskell post Parsing Part 2: Applicative Parsing says this about alternation with regex-applicative:
Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.
Referring to this function from their Git repository:
numberParser :: RE Char Value
numberParser = (ValueNumber . read) <$>
(negativeParser <|> decimalParser <|> integerParser)
where
integerParser = some (psym isNumber)
decimalParser = combineDecimal <$> many (psym isNumber) <*> sym '.' <*> some (psym isNumber)
negativeParser = (:) <$> sym '-' <*> (decimalParser <|> integerParser)
combineDecimal :: String -> Char -> String -> String
combineDecimal base point decimal = base ++ (point : decimal)
However, I can't figure out why that would be so. When I change decimalParser <|> integerParser to integerParser <|> decimalParser, it still seems like it always parses the right thing (in particular, I did that and ran stack test, and their tests all still passed). The decimal parser must have a decimal point, and the integer parser can't have one, so it will stop parsing there, resulting in the decimal point making the next piece of the parse fail, backtracking us back to the decimal parser. It seems like the only case this wouldn't occur in would be if the next part of the overall parser after this one could accept the decimal point (making it an ambiguous grammar), but you still wouldn't "lose all the information after the decimal, or worse, have a parse failure". Is my reasoning correct and this a mistake in that article, or is there a case I'm not seeing where one of their outcomes could happen?
There is a difference, and it matters, but part of the reason is that the rest of the parser is quite fragile.
The tests pass because the tests don't cover this part of the parser (the closest ones only exercise
stringParser).Here's a test that currently passes, but wouldn't if you swapped those parsers (stick it in
test/Spec.hsand add it to thedoblock undermain):If you swap the parsers, you get as a result
ValueNumber 3.0: theintegerParser(which is now first) succeeds parsing3, but then the rest of the input gets discarded.To give more context, we have to see where
numberParseris used:numberParseris one of the alternatives ofvalueParser...exampleLineParser, wherevalueParseris followed byreadThroughBar(and I mean the relevant piece of code is literallyvalueParser <* readThroughBar);readThroughBardiscards all characters until the next vertical bar (usingmany (psym (\c -> c /= '|' && c /= '\n'))).So if
valueParsersucceeds parsing just3, then the subsequentreadThroughBarwill happily consume and discard the rest.4 |.The explanation from the blogpost you quote is only partially correct:
(emphasis mine) You will only lose information if your parser actively discards it, which
readThroughBardoes here.As you already suggested, the backtracking behavior of
REmeans that the noncommutativity of<|>really only matters for correctness with ambiguous syntaxes (it might still have an effect on performance in general), which would not be a problem here ifreadThroughBarwere less lenient, e.g., by consuming only whitespace before|.I think that shows that using
psymwith(/=)is at least a code smell, if not a clear antipattern. By only looking for the delimiter without restricting the characters in the middle, it makes it hard to catch mistakes where the preceding parser does not consume as much input as it should. A better alternative is to ensure that the consumed characters may contain no meaningful information, for example, requiring them to all be whitespace.