Does order really matter with <|> here in this parser?

Question

Does order really matter with <|> here in this parser?

142 Views Asked by Joseph Sible-Reinstate Monica At 12 January 2020 at 18:07

The Monday Morning Haskell post Parsing Part 2: Applicative Parsing says this about alternation with regex-applicative:

Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.

Referring to this function from their Git repository:

numberParser :: RE Char Value
numberParser = (ValueNumber . read) <$>
  (negativeParser <|> decimalParser <|> integerParser)
  where
    integerParser = some (psym isNumber)
    decimalParser = combineDecimal <$> many (psym isNumber) <*> sym '.' <*> some (psym isNumber)
    negativeParser = (:) <$> sym '-' <*> (decimalParser <|> integerParser)

    combineDecimal :: String -> Char -> String -> String
    combineDecimal base point decimal = base ++ (point : decimal)

However, I can't figure out why that would be so. When I change decimalParser <|> integerParser to integerParser <|> decimalParser, it still seems like it always parses the right thing (in particular, I did that and ran stack test, and their tests all still passed). The decimal parser must have a decimal point, and the integer parser can't have one, so it will stop parsing there, resulting in the decimal point making the next piece of the parse fail, backtracking us back to the decimal parser. It seems like the only case this wouldn't occur in would be if the next part of the overall parser after this one could accept the decimal point (making it an ambiguous grammar), but you still wouldn't "lose all the information after the decimal, or worse, have a parse failure". Is my reasoning correct and this a mistake in that article, or is there a case I'm not seeing where one of their outcomes could happen?

Original Q&A

There are 2 best solutions below

Paul Johnson On 12 January 2020 at 18:23

The issue comes with parsing something like:

12.34

If you try the integer parser first then it will find the 12, conclude that it has found an integer, and then try to parse the .34 as the next token.

[...] the decimal point making the next piece of the parse fail, backtracking us back to the decimal parser.

Yes, as long as your parser is doing backtracking. For performance reasons most production parsers (e.g. megaparsec) don't do that unless they are specifically told to (usually using the try combinator). The RE parser used in the blog post may well be an exception, but I can't find its interpreter to check.

**Li-yao Xia** · Accepted Answer · 2020-01-12T19:53:07.517000

There is a difference, and it matters, but part of the reason is that the rest of the parser is quite fragile.

When I change decimalParser <|> integerParser to integerParser <|> decimalParser, it still seems like it always parses the right thing (in particular, I did that and ran stack test, and their tests all still passed).

The tests pass because the tests don't cover this part of the parser (the closest ones only exercise stringParser).

Here's a test that currently passes, but wouldn't if you swapped those parsers (stick it in test/Spec.hs and add it to the do block under main):

badex :: Spec
badex = describe "Bad example" $ do
  it "Should fail" $
    shouldMatch
      exampleLineParser
      "| 3.4 |\n"
      [ ValueNumber 3.4 ]

If you swap the parsers, you get as a result ValueNumber 3.0: the integerParser (which is now first) succeeds parsing 3, but then the rest of the input gets discarded.

To give more context, we have to see where numberParser is used:

numberParser is one of the alternatives of valueParser...
which is used in exampleLineParser, where valueParser is followed by readThroughBar (and I mean the relevant piece of code is literally valueParser <* readThroughBar);
readThroughBar discards all characters until the next vertical bar (using many (psym (\c -> c /= '|' && c /= '\n'))).

So if valueParser succeeds parsing just 3, then the subsequent readThroughBar will happily consume and discard the rest .4 |.

The explanation from the blogpost you quote is only partially correct:

Note that order matters! If we put the integer parser first, we’ll be in trouble! If we encounter a decimal, the integer parser will greedily succeed and parse everything before the decimal point. We'll either lose all the information after the decimal, or worse, have a parse failure.

(emphasis mine) You will only lose information if your parser actively discards it, which readThroughBar does here.

As you already suggested, the backtracking behavior of RE means that the noncommutativity of <|> really only matters for correctness with ambiguous syntaxes (it might still have an effect on performance in general), which would not be a problem here if readThroughBar were less lenient, e.g., by consuming only whitespace before |.

I think that shows that using psym with (/=) is at least a code smell, if not a clear antipattern. By only looking for the delimiter without restricting the characters in the middle, it makes it hard to catch mistakes where the preceding parser does not consume as much input as it should. A better alternative is to ensure that the consumed characters may contain no meaningful information, for example, requiring them to all be whitespace.

Does order really matter with <|> here in this parser?

There are 2 best solutions below

Related Questions in REGEX

Related Questions in PARSING

Related Questions in HASKELL

Related Questions in ALTERNATION

Trending Questions

Popular # Hahtags

Popular Questions