Consider the usage of these different parser combinators.
import Control.Applicative.Combinators
import Text.Regex.Applicative
main :: IO ()
main = do
let parser1 = sym '"' *> manyTill anySym (sym '"')
print $ match parser1 "\"abc\""
let parser2 = sym '"' *> many anySym <* sym '"'
print $ match parser2 "\"abc\""
import Control.Applicative.Combinators
import Text.ParserCombinators.ReadP hiding(many, manyTill)
main :: IO ()
main = do
let parser1 = char '"' *> manyTill get (char '"')
print $ readP_to_S parser1 "\"abc\""
let parser2 = char '"' *> many get <* char '"'
print $ readP_to_S parser2 "\"abc\""
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative.Combinators
import Data.Attoparsec.Text hiding(manyTill)
main :: IO ()
main = do
let parser1 = char '"' *> manyTill anyChar (char '"')
print $ parseOnly parser1 "\"abc\""
let parser2 = char '"' *> many anyChar <* char '"'
print $ parseOnly parser2 "\"abc\""
import Control.Applicative.Combinators
import Text.Megaparsec hiding(many, manyTill)
import Data.Void
main :: IO ()
main = do
let parser1 = single '"' *> manyTill anySingle (single '"') :: Parsec Void String String
print $ parseMaybe parser1 "\"abc\""
let parser2 = single '"' *> many anySingle <* single '"' :: Parsec Void String String
print $ parseMaybe parser2 "\"abc\""
With all four of them, the manyTill parser successfully matches abc, since this doesn't depend on backtracking. With regex-applicative and ReadP, the many parser also successfully matches abc, since they both backtrack by default. With megaparsec, the many parser fails to match, since it doesn't backtrack by default. So far, everything makes sense. However, with attoparsec, the many parser fails to match, even though it does backtrack: its documentation says "attoparsec parsers always backtrack on failure" and "if you feed incremental input to an a parser, it will require memory proportional to the amount of input you supply. (This is necessary to support arbitrary backtracking.)". Why is this? Isn't that supposed to be exactly what backtracking is?
The meaning of "backtracking" in the Attoparsec documentation is different than the meaning of backtracking for the other backtracking parsers.
It helps to review what "backtracking" means when using
tryfor a Parsec or Megaparsec parser. These parsers have a concept of failing after consuming input ("consume err" = cerr) versus failing after consuming nothing ("empty err" = eerr). For these parsers, thep <|> qalternative operator treats failure ofpdifferently if it's cerr (fail the wholep <|> qimmediately) versus eerr (try the alternativeqinstead). Thetryfunction backtracks by converting cerr to eerr. That is,try p <|> qwill "backtrack" an erroneous consumption of the input stream in the eventpfails with cerr. It's a single step of backtracking on failure within alternatives (though with nestedtrycalls, multiple steps of backtracking can be performed in a sequence/cascade of parse failures).Attoparsec doesn't differentiate betweeen cerr and eerr, so it's as if all parsers are surrounded by
trycalls. This means that it automatically performs multiple steps of backtracking on failure within alternatives.ReadPbacktracks implicitly by simultaneously evaluating every possible parse in parallel, discarding those that ever fail, and picking the "first" one that remains. It "backtracks" failures over a tree of all possible parses, whether the failures were generated in the context of an alternative or not.It turns out that "multiple steps of backtracking on failure within alternatives" is a more modest form of backtracking than "backtracking over a tree of all possible parses".
A couple of simplified examples may help show the difference. Consider the parser:
and the input string
"bd". This parser fails with Parsec/Megaparsec. The left-hand alternative consumes the"b"withanyCharbefore failing, having consumed input (cerr), and the whole parser fails. This works fine with Attoparsec, though: the left-hand alternative fails atchar 'a', and Attoparsec backtracks on this failure within an alternative to trychar 'b'which succeeds. It also works withReadPwhich constructs all possible parses in parallel, before discarding the parse from the left-hand alternative whenchar 'a'fails, resulting in a single successful parse bychar 'b'.Now, consider the parser:
and the input string
"b". (Recall thatpure '*'consumes nothing and always succeeds.) This parser fails with Parsec/Megaparsec, becauseanyCharparses the"b", thepure '*'is ignored, and the empty string isn't matched bychar 'b'. It also fails with Attoparsec:anyCharsuccessfully parses the"b", and there's no failure in the context of an alternative, so no backtracking to try thepure '*'alternative. The attempt to parse the empty string withchar 'b'subsequently fails. (This failure, if it occurs in the context of another alternative might result in backtracking of that alternative, but never reconsideration of thispure '*'alternative.)In contrast, this parses fine with
ReadP.ReadPparses the alternatives in parallel, considering bothanyCharparsing the"b"andpure '*'parsing nothing. When thechar 'b'parse is tried, it fails on the former but succeeds on the latter.Back to your example. When parsing with Attoparsec, because:
the left alternative
(:) <$> anyChar <*> many anyCharwill continue to successfully match all the way up to and including the point whereanyCharmatches the closing quotation mark. At EOF, the left-hand side will fail (without consuming input, though Attoparsec doesn't care about that), and the right-hand side will succeed. The only failure within an alternative is at EOF, which didn't consume anything anyway, so Attoparsec's automatic "backtracking" plays no role; Megaparsec would have done the same thing. Anyway, once thismany anyCharhas succeeded, it won't be revisited, even if the terminatingchar '"'subsequently fails.So, that's why you need
manyTillto explicitly watch for the terminating character.