Parsec, read text ended by a string

Question

Parsec, read text ended by a string

1.5k Views Asked by eskaev At 07 June 2025 at 04:48

I am struggling with Parsec to parse a small subset of the Google project wiki syntax, and convert it into HTML. My syntax is limited to text sequences and item lists. Here is an example of what I want to recognize:

Text that can contain any kind of characters,
except the string "\n *"
 * list item 1
 * list item 2

End of list

My code so far is:

import Text.Blaze.Html5 (Html, toHtml)
import qualified Text.Blaze.Html5 as H
import Text.ParserCombinators.Parsec hiding (spaces)

parseList :: Parser Html
parseList = do
    items <- many1 parseItem
    return $ H.ul $ sequence_ items

parseItem :: Parser Html
parseItem = do
    string "\n *"
    item <- manyTill anyChar $
        (try $ lookAhead $ string "\n *") <|>
        (try $ string "\n\n")
    return $ H.li $ toHtml item

parseText :: Parser Html
parseText = do
    text <- manyTill anyChar $
        (try $ lookAhead $ string "\n *") <|>
        (eof >> (string ""))
    return $ toHtml text

parseAll :: Parser Html
parseAll = do
    l <- many (parseUl <|> parseText)
    return $ H.html $ sequence_ l

When applying parseAll to any sequence of characters, I get the following error message: "*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string. I understand that it is because my parser parseText can read empty strings, but I can't see any other way. How can I recognize text delimited by a string? ("\n *" here).

I am also open to any remarks or suggestions concerning the way I am using Parsec. I can't help but see that my code is a bit ugly. Can I do all this in a simpler way? For example, there is code replication (which is kind of painful) because of the string "\n *", that is used to recognize the end of a text sequence, the beginning of a list item, AND the end of a list item...

Original Q&A

There are 2 best solutions below

user3125280 On 23 December 2013 at 01:24

The problem is the manyTill combinator matches zero or more anyChar. Just change parseText to match at least one anyChar, so that it fails when reading one of the separators - unfortunately there is no many1Till combinator.

Also I prefer parseAll = fmap (H.html . sequence) $ many (parseUl <|> parseText), since you mentioned ugliness tips.

parseText = do
               notFollowedBy $ string "\n *"
               first <- anyChar
               rest <- manyTill anyChar $
                       (try $ lookAhead $ string "\n *") <|>
                       (eof >> (string ""))
               return $ toHtml first:rest

parseAll = fmap (H.html . sequence) $ many (parseUl <|> parseText)

That said, "parseUl" on google gives only this question so I don't know a better solution without understanding that parser.

Desperate for my first accepted answer, I wrote it out in full :) just add the html stuff on top with fmap (preferred) or return.

module Main where
import System.Environment
import Control.Monad
import Text.ParserCombinators.Parsec hiding (spaces)

parseList :: Parser [String]
parseList = many1 parseItem

parseItem :: Parser String
parseItem = string "\n *" >> (manyTill anyChar $ try $ lookAhead $ char '\n')

parseText :: Parser String
parseText = do
               notFollowedBy $ string "\n *" 
               first <- anyChar
               rest <- manyTill anyChar $
                   (try $ lookAhead $  string "\n *") <|>
                   (eof >> (string ""))
               return $ first:rest

parseAll :: Parser [String]
parseAll = many $ parseText <|> fmap concat parseList

parseIt :: String -> String
parseIt input = case parse parseAll "wiki" input of
    Left err -> "No match: " ++ show err
    Right val -> "It worked"

main = do
          args <- getArgs
          putStrLn (parseIt (args !! 0))

I assumed lists can't contain newlines, but the try $ lookahead $ char '\n' is easily tweaked. You can factor out string "\n *" to avoid duplication. Here I crushed all the lists and ignored the parse with sequence, but you'll have to cahnge that. It would all be simpler if you divided the "text" into lines of text and then just check for either a line of text or a line from a list.

**user2407038** · Accepted Answer

parseItem :: Parser String
parseItem = do
    manyTill anyChar $
        (try $ lookAhead $ string "\n *") <|>
        (try $ string "\n\n")

parseText :: Parser [String]
parseText = 
  string "\n *" >> -- remove this if text *can't* contain a leading '\n *'
  sepBy1 parseItem (string "\n *")

I removed the HTML stuff because for whatever reason I couldn't get blaze-html to install on my machine. But in principle it should be essentially the same thing. This parses strings delimited by the string "\n *" and ended by the string "\n\n". I don't know if have a leading \n is what you want but that is easy to fix.

Also, I don't know if the empty string is valid. You should change sepBy1 to sepBy if it is.

As for the error you were getting: you have string "" inside of many. Not only does this give the error you got, it doesn't make any sense! The parser string "" will always succeed without consuming anything, since the empty string is a prefix of all strings and "" ++ x == x. If you try to do this multiple times then you will never finish parsing.

Besides all that, your parseList should parse your language. It essentially does the same thing that sepBy does. I just think sepBy is cleaner :)

Parsec, read text ended by a string

There are 2 best solutions below

Related Questions in HASKELL

Related Questions in PARSEC

Related Questions in BLAZE-HTML

Trending Questions

Popular # Hahtags

Popular Questions