Parsing sentences in Haskell more effectively using attoparsec

173 Views Asked by At

I'm trying to parse an ebook in .txt form, to learn more about attoparsec and Haskell (I'm a newbie). In this case, I'm trying to count the number of sentences in the given text file. Here's my code:

{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as Txt
import Data.List
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)

data Prose = Prose {
  word :: [Char]
} deriving Show

optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())

specialChars = ['-', '_', '…', '“', '”', '\"', '\'', '’', '@', '#', '$',
                '%', '^', '&', '*', '(', ')', '+', '=', '~', '`', '{', '}',
                '[', ']', '/', ':', ';', ',']

inputSentence :: Parser Prose
inputSentence = Prose <$> many1' (letter <|> digit <|> space <|> satisfy (inClass specialChars))

sentenceSeparator :: Parser ()
sentenceSeparator = many1 (space <|> satisfy (inClass ".?!")) >> pure ()

sentenceParser :: String -> [Prose]
sentenceParser str = case parseOnly wp (T.pack str) of
    Left err -> error err
    Right x -> x
    where
        wp = optional sentenceSeparator *> inputSentence `sepBy1` sentenceSeparator

main :: IO()
main = do
  input <- readFile "test.txt"
  let sentences = sentenceParser input
  print sentences
  print $ length sentences

Click this link to the github repo if you want to take a complete look at what I'm doing. My problem is that when I try to parse text file with input: enter image description here

I get an`output as follows:

enter image description here

So my question is, how can I:

  1. Make the parser realize that anything with "\n\n.." is a different sentence.
  2. Input like Daniel G. Brinton is just 1 sentence.

I've tried using isHorizontalSpace, but to no avail.

0

There are 0 best solutions below