I am trying to parse mediawiki dump files using xml-conduit. There are two tags that I am interested in, SiteInfo and Page. Here is a sample xml: https://gist.github.com/shadow-fox/7ff8df7a953e0ca9534bef45700686fe
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Monad.Trans.Resource (runResourceT, MonadThrow)
import Data.Conduit (Consumer, ($$))
import Data.Text (Text, pack, unpack)
import Data.Text.Read (decimal)
import Data.XML.Types (Event)
import Text.XML.Stream.Parse
data SiteInfo = SiteInfo {
name :: Text,
dbname :: Text,
base :: Text,
generator :: Text,
isCaseSensitive :: Bool,
namespaces :: [NameSpace]
} deriving (Show, Read)
data NameSpace = NameSpace {
keyns :: Int,
casens :: Text,
value :: Text
} deriving (Show, Read)
data WikiDoc = WikiDoc {
title :: Text,
namespace :: Text,
pageId :: Text,
revision :: Revision
} deriving (Show, Read)
data Revision = Revision {
id :: Int,
parentId :: Int,
timestamp :: Text,
comment :: Text,
model :: Text,
format :: Text,
text :: Text,
sha :: Text
} deriving (Show, Read)
parseSiteInfo :: MonadThrow m => Consumer Event m SiteInfo
parseSiteInfo = force "siteinfo tag missing" $ do
n <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}sitename" content
db <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}dbname" content
b <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}base" content
g <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}generator" content
c <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}case" content
ns <- tag' "{http://www.mediawiki.org/xml/export-0.10/}namespaces" $ many parseNamespace
return SiteInfo { name = n, dbname = db, base = b, generator = g, isCaseSensitive = c, namespaces = ns }
parseNamespace :: MonadThrow m => Consumer Event m NameSpace
parseNamespace = do
tag' "{http://www.mediawiki.org/xml/export-0.10/}namespace" (requireAttr "key") $ \key -> do
v <- content
return $ NameSpace { key = read $ unpack key, value = v}
parseRevision :: MonadThrow m => Consumer Event m Revision
parseRevision = force "revision tag missing" $ do
tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}id" content
pid <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}parentid" content
ts <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}timestamp" content
con <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}contributor" content
un <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}username" content
revid <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}id" content
com <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}comment" content
m <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}model" content
f <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}format" content
t <- tagIgnoreAttrs "{http://www.mediawiki.org/xml/export-0.10/}text" content
s <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}sha1" content
return Revision {id = revid, parentId = pid, timestamp = ts, comment = com, model = m, format = f, text = t, sha = s}
parsePage :: MonadThrow m => Consumer Event m WikiDoc
parsePage = force "page tag missing" $
t <- force "title tag missing" $ tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}title" content
ns <- force "ns tag missing" $ tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}ns" content
id <- force "id tag missing" $ tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}id" content
_ <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}restrictions" content
rev <- tagNoAttr "{http://www.mediawiki.org/xml/export-0.10/}revision" $ parseRevision
return $ WikiDoc {title = t, namespace = ns, pageId = id, revision = rev}
main :: IO ()
main = do
wikiPages <- parseFile def "sample.xml" $$ parseXml
print wikiPages
I have the bits and pieces but don't know how to tie it all together and get the desire result.
I don't know how to get if there is more than 1 attribute in a tag example in the namespace tag : <namespace key="-2" case="case-sensitive">Media</namespace>
I want the result at the end to hold both the siteinfo and wikidoc.
You should use the "Applicative" pattern to combine via
(,) <$> parserOne <*> parserTwo
. Here's a complete example which I wrote yesterday:Here's an example xml: