Download wikipedia markup using Haskell

132 Views Asked by At

Using http-conduit I want to download the raw wikimedia markup for any page, for example the Wikipedia page Stack Overflow.

Also, I'd like the solution to be applicable to wikimedia pages other than en.wikipedia.org, for example de.wikibooks.org.

Note: This question was immediately answered in Q&A form and therefore intentionally does not show research effort!

1

There are 1 best solutions below

0
On BEST ANSWER

This question uses query parameters in http-conduits as described in this previous SO answer.

We will use the method described here on SO to download the markup content of a page.

Although this task could be possible using the mediawiki, it seems significantly simpler to use the ?action=raw method without explicitly using the API.

In order to support different pages (e.g. en.wikimedia.org), I wrote two functions getWikipediaPageMarkup and getEnwikiPageMarkup, the former one being more general and allowing to use custom domains (any domain should work, assuming Mediawiki is installed under /wiki).

{-# LANGUAGE OverloadedStrings #-}
import Network.HTTP.Conduit
import Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy.Char8 as LB
import Network.HTTP.Types (urlEncode)
import Data.Monoid ((<>))

-- | Get the Mediawiki marup
getWikipediaPageMarkup :: ByteString -- ^ The wikipedia domain, e.g. "en.wikipedia.org"
                       -> ByteString -- ^ The wikipedia page title to download
                       -> IO LB.ByteString -- ^ The wikipedia page markup
getWikipediaPageMarkup domain page = do
    let url = "https://" <> domain <> "/wiki/" <> urlEncode True page
    request <- parseUrl $ B.unpack url
    let request' = setQueryString [("action", Just "raw")] request
    fmap responseBody $ withManager $ httpLbs request'

-- | Like @getWikipediaPageMarkup@, but hardcoded to 'en.wikipedia.org'
getEnwikiPageMarkup :: ByteString -> IO LB.ByteString
getEnwikiPageMarkup = getWikipediaPageMarkup "en.wikipedia.org"

Note that a recent http-conduit version is required (minimum: 2.1, tested with 2.1.4) in order to compile the code.