I realized that accents in my texts get converted to �. I boiled it down, to the following example, which writes (and overwrites) the file test.txt.
It uses exclusively methods from Data.Text, which are supposed to handle unicode texts. I checked that both the source file as well the output file are encoded in utf8.
{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (writeFile)
import Data.Text
import Data.Text.IO
someText :: Text
someText = "Université"
main :: IO ()
main = do
writeFile "test.txt" someText
After running the code, test.txt contains: Universit�. In ghci, I get the following
*Main> someText
"Universit\233"
Is this already encoded incorrectly? I also found a comment on � in https://hackage.haskell.org/package/text-1.2.2.2/docs/Data-Text.html, but I still do not know how to correct the example above.
How do I use accents in an OverloadedString and correctly write them to a file?
This has nothing to do with
Data.Text
, and certainly not withOverloadedStrings
– both handle UTF-8–Unicode just fine.However
Data.Text.IO
will not write a BOM or anything that indicates the encoding, i.e. the file really just contains the text as-is. On any modern system, this means it will be in raw UTF-8 form:So depending on what editor you open the file with, it may guess a wrong encoding, and that's apparently your issue. On Linux, UTF-8 has long been the standard, so no issue here, but Windows isn't so up-to-date. It should be possible to manually select the encoding in any editor, though.
In fact,
Data.Text.IO.writeFile
will use your locale to decide how to encode the file. Everybody should have UTF-8 as their locale nowadays, if you don't please change that.To get a BOM in your file and thus preclude such issues, use
utf8_bom
.Regarding the output you see in GHCi: that's the
Show
instance at work; it escapes any string-like values to the safest conceivable form, i.e. anything that's not ASCII to an escape sequence, which for'é'
happens to be'\233'
. Again not specific toText
, in fact you get this even for single characters:This escaping never happens when you use the direct-IO-output actions for your string types, i.e.
putChar
,putStr
orputStrLn
.