Encoding does not switch when trying to read json file

372 Views Asked by At

I have a json file file.json encoded KOI8-R.

Boost Json only works in UTF-8 encoding, so I'm converting the file from KOI8-R to UTF-8:

boost::property_tree::ptree tree;

std::locale loc = boost::locale::generator().generate(ru_RU.UTF-8);
std::ifstream ifs("file.json", std::ios::binary);
ifs.imbue(loc)

boost::property_tree::read_json(ifs, tree);

However, the file cannot be read .. What am I doing wrong?

UPDATE:

I made up a JSON file "test.txt":

{
    "соплодие": "лысеющий",
    "обсчитавший": "перегнавший",
    "кариозный": "отдёргивающийся",
    "суверенен": "носившийся",
    "рецидивизм": "поляризуются"
}

And saved it in koi8-r.

I have a code:

#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/json_parser.hpp>

int main() {
    boost::property_tree::ptree pt;
    boost::property_tree::read_json("test.txt", pt);
}

Compiled, ran and got the following error:

terminate called after throwing an instance of 'boost::wrapexcept<boost::property_tree::json_parser::json_parser_error>'
  what():  test.txt(2): invalid code sequence
Aborted (core dumped)

Then I use boost locale:

#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/json_parser.hpp>

#include <boost/locale/generator.hpp>
#include <boost/locale/encoding.hpp>


int main() {
    std::locale loc = boost::locale::generator().generate("ru_RU.utf8");
    std::ifstream ifs("test.txt", std::ios::binary);
    ifs.imbue(loc);
    
    boost::property_tree::ptree pt;
    boost::property_tree::read_json(ifs, pt);
}

Compiled (g++ main.cpp -lboost_locale), ran and got the following error:

terminate called after throwing an instance of 'boost::wrapexcept<boost::property_tree::json_parser::json_parser_error>'
  what():  <unspecified file>(2): invalid code sequence
Aborted (core dumped)
1

There are 1 best solutions below

5
On

The JSON spec requires UTF8:

8.1. Character Encoding

 JSON text exchanged between systems that are not part of a closed
 ecosystem MUST be encoded using UTF-8 [RFC3629].

It makes sense for a general purpose library to only support that. See here for more context: JSON character encoding - is UTF-8 well-supported by browsers or should I use numeric escape sequences?

How To Do It Anyways

Maybe with libiconv or libicu, Boost locale supports the latter.

Using Boost Locale/ICU

This requires that your library was built with ICU support, and maybe(?) you have the required locales, which is likely the case already on your system.

It also assumes the source code is in UTF8 encoding, which, again, is likely.

Live On Compiler Explorer

#include <boost/locale.hpp>
#include <boost/locale/conversion.hpp>
#include <boost/json.hpp>
#include <boost/json/src.hpp>
#include <iostream>
#include <fstream>

namespace json = boost::json;

int main() {
    std::string koi8r = [] {
        std::ifstream ifs("input.txt", std::ios::binary);
        return std::string(std::istream_iterator<char>(ifs), {});
    }();

    json::value doc =
        json::parse(boost::locale::conv::to_utf<char>(koi8r, "KOI8-R"));

    std::cout << "Serialized back: " << doc << "\n";

    std::cout << "Extracting a single key: " << doc.as_object()["соплодие"] << "\n";
}

I made up a random JSON:

{
    "соплодие": "лысеющий",
    "обсчитавший": "перегнавший",
    "кариозный": "отдёргивающийся",
    "суверенен": "носившийся",
    "рецидивизм": "поляризуются"
}

And saved it in koi8-r as "input.txt":

00000000: 7b0a 2020 2020 22d3 cfd0 cccf c4c9 c522  {.    "........"
00000010: 3a20 22cc d9d3 c5c0 ddc9 ca22 2c0a 2020  : "........",.  
00000020: 2020 22cf c2d3 dec9 d4c1 d7db c9ca 223a    "...........":
00000030: 2022 d0c5 d2c5 c7ce c1d7 dbc9 ca22 2c0a   "...........",.
00000040: 2020 2020 22cb c1d2 c9cf dace d9ca 223a      ".........":
00000050: 2022 cfd4 c4a3 d2c7 c9d7 c1c0 ddc9 cad3   "..............
00000060: d122 2c0a 2020 2020 22d3 d5d7 c5d2 c5ce  .",.    ".......
00000070: c5ce 223a 2022 cecf d3c9 d7db c9ca d3d1  ..": "..........
00000080: 222c 0a20 2020 2022 d2c5 c3c9 c4c9 d7c9  ",.    "........
00000090: dacd 223a 2022 d0cf ccd1 d2c9 dad5 c0d4  ..": "..........
000000a0: d3d1 220a 7d0a                           ..".}.

Now running that program shows:

Serialized back: {"соплодие":"лысеющий","обсчитавший":"перегнавший","кариозный":"отдёргивающий
ся","суверенен":"носившийся","рецидивизм":"поляризуются"}
Extracting a single key: "лысеющий"