Given I have the following file:
{
"title": {
"ger": "Komödie" (utf8, encoded as c3 b6)
},
"files": [
{
"filename": "Kom�die" (latin1, encoded as f6)
}
]
}
(might look differently if you try to copy-paste it)
This happened due to an application bug, I cannot fix the source which generates these files.
I try now to fix the charset of the filename field(s), there can be multiple of them. I tried first with jq (single field):
value="$(jq '.files[0].filename' <in.txt | iconv -f latin1 -t utf-8)"
jq --arg f "$value" '.files[0].filename = $f' <in.txt
But jq interprets the whole file as utf-8 and this damages the single f6 character.
I would like to find a solution in python, but also there, the input is by default interpreted as utf-8 in linux. I tried with 'ascii', but this doesn't allow characters >= 128.
Now, I think I found a way, but the json serializer escapes all characters. As I (intentionally) work with the wrong character set, the escaped sequence is also garbage.
#!/usr/bin/python3
import sys
import io
import json
with open('in.txt', encoding='latin1') as fh:
j = json.load(fh)
for f in j['files']:
f['filename'] = f['filename'].encode('utf-8').decode('latin1') # might be wrong, couldn't test
with open('out.txt', 'w', encoding='latin1') as fh:
json.dump(j, fh)
What can I do to achieve the expected result, a clean non-escaped utf-8 json file?
A workaround I found so far is not using json at all. Instead, I only process those line wich are known to have the wrong character set. This works because I know the files are very well-structured and formatted and I can easily find the broken lines by static text patterns. Note that I'm still working with the wrong charset. But so far, opening utf-8 as latin1 and saving it as lating1 again hasn't damaged anything of the utf-8 parts.
output: