Confused with what parsing needs to be done and at what end client/server.
When i send an Umlaut 'Ö' to my ejabberd,
it is received by ejabberd as <<"195, 150">>
Following this i send this to my client as Push notifications (via GCM/APNS silently). From there, the client builds by UTF-8 decoding on each numeral one by one (this is wrong).
i.e. 195 is first decoded to gibberish character � and so on.
This reconstruction needs identification if two bytes are to be entertained or 3 or more. This varies with the language of letters (German here e.g.).
How would the client identify which language it is going to reconstruct (no. of bytes to decode in one go)?
To add more,
lists:flatten(mochijson2:encode({struct,[{registration_ids,[Reg_id]},{data ,[{message,Message},{type,Type},{enum,ENUM},{groupid,Groupid},{groupname,Groupname},{sender,Sender_list},{receiver,Content_list}]},{time_to_live,2419200}]})).
Produced the json as:
"{\"registration_ids\":[\"APA91bGLjnkhqZlqFEp7mTo9p1vu9s92_A0UIzlUHnhl4xdFTaZ_0HpD5SISB4jNRPi2D7_c8D_mbhUT_k-T2Bo_i_G3Jt1kIqbgQKrFwB3gp1jeGatrOMsfG4gAJSEkClZFFIJEEyow\"],\"data\":{\"message\":[104,105],\"type\":[71,82,79,85,80],\"enum\":2001,\"groupid\":[71,73,68],\"groupname\":[71,114,111,117,112,78,97,109,101],\"sender\":[49,64,100,101,118,108,97,98,47,115,100,115],\"receiver\":[97,115,97,115]},\"time_to_live\":2419200}"
where i had given "hi" as message and mochijson gave me ASCII values [104,105].
The groupname field was given the value "Groupname",
the ASCIIs are also correct after json creation i.e. 71,114,111,117,112,78,97,109,101
However when i use http://www.unit-conversion.info/texttools/ascii/
It is decodes as Ǎo��me and not "Groupname".
So, who should do the parsing? How the same should be handled.
My reconstructed message is all gibberuish when the ASCII is reconstructed.
Thanks
The things to worry about here is manyfold, and has to do with both the encoding desired or the datastructure. In Erlang, text is handled in one of the following ways:
[0..255, ...]
)io:format("~s~n", [List])
). When that happens (with the~s
flag specifically), the VM assumes the encoding islatin-1
(ISO-8859-1).[0..1114111, ...]
).io:format("~ts~n", [List])
where~ts
is like~s
but as unicode.UTF-x
)<<0..255, ...>>
)binary
format.0..255
) without specific meaning (<<Bin/binary>>
)<<Bin/utf-8>>
)<<Bin/utf-16>>
)<<Bin/utf-32>>
)io:format("~s~n", [Bin])
will still assume any sequence is a latin-1 sequence;io:format("~ts~n", [Bin])
will assumeUTF-8
only.iodata()
), used exclusively for output.So in a gist:
Also to note: until version 17.0, all Erlang source files were latin-1 only. 17.0 added an option to have the compiler read your source file as unicode by adding this header:
The next factor is that JSON, by specification, is assuming
UTF-8
as an encoding for everything it has. Furthermore, JSON libraries in Erlang will tend to assume that a binary is a string, and that lists are JSON arrays.This means that if you want your output to be adequate, you must use UTF-8 encoded binaries to represent any JSON.
If what you have is:
list_to_binary(List)
to get the proper binary representationunicode:characters_to_binary(List, unicode, utf8)
to get a utf-8 encoded binaryunicode:characters_to_binary(Bin, latin1, utf8)
unicode:characters_to_binary(Bin, utf16 | utf32, utf8)
Take that UTF-8 binary, and send it to the JSON library. If the JSON library is correct and the client parses it properly, then it should be right.