Parsing ASCII characters with Erlang

1.1k Views Asked by At

Confused with what parsing needs to be done and at what end client/server.

When i send an Umlaut 'Ö' to my ejabberd, 
it is received by ejabberd as <<"195, 150">>

Following this i send this to my client as Push notifications (via GCM/APNS silently). From there, the client builds by UTF-8 decoding on each numeral one by one (this is wrong).

i.e. 195 is first decoded to gibberish character � and so on.

This reconstruction needs identification if two bytes are to be entertained or 3 or more. This varies with the language of letters (German here e.g.).

How would the client identify which language it is going to reconstruct (no. of bytes to decode in one go)?

To add more,

lists:flatten(mochijson2:encode({struct,[{registration_ids,[Reg_id]},{data ,[{message,Message},{type,Type},{enum,ENUM},{groupid,Groupid},{groupname,Groupname},{sender,Sender_list},{receiver,Content_list}]},{time_to_live,2419200}]})).

Produced the json as:

"{\"registration_ids\":[\"APA91bGLjnkhqZlqFEp7mTo9p1vu9s92_A0UIzlUHnhl4xdFTaZ_0HpD5SISB4jNRPi2D7_c8D_mbhUT_k-T2Bo_i_G3Jt1kIqbgQKrFwB3gp1jeGatrOMsfG4gAJSEkClZFFIJEEyow\"],\"data\":{\"message\":[104,105],\"type\":[71,82,79,85,80],\"enum\":2001,\"groupid\":[71,73,68],\"groupname\":[71,114,111,117,112,78,97,109,101],\"sender\":[49,64,100,101,118,108,97,98,47,115,100,115],\"receiver\":[97,115,97,115]},\"time_to_live\":2419200}"

where i had given "hi" as message and mochijson gave me ASCII values [104,105].

The groupname field was given the value "Groupname",
the ASCIIs are also correct after json creation i.e. 71,114,111,117,112,78,97,109,101

However when i use http://www.unit-conversion.info/texttools/ascii/

It is decodes as Ǎo��me and not "Groupname".

So, who should do the parsing? How the same should be handled.

My reconstructed message is all gibberuish when the ASCII is reconstructed.

Thanks

1

There are 1 best solutions below

0
On

The things to worry about here is manyfold, and has to do with both the encoding desired or the datastructure. In Erlang, text is handled in one of the following ways:

  1. lists of bytes ([0..255, ...])
    • This is what you get if you listen to a socket and the data is returned as a list.
    • The VM assumes no encoding. They're bytes and mean little more.
    • The VM can however interpret these as strings (say in io:format("~s~n", [List])). When that happens (with the ~s flag specifically), the VM assumes the encoding is latin-1 (ISO-8859-1).
  2. lists of Unicode codepoints ([0..1114111, ...]).
    • You may get those from files that are read as unicode and as a list.
    • You can use them in output when you have a formatter such as io:format("~ts~n", [List]) where ~ts is like ~s but as unicode.
    • Those lists represent the codepoints you see in the unicode standard, without any encoding (they are not UTF-x)
    • This can work in conjunction with latin-1 lists of characters because the Unicode codepoints and latin1 characters have the same sequence numbers below 255.
  3. Binaries (<<0..255, ...>>)
    • This is what you get if you listen or read to/from anything under a binary format.
    • The VM can be told to assume many things:
      1. They are sequences of bytes (0..255) without specific meaning (<<Bin/binary>>)
      2. They are utf-8 encoded sequences (<<Bin/utf-8>>)
      3. They are utf-16 encoded sequences (<<Bin/utf-16>>)
      4. They are utf-32 encoded sequences (<<Bin/utf-32>>)
    • io:format("~s~n", [Bin]) will still assume any sequence is a latin-1 sequence; io:format("~ts~n", [Bin]) will assume UTF-8 only.
  4. A mixed list of both unicode lists and utf-encoded binaries (known as iodata()), used exclusively for output.

So in a gist:

  • lists of bytes
  • lists of latin-1 characters
  • lists of Unicode codepoints
  • binary of bytes
  • utf-8 binary
  • utf-16 binary
  • utf-32 binary
  • lists of many of these for output that is quickly concatenated

Also to note: until version 17.0, all Erlang source files were latin-1 only. 17.0 added an option to have the compiler read your source file as unicode by adding this header:

%% -*- coding: utf-8 -*-

The next factor is that JSON, by specification, is assuming UTF-8 as an encoding for everything it has. Furthermore, JSON libraries in Erlang will tend to assume that a binary is a string, and that lists are JSON arrays.

This means that if you want your output to be adequate, you must use UTF-8 encoded binaries to represent any JSON.

If what you have is:

  • A list of bytes that represent a utf-encoded string, then list_to_binary(List) to get the proper binary representation
  • A list of codepoints, then use unicode:characters_to_binary(List, unicode, utf8) to get a utf-8 encoded binary
  • A binary representing a latin-1 string: unicode:characters_to_binary(Bin, latin1, utf8)
  • A binary of any other UTF encoding: unicode:characters_to_binary(Bin, utf16 | utf32, utf8)

Take that UTF-8 binary, and send it to the JSON library. If the JSON library is correct and the client parses it properly, then it should be right.