Ruby string escape for supplementary plane Unicode characters

515 Views Asked by At

I know that I can escape a basic Unicode character in Ruby with the \uNNNN escape sequence. For example, for a smiling face U+263A (☺) I can use the string literal "\u2603".

How do I escape Unicode characters greater than U+FFFF that fall outside the basic multilingual plane, like a winking face: U+1F609 (😉)?

Using the surrogate pair form like in Java doesn't work; it results in an invalid string that contains the individual surrogate code points:

s = "\uD83D\uDE09" # => "\xED\xA0\xBD\xED\xB8\x89"
s.valid_encoding? # => false
1

There are 1 best solutions below

0
On BEST ANSWER

You can use the escape sequence \u{XXXXXX}, where XXXXXX is between 1 and 6 hex digits:

s = "\u{1F609}" # => ""

The braces can also contain multiple runs separated by single spaces or tabs to encode multiple characters:

s = "\u{41f 440 438 432 435 442 2c 20 43c 438 440}!" # => "Привет, мир!"

You could also use byte escapes to write a literal that contains the UTF-8 encoding of the character, though that's not very convenient, and doesn't necessarily result in a UTF-8-encoded string, if the file encoding differs:

# encoding: utf-8
s = "\xF0\x9F\x98\x89" # => ""
s.length # => 1

# encoding: iso-8859-1
s = "\xF0\x9F\x98\x89" # => "\xF0\x9F\x98\x89"
s.length # => 4