If you are not interested in the details of Mongolian but just want a quick answer about using and converting Unicode values in Swift, then skip down to the first part of the accepted answer.
Background
I want to render Unicode text for traditional Mongolian to be used in iOS apps. The better and long term solution is to use an AAT smart font that would render this complex script. (Such fonts do exist but their license does not allow modification and non-personal use.) However, since I have never made a font, let alone all of the rendering logic for an AAT font, I just plan to do the rendering myself in Swift for now. Maybe at some later date I can learn to make a smart font.
Externally I will use Unicode text, but internally (for display in a UITextView
) I will convert the Unicode to individual glyphs that are stored in a dumb font (coded with Unicode PUA values). So my rendering engine needs to convert Mongolian Unicode values (range: U+1820 to U+1842) to glyph values stored in the PUA (range: U+E360 to U+E5CF). Anyway, this is my plan since it is what I did in Java in the past, but maybe I need to change my whole way of thinking.
Example
The following image shows su written twice in Mongolian using two different forms for the letter u (in red). (Mongolian is written vertically with letters being connected like cursive letters in English.)
In Unicode these two strings would be expressed as
var suForm1: String = "\u{1830}\u{1826}"
var suForm2: String = "\u{1830}\u{1826}\u{180B}"
The Free Variation Selector (U+180B) in suForm2
is recognized (correctly) by Swift String
to be a unit with the u (U+1826) that precedes it. It is considered by Swift to be a single character, an extended grapheme cluster. However, for the purposes of doing the rendering myself, I need to differentiate u (U+1826) and FVS1 (U+180B) as two distinct UTF-16 code points.
For internal display purposes, I would convert the above Unicode strings to the following rendered glyph strings:
suForm1 = "\u{E46F}\u{E3BA}"
suForm2 = "\u{E46F}\u{E3BB}"
Question
I have been playing around with Swift String
and Character
. There are a lot of convenient things about them, but since in my particular case I deal exclusively with UTF-16 code units, I wonder if I should be using the old NSString
rather than Swift's String
. I realize that I can use String.utf16
to get UTF-16 code points, but the conversion back to String
isn't very nice.
Would it be better to stick with String
and Character
or should I use NSString
and unichar
?
What I have read
Updates to this question have been hidden in order to clean the page up. See the edit history.
Updated for Swift 3
String and Character
For almost everyone in the future who visits this question,
String
andCharacter
will be the answer for you.Set Unicode values directly in code:
Use hexadecimal to set values
Note that the Swift Character can be composed of multiple Unicode code points, but appears to be a single character. This is called an Extended Grapheme Cluster.
See this question also.
Convert to Unicode values:
Convert from Unicode hex values:
Or alternatively:
A few more examples
Note that for UTF-8 and UTF-16 the conversion is not always this easy. (See UTF-8, UTF-16, and UTF-32 questions.)
NSString and unichar
It is also possible to work with
NSString
andunichar
in Swift, but you should realize that unless you are familiar with Objective C and good at converting the syntax to Swift, it will be difficult to find good documentation.Also,
unichar
is aUInt16
array and as mentioned above the conversion fromUInt16
to Unicode scalar values is not always easy (i.e., converting surrogate pairs for things like emoji and other characters in the upper code planes).Custom string structure
For the reasons mentioned in the question, I ended up not using any of the above methods. Instead I wrote my own string structure, which was basically an array of
UInt32
to hold Unicode scalar values.Again, this is not the solution for most people. First consider using extensions if you only need to extend the functionality of
String
orCharacter
a little.But if you really need to work exclusively with Unicode scalar values, you could write a custom struct.
The advantages are:
String
,Character
,UnicodeScalar
,UInt32
, etc.) when doing string manipulation.String
is easy.Disadavantages are:
You can make your own, but here is mine for reference. The hardest part was making it Hashable.