How to find UTF-8 reference of a composite unicode character

1.5k Views Asked by At

At work, i have this issue where i need to find the UTF-8 reference of a composite unicode character.

The character in question is a "n" with a "^" on top : n̂. This is represented in unicode by the character "n" (U+006E) followed by the circumflex accent (U+0302).

What i'm looking to find is the single reference of this character in UTF-8.

I've been looking all around, but i can't seem to find an answer to this. I feel stupid because it doesn't seem that finding such a simple thing would be hard.

Edit : So i thought that the composition of "n" and "^" could be mapped to a single UTF-8 code point (i hope i'm using the terminology right). However, you explained me that it was otherwise. Thank you all for the help.

Loïc.

2

There are 2 best solutions below

0
Remy Lebeau On BEST ANSWER

UTF-8 is a byte encoding for a sequence of individual Unicode codepoints. There is no single Unicode codepoint defined for , not even when a Unicode string is normalized in NFC or NFKC formats. As you have noted, consists of codepoint U+006E LATIN SMALL LETTER N followed by codepoint U+0302 COMBINING CIRCUMFLEX ACCENT. In UTF-8, U+006E is encoded as byte 0x6E, and U+0302 is encoded as bytes 0xCC 0x82.

2
Joe On

If you want the string as composed as possible, then you want it in NFC (Normalized Form Composed, see Unicode equivalence). You can do this in Python using this example:

#!/usr/bin/python3

import unicodedata

for s in ['Jalapen\u0303o', 'n̂']:
  print(s)
  print(ascii(s))
  print('NFC:', ascii(unicodedata.normalize('NFC', s))) 
  print('NFD:', ascii(unicodedata.normalize('NFD', s)))
  print('')

This will give you:

Jalapeño

'Jalapen\u0303o'

NFC: 'Jalape\xf1o'

NFD: 'Jalapen\u0303o'

'n\u0302'

NFC: 'n\u0302'

NFD: 'n\u0302'

As you can see, while the 'ñ' has both a composed and decomposed form, the 'n̂' does not. Its only form is decomposed, as two separate characters.