How to find UTF-8 reference of a composite unicode character

1.4k Views Asked by At

At work, i have this issue where i need to find the UTF-8 reference of a composite unicode character.

The character in question is a "n" with a "^" on top : n̂. This is represented in unicode by the character "n" (U+006E) followed by the circumflex accent (U+0302).

What i'm looking to find is the single reference of this character in UTF-8.

I've been looking all around, but i can't seem to find an answer to this. I feel stupid because it doesn't seem that finding such a simple thing would be hard.

Edit : So i thought that the composition of "n" and "^" could be mapped to a single UTF-8 code point (i hope i'm using the terminology right). However, you explained me that it was otherwise. Thank you all for the help.

Loïc.

2

There are 2 best solutions below

0
On BEST ANSWER

UTF-8 is a byte encoding for a sequence of individual Unicode codepoints. There is no single Unicode codepoint defined for , not even when a Unicode string is normalized in NFC or NFKC formats. As you have noted, consists of codepoint U+006E LATIN SMALL LETTER N followed by codepoint U+0302 COMBINING CIRCUMFLEX ACCENT. In UTF-8, U+006E is encoded as byte 0x6E, and U+0302 is encoded as bytes 0xCC 0x82.

2
On

If you want the string as composed as possible, then you want it in NFC (Normalized Form Composed, see Unicode equivalence). You can do this in Python using this example:

#!/usr/bin/python3

import unicodedata

for s in ['Jalapen\u0303o', 'n̂']:
  print(s)
  print(ascii(s))
  print('NFC:', ascii(unicodedata.normalize('NFC', s))) 
  print('NFD:', ascii(unicodedata.normalize('NFD', s)))
  print('')

This will give you:

Jalapeño

'Jalapen\u0303o'

NFC: 'Jalape\xf1o'

NFD: 'Jalapen\u0303o'

'n\u0302'

NFC: 'n\u0302'

NFD: 'n\u0302'

As you can see, while the 'ñ' has both a composed and decomposed form, the 'n̂' does not. Its only form is decomposed, as two separate characters.