C++ implementation of python unicodedata library

439 Views Asked by At

New user here, please be gentle.

we are looking to implement a piece of python code in c++, but it involves some intricate unicode library called unicodedata, in particular this function

unicodedata.category('A')  # 'L'etter, 'u'ppercase
'Lu'

Any chance that this can be readily achieved in c++? Would embedding compiled python code in c++ be worthwhile, assuming we want to do this in the context of online tensorflow model serving? Thanks!

1

There are 1 best solutions below

2
On

Just stick the output of this Python code into a C++ source file:

import unicodedata

print('typedef enum {Cn, Cc, Cf, Co, Cs, Ll, Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf, Pi, Po, Ps, Sc, Sk, Sm, So, Zl, Zp, Zs} CATEGORY_e;')
print('const CATEGORY_e CHAR_CATEGORIES[] = {%s};' % ', '.join(unicodedata.category(chr(codepoint)) for codepoint in range(0x110000)))

(If you are still using Python 2.x instead of 3.x, replace chr with unichr.)

You now have a convenient lookup table of Unicode character categories to use in your C++ programs.