How to enforce character length for asian languages such as chinese?

1.3k Views Asked by At

Using Django v1.10 and Postgres

there's a datafield which may contain a mixture of symbols (such as \|?), numbers, alphabetical letters, as well as Asian language characters.

The user says the maximum of this field should be 15 characters.

How do I enforce this using Django and Postgres as the database? In postgres, we use utf-8 encoding.

1 character may be a digit or a Chinese character or an English alphabetic letter

I know in PHP, there's a function called mb_strlen. And in python, the equivalent would be to use unicode strings.

Within the Django way, what's the best way to enforce max string length?

1

There are 1 best solutions below

4
On

To begin with, you have to start by defining what you mean by characters. You mentioned korean, which is one of the languages that many string length functions misinterpret.

Multiple unicode characters may be used to describe a single grapheme (user perceived character), such as:

>>> len(u"한")
3

Using unicode strings will make it easy to count the number of unicode characters, but that is not the same as the number of user perceived characters. I would recommend reading this article on python text length.

If you do wish to count unicode characters instead of graphemes, then it's simple. Just use a CharField with a max_length argument (on your model and your forms).

If you wish to limit the field to a maximum of 15 graphemes however, you have to let the database field contain more characters than that and make some custom validation for your forms.

A helpful library for such a validator might be grapheme, which can calculate the number of graphemes in a string.