isalpha giving True for some Sinhala words

Question

isalpha giving True for some Sinhala words

43.9k Views Asked by cmgchess At 15 December 2022 at 06:48

I'm trying to check if a sentence only has Sinhala words (they can be nonsense words as long as they are written in Sinhala). Sometimes there can be English words in a sentence mixed with sinhala words. The thing is sometimes Sinhala words give True when checked with isalpha() giving incorrect results in my classification.

for example I did something like this.

for i in ['මට', 'කෑම', 'කන්න', 'ඕන']:
  print(i.isalpha())

gives

True
False
False
True

Is there a way to overcome this

Original Q&A

There are 3 best solutions below

James Demisse On 15 December 2022 at 07:22

this might help

from string import ascii_lowercase, ascii_uppercase
all = ascii_uppercase + ascii_lowercase
for i in ['මට', 'කෑම', 'කන්න', 'ඕන']:
  print(i in all)

here is the output

False
False
False
False

Andj On 22 May 2023 at 12:47

This is an old question, but analysis of the question is somewhat incomplete. At it's simplest: not all word-forming characters are alphabetic characters. It is insufficient to match words. The python definition of alphabetic are those Unicode characters assigned the categories of “Lm”, “Lt”, “Lu”, “Ll”, and “Lo”.

This excludes many word forming characters including combining diacritics, dependent vowels in South Asian and South East Asian languages, the punct volant in Catalan, etc.

Additionally Python's definition of an alphabetic character doesn't always align with Unicode's definition. For Unicode, we use the categories “Lm”, “Lt”, “Lu”, “Ll”, “Lo”, "Nl", and "Other_Alphabetic".

The question gives the results for Python's interpretation:

for i in ['මට', 'කෑම', 'කන්න', 'ඕන']:
    print(i.isalpha())

Results in:

True
False
False
True

For Unicode definition:

import regex
for i in ['මට', 'කෑම', 'කන්න', 'ඕන']:
    print(bool(regex.match(r'^\p{Alphabetic}+$', i)))

With the results:

True
True
False
True

Which is slightly better, but not sufficient. One possible addition is to expand the regex pattern:

for i in ['මට', 'කෑම', 'කන්න', 'ඕන']:
    if len(i) == 1:
        result = bool(regex.match(r'[\p{Alphabetic}]', i))
    else:
        result = bool(regex.match(r'^\p{Alphabetic}[\p{Alphabetic}\p{Mn}\p{Mc}\u00B7]*$', i))
print(result)

Which gives:

True
True
True
True

Alternatively use the metacharacter for word forming characters:

for i in ['මට', 'කෑම', 'කන්න', 'ඕන']:
    print(bool(regex.match(r'[\w]+', i)))

which gives:

True
True
True
True

**ReetS** · Accepted Answer · 2022-12-15T07:26:36.577000

How isalpha works is by checking if the category of a character for Unicode is Lm, Lt, Lu, Ll, or Lo. See below for their meaning.

Ll    Lowercase Letter
Lm    Modifier Letter
Lo    Other Letter
Lu    Uppercase Letter

This "breaks" python when characters are joined together. In your first example if we see ම or ට the category (from the lookup tool below) is Lo. This is valid so it gives us True In your second example, the first letter is කෑ which is actually two characters (ක and ෑ). The category for ෑ is not a letter one so it returns False.

Long story short, Python is technically right. If you we were to do what you intended you would have to split joined characters and then remove the extra characters added on.

So, it is complicated. There may be a library out there that does this but I do not know any.

Cheers

source: https://docs.python.org/3/library/stdtypes.html#str.isalnum
character lookup: https://www.compart.com/en/unicode/

isalpha giving True for some Sinhala words

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in UNICODE

Related Questions in UTF-8

Related Questions in ISALPHA

Trending Questions

Popular # Hahtags

Popular Questions