Unknown character for Turkish character

105 Views Asked by At

I have a dataframe consisting of two columns: (1) Turkish cities, (2) corresponding values.

dict_ = {'City': {0: 'ADANA',
  1: 'ANKARA',
  2: 'ANTALYA',
  3: 'AYDIN',
  4: 'BALIKESİR',
  5: 'BURSA',
  6: 'DENİZLİ',
  7: 'DÜZCE',
  8: 'DİYARBAKIR',
  9: 'ELAZIĞ',
  10: 'GAZİANTEP',
  11: 'GİRESUN',
  12: 'HATAY',
  13: 'KAHRAMANMARAŞ',
  14: 'KARABÜK',
  15: 'KARS',
  16: 'KAYSERİ',
  17: 'KIRIKKALE',
  18: 'KIRKLARELİ',
  19: 'KIRŞEHİR',
  20: 'KOCAELİ',
  21: 'KONYA',
  22: 'KÜTAHYA',
  23: 'MANİSA',
  24: 'MARDİN',
  25: 'MERSİN',
  26: 'MUĞLA',
  27: 'ORDU',
  28: 'OSMANİYE',
  29: 'SAKARYA',
  30: 'SAMSUN',
  31: 'TRABZON',
  32: 'UŞAK',
  33: 'YALOVA',
  34: 'ZONGULDAK',
  35: 'ÇORUM',
  36: 'İSTANBUL',
  37: 'İZMİR'},
 'Value': {0: 15,
  1: 25,
  2: 19,
  3: 2,
  4: 6,
  5: 5,
  6: 3,
  7: 1,
  8: 1,
  9: 1,
  10: 7,
  11: 2,
  12: 31,
  13: 5,
  14: 1,
  15: 1,
  16: 4,
  17: 5,
  18: 1,
  19: 1,
  20: 6,
  21: 4,
  22: 2,
  23: 1,
  24: 1,
  25: 5,
  26: 5,
  27: 4,
  28: 3,
  29: 2,
  30: 3,
  31: 2,
  32: 2,
  33: 1,
  34: 2,
  35: 2,
  36: 221,
  37: 6}}

data = pd.DataFrame(dict_)

When I try to capitalize the City column (where the first letter is uppercase and the rest is lowercase), I am having a weird character issue.

data['İl'].apply(str.capitalize)

Lowercase version of "İ" changes to a character when I cannot identify, for examples:

enter image description here

or

enter image description here

import unicodedata
unicodedata.name("i̇")
# TypeError: name() argument 1 must be a unicode character, not str

I tried many solutions but to no avail!

2

There are 2 best solutions below

0
On BEST ANSWER
def turkish_title_case(text):
    turkish_correction = {"İ": "i", "I": "ı", "Ç": "ç", "Ğ": "ğ", "Ü": "ü", "Ş": "ş", "Ö": "ö"}

    for turkish, corrected in turkish_correction.items():
        text = text.replace(turkish, corrected)
    text = text.capitalize()

    turkish_correction = {"I": "İ"}
    for turkish, corrected in turkish_correction.items():
        text = text.replace(turkish, corrected)

    return text

Considering that the city names are fixed, this may work for this case.

enter image description here

0
On

Based on this solution, you could try the unicode_tr package, which can be installed with:

pip install unicode_tr

With this you can do:

import pandas as pd
from unicode_tr import unicode_tr

dict_ = {
    'City': {
        0: 'ADANA',
        1: 'ANKARA',
        2: 'ANTALYA',
        3: 'AYDIN',
        4: 'BALIKESİR',
        5: 'BURSA',
        6: 'DENİZLİ',
        7: 'DÜZCE',
        8: 'DİYARBAKIR',
        9: 'ELAZIĞ',
        10: 'GAZİANTEP',
        11: 'GİRESUN',
        12: 'HATAY',
        13: 'KAHRAMANMARAŞ',
        14: 'KARABÜK',
        15: 'KARS',
        16: 'KAYSERİ',
        17: 'KIRIKKALE',
        18: 'KIRKLARELİ',
        19: 'KIRŞEHİR',
        20: 'KOCAELİ',
        21: 'KONYA',
        22: 'KÜTAHYA',
        23: 'MANİSA',
        24: 'MARDİN',
        25: 'MERSİN',
        26: 'MUĞLA',
        27: 'ORDU',
        28: 'OSMANİYE',
        29: 'SAKARYA',
        30: 'SAMSUN',
        31: 'TRABZON',
        32: 'UŞAK',
        33: 'YALOVA',
        34: 'ZONGULDAK',
        35: 'ÇORUM',
        36: 'İSTANBUL',
        37: 'İZMİR'
    },
    'Value': {
        0: 15,
        1: 25,
        2: 19,
        3: 2,
        4: 6,
        5: 5,
        6: 3,
        7: 1,
        8: 1,
        9: 1,
        10: 7,
        11: 2,
        12: 31,
        13: 5,
        14: 1,
        15: 1,
        16: 4,
        17: 5,
        18: 1,
        19: 1,
        20: 6,
        21: 4,
        22: 2,
        23: 1,
        24: 1,
        25: 5,
        26: 5,
        27: 4,
        28: 3,
        29: 2,
        30: 3,
        31: 2,
        32: 2,
        33: 1,
        34: 2,
        35: 2,
        36: 221,
        37: 6
    }
}

data = pd.DataFrame(dict_)

data["City"].apply(unicode_tr.capitalize)

which outputs:

0             Adana
1            Ankara
2           Antalya
3             Aydın
4         Balıkesir
5             Bursa
6           Denizli
7             Düzce
8        Diyarbakır
9            Elazığ
10        Gaziantep
11          Giresun
12            Hatay
13    Kahramanmaraş
14          Karabük
15             Kars
16          Kayseri
17        Kırıkkale
18       Kırklareli
19         Kırşehir
20          Kocaeli
21            Konya
22          Kütahya
23           Manisa
24           Mardin
25           Mersin
26            Muğla
27             Ordu
28         Osmaniye
29          Sakarya
30           Samsun
31          Trabzon
32             Uşak
33           Yalova
34        Zonguldak
35            Çorum
36         İstanbul
37            İzmir
Name: City, dtype: object