How to capitalize polish special letters in C++?

188 Views Asked by At

I've got a string I want to capitalize, but it might contain polish special letters (ą, ć, ę, ł, ń, ó, ś, ż, ź). The function transform(string.begin(), string.end(), string.begin(), ::toupper); only capitalizes the latin alphabet, so I wrote a function like this:


    string to_upper(string nazwa)
    {
        transform(nazwa.begin(), nazwa.end(), nazwa.begin(), ::toupper);

        for (int i = 0; i < (int)nazwa.size(); i++)
        {
            switch(nazwa[i])
            {
                case u'ą':
                {
                    nazwa[i] = u'Ą';
                    break;
                }
                case u'ć':
                {
                    nazwa[i] = u'Ć';
                    break;
                }
                case u'ę':
                {
                    nazwa[i] = u'Ę';
                    break;
                }
                case u'ó':
                {
                    nazwa[i] = u'Ó';
                    break;
                }
                case u'ł':
                {
                    nazwa[i] = u'Ł';
                    break;
                }
                case u'ń':
                {
                    nazwa[i] = u'Ń';
                    break;
                }
                case u'ś':
                {
                    nazwa[i] = u'Ś';
                    break;
                }
                case u'ż':
                {
                    nazwa[i] = u'Ż';
                    break;
                }
                case u'ź':
                {
                    nazwa[i] = u'Ź';
                    break;
                }
            }
        }

        return nazwa;
    }

I also tried using if instead of switch but it doesn't change anything. In Qt Creator next to every capital letter to be inserted apart from u'Ó' gives me a similar error: Implicit conversion from 'char16_t' to 'std::basic_string<char>::value_type' (aka 'char') changes value from 260 to 4 (this is from u'Ą'). After running the program, the chars in the string aren't swaped.

3

There are 3 best solutions below

19
RedStoneMatt On BEST ANSWER

The source of your issue

std::string stores characters as chars, which are one byte long, and therefore their value can only go from 0 to 255.

This makes it impossible to store u'ą' in one char for example, as the unicode value for ą is 0x105 (= 261 in decimal, which is higher than 255).

To avoid this problem, humans have invented UTF-8, which is a character encoding standard that lets you encode any Unicode characters as bytes. Characters that have a higher value will of course take multiple bytes to encode.

It is very likely that your std::string have its characters encoded in UTF-8. (I say very likely because your code doesn't directly indicate it, but it is pretty much 100% certain that it is the case, because it's the only universal way to encode accented letters in char-based strings. To be absolutely 100% sure, you'd need to check Qt's code, since it seems to be what you are using)

The result of this is that you can't just use a for to iterate through the chars of your std::string the way that you are because you basically assume that one char equals one letter, which is simply not the case.

In the case of ą for example, it'll be encoded as bytes C4 85, so you will have one char that will have the value 0xC4 (= 196) followed by another char of value 0x85 (= 133).


The specific case for the characters you want to capitalize

The Latin Extended-A part of the Unicode table (archive) fortunately shows us that these special capital letters come right before their lowercase counterparts.

More than that, we can see that:

  • From Unicode index 0x100 to 0x137 (both included), lowercase letters are the odd indices.
  • From 0x139 to 0x148 (both included), lowercases are the even indices.
  • From 0x14A to 0x177 (both included), lowercases are the odd indices.
  • From 0x179 to 0x17E (both included), lowercases are the even indices.

This will make it easier to convert lowercase code points to uppercase ones, since all we have to do is check if the index of a character corresponds to a lowercase one, and if so, subtract one to it to make it uppercase.


Encoding one of those characters in UTF-8

To encode these in UTF-8 (source):

  • Convert the code point (the Unicode value if you prefer to say it like that) in binary
  • The first byte of your UTF-8-encoded character will have binary value 110xxxxx, replace xxxxx with the higher five bytes of the binary code point of the character
  • The second byte will have binary value 10xxxxxx, replace xxxxxx with the lower six bytes of the binary code point of the character

So for ą, value is 0x105 in hex, so 00100000101 in binary.

First byte value is then 11000100 (= 0xC4).

Second byte value is then 10000101 (= 0x85).

Note that this encoding 'technique' works because the characters you want to capitalize have their value (code point) between 0x80 and 0x7FF. It changes depending of how high the value is, see documentation here.


Fixing your code

I have rewritten your to_upper function accoding to what I have written so far:

string to_upper(string nazwa)
{
    for (int i = 0; i < (int)nazwa.size(); i++)
    {
        // Getting the current character we are working with
        char chr1 = nazwa[i];

        // We want to find UTF-8-encoded polish letters here
        // So we are looking for a character that has first three bits set to 110,
        // as all polish letters encoded in UTF-8 are in UTF-8 Class 1 and therefore
        // are two bytes long, the first byte being of binary value 110xxxxx
        if(((chr1 >> 5) & 0b111) != 0b110) {
            nazwa[i] = toupper(chr1); // Do the std toupper here for regular characters
            continue;
        }

        // If we are here, then the character we are dealing with is two bytes long, so get its value.
        // We won't need to check for that second byte during next iteration, so we increment i
        i++;
        char chr2 = nazwa[i];

        // Get the unicode value of the encoded character
        uint16_t fullChr = ((chr1 & 0b11111) << 6) | (chr2 & 0b111111);

        // Get the various conditions to check for lowercase code points
        bool lowercaseIsOdd =  (fullChr >= 0x100 && fullChr <= 0x137) || (fullChr >= 0x14A && fullChr <= 0x177);
        bool lowercaseIsEven = (fullChr >= 0x139 && fullChr <= 0x148) || (fullChr >= 0x179 && fullChr <= 0x17E);
        bool chrIndexIsOdd =   (fullChr % 2) == 1;

        // Depending of whether the code point needs to be odd or even to be lowercase and depending of if the code point
        // is odd or even, decrease it by one to make it uppercase
        if((lowercaseIsOdd && chrIndexIsOdd)
        || (lowercaseIsEven && !chrIndexIsOdd))
            fullChr--;

        // Support for some additional, more commonly used accented letters
        if(fullChr >= 0xE0 && fullChr <= 0xF6)
            fullChr -= 0x20;

        // Re-encode the character point in UTF-8
        nazwa[i-1] = (0b110 << 5) | ((fullChr >> 6) & 0b11111); // We incremented i earlier, so subtract one to edit the first byte of the letter we're encoding
        nazwa[i] = (0b10 << 6) | (fullChr & 0b111111);
    }

    return nazwa;
}

Note: don't forget to #include <cstdint> for uint16_t to work.

Note 2: I have added support for some Latin 1 Supplement (archive) letters because you asked for it in comments. Although we subtract 0x20 from lowercase code points to get the uppercase ones, it is pretty much the same principle as for other letters I have covered in this answer.

I have included lots of comments in my code, please consider reading them for a better understanding.

I have tested it with the string "ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž" and it converted it to "ĀĀĂĂĄĄĆĆĈĈĊĊČČĎĎĐĐĒĒĔĔĖĖĘĘĚĚĜĜĞĞĠĠĢĢĤĤĦĦĨĨĪĪĬĬĮĮİİIJIJĴĴĶĶĸĹĹĻĻĽĽĿĿŁŁŃŃŅŅŇŇŊŊŌŌŎŎŐŐŒŒŔŔŖŖŘŘŚŚŜŜŞŞŠŠŢŢŤŤŦŦŨŨŪŪŬŬŮŮŰŰŲŲŴŴŶŶŸŹŹŻŻŽŽ", so it works perfectly:

int main() {
    string str1 = "ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž";
    string str2 = to_upper(str1);

    printf("str1: %s\n", str1.c_str());
    printf("str2: %s\n", str2.c_str());
}

Picture of a CMD printing the results of the above code

Note: All terminals use UTF-8 by default, Qt labels as well, basically EVERYTHING uses UTF-8, EXCEPT the Windows CMD, so if you are testing the above code on a Windows CMD or Powershell, you need to change them to UTF-8 using command chcp 65001, or by adding a Windows API call to change the CMD encoding when you execute your code.

Note 2: When you write raw strings directly in your code, your compiler will encode them in UTF-8 by default. Which is why my version of the to_upper function works with polish letters directly written in code without further modifications. When I say that EVERYTHING uses UTF-8, I mean it.

Note 3: I kept it to avoid causing problems with your current code, but you use string instead of std::string, implying that you have a using namespace std; somewhere in your code. In which case, please see Why is "using namespace std;" considered bad practice?


Note about the other answers

Please keep in mind that my answer is very specific to your case. It aims to, as you asked for, capitalize polish letters.

Other answers rely on std features which are apparently more universal and work with all languages, so I'd invite you to give them a look.

It's always better to rely on existing features rather than reinventing the wheel, but I think it's also good to have a self-made alternative that might be easier to understand and sometimes is more efficient.

9
Marek R On

The easiest way to handle this is use wide string. The only trap is proper handling of encoding/locale.

So try this:

#include <algorithm>
#include <iostream>
#include <locale>
#include <string>

int main()
try {
    std::locale cLocale{ "C.UTF-8" };
    std::locale::global(cLocale);

    std::locale sys { "" };
    std::wcin.imbue(sys);
    std::wcout.imbue(sys);

    std::wstring line;
    while (getline(std::wcin, line)) {
        std::transform(line.begin(), line.end(), line.begin(), [&cLocale](auto ch) { return std::toupper(ch, cLocale); });
        std::wcout << line << L'\n';
    }
} catch (const std::exception& e) {
    std::cerr << e.what() << '\n';
}

https://godbolt.org/z/3cKaEeW3z

Now:

  • cLocale defines locale which will be used by standard library when interaction with your program.
  • sys is system locale which defines what kind of encoding should be used on input output streams. Note which overload toupper is used.

Same code should work with std::string and std::cin std::cout only if you use one byte encoding which works for Polish language. In such case you should change string in cLocale to:

#include <algorithm>
#include <iostream>
#include <locale>
#include <string>

int main()
try {
    std::locale cLocale{ ".1250" };
    std::locale::global(cLocale);

    std::locale sys { "" };
    std::cin.imbue(sys);
    std::cout.imbue(sys);

    std::string line;
    while (getline(std::cin, line)) {
        std::transform(line.begin(), line.end(), line.begin(), [&cLocale](auto ch) { return std::toupper(ch, cLocale); });
        std::cout << line << '\n';
    }
} catch (const std::exception& e) {
    std::cerr << e.what() << '\n';
}

Note that this locale name is platform and compiler depended and also system has to be configured to work. Above works on Windows with MSVC (I've test that). Can't demo this since there is no online compiler which supports polish locale.

If multibyte encoding is used then transform will fail since will not be able to process this multibyte characters

0
n. m. could be an AI On

This should work on most Unix-y systems, except for weird cases like Turkish I and possibly German ß.

#include <clocale>
#include <locale>
#include <iostream>
#include <string>
#include <cwctype>
#include <codecvt>

inline std::wstring stow(const std::string& p)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> wconv;
    return wconv.from_bytes(p);
}

inline std::string wtos(const std::wstring& p)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> wconv;
    return wconv.to_bytes(p);
}


int main()
{
    std::locale loc("");

    // AFAICT the calls below are optional on a Mac 
    // for this particular task but it could be a 
    // good idea to use them anyway
    // std::setlocale(LC_ALL, "");
    // std::locale::global(loc);
    // std::cin.imbue(loc);
    // std::cout.imbue(loc);

    std::string s;
    std::getline(std::cin, s);

    std::wstring w = stow(s);
    for (auto& c: w)
    {
        c = std::toupper(c, loc);
    }

    std::cout << wtos(w) << "\n";
}

Note it uses deprecated C++ facilities for UTF-8 code conversion. If this bothers you, substitute any UTF-8 to UTF-32 and back convertors in stow and wtos. Also feel free to substitute a locale that exists on your system (could be "pl_PL.UTF-8" or similar).,