I've got a string I want to capitalize, but it might contain polish special letters (ą, ć, ę, ł, ń, ó, ś, ż, ź). The function transform(string.begin(), string.end(), string.begin(), ::toupper); only capitalizes the latin alphabet, so I wrote a function like this:
string to_upper(string nazwa)
{
transform(nazwa.begin(), nazwa.end(), nazwa.begin(), ::toupper);
for (int i = 0; i < (int)nazwa.size(); i++)
{
switch(nazwa[i])
{
case u'ą':
{
nazwa[i] = u'Ą';
break;
}
case u'ć':
{
nazwa[i] = u'Ć';
break;
}
case u'ę':
{
nazwa[i] = u'Ę';
break;
}
case u'ó':
{
nazwa[i] = u'Ó';
break;
}
case u'ł':
{
nazwa[i] = u'Ł';
break;
}
case u'ń':
{
nazwa[i] = u'Ń';
break;
}
case u'ś':
{
nazwa[i] = u'Ś';
break;
}
case u'ż':
{
nazwa[i] = u'Ż';
break;
}
case u'ź':
{
nazwa[i] = u'Ź';
break;
}
}
}
return nazwa;
}
I also tried using if instead of switch but it doesn't change anything.
In Qt Creator next to every capital letter to be inserted apart from u'Ó' gives me a similar error: Implicit conversion from 'char16_t' to 'std::basic_string<char>::value_type' (aka 'char') changes value from 260 to 4 (this is from u'Ą'). After running the program, the chars in the string aren't swaped.
The source of your issue
std::stringstores characters aschars, which are one byte long, and therefore their value can only go from 0 to 255.This makes it impossible to store
u'ą'in onecharfor example, as the unicode value forąis0x105(= 261 in decimal, which is higher than 255).To avoid this problem, humans have invented
UTF-8, which is a character encoding standard that lets you encode any Unicode characters as bytes. Characters that have a higher value will of course take multiple bytes to encode.It is very likely that your
std::stringhave its characters encoded in UTF-8. (I say very likely because your code doesn't directly indicate it, but it is pretty much 100% certain that it is the case, because it's the only universal way to encode accented letters inchar-based strings. To be absolutely 100% sure, you'd need to check Qt's code, since it seems to be what you are using)The result of this is that you can't just use a
forto iterate through thechars of yourstd::stringthe way that you are because you basically assume that onecharequals one letter, which is simply not the case.In the case of
ąfor example, it'll be encoded as bytesC4 85, so you will have onecharthat will have the value0xC4(= 196) followed by anothercharof value0x85(= 133).The specific case for the characters you want to capitalize
The Latin Extended-A part of the Unicode table (archive) fortunately shows us that these special capital letters come right before their lowercase counterparts.
More than that, we can see that:
This will make it easier to convert lowercase code points to uppercase ones, since all we have to do is check if the index of a character corresponds to a lowercase one, and if so, subtract one to it to make it uppercase.
Encoding one of those characters in UTF-8
To encode these in UTF-8 (source):
110xxxxx, replacexxxxxwith the higher five bytes of the binary code point of the character10xxxxxx, replacexxxxxxwith the lower six bytes of the binary code point of the characterSo for
ą, value is0x105in hex, so00100000101in binary.First byte value is then
11000100(= 0xC4).Second byte value is then
10000101(= 0x85).Note that this encoding 'technique' works because the characters you want to capitalize have their value (code point) between 0x80 and 0x7FF. It changes depending of how high the value is, see documentation here.
Fixing your code
I have rewritten your
to_upperfunction accoding to what I have written so far:Note: don't forget to
#include <cstdint>foruint16_tto work.Note 2: I have added support for some Latin 1 Supplement (archive) letters because you asked for it in comments. Although we subtract
0x20from lowercase code points to get the uppercase ones, it is pretty much the same principle as for other letters I have covered in this answer.I have included lots of comments in my code, please consider reading them for a better understanding.
I have tested it with the string
"ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž"and it converted it to"ĀĀĂĂĄĄĆĆĈĈĊĊČČĎĎĐĐĒĒĔĔĖĖĘĘĚĚĜĜĞĞĠĠĢĢĤĤĦĦĨĨĪĪĬĬĮĮİİIJIJĴĴĶĶĸĹĹĻĻĽĽĿĿŁŁŃŃŅŅŇŇŊŊŌŌŎŎŐŐŒŒŔŔŖŖŘŘŚŚŜŜŞŞŠŠŢŢŤŤŦŦŨŨŪŪŬŬŮŮŰŰŲŲŴŴŶŶŸŹŹŻŻŽŽ", so it works perfectly:Note: All terminals use UTF-8 by default, Qt labels as well, basically EVERYTHING uses UTF-8, EXCEPT the Windows CMD, so if you are testing the above code on a Windows CMD or Powershell, you need to change them to UTF-8 using command
chcp 65001, or by adding a Windows API call to change the CMD encoding when you execute your code.Note 2: When you write raw strings directly in your code, your compiler will encode them in UTF-8 by default. Which is why my version of the
to_upperfunction works with polish letters directly written in code without further modifications. When I say that EVERYTHING uses UTF-8, I mean it.Note 3: I kept it to avoid causing problems with your current code, but you use
stringinstead ofstd::string, implying that you have ausing namespace std;somewhere in your code. In which case, please see Why is "using namespace std;" considered bad practice?Note about the other answers
Please keep in mind that my answer is very specific to your case. It aims to, as you asked for, capitalize polish letters.
Other answers rely on
stdfeatures which are apparently more universal and work with all languages, so I'd invite you to give them a look.It's always better to rely on existing features rather than reinventing the wheel, but I think it's also good to have a self-made alternative that might be easier to understand and sometimes is more efficient.