How to get the characters from a UTF-8 string?

900 Views Asked by At
char *w = "Artîsté";
printf("%lu\n", strlen(w));
int z;
for(z=0; z<strlen(w); z++){
    //printf("%c", w[z]);  //prints as expected
    printf("%i: %c\n", z, w[z]);//doesn't print anything
}

If I run this, it fails at the î. How do I print a multibyte char and how do I know when a I've hit a multibyte character?

2

There are 2 best solutions below

2
On

Use the wide char and multi-byte functions:

int utf8len(char *str)
{
    int len, inc;

    // mblen(NULL, 0) is needed to reset internal conversion state
    for (len = 0, mblen(NULL, 0); *str; len++, str += inc)
        if ((inc = mblen(str, MB_CUR_MAX)) < 0)
            return inc;

    return len;
}

int main()
{
    setlocale(LC_ALL, "");
    char *w = "Artîsté";
    printf("%lu\n", strlen(w));

    int z, len = utf8len(w);
    wchar_t wstr[len+1];
    mbstowcs(wstr, w, len);
    for(z=0; z<len; z++)
        printf("%i: %lc\n", z, wstr[z]);
}

You got lucky with the first printf, because you never changed the data, once you split up the chars, your output was no longer utf8.

2
On

If your execution environment uses UTF-8 (Linux, for example), your code will work as-is, as long as you set a suitable locale, as in setlocale(LC_ALL, "en_US.utf9"); before calling that printf.

demo: http://ideone.com/zFUYM

Otherwise, your best bet is probably to convert to wide string and print that. If you plan on doing something other than I/O with the individual characters of that string, you will have to do it anyway.

As for hitting a multibyte char, the portable way to test is if mblen() returns a value greater than 1.