Check if Unicode character is displayed or tofu

2k Views Asked by At

My question is similar to this one, but a little step-forward.

In my Win32 program I have some menu button with Unicode characters above BMP, such as U+1F5A4 (UTF-16 surrogate pairs 0xD83D 0xDDA4).
In Windows 10 the system font Segoe UI doesn't have this glyph: it is automagically replaced with a glyph from the font Segoe UI Symbol and displayed correctly in the button, thanks to a process called font linking (or font fallback, still not clear to me).
But in Windows 7 the font linking brings to a font that doesn't have this glyph neither, and the surrogate pairs appear as two empty boxes ▯▯. The same in Windows XP with Tahoma font.

I want to avoid these replacement boxes, by parsing the text before or after the assignment to the button, and replacing the missing glyph with some common ASCII character.

I tried GetGlyphOutline, ScriptGetCMap, GetFontUnicodeRanges and GetGlyphIndices but they don't support surrogate pairs.
I also tried GetCharacterPlacement and Uniscribe ScriptItemize+ScriptShape that support surrogate pairs, but all these functions search only into the base font of HDC (Segoe UI), they don't search for eventually fallback font (Segoe UI Symbol), which is the one that provides the glyph.

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontLink\SystemLink it's a place where I looked, but I really think it's not there the system takes the fonts to link to.

The question is: how can I know if the system font-linking produces the correct glyph or tofu boxes instead?


Edit

I found some kind of solution copying trom this code and adding the last GetCharacterPlacement.

#include <usp10.h>

wchar_t *checkGlyphExist( HWND hwnd, wchar_t *sUnicode, wchar_t *sLimited ) {

    // Create metafile
    HDC hdc = GetDC( hwnd );
    HDC metaFileDC = CreateEnhMetaFile( hdc, NULL, NULL, NULL );

    // Select menu font
    NONCLIENTMETRICSW ncm;
    ncm.cbSize = sizeof(ncm);
    SystemParametersInfoW( SPI_GETNONCLIENTMETRICS, ncm.cbSize, &ncm, 0 );
    HFONT hFont = CreateFontIndirectW( &(ncm.lfMenuFont) );
    SelectObject( metaFileDC, hFont );
    wprintf( L"%s\n", ncm.lfMenuFont.lfFaceName );  // 'Segoe UI' in Win 10 and 7 (ok)
                                                    // 'Tahoma' in Win XP (ok)

    // Use the meta file to intercept the fallback font chosen by Uniscribe
    SCRIPT_STRING_ANALYSIS ssa;
    ScriptStringAnalyse( metaFileDC, sUnicode, wcslen(sUnicode), 0, -1,
                      SSA_METAFILE | SSA_FALLBACK | SSA_GLYPHS | SSA_LINK,  
                      0, NULL, NULL, NULL, NULL, NULL, &ssa );
    ScriptStringFree( &ssa );
    HENHMETAFILE metaFile = CloseEnhMetaFile(metaFileDC);
    LOGFONTW logFont = {0};
    EnumEnhMetaFile( 0, metaFile, metaFileEnumProc, &logFont, NULL );
    DeleteEnhMetaFile( metaFile );
    wprintf( L"%s\n", logFont.lfFaceName );
        // 'Segoe UI Symbol' in Win 10 (ok)
        // 'Microsoft Sans Serif' in Win 7 (wrong, should be 'Segoe UI Symbol')
        // 'Tahoma' in Win XP for characters above 0xFFFF (wrong, should be 'Microsoft Sans Serif', I guess)
    
    // Get glyph indices for the 'sUnicode' string
    hFont = CreateFontIndirectW( &logFont );
    SelectObject( hdc, hFont );
    GCP_RESULTSW infoStr = {0};
    infoStr.lStructSize = sizeof(GCP_RESULTSW);
    wchar_t tempStr[wcslen(sUnicode)];  
    wcscpy( tempStr, sUnicode );
    infoStr.lpGlyphs = tempStr;
    infoStr.nGlyphs = wcslen(tempStr);
    GetCharacterPlacementW( hdc, tempStr, wcslen(tempStr), 0, &infoStr, GCP_GLYPHSHAPE );
    ReleaseDC( hwnd, hdc );

    // Return one string
    if( infoStr.lpGlyphs[0] == 3 || // for Windows 7 and 10
        infoStr.lpGlyphs[0] == 0 )  // for Windows XP
        return sLimited;
    else
        return sUnicode;
}

// Callback function to intercept font creation
int CALLBACK metaFileEnumProc( HDC hdc, HANDLETABLE *table, const ENHMETARECORD *record,
                            int tableEntries, LPARAM logFont ) {
    if( record->iType == EMR_EXTCREATEFONTINDIRECTW ) {
        const EMREXTCREATEFONTINDIRECTW* fontRecord = (const EMREXTCREATEFONTINDIRECTW *)record;
        *(LOGFONTW *)logFont = fontRecord->elfw.elfLogFont;
    }
    return 1;
}

You can call it with checkGlyphExist( hWnd, L"", L"<3" );

I tested on Windows 10 and on two virtual machines: Windows 7 Professional, Windows XP SP2.
It works quite well, but two problems still remain about the fallback font that EnumEnhMetaFile retrieves when a glyph is missing in base font:

  • in Windows 7 is always Microsoft Sans Serif, but the real fallback font should be Segoe UI Symbol.
  • in Windows XP is Tahoma instead of Microsoft Sans Serif, but only for surrogate pairs characters (for BMP characters is Microsoft Sans Serif that is correct, I guess).

Can someone help me to solve this?

2

There are 2 best solutions below

4
On

First you have to make sure you're using same API on both Win7 and Win10. Lower level gdi32 API is not supposed to support surrogate pairs in general I think, while newer DirectWrite does, on every level. Next thing to keep in mind is that font fallback (font linking is a different thing) data differs from release to release and it's not something user has access to, and it's not modifiable.

Second thing to check if Win7 provides fonts for symbol at U+1F5A4 in a first place, it's possible it was introduced in later versions only.

Basically if you're using system rendering functionality, older or newer, you're not supposed to control fallback most of the time, if it doesn't work for you it usually means it won't work. DirectWrite allows custom fallback lists, where you can for example explicitly assign U+1F5A4 to any font you want, that supports it, including custom fonts that you can bundle with your application.

If you want more detailed answer, you'll need to show some sources excerpts that don't work for you.

2
On

I believe the high and low 16-bit words are well defined for surrogate pairs. You should be able to identify surrogate pairs by checking the range of values for each of the 16-bit words.

For the high word it should be in the range of 0xd800 to 0xdbff For the low word it should be in the range of 0xdc00 to 0xdfff

If any two pair of "characters" meets this criteria, they are a surrogate pair.

See the wikipedia article on UTF-16 for more information.