I'm creating a generic SNES tilemap editor (similar to NES Screen Tool), meaning I'm drawing a lot of 4bpp tiles. However, my graphics loop takes too long to run, even with CachedBitmaps, which can't have their palettes changed, of which I may need to switch between 8. I can deal with the SNES format and size of things, but am struggling with the Windows side.
// basically the entire graphics drawing routine
case(WM_PAINT):{
    PAINTSTRUCT ps;
    HDC hdc = BeginPaint(hwnd, &ps);
    
    Gdiplus::Graphics graphics(hdc);
    graphics.Clear(ARGB1555toARGB8888(CGRAM[0]));   // convert 1st 15-bit CGRAM color to 32-bit & clear bkgd
    
    // tileset2[i]->SetPalette(colorpalette); // called in tileset loading to test 1 palette
    for(uint16_t i = 0; i < 1024; i++){
        tilesetX[i] = new Gdiplus::CachedBitmap(tileset2[i], &graphics);
    }
    
    /* struct SNES_Tile{
        uint16_t tileIndex: 10,
        uint16_t palette: 3,
        uint16_t priority: 1, // (irrelevant for this project)
        uint16_t horzFlip: 1,
        uint16_t vertFlip: 1,
    }*/
    // I can see each individual tile being drawn
    for(int y = 0; y < 32; y++){
        for(int x = 0; x < 32; x++){
            // assume tilemap is set to 32x32, and not 64x32 or 32x64 or 64x64
            graphics.DrawCachedBitmap(tilesetX[BG2[y * 32 + x] & 0x03FF], x * BG2CHRSize, y * BG2CHRSize);
            // BG2[y * 32 + x]  & 0x03FF    : get tile index from VRAM and strip attributes
            // tilesetX[...]                : get CachedBitmap to draw
        }
    }
    
    EndPaint(hwnd, &ps);
    break;
}
I am early enough in my program that rewriting the entire graphics routine wouldn't be too much of a hassle.
Should I give up on GDI+ and switch to Direct2D or something else? Is there a faster way to draw 4bpp bitmaps without having to create a copy for each palette?