Suggestions on optimizing a Z-buffer implementation?

Question

Suggestions on optimizing a Z-buffer implementation?

1.3k Views Asked by kestrel At 24 November 2011 at 18:34

I'm writing a 3D graphics library as part of a project of mine, and I'm at the point where everything works, but not well enough.

In particular, my main headache is that my pixel fill-rate is horribly slow -- I can't even manage 30 FPS when drawing a triangle that spans half of an 800x600 window on my target machine (which is admittedly an older computer, but it should be able to manage this . . .)

I ran gprof on my executable, and I end up with the following interesting lines:

  %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
43.51      9.50     9.50                             vSwap
34.86     17.11     7.61   179944     0.04     0.04  grInterpolateHLine
13.99     20.17     3.06                             grClearDepthBuffer
<snip>
0.76      21.78     0.17      624     0.27    12.46  grScanlineFill

The function vSwap is my double-buffer swapping function, and it also performs vsyching, so it makes sense to me that the test program will spend much of its time waiting in there. grScanlineFill is my triangle-drawing function, which creates an edge list and then calls grInterpolateHLine to actually fill in the triangle.

My engine is currently using a Z-buffer to perform hidden surface removal. If we discount the (presumed) vsynch overhead, then it turns out that the test program is spending something like 85% of its execution time either clearing the depth buffer, or writing pixels according to the values in the depth buffer. My depth buffer clearing function is simplicity itself: copy the maximum value of a float into each element. The function grInterpolateHLine is:

void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
    for(; x1 <= x2; x1 ++, z += zstep) {
        if(z < grDepthBuffer[x1 + y*VIDEO_WIDTH]) {
            vSetPixel(x1, y, colour);
            grDepthBuffer[x1 + y*VIDEO_WIDTH] = z;
        }
    }
}

I really don't see how I can improve that, especially considering that vSetPixel is a macro.

My entire stock of ideas for optimization has been whittled down to precisely one:

Use an integer/fixed-point depth buffer.

The problem that I have with integer/fixed-point depth buffers is that interpolation can be very annoying, and I don't actually have a fixed-point number library yet. Any further thoughts out there? Any advice would be most appreciated.

Original Q&A

There are 3 best solutions below

Gerald On 24 November 2011 at 20:38

Hard to really tell what higher order optimizations can be done without seeing the rest of the code. I have a couple of minor observation, though.

There's no need to calculate x1 + y * VIDEO_WIDTH more than once in grInterpolateHLine. i.e.:

void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
    int offset = x1 + (y * VIDEO_WIDTH);
    for(; x1 <= x2; x1 ++, z += zstep, offset++) {
        if(z < grDepthBuffer[offset]) {
            vSetPixel(x1, y, colour);
            grDepthBuffer[offset] = z;
        }
    }
}

Likewise, I'm guessing that your vSetPixel does a similar calculation, so you should be able to use the same offset there as well, and then you only need to increment offset and not x1 in each loop iteration. Chances are this can be extended back to the function that calls grInterpolateHLine, and you would then only need to do the multiplication once per triangle.

There are some other things you could do with the depth buffer. Most of the time if the first pixel of the line either fails or passes the depth test, then the rest of the line will have the same result. So after the first test you can write a more efficient assembly block to test the entire line in one shot, then if it passes you can use a more efficient block memory setter to block-set the pixel and depth values instead of doing them one at a time. You would only need to test/set per pixel if the line is only partially occluded.

Also, not sure what you mean by older computer, but if your target computer is multi-core then you can break it up among multiple cores. You can do this for the buffer clearing function as well. It can help quite a bit.

kestrel On 26 December 2011 at 20:40

I ended up solving this by replacing the Z-buffer with the Painter's Algorithm. I used SSE to write a Z-buffer implementation that created a bitmask w/the pixels to paint (plus the range optimization suggested by Gerald), and it still ran far too slowly.

Thank you, everyone, for your input.

**Brett Hale** · Accepted Answer · 2011-11-24T22:50:02.627000

You should have a look at the source code to something like Quake - considering what it could achieve on a Pentium, 15 years ago. Its z-buffer implementation used spans rather than per-pixel (or fragment) depth. Otherwise, you could look at the rasterization code in Mesa.

Suggestions on optimizing a Z-buffer implementation?

There are 3 best solutions below

Related Questions in C

Related Questions in OPTIMIZATION

Related Questions in GRAPHICS

Related Questions in ZBUFFER

Trending Questions

Popular # Hahtags

Popular Questions