I'm trying to implement frustum culling in the gpu. After reading a bit and also stumbling on this very helpfull repo : https://github.com/ellioman/Indirect-Rendering-With-Compute-Shaders , I've noticed that the goto implementation seems to be
- Test the bbox of all objects and mark the one that are in the camera frustum with a 1, and the one that are not with a 0
- Save the result in a buffer.
- Execute a scan algorithm on this buffer
- Use the indices computed as indices in a final buffer to store the selected matrices, that will be used in the draw pass.
But I'm wondering : Why use the scan and all its complexity, and not just append the matrices of the objects that passed the bbox test into an appendbuffer directly ? My guess is that appendbuffer access are slow, but are they THAT slower than running a scan on the gpu (which can take 2 dispatch call if the input array is bigger than the max threads per group).
Thank you !
EDIT : I am on unity, but I don't think this matter so much for this question.