I am using Metal to process video frames in real time. At a high level, my code is structured as follows:
- Obtain a video texture
- Create a command buffer
- Encode a number of compute shaders in series, each with its own command encoder created from the command buffer in (2). Each shader operates on the texture in (1) – reading from it, performing some calculation, and writing back to it (via
read_writeaccess). - Commit the command buffer
When profiling this setup, I noticed that the actual computations take up a very small proportion of the GPU work. The vast majority is taken up by the "synchronization" category on shader lines that read from or write to the texture:
The Metal documentation seems to imply that this is a product of processing large textures:
Large textures: Indicated by high synchronization time when profiling your fragment shader. The per-line profiling result shows a high percentage of time in wait memory.
That makes sense, but I can't reduce the size of my textures since I'm processing 4k video at full resolution. As an experiment, I tried making one massive shader that takes all necessary the input textures and buffers and performs all the computations in one function. Performance was greatly improved, and the total synchronization cost was reduced as expected. Of course, this makes a lot of sense – it seems much better to read once at the start of processing and write once at the end. However, it is much more convenient (from the perspective of the rest of the code) to treat each compute step as its own compute kernel. That way, I can easily add or remove steps as necessary, and I don't have a ton of largely unrelated parameters in my kernel function.
Is there another way I can avoid this synchronization overhead without creating one massive compute kernel? It seems like there should be some way to allow Metal to elide (or at least optimize) a write immediately followed by a read from the same texture within the same command buffer. I thought that coalescing these shaders to use one encoder might help; however, I don't think I can do this – each shader has varying auxiliary input textures or buffers. Moreover, I saw in the debug tools that Metal seems to be coalescing these encoders automatically.
