Is there a better way to optimize Metal compute shaders that operate on the same texture in series?

102 Views Asked by tenuki At 09 January 2024 at 16:29

I am using Metal to process video frames in real time. At a high level, my code is structured as follows:

Obtain a video texture
Create a command buffer
Encode a number of compute shaders in series, each with its own command encoder created from the command buffer in (2). Each shader operates on the texture in (1) – reading from it, performing some calculation, and writing back to it (via read_write access).
Commit the command buffer

When profiling this setup, I noticed that the actual computations take up a very small proportion of the GPU work. The vast majority is taken up by the "synchronization" category on shader lines that read from or write to the texture:

The Metal documentation seems to imply that this is a product of processing large textures:

Large textures: Indicated by high synchronization time when profiling your fragment shader. The per-line profiling result shows a high percentage of time in wait memory.

That makes sense, but I can't reduce the size of my textures since I'm processing 4k video at full resolution. As an experiment, I tried making one massive shader that takes all necessary the input textures and buffers and performs all the computations in one function. Performance was greatly improved, and the total synchronization cost was reduced as expected. Of course, this makes a lot of sense – it seems much better to read once at the start of processing and write once at the end. However, it is much more convenient (from the perspective of the rest of the code) to treat each compute step as its own compute kernel. That way, I can easily add or remove steps as necessary, and I don't have a ton of largely unrelated parameters in my kernel function.

Is there another way I can avoid this synchronization overhead without creating one massive compute kernel? It seems like there should be some way to allow Metal to elide (or at least optimize) a write immediately followed by a read from the same texture within the same command buffer. I thought that coalescing these shaders to use one encoder might help; however, I don't think I can do this – each shader has varying auxiliary input textures or buffers. Moreover, I saw in the debug tools that Metal seems to be coalescing these encoders automatically.

Original Q&A

Is there a better way to optimize Metal compute shaders that operate on the same texture in series?

There are 0 best solutions below

Related Questions in IOS

Related Questions in METAL

Related Questions in METALKIT

Trending Questions

Popular # Hahtags

Popular Questions