What does storageBarrier in WebGPU actually do?

1.8k Views Asked by At

So I'm exploring WebGPU and figured it would be an interesting exercise to implement a basic neural network in it. Having little understanding of both GPU shader programming and neural networks and my only reference for WebGPU(w3.org/TR/webgpu and w3.org/TR/WGSL) being highly technical has made it really interesting indeed.

Anyway, somehow I've muddled my way to a point where I can actually perform feed forward and back propagation correctly on small network, also its blazingly fast compared to my js cpu implementation, even though I'm sure I'm severely underutilizing the hardware.

I've come to a point where I want to try bigger networks but I'm at a bit of a loss when it comes to workgroups and synchronizing execution. For the purpose of keeping it simple, I'll focus my problem on the feed forward operation:

Currently, I'm dispatching exactly the number of threads that correspond to the widest layer in the neural network. The idea is that each thread computes the value for a single neuron in the current layer and then hits a barrier and then every thread moves on to the next layer together, on and on.

The problem is, I only of two ways to set a barrier - either workgroupBarrier() or ending execution and dispatching a new pile of threads for the next layer.

The problem with the first one is that it only works within a workgroup and I can only make workgroups so big before the performance starts suffering because from what I understand, only a single CU can work on a workgroup because of the need to share memory. If I make my workgroup 256x256 then it would get cut into chunks that the single CU would have to chew through while rest of the hardware sits idle. This limits how wide I can make my networks by how many threads a single CU can fit in it, pretty lame.

The problem with the second one is pretty obvious - a separate dispatch is just slow, much slower than a barrier from my testing.

As it is right now, I'm not using workgroup shared memory at all, all I want to do is dispatch an arbitrary number of threads and have a global barrier. As far as I understand though, WebGPU doesn't have a global barrier... except maybe storageBarrier?

Even after reading the 2 sentences on w3.org about what it is, I still have no clue as to what it is but I think its something to do with memory access synchronization rather than a global barrier. I did test it, the results come out correct, however even if I remove all barriers from my code the result comes out correct, perks of the SIMT execution style of GPU's I guess. However, I don't need it to be "probably correct" I need guaranteed correct, so I need a global barrier. Is storageBarrier the thing? If not then what is it?

Bonus question - why are there 3 dimensions to workgroups and dispatches, why not just have one?

1

There are 1 best solutions below

0
On

Great questions.

Easy one first:

Bonus question - why are there 3 dimensions to workgroups and dispatches, why not just have one?

That's just how the GPUs are structured internally. Compute shaders evolved after straight graphics rendering. 2D dispatches correspond well to 2D image processing (e.g. convolutions), and graphics rendering has 3D textures as well.

A barrier helps you coordinate access to read-write memory. The question is: what agents (invocations) are you coordinating, and what memory are you controlling access to.

Barriers coordinate across two dimensions:

  • different invocations.
  • different address spaces.

Invocations are hierarchically grouped:

  • workgroup: invocations that run in parallel and have shared access to variables in the 'workgroup' address space.
  • all the invocations in the dispatch, i.e. all the workgroups launched by the same dispatch. Different workgroups in the same dispatch might run concurrently, or they might run serially. The model therefore does not support well-defined coordination between workgroups in the same dispatch.

Address spaces:

  • 'workgroup' address space: holds variables that are shared within a single workgroup
  • 'storage': holds variables (buffers) shared across all the invocations in the dispatch, i.e. all the workgroups. These can be read-only or read-write.
  • 'uniform': like storage, but always read-only, so coordination is trivial.

Given that, we can now say:

  • storageBarrier coordinates access by invocations in single workgroup to buffers in 'storage' address space.
  • workgroupBarrier coordinates access by invocations in a single workgroup to variables in 'workgroup' address space.

In detail, a reasonable way to think about it is that a barrier for address space X (X is 'workgroup' or 'storage), is a point in execution where:

  • all invocations in a workgroup wait for each other to reach the barrier
  • all in-flight writes to variables in address space 'X' complete
  • then all invocations become unblocked, and can continue executing after the barrier.
  • after the barrier, any reads from variables in address space 'X' will "see" the writes that were initiated before the barrier.

(This is not how it's described in the spec because it's overconstrained. But that's for the language lawyers.)

You'll notice: you can only coordinate across invocations in the same workgroup. That means there is no supported way to do this with non-atomic operations:

  • write data to 'storage' buffers in one workgroup
  • read the same data back in a different workgroup, but in the same dispatch

Why? Metal Shading Language barriers don't support it. Sorry. For details, see https://github.com/gpuweb/gpuweb/pull/2297

(In case you're looking to follow up in discussions of memory model definition and testing, that pattern called the "message passing" pattern.)

Note: "CU" or "compute unit" is not a well-defined term in GPU language specs. It's how particular GPUs are organized and marketed, but that's a detail.

Ok, about how to structure your workgroups. It's all easy if the shape of your data is the same as your workgroup. But otherwise you have to block your data, ie. partition the problem to fit, or make a single invocation do a block of data at a time. That's the key to maximizing utilization and parallelism. There's a lot of literature/tutorials about how to do that, especially for things like matrix multiply.