How do the nodes in a CUDA graph connect?

1.1k Views Asked by At

CUDA graphs are a new way to synthesize complex operations from multiple operations. With "stream capture", it appears that you can run a mix of operations, including CuBlas and similar library operations and capture them as a singe "meta-kernel".

What's unclear to me is how the data flow works for these graphs. In the capture phase, I allocate memory A for the input, memory B for the temporary values, and memory C for the output. But when I capture this in a graph, I don't capture the memory allocations. So when I then instantiate multiple copies of these graphs, they cannot share the input memory A, temporary workspace B or output memory C.

How then does this work? I.e. when I call cudaGraphLaunch, I don't see a way to provide input parameters. My captured graph basically starts with a cudaMemcpyHostToDevice, how does the graph know which host memory to copy and where to put it?

Background: I found that CUDA is heavily bottlenecked on kernel launches; my AVX2 code was 13x times slower when ported to CUDA. The kernels themselves seem fine (according to NSight), it's just the overhead of scheduling several hundred thousand kernel launches.

1

There are 1 best solutions below

4
On BEST ANSWER

A memory allocation would typically be done outside of a graph definition/instantiation or "capture".

However, graphs provide for "memory copy" nodes, where you would typically do cudaMemcpy type operations.

At the time of graph definition, you pass a set of arguments for each graph node (which will depend on the node type, e.g. arguments for the cudaMemcpy operation, if it is a memory copy node, or kernel arguments if it is a kernel node). These arguments determine the actual memory allocations that will be used when that graph is executed.

If you wanted to use a different set of allocations, one method would be to instantiate another graph with different arguments for the nodes where there are changes. This could be done by repeating the entire process, or by starting with an existing graph, making changes to node arguments, and then instantiating a graph with those changes.

Currently, in cuda graphs, it is not possible to perform runtime binding (i.e. at the point of graph "launch") of node arguments to a particular graph/node. It's possible that new features may be introduced in future releases, of course.

Note that there is a CUDA sample code called simpleCudaGraphs available in CUDA 10 which demonstrates the use of both memory copy nodes, and kernel nodes, and also how to create dependencies (effectively execution dependencies) between nodes.