How to design for reduced register usage in Numba Cuda Kernels?

90 Views Asked by At

UPDATE: The original question below remains a good quesiton asking about general design principles for GPU programming in Numba and it is not quite the same as predicting register usage. I have also since discovered there's no need to predict register usage since there is an undocumented command in Numba that tells you the register usage exactly. THis can allow one by trial an error to reduce register use, but converting that to general intuition and strategies for register use what this question is asking. The links added before this question claiming answers contian incorrect information that don't answer this question. Those links and the comments on this quesiton simply claim, completely incorrectly, that register use can't be determined from numba. This is absolutely incorrect, there is a not-yet-documented private method numba supplies to do exactly that. This question isn't asking that. It's asking how one strategically designs up front to minimize register use.

If a kernel uses more than 64 registers then on many cuda devices one can't use the maximum number of available threads. I find that my code also seems to use way more registers than I would guess from scanning it visually for the number of intermediate results. Even loops add lots to the register count.

So how can one see what lines of code are the culprits in using registers?

I'd settle for some rules of thumb or even better if there was a way to look at the numba IR available in the kernel.inspect_types() output.

IS there some equivalence between the $ sigil variables in the Numba IR code and registers?

I realize I can get get the total register count for a kernel by looking at the ._func.info and _.fun.get().attr and that's helpful. But it doesn't tell you what aspect in your code is causing the number of registers to balloon.

So I want a way to either be able to guess better or actually see it in the Numba IR.

Any insights?

FOr concreteness, here is a trivial example of this someone posted: https://gist.github.com/sklam/0e750e0dea7571c68e94d99006ae8533

When I say rules of thumb I am thinking that maybe they might look like this

  1. Add one for every fetch from global memory not going into shared memory

  2. add one for every binary operator like + or *

  3. Add one for every input variable name (e.g. a pointer to global)

  4. add one for every local variable.

But In practice I see more register's used than that would account for. And I also see the register count go up quite a lot when I include a loop or an if-statement. Thus I know I'm not doing it right.

Bottom line How can I skillfully reduce register counts? I realize that optimizing compilers might be doing tricks to re-order code or choose when to make a variable a register or main memory, but still I think there ought to be rule one can follow to try to reduce register usage

0

There are 0 best solutions below