In Cuda C++, I have a big array Arr
of integers, and a function F
: int
-> int
. I want to find the first index of some items in Arr
that makes F
maximal.
How can I write a kernel that always keeps the maximal value (in F
) to compare with others using atomic stuff to avoid facing any race condition problems?
BTW, I wonder if I can use the functions in the Thrust library for this purpose instead.
Based on your description, including usage of
int
, and a desire to use atomics, I would suggest using a custom atomic. This should work for arrays up to 4 billion elements:We are assembling and disassembling the 64-bit quantity as needed, and using the general method outlined in the programming guide for arbitrary atomics.
Yes, you can do this with a transform and a reduce operation in thrust. In fact thrust can combine these into a single algorithm call. Here is an example:
Notes:
Comparing those 2 implementations/examples, I know which one I would choose. The atomic method is brittle, type-limited, and size-limited. The thrust example could be adapted to handle types more flexibly (e.g. a function that returns a 64-bit type) and could be extended to handle beyond 4 billion elements.
The code here is just intended to be a possible roadmap. It's not thoroughly tested; bugs are always possible.
There is a strong correspondence between the two methods. The
main()
routines have almost a 1:1 correspondence, which hopefully you can identify. Furthermore ther()
functor corresponds to thedone()
function, and thef()
functor corresponds to thef()
function.Don't assume that you can readily/trivially increase my atomic example to 4 billion elements. The
f()
function I wrote would overflow/underflow anint
variable. But with an appropriate data array andf()
function, it should be possible to use up to 4 billion elements.EDIT: As suggested in the comments below, we may be able to do a better job in the atomic case, by doing a threadblock level shared sweep reduction, followed by a single atomic per threadblock. Here is an example of that: