OpenACC 2.0 routine: data locality

Question

OpenACC 2.0 routine: data locality

359 Views Asked by lodhb At 11 November 2014 at 20:19

Take the following code, which illustrates the calling of a simple routine on the accelerator, compiled on the device using OpenACC 2.0's routine directive:

#include <iostream>

#pragma acc routine
int function(int *ARRAY,int multiplier){
        int sum=0;

        #pragma acc loop reduction(+:sum)
        for(int i=0; i<10; ++i){
                sum+=multiplier*ARRAY[i];
        }

        return sum;
}

int main(){
        int *ARRAY = new int[10];
        int multiplier = 5;
        int out;

        for(int i=0; i<10; i++){
                ARRAY[i] = 1;
        }

        #pragma acc enter data create(out) copyin(ARRAY[0:10],multiplier)

        #pragma acc parallel present(out,ARRAY[0:10],multiplier)
        if (function(ARRAY,multiplier) == 50){
                out = 1;
        }else{
                out = 0;
        }

        #pragma acc exit data copyout(out) delete(ARRAY[0:10],multiplier)

        std::cout << out << std::endl;
}

How does function know to use the device copies of ARRAY[0:10] and multiplier when it is called from within a parallel region? How can we enforce the use of the device copies?

Original Q&A

There are 3 best solutions below

**Guang Mo** · Answer 1 · 2015-03-09T03:59:47.170000

Basically, when you involved "data" clause, the device will create/copy data to the device memory, then the block of code that defined with "acc routine" will be executed on the device. Notice that the memory between host and device does not share unlike multi-threading (OpenMP). So yes, "function" will be using the device copies of ARRAY and multiplier as long as it is under data segment. Hope this helps! :)

**Brian Yang** · Answer 2 · 2015-04-07T05:50:55.130000

Brian Yang On 07 April 2015 at 05:50

You should assign the function with one parallelism level such as gang/worker/vector. It's a more accurate way.

The routine will use the date in device memory.

**jefflarkin** · Answer 3 · 2015-08-10T20:41:15.233000

When your routine is called within a device region (the parallel in your code), it is being called by the threads on the device, which means those threads will only have access to arrays on the device. The compiler may actually choose to inline that function, or it may be a device-side function call. That means that you can know that when the function is called from the device it will be receiving device copies of the data because the function is essentially inheriting the present data clause from the parallel region. If you still want to convince yourself that you're running on the device once inside the function, you could call acc_on_device, but that only tells you that you're running on the accelerator, not that you received a device pointer.

If you want to enforce the use of device copies more than that, you could make the routine nohost so that it would technically not be valid to call from the host, but that doesn't really do what you're asking, which is to do a check on the GPU that the array really is a device array.

Keep in mind though that any code inside a parallel region that is not inside a loop will be run gang-redundantly, so the write to out is likely a race condition, unless you happen to be running with one gang or you write to it using an atomic.

OpenACC 2.0 routine: data locality

There are 3 best solutions below

Related Questions in GPGPU

Related Questions in OPENACC

Trending Questions

Popular # Hahtags

Popular Questions