I'm working with matrices that range in size from 2,000x2,000 up to 5,000x5,000, doing operations such as multiplication and QR decomposition. I'm curious if, for example, I should align the stride by 64 for all matrixes for best performance. Also, should I avoid strides that are a multiple of some page size due to cache associativity, or does that not apply to GPU memory?
What stride should I use for matrices in CUDA for the fastest possible speed?
147 Views Asked by meisel At
1
There are 1 best solutions below
Related Questions in CUDA
- direct global memory access using cuda
- Threads syncronization in CUDA
- Merge sort using CUDA: efficient implementation for small input arrays
- why cuda kernel function costs cpu?
- How to detect NVIDIA CUDA Architecture
- What is the optimal way to use additional data fields in functors in Thrust?
- cuda-memcheck fails to detect memory leak in an R package
- Understanding Dynamic Parallelism in CUDA
- C/CUDA: Only every fourth element in CudaArray can be indexed
- NVCC Cuda 5.0 on Ubuntu 12.04 /usr/lib/libudt.so file format not recognized
- Reduce by key on device array
- Does CUDA include a real c++ library?
- cuMemcpyDtoH yields CUDA_ERROR_INVALID_VALUE
- Different Kernels sharing SMx
- How many parallel threads i can run on my nvidia graphic card in cuda programming?
Related Questions in MEMORY-ALIGNMENT
- Is it safe to read and write on an array of 32 bit data byte by byte?
- Tell C++ that pointer data is 16 byte aligned
- MinGW64 Is Incapable of 32 Byte Stack Alignment (Required for AVX on Windows x64), Easy Work Around or Switch Compilers?
- Allocating an array of aligned struct
- Clang runtime fault when throwing aligned type. Compiler bug?
- Under Xcode 6.3, NULL C++ reference address evaluates as non-zero
- Why processor read only aligned addresses
- Uninitialized memory in C++
- Read / write partially allocated aligned memory
- Why does this alignment attribute have to be specified in a typedef?
- How to enable alignment exceptions for my process on x64?
- Single malloc for multiple data type
- How to make derived class of a memory-aligned class lose alignment
- Why does this struct not align properly?
- x86, C++, gcc and memory alignment
Related Questions in STRIDE
- Finding the appropriate stride for TLB misses in an array in a code snippet
- Unity Compute Shaders Vertex Index error
- Add Offset to Indices of Numpy Array
- Stride() excludes the "through" value in some cases, when using Double
- Divide array in chunks with different sizes
- Can numpy strides stride only within subarrays?
- Why using `MPI_Type_vector` for specifying a stride gap?
- Transforming a sequence of integers into the binary representation of that sequence's strides
- What stride should I use for matrices in CUDA for the fastest possible speed?
- Python String Slicing Stride Clarification
- NumPy 'as_strided' for strided sliding window over RGBA image (3D array)
- How to set right strides in MLMultiArray in Core ML? What's the strides' values' meaning?
- broadcast shapes with stride
- Numpy / Pandas slicing based on intervals
- Unravel strided indices
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I imagine most people trust
cudaMallocPitchorcudaMalloc3Dto provide the proper alignment as this is their stated purpose. While not explicitly clarified in the runtime documentation, they align tocudaDeviceProp::textureAlignment(512 byte on current hardware). There are also NPP's allocator functions which seem to have different alignment strategies (or at least did so in the past). See How does CUDA's nppiMalloc... function guarantee alignment? for some discussion on that.The lack of a pitched allocator function for the stream ordered memory allocator suggests that alignment may not be as relevant today. Or it might be an oversight in the API, who knows?
What we do know from different parts of the programming guide is that
memcpy_asyncrequires 16 byte alignment for best performanceThe best practices guide simply recommends 32 byte aligned memory transactions.
I'm not aware of a list of cache parameters for each generation. Turing's L2 is 4 MiB 16-way set associative with 64 byte lines and the memory pages are 2 MiB. If I did the math right, this means an alignment of 256 kiB would be pathological. With these numbers I'd imagine you could start seeing effects with 16 kiB alignment or more but I'm not aware of any official guidance on the subject.
Personally I stick with the pitched allocators and when I don't use them, I use the texture alignment except for smaller line sizes where I just use the next power of 2 as to not waste so much memory unless I plan to use texture binding.