I read some tutorials on how to implement a raytracer in opengl 4.3 compute shaders, and it made me think about something that had been bugging me for a while. How exactly do GPUs handle the massive amount of random access reads necessary for implementing something like that? Does every stream processor get its own copy of the data? It seems that the system would become very congested with memory accesses, but that's just my own, probably incorrect intuition.
How do GPUs handle random access?
842 Views Asked by Errata - C At
1
There are 1 best solutions below
Related Questions in OPENGL
- setting OpenGL version in objective-C
- How to run OpenGL version 3.3 (with Intel HD 4000) on Ubuntu 15.04
- Can linear filtering be used for an FBO blit of an MSAA texture to non-MSAA texture?
- How to get shader version from QOpenGLShader?
- "Capture GPU Frame" in XCode -- iOS only?
- Difference between glewGetString(GLEW_VERSION) and glewIsSupported
- Tesselation result flickering - OpenGL/GLSL
- Water rendering in opengl
- Texture mapping consuming physical memory
- Rotating a Cube using Quaternions in PyOpenGL
- Switching from perspective orthographic projection in OpenGL
- FreeType2 and OpenGL : Use unicode
- Should Meshes with and without Skeleton use different Shaders?
- How to get accurate 3D depth from 2D screen mouse click for large scale object in OpenGL?
- Trying to load 2d texture with glTexImage2D, but just getting blank
Related Questions in GPU
- Get GPU temperature in Android
- Can I use Julia to program my GPU & CPU?
- C: Usage of any GPU for parallel calculations
- Can I run Cuda or OpenCl on Intel processor graphics I7 (3rd or 4rd generation)
- How to get fragment coordinate in fragment shader in Metal?
- Is prefix scan CUDA sample code in gpugems3 correct?
- How many threads/work-items are used?
- When do we need two dimension threads in CUDA?
- What does a GPU kernel overhead consist of?
- Efficiently Generate a Heat Map Style Histogram using GLSL
- installing gputools on windows
- Make a dependent loop independent
- Is it possible to execute multiple instances of a CUDA program on a multi-GPU machine?
- CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments
- Missing functions vload and vstore: OpenCL on Android
Related Questions in COMPUTE-SHADER
- OpenGL Compute Shader synchronization between groups
- How do GPUs handle random access?
- Are there DirectX guidelines for binding and unbinding resources between draw calls?
- OpenGL Compute shader atomic operations
- Unity Compute Shaders Vertex Index error
- GPU returning randoms floats instead of buffer content - DirectX 11
- How would I use rasterization to create a texel to primitive texture-map - DirectX 11
- Ping pong propagation in glsl compute shader possible in one call?
- Debugging Metal compute shaders with Xcode
- How to execute parallel compute shaders across multiple compute queues in Vulkan?
- Unity Compute Shader, Array Indexing through SV_DispatchThreadID
- Why is the data written by imageStore for integer Sampler interpretted as Float?
- Mapping between Compute Shaders and Cuda
- Compute Shader write to texture
- OpenGL Shader exception in Vncviewer
Related Questions in RANDOM-ACCESS
- Random access (or otherwise fast access) of edges in boost graph library
- C++ / Fast random access skipping in a large file
- How do GPUs handle random access?
- how to pick up value from array in java
- Retrieving Random Single Items in Dynamo
- Rails 3: Display 1 Random Item from Database: Question_Edit:more_detailed
- Random-access container for strings in python?
- Python Random Access File
- writing data in to files with java
- Where are visual basic Random Access Files stored?
- Restrict method to RandomAccess List
- How does random access memory work? Why is it constant-time random-access?
- Is there a trait that works for both VecDeque and Vec?
- Converting Quick BASIC to VB.Net - Random Access Files
- How can i return the address of a line in a Random Access file?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
The Stream Multiprocessors (SM) have caches, but they are relatively small and won't help with truly random access.
Instead, GPUs are trying to mask the memory access latency: that is each SM is assigned more threads to execute than it has cores. On every free clock cycle it schedules some of the threads that aren't blocked on memory access. When the data needed for a thread isn't in the SM cache, the thread stalls until that data arrives, letting other threads to be executed instead.
Note that this masking only works if the amount of computation exceeds the time spent on waiting for the data (e.g. per-pixel lighting calculations). If it's not the case (e.g. just summing lots of randomly scattered 32-bit floats), then you are likely to bottleneck at the memory bus bandwidth: most of the time your threads will be stalled waiting for their bits to arrive.
A related thing that can help with SM utilization is data-locality. When multiple threads access nearby memory locations then one cache-line fetch will bring the data needed by multiple threads. For example, when texturing a perspectively warped triangle, even though each fragment's texture coordinates can be arbitrary, nearby fragments are still likely to read nearby texels from the texture. Consequently there's a lot of common data shared between the threads, and one cache-line fetch would unblock multiple of them.
Ray-tracing, on the other hand, is horrible at data-locality. Secondary rays tend to diverge a lot, and hit different surfaces at practically random locations thru-out the entire scene. This makes it very hard to utilize the SM architecture for either ray-scene intersection or shading purposes.