According to the Kepler whitepage, the warp size for a Kepler based GPU is 32 and each multiprocessor contains 4 warp schedulars which select two independant instructions from a chosen warp. This means that each clock cycle, 32*4*2 = 256 calculations are to be performed, but a multiprocessor only contains 192 ALUs. How are these calculations performed then?
1
There are 1 best solutions below
Related Questions in CUDA
- direct global memory access using cuda
- Threads syncronization in CUDA
- Merge sort using CUDA: efficient implementation for small input arrays
- why cuda kernel function costs cpu?
- How to detect NVIDIA CUDA Architecture
- What is the optimal way to use additional data fields in functors in Thrust?
- cuda-memcheck fails to detect memory leak in an R package
- Understanding Dynamic Parallelism in CUDA
- C/CUDA: Only every fourth element in CudaArray can be indexed
- NVCC Cuda 5.0 on Ubuntu 12.04 /usr/lib/libudt.so file format not recognized
- Reduce by key on device array
- Does CUDA include a real c++ library?
- cuMemcpyDtoH yields CUDA_ERROR_INVALID_VALUE
- Different Kernels sharing SMx
- How many parallel threads i can run on my nvidia graphic card in cuda programming?
Related Questions in KEPLER
- Increasing achieved occupancy doesn't enhance computation speed linearly
- vpython) How to simulate kepler's 2nd law?
- How to implement help in Eclipse 4.3 (Kepler) application?
- On Double Precision Units (DPUs) on Kepler K20Xm
- Load/Store Units (LD/ST) and Special Function Units (SFUs) for the Kepler architecture
- Monitoring NVENC hardware (Active or idle)
- Error when visualizing data in kepler gl in jupyter notebook
- what is the L1 cache throughput in Nvidia's Kepler?
- Python Kepler´s law Plotting
- Coalesced access vs broadcast access to a global memory location on GPU
- Pretty print in Eclipse CDT: unable to look at any variable
- CUDA Kepler: not enough ALUs
- understanding Nvidia Kepler Assembly instructions
- Displaying orbit with vpython using kepler's equation but the planet won't orbit
- Error: External calls are not supported (found non-inlined call to cublasGetVersion_v2)
Related Questions in WARP-SCHEDULER
- What is the instruction issue time latency of the warp schedulers in CUDA?
- blocks, threads, warpSize
- In NVIDIA gpu, Can ld/st and arithmetic instruction(such as int32 fp32 )run simultaneously in same sm?
- CUDA Kepler: not enough ALUs
- Why there are two warp schedulers in a SM of GPU?
- Questions of resident warps of CUDA
- How do CUDA blocks/warps/threads map onto CUDA cores?
- cuda shared memory and block execution scheduling
- cuda: warp divergence overhead vs extra arithmetic
- CUDA Warps and Thread Divergence
- How to a warp cause another warp be in the Idle state?
- Is there a way to explicitly map a thread to a specific warp in CUDA?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
The actual whitepaper wording is as follows:
The interpretation is that in any given cycle, at most 4 warps can be scheduled. For each of those 4 warps, (up to) 2 independent instructions per warp can be dispatched. "can be dispatched" is not the same as "will be dispatched".
The 192 ALUs you are referring to are related to single precision floating point arithmetic operations (SP units for the purpose of this discussion). However there are other functional units in the SM(X) such as double precision floating point arithmetic units (DP units), load/store units (LD/ST units), and other units. Refer to the diagram on page 8 of the whitepaper linked above. If a given set of instructions were all using the SP units, then 8 instructions could not be scheduled, at most 6 (32x6=192) could be scheduled. However, if the instruction mix contains independent instructions of different types (e.g. loads, stores, SP ops, etc.) then the limitation of 192 SP units will not necessarily be the determining factor in how many instructions actually get scheduled in any given cycle.
The bottom line is that 8 instructions (2 inst/scheduler x 4 schedulers) per cycle is the maximum possible instruction issue rate per SM(X). Real world codes do not necessarily achieve this. It's entirely possible that in a given cycle no instructions could get issued, due to stall/starvation conditions.