I've just recently discovered AMD's equivalent to CUDA's __byte_perm intrinsic; amdgcn_ds_swizzle(Or at least I think its the equivalent of a byte permutation function). My problem is this: CUDA's byte perm takes in two unsigned 32 bit integers, and then permutes that based on the value of the selector argument (supplied as a hex value). However, AMD's swizzle function only takes in one single unsigned 32 bit integer, and one int that's named as "pattern". How do I utilize AMD's Swizzle intrinsic function?
AMD HCC Swizzle Intrinsic
866 Views Asked by ligosan At
1
There are 1 best solutions below
Related Questions in C++
- C++ using std::vector across boundaries
- Linked list without struct
- Connecting Signal QML to C++ (Qt5)
- how to get the reference of struct soap inherited in C++ Proxy/Service class
- Why we can't assign value to pointer
- Conversion of objects in c++
- shared_ptr: "is not a type" error
- C++ template using pointer and non pointer arguments in a QVector
- C++ SFML 2.2 vectors
- Lifetime of temporary objects
- I want to be able to use 4 different variables in a select statement in c ++
- segmentation fault: 11, extracting data in vector
- How to catch delay-import dll errors (missing dll or symbol) in MinGW(-w64)?
- How can I print all the values in this linked list inside a hash table?
- Configured TTL for A record(s) backing CNAME records
Related Questions in PARALLEL-PROCESSING
- Async vs Horizontal scaling
- Scattered indices in MPI
- How to perform parallel processes for different groups in a folder?
- Julia parallel programming - Making existing function available to all workers
- Running scala futures somewhat in parallel
- running a thread in parallel
- How to make DGEMM execute sequentially instead of in parallel in Matlab Mex Function
- Running time foreach package
- How to parallelize csh script with nested loop
- SSIS ETL parallel extraction from a AS400 file
- Fill an array with spmd in Matlab
- Distribute lines of code to workers
- Java 8 parallelStream for concurrent Database / REST call
- OutOfRangeException with Parallel.For
- R Nested Foreach Parallelization not Working
Related Questions in GPGPU
- How to detect NVIDIA CUDA Architecture
- Different Kernels sharing SMx
- How to do calculation using OpenGL ES 2.0/3.0?
- How to run PageRank in Blazegraph on a dataset?
- When do we need two dimension threads in CUDA?
- CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments
- Confusion over compute units and expected cores on nvidia GPU
- Declaring a cl_uint variable in OpenCL C leads to Segmentation fault (core dumped)
- Unkown Issue with input sequence size of FFT in OpenCL
- Passing Host Function as a function pointer in __global__ OR __device__ function in CUDA
- Nvidia OpenCL hangs on blocking buffer access
- CUDA: Cuda memory accessing different than OpenCL? What is causing this illegal memory access?
- Computing on variable length arrays in OpenCL
- AMD HCC Swizzle Intrinsic
- Sparse matrix multiplication OpenCL vs Intel MKL performance
Related Questions in GPU
- Get GPU temperature in Android
- Can I use Julia to program my GPU & CPU?
- C: Usage of any GPU for parallel calculations
- Can I run Cuda or OpenCl on Intel processor graphics I7 (3rd or 4rd generation)
- How to get fragment coordinate in fragment shader in Metal?
- Is prefix scan CUDA sample code in gpugems3 correct?
- How many threads/work-items are used?
- When do we need two dimension threads in CUDA?
- What does a GPU kernel overhead consist of?
- Efficiently Generate a Heat Map Style Histogram using GLSL
- installing gputools on windows
- Make a dependent loop independent
- Is it possible to execute multiple instances of a CUDA program on a multi-GPU machine?
- CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments
- Missing functions vload and vstore: OpenCL on Android
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
ds_swizzle and __byte_perm do are a little bit different. One permutes a whole register across lanes and the later permutes any four bytes from two 32-bit regs.
AMD's ds_swizzle_b32 GCN instruction is actually swapping values with other lanes. You specify the 32-bit register in the lane you want to read and the 32-bit register you want to place it in. There is also a hard-coded value that specifies how these are to be swapped. A great explanation of ds_swizzle_b32 is here as user3528438 pointed out.
The __byte_perm does not swap data with other lanes. It just gathers any 4 bytes from two 32-bit registers in its own lane and stores it to a register. There is no cross-lane traffic.
I'm guessing the next question would be how to do a "byte permute" on AMD GCN hardware. The instruction for that is v_perm_b32. (see page 12-152 here) It basically selects any four bytes from two specified 32-bit registers.