I have a legacy Windows DLL (written in c++) for which I need to maintain a 32-bit version, along with the 64-bit version. I'm updating the heavy math code with simd using Agner's vector class library, and seeing little or no speed improvements for the 32-bit version when compiling with AVX as compared to SSE4.2. I'm aware that with 32-bit code there are always only 8 vector registers available, but I'm not clear (after much searching) exactly what this means when compiling with AVX, AVX2 or AVX512. Are there compiler options (Microsoft or Clang) that will give me some worthwhile speed improvements over SSE4.2 (for simple loops of floating point operations), or should I just save myself some trouble and compile the 32-bit version with SSE4.2?
Are there any real benefits to compiling a 32-bit version of my DLL with AVX or higher?
294 Views Asked by dts At
1
There are 1 best solutions below
Related Questions in SIMD
- OpenMP SIMD on Power8
- How to add values from vector to each other
- Effective way to extract from SSE vector on AMD processors
- Running Yeppp library with Mono on Raspbery Pi
- Store, modify and retrieve strings with GCC Vector Extensions?
- parallelizing matrix multiplication through threading and SIMD
- SSE - AVX conversion from double to char
- 32-bit Hamming String formation from 32 8-bit comparisons
- Optimizing SIMD histogram calculation
- Initializing int4 using Swift; bug or expected behaviour?
- Vectorize 2d-array access (GCC)
- Is it really efficient to use Karatsuba algorithm in 64-bit x 64-bit multiplication?
- (Vec4 x Mat4x4) product using SIMD and improvements
- What are some rules of thumb for when SIMD would be faster? (SSE2, AVX)
- How can I use simd in MIPS?
Related Questions in AVX
- Check whether __m128i is zero?
- Compare two 16-byte values for equality using up to SSE 4.2?
- For some reason serial code runs faster than SIMD code
- SSE - AVX conversion from double to char
- GCC emits vastly different code using "-march=native" on similar architectures
- Wrapper for `__m256` Producing Segmentation Fault with Constructor - Windows 64 + MinGW + AVX Issues
- 32-bit Hamming String formation from 32 8-bit comparisons
- MinGW64 Is Incapable of 32 Byte Stack Alignment (Required for AVX on Windows x64), Easy Work Around or Switch Compilers?
- Largest data type which can be fetch-ANDed atomically?
- (Vec4 x Mat4x4) product using SIMD and improvements
- Need for fast data demuxing in C# by using multi-threading, AVX, GPU or whatever
- What are some rules of thumb for when SIMD would be faster? (SSE2, AVX)
- How can I convert a vector of float to short int using avx instructions?
- AVX support for remainder in G++ 5.4.0
- How to efficiently perform double/int64 conversions with SSE/AVX?
Related Questions in VECTOR-CLASS-LIBRARY
- Why performance for this index-of-max function over many arrays of 256 bytes is so slow on Intel i3-N305 compared to AMD Ryzen 7 3800X?
- Looking for an efficient function to find an index of max element in SIMD vector using a library
- I used Agner Fog's vector class but met a serious performance reduction problem when the code was compiled by MSVC
- How to use VCL as a separate namespace?
- How to gather arbitrary indexes in VCL with AVX2 enabled
- How to use Vector Class Library for AVX vectorization together with the openmp #pragma omp parallel for reduction?
- AVX2/VCL : static/dynamic lane scheduling
- Can't get vectorclass library to compile to AVX2 instructions in MSVC2019
- Vector class library: solivng a problem while using vec4d
- How to compile a project which requires SSE2 on MacBook with M1 chip?
- Vector resize function not working properly c++
- Does anyone know of a fix for an MSVC compiler bug/annoyance where SIMD Extension settings get "stuck" on AVX?
- Are there any real benefits to compiling a 32-bit version of my DLL with AVX or higher?
- Is it good or bad (performance-wise) to use std::vector<Vec8d>
- Vector class library for processing speed
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I'm answering this question myself even though the question should arguably just be deleted ... maybe it will help someone, sometime.
By the time I got my simd code punched up (aligning the memory made a big difference) and fiddled around with MSVC compiler options, my 32-bit compile started acting exactly as expected when comparing no simd to SSE4.2, AVX and AVX512. Benchmarking the sample code below showed speed improvement ratios of 48%, 22% and 10% for SSE4.2, AVX, AVX512, respectively, for the 32-bit.
Oddly, the 64-bit compile runs much faster for no simd but slightly SLOWER than the 32-bit for all three simd options (good subject for a new question).
I compiled the code with no /Qpar switch and /Qvec-report:2 /Qpar-report:2 to verify to the extent possible that there was no auto-vectorization or auto-parallelization going on.