Fortran code compiled in one Windows machine (2018 Intel processor) gives different results when exe is copied to other machine (2022 Intel processor)

Question

Fortran code compiled in one Windows machine (2018 Intel processor) gives different results when exe is copied to other machine (2022 Intel processor)

187 Views Asked by Millemila At 28 October 2022 at 02:13

Is it possible that a Fortran code compiled in one Windows machine with Visual studio 2019 on a 2018 Intel processor gives a slightly different result when the exe is copied to another machine (that has a 2022 Intel processor)? Could you please list the possible causes of this behavior?

Original Q&A

There are 1 best solutions below

**Peter Cordes** · Answer 1 · 2022-10-28T02:19:56.140000

Run-time dispatch to different versions of functions based on CPU instruction sets, and/or auto-parallelization to all cores, are likely candidates.

Some compiler optimizations pretend that FP math is associative. In reality, FP math is not quite associative; different temporaries introduces different rounding error.

Different choices in how to find parallelism (like auto-vectorization) in loops like a dot-product or sum of an array can lead to different rounding, which can make the result numerically more accurate, but different from the source order of operations¹. Such loops can only be vectorized by pretending FP math is associative.

Auto-parallelization (e.g. OpenMP) with a different number of cores could break a problem up in a different way, into different sub-totals. If your program uses all cores, that's a likely candidate.

Intel compilers can also make code that dispatches to different versions of a function depending on which SIMD instruction sets are available. So you could have an SSE4 version, an AVX2+FMA version, and an AVX-512 version (perhaps even using 512-bit vectors.)

Different SIMD widths lead to a different number of accumulators, if the loop uses the same number of vector registers. So that's a different sets of numbers getting added together into subtotals, e.g. for a dot product.

Does only one of the processors have AVX-512? Or is one of them a Pentium or Celeron without AVX? If so, that's also a likely factor.

Runtime-dispatch in library functions like TBB or SVML could also be a factor, not just code directly generated by the compiler.

Intel compiler FP-math options

Intel compilers default to -fp-model=fast. See the docs for Intel Fortran and Intel C/C++ compiler (both classic, not LLVM-based OneAPI, although presumably they turn on -ffast-math by default in that). The C++ compiler docs seem more detailed in their descriptions.

(The other mainstream compilers, like LLVM and GCC (gfortran), default to -fno-fast-math. But if you use OpenMP, you can let the compiler treat the sum or product or whatever in a specific reduction loop as associative.)

Specifically the Intel default is fast=1, and there's an even more aggressive level of optimization, -fp-model=fast=2 that's not on by default.

See also Intel's 2018 article Consistency of Floating-Point Results using the Intel® Compiler or Why doesn’t my application always give the same answer?, covering the kinds of value-unsafe optimization Intel's compilers do and how the various FP-model switches affect that. And/or slides from a 2008 talk of the same title.

Quoting some descriptions from that article:

-fp-model=consistent - not necessarily the same order of operations as the source, but the same on every machine. And run-to-run consistency when auto-parallelizing with OpenMP. Otherwise with dynamic scheduling of a reduction, the way subtotals are added could depend on which threads finished which work first.
-fp-model=precise - allows value-safe optimizations only (but still including contraction of a*b+c into an FMA).
-fp-model=strict - enables access to the FPU environment (e.g. you can change the FP rounding mode and compiler optimizations will respect that possibilitiy).
disables floating-point contractions such as fused multiply-add (fma) instructions implies “precise” and “except

This part probably doesn't lead to optimizations that vary across machines, but it interesting:

The ANSI Fortran standard is less restrictive than the C standard: it requires the compiler to respect the order of evaluation specified by parentheses, but otherwise allows the compiler to reorder expressions as it sees fit. The Intel Fortran compiler has therefore implemented a corresponding switch, /assume:protect_parens (-assume protect_parens), that results in standardconforming behavior for reassociation, with considerably less impact on performance than /fp:precise (-fp-model precise). This switch does not affect any value-unsafe optimizations other than reassociation.

Footnote 1: different isn't always worse, numerically

Floating point math always has rounding error (except in rare cases, e.g. adding 5.0 + 3.0 or other cases where the mantissa have enough trailing zeros so no 1 bits have to get rounded away after shifting the mantissa to align them). The number you'd get from doing FP math operations in source order with source-specified precision will have rounding error.

Using multiple accumulators (from vectorizing and unrolling a reduction) is usually better numerically, a step in the direction of pairwise summation; a simple sum += a[i] is the worst way to add an array if the elements are all or mostly positive and of uniform size. Adding a small number to a large number loses a lot of precision, so having 16 or 32 different buckets to sum into means the totals aren't quite as big until you add the buckets together. See also Simd matmul program gives different numerical results.

You can jokingly call this wrong because it makes it harder to verify / test that a program does exactly what you think it does, and because it's not what the source says to do.

But unless you're doing stuff like Kahan summation (error-compensation) or double-double (extended precision) or other numerical techniques that depend on precise FP rounding semantics, fast-math optimizations merely give you answers that are wrong in a different way than the source, and mathematically maybe less wrong.

(Unless it also introduces some approximations like rsqrtps + a Newton iteration instead of x / sqrt(y). That may only be on with fast=2, which isn't the default. But I'm not sure. Some optimizations also might not care about turning -0.0 into +0.0.)

Fortran code compiled in one Windows machine (2018 Intel processor) gives different results when exe is copied to other machine (2022 Intel processor)

There are 1 best solutions below

Intel compiler FP-math options

Footnote 1: different isn't always worse, numerically

Related Questions in FLOATING-POINT

Related Questions in FORTRAN

Related Questions in INTEL

Related Questions in INTEL-FORTRAN

Related Questions in FAST-MATH

Trending Questions

Popular # Hahtags

Popular Questions