I need a version of the following fast exp2 function working in double precision, can you help me ? Please don't answer saying that it is an approximation so a double version is pointless and casting the result to (double) is enough..thanks. The function which I found somewhere and which works for me is the following and it is much faster than exp2f(), but unfortunately I could not find any double precision version:
inline float fastexp2(float p)
{
if(p<-126.f) p= -126.f;
int w=(int)p;
float z=p-(float)w;
if(p<0.f) z+= 1.f;
union {uint32_t i; float f;} v={(uint32_t)((1<<23)*(p+121.2740575f+27.7280233f/(4.84252568f -z)-1.49012907f * z)) };
return v.f;
}
My assumption is that the existing code from the question assumes IEEE-754 binary floating-point computation, in particular mapping C's
floattype to IEEE-754'sbinary32format.The existing code suggests that only floating-point results in the normal range are of interest: subnormal results are avoided by clamping the input from below and overflow is ignored. So for
floatcomputation valid inputs are in the interval [-126, 128). By exhaustive test, I found that the maximum relative error of the function in the question is 7.16e-5, and that the root-mean-square (RMS) error is 1.36e-5.My assumption is that the desired change to
doublecomputation should widen the range of allowed inputs to [-1022, 1024), and that identical relative accuracy should be maintained. The code is written in a fairly cryptic fashion. So as a first step, I rearranged it into a more readable version. In a second step, I tweaked the coefficients of the core approximation so as to minimize the maximum relative error. This results in the following ISO-C99 code:Refactoring and retuning of the coefficients resulted in slight improvements in accuracy, with maximum relative error of 5.04e-5 and RMS error of 1.03e-5. It should be noted that floating-point arithmetic is generally not associative, therefore any re-association of floating-point operations, either by manual code changes or compiler transformations could affect the stated accuracy and requires careful re-testing.
I do not expect any need to modify the code as it compiles into efficient machine code for common architectures, as can be seen from trial compilations with Compiler Explorer, e.g. gcc with x86-64 or gcc with ARM64.
At this point it is obvious what needs to be changed for switching to
doublecomputation. Change all instances offloattodouble, all instances ofint32_ttoint64_t, change type suffixes for literal constants and math functions, and change the floating-point format specific parameters for IEEE-754binary32to those for IEEE-754binary64. The core approximation needs re-tuning to make best possible use of double precision coefficients in the core approximation.Both maximum relative error and root-mean-square error decreases very slightly to 4.93e-5 and 9.91e-6, respectively. This is as expected, because for an approximation that is roughly accurate to 15 bits it matters little whether intermediate computation is performed with 24 bits of precision or 53 bits of precision. The computation uses a division, and this tends to be slower for
doublethan forfloaton all platforms I am familiar with, so the double-precision port doesn't seem to provide any significant advantages other than perhaps eliminating conversion overhead if the calling code uses double-precision computation.