For a game that I'm working on - need the ability to run million of "exp" calls on vectors. Basically
void vector_exp(const double *x, const double *result, int n)
{
for (int i=0 ; i<n ; i++) result[i] = exp(x[i]) ;
}
For my specific case, inputs are all -50..+50. Need double precision, with 8 decimals match current ‘exp’ - to pass test cases.
I have the same challenge also with 'log'. Input range is 1e-7 to 1e7.
Would like to utilize the AVX 512 instructions - which should be able to do (in theory) 8 double precision at a time. I've retrieved the glibc code (both the "C" version, and the ".S" version built for AVX) - but I'm not sure how to move forward from here.
https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu
I'm sure the other answers are better than mine - running a very quick and dirty polynomial approximation, I end up with these.
Out-of-order execution across independent
exp2()orlog2()operations can handle the FMA dependency chains from using Horner's rule for the 6th-order polynomials.See also Agner Fog's VCL implementations which aim for high precision, close to full precision of
double:double-precision
exp_dtemplate, supporting a few bases including2.0, andexp(x-1)vs.exp(x). (See theexp2caller for the right template params).Uses a 13th-order Taylor series, which comments in the code indicate is faster than the alternate version, using a Pade expansion: a ratio of two polynomials. One division per many FMAs isn't a disaster for throughput, especially if you have lots surrounding code that also does some non-division work with each result, but doing just this might be too much division per FMA.
double-precision
log_dtemplate. This does use a ratio of 5th-order polynomials for the mantissa. Template params supportlog(x)vs.log(x+1)to avoid losing precision. It only does natural log (basee), so you'll need to scale the result by1/ln(2).