Huge differences in implementations?

126 Views Asked by At

I am writing a few functionalities for distributions and used the normal-distribution to run the tests between my implementation and C++ Boost.

Given the Probability Density Function (pdf: http://www.mathworks.com/help/stats/normpdf.html)

Which I wrote like that:

double NormalDistribution1D::prob(double x) {
    return (1 / (sigma * (std::sqrt(boost::math::constants::pi<double>()*2))))*std::exp((-1 / 2)*(((x - mu) / sigma)*((x - mu) / sigma)));
}

To compare my results the way how it is done with C++ Boost:

    boost::math::normal_distribution <> d(mu, sigma);
    return boost::math::pdf(d, x);

I'm not very positively surprised - my version took 44278 nano Seconds, boost only 326.

So I played a bit around and wrote the method probboost in my NormalDistribution1D-Class and compared all three:

void MATTest::runNormalDistribution1DTest1() {
    double mu = 0;
    double sigma = 1;
    double x = 0;

    std::chrono::high_resolution_clock::time_point tn_start = std::chrono::high_resolution_clock::now();
    NormalDistribution1D *n = new NormalDistribution1D(mu, sigma);
    double nres = n->prob(x);
    std::chrono::high_resolution_clock::time_point tn_end = std::chrono::high_resolution_clock::now();

    std::chrono::high_resolution_clock::time_point tdn_start = std::chrono::high_resolution_clock::now();
    NormalDistribution1D *n1 = new NormalDistribution1D(mu, sigma);
    double nres1 = n1->probboost(x);
    std::chrono::high_resolution_clock::time_point tdn_end = std::chrono::high_resolution_clock::now();

    std::chrono::high_resolution_clock::time_point td_start = std::chrono::high_resolution_clock::now();
    boost::math::normal_distribution <> d(mu, sigma);
    double dres = boost::math::pdf(d, x);
    std::chrono::high_resolution_clock::time_point td_end = std::chrono::high_resolution_clock::now();

    std::cout << "Mu : " << mu << "; Sigma: " << sigma << "; x" << x << std::endl;
    if (nres == dres) {
        std::cout << "Result" << nres << std::endl;
    } else {
        std::cout << "\033[1;31mRes incorrect: " << nres << "; Correct: " << dres << "\033[0m" << std::endl;
    }


    auto duration_n = std::chrono::duration_cast<std::chrono::nanoseconds>(tn_end - tn_start).count();
    auto duration_d = std::chrono::duration_cast<std::chrono::nanoseconds>(td_end - td_start).count();
    auto duration_dn = std::chrono::duration_cast<std::chrono::nanoseconds>(tdn_end - tdn_start).count();

    std::cout << "own boost: " << duration_dn << std::endl;
    if (duration_n < duration_d) {
        std::cout << "Boost: " << (duration_d) << "; own implementation: " << duration_n << std::endl;
    } else {
        std::cout << "\033[1;31mBoost faster: " << (duration_d) << "; than own implementation: " << duration_n << "\033[0m" << std::endl;
    }
}

The results are (Was compiling and running the checking-Method 3 times)

own boost: 1082 Boost faster: 326; than own implementation: 44278

own boost: 774 Boost faster: 216; than own implementation: 34291

own boost: 769 Boost faster: 230; than own implementation: 33456

Now this puzzles me: How is it possible the method from the class takes 3 times longer than the statements which were called directly?

My compiling options:

g++ -O2   -c -g -std=c++11 -MMD -MP -MF "build/Debug/GNU-Linux-x86/main.o.d" -o build/Debug/GNU-Linux-x86/main.o main.cpp

g++ -O2    -o ***Classes***
2

There are 2 best solutions below

3
On

Firs of all, you are allocating your object dynamically, with new:

NormalDistribution1D *n = new NormalDistribution1D(mu, sigma);
double nres = n->prob(x);

If you did just like you have done with boost, that alone would be sufficient to have the same (or comparable) speed:

NormalDistribution1D n(mu, sigma);
double nres = n.prob(x);

Now, I don't know if the way you spelled out your expression in NormalDistribution1D::prob() is significant, but I am skeptical that it would make any difference to write it in a more "optimized" way, because arithmetic expressions like that is the kind of thing compilers can optimize very well. Maybe it will get faster if you user the --ffast-math switch, that will give more optimization freedom to the compiler.

Also, if the definition of double NormalDistribution1D::prob(double x) is in another compilation unit (another .cpp file), the compiler will be unable to inline it, and that will also make a noticeable overhead (maybe twice slower or less). In boost, almost everithing is implemented inside headers, so inlines will always happen when the compiler seems fit. You can overcome this problem if you compile and link with gcc's -flto switch.

0
On

You didn't compile with the -ffast-math option. That means the compiler cannot (in fact, must not!) simplify (-1 / 2)*(((x - mu) / sigma)*((x - mu) / sigma)) to a form similar to that used in boost::math::pdf,

expo = (x - mu) / sigma
expo *= -x
expo /= 2
result = std::exp(expo)
result /= sigma * std::sqrt(2 * boost::math::constants::pi<double>())

The above forces the compiler to do the fast (but potentially unsafe / inaccurate) calculation without resorting to using -ffast_math.

Secondly, the time difference between the above and your code is minimal compared to the time needed to allocate from the heap (new) vs. the stack (local variable). You are timing the cost of allocating dynamic memory.