Copula implementation in Python super slow

254 Views Asked by At

In the context of a model for risk management to calculate the VaR, I am implementing a routine to create multivariate distributions with user defined marginals.

I'll provide an example of the code, using a gaussian copula and gamma marginals:

# generate a multivariate normal distribution with a given input correlation:
mv = scipy.stats.multivariate_normal.rvs(mean = np.zeros(10), cov = correlation, size = (50000, 50, 10))

# transform the sample to a uniform with the above correlation:
unif = scipy.stats.norm.cdf(mv)

# apply the marginal:
mv_fin = scipy.stats.gamma.ppf(unif, a=1)

It works well and produce results as expected. The key problem is that it is super slow and these 3 steps take 90% of the overall model runtime. How can I speed it up? Thanks

2

There are 2 best solutions below

1
On

Regarding scipy.stats.norm.cdf, the numba-stats package provide a function to compute apparently the same thing. The default parameters needs to be explicitly provided to this functions as opposed to the Scipy alternative. You can call it using numba_stats.stats.norm_cdf(mv, 0, 1). This is 2.2 times faster on my machine with a i5-9600KF. The speed up mainly comes from the use of multiple threads (compared to the Scipy implementation which is sequential so far).

Regarding scipy.stats.gamma.ppf, things are more complex since the Numba package does not seems to provide any alternative (ie. the gamma sub-module does not seem implemented so far). The implementation of Scipy is sequential. The slow part appear to come from the function gammaincinv (of the module scipy.special) which is called for all value of unif (with 1 for the first parameter). This function calls igami which appear to come from the Cephes library. The source can be found here. This function makes use of a lot of expensive mathematical functions like pow, exp, log (as well as basic loops to compute power/asymptomatic series). The library (and the Scipy functions) seems to be optimized to provide a pretty good accuracy at the expense of a relatively poor performance.

Interestingly, this function calls is surprisingly 2.6 times faster on Linux than Windows on the same machine with the same exact version of Scipy (1.10.1). I think this is due to a more efficient implementation of the default math library on Linux than Windows. You can try to use an optimized math library like the one of Intel for example so to certainly speed up the computation. Note that rebuilding Scipy (and Cephes) on the target machine might be needed to get a performance boost. A simpler approach is certainly to try the Intel Distribution for Python assuming you use an Intel processor (or possibly an AMD one, since most Intel tools works on them though the performance might not be as good as on Intel ones).

Another possible way to speed the computation up is to use multiple threads when computing the function scipy.stats.gamma.ppf. Unfortunately, Scipy does not support this yet. That being said, one can reimplement the Scipy function and call the gammaincinv function even though this is certainly a significant work. A simpler (ugly) alternative solution is simply to hook the function scipy.stats.gamma._ppf so to use multiple threads. Indeed, this is a pure-Python function and such function can be re-assigned. This strategy speed up the computation by x3.75 on my machine with 6 cores. This optimization can be combined with the better math library.

Here is the final script:

from scipy.stats import multivariate_normal
import numpy as np
from numba_stats import norm
import scipy.stats
import scipy.special

import multiprocessing.pool
from types import MethodType
from joblib import Parallel, delayed

def multithreaded_gamma_ppf(self, q, a):
    cpu_count = multiprocessing.pool.get_context().cpu_count()
    size = q.shape[0]
    q_parts = [q[size*i//cpu_count:size*(i+1)//cpu_count] for i in range(cpu_count)]
    a_parts = [a[size*i//cpu_count:size*(i+1)//cpu_count] for i in range(cpu_count)]
    tmp = Parallel(n_jobs=cpu_count)(delayed(scipy.special.gammaincinv)(a_part, q_part) for a_part, q_part in zip(a_parts, q_parts))
    return np.concatenate(tmp)

# See https://stackoverflow.com/questions/10374527
scipy.stats.gamma._ppf = MethodType(multithreaded_gamma_ppf, scipy.stats.gamma)


if __name__ == '__main__':
    # Fake parameter just for a simple test
    correlation = 0.5

    # Windows:  5.95 s
    # Linux:    5.25 s
    mv = multivariate_normal.rvs(mean = np.zeros(10), cov = correlation, size = (50000, 50, 10))

    # Windows:  7.83 s => 3.53 s
    # Linux:    5.75 s => 2.21 s
    unif = norm.cdf(mv, 0, 1)

    # Windows:  2 min 48 s => 44.9 s
    # Linux:    1 min  6 s => 17.2 s
    mv_fin = scipy.stats.gamma.ppf(unif, a=1)

As you can see, combining multiple threads with a fast math library (or just using Linux for sake of simplicity) results in a significant speed up. The time dropped from 181.78 s on Windows for the initial implementation to 24.66 s on Linux for the optimized implementation.
This is a speed up of 7.4x!

0
On

You can reduce the computation time by 4x by replacing the gamma probability point function with -ln(1 - x).

I stumbled on this while trying to find a fast approximation for the inverse incomplete gamma function: scipy.stats.gamma.ppf(unif, a=1) is very nearly equal to -np.log(1 - unif). Here's a plot of the two.

plot of gamma function and -np.log(1 - x)

Here's a plot of the difference between the two functions. Note that the Y axis is in units of 10-15.

plot of error between two functions

In other words, this is generally accurate to 15 decimal places.

Using this fast approximation of the gamma PPF, you can edit the original code like so.

mv = scipy.stats.multivariate_normal.rvs(mean = np.zeros(10), cov = correlation, size = (50000, 50, 10))
unif = scipy.stats.norm.cdf(mv)
mv_fin = -np.log(1 - unif)

This was roughly 4x faster overall in my testing.

The other answers also explain how to run the calculation in parallel. This idea can be combined with those ideas - you can both use a less expensive calculation and run it in parallel.