Optimize tensor contraction in C++ compared to Python

67 Views Asked by At

I have 3 vectors(numpy arrays in Python) in C++ and in Python, I wish to do the following tensor contraction:

import numpy as np
import time

N_t, N = 400, 400
a = np.random.rand(N_t, 2, 2, N)
b = np.random.rand(2, 2, N)
c = np.random.rand(N_t, 2, 2, N)

start_time = time.time()
d = np.einsum('iacm, cdm, jbdm -> ijabm', a, b, c)
print(time.time() - start_time)

Just for simplicity, 3 arrays are generated from random. Python uses around 2 seconds.

Now, in C++, without any optimisation, I can write (with the aid of ChatGPT to avoid labour work typing out some common features)

#include <iostream>
#include <vector>
#include <chrono>
#include <random>

using namespace std;

// Function to generate a random 4D vector with given shape
std::vector<std::vector<std::vector<std::vector<double> > > > generateRandom4DVector(int dim1, int dim2, int dim3, int dim4) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<double> dis(-1.0, 1.0); // Random numbers between -1 and 1
    
    std::vector<std::vector<std::vector<std::vector<double> > > > vec(dim1,
                                                                   std::vector<std::vector<std::vector<double> > >(dim2,
                                                                                                                 std::vector<std::vector<double> >(dim3,std::vector<double>(dim4))));
    
    // Populate the vector with random values
    for (int i = 0; i < dim1; ++i) {
        for (int j = 0; j < dim2; ++j) {
            for (int k = 0; k < dim3; ++k) {
                for (int l = 0; l < dim4; ++l) {
                    vec[i][j][k][l] = dis(gen); // Generate random number and assign to vector element
                }
            }
        }
    }
    
    return vec;
}

int main() {
    
    int dim1 = 400, dim2 = 2, dim3 = 2, dim4 = 400;
    
    std::vector<std::vector<std::vector<std::vector<double> > > > x = generateRandom4DVector(dim1, dim2, dim3, dim4);
    std::vector<std::vector<std::vector<std::vector<double> > > > y = generateRandom4DVector(dim1, dim2, dim3, dim4);
    std::vector<std::vector<std::vector<std::vector<double> > > > z = generateRandom4DVector(dim1, dim2, dim3, dim4);
    std::vector<std::vector<std::vector<std::vector<std::vector<int> > > > > w(
            dim1, std::vector<std::vector<std::vector<std::vector<int> > > >(
                    dim1, std::vector<std::vector<std::vector<int> > >(
                            2, std::vector<std::vector<int> >(
                                    2, std::vector<int>(dim4)
                            )
                    )
            ));
    
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int i = 0; i < dim1; i++) {
        for (int j = 0; j < dim1; j++) {
            for (int a = 0; a < 2; a++) {
                for (int b = 0; b < 2; b++) {
                    for (int m = 0; m < dim4; m++) {
                        for (int c = 0; c < 2; c++) {
                            for (int d = 0; d < 2; d++) {
                                w[i][j][a][b][m] += x[i][a][c][m] * y[0][c][d][m] * z[j][b][d][m];
                            }
                        }
                    }
                }
            }
        }
    }
    
    
    // Stop measuring time
    auto end = std::chrono::high_resolution_clock::now();
    
    // Calculate duration
    std::chrono::duration<double> duration = end - start;
    
    // Output duration in seconds
    std::cout << "Elapsed time: " << duration.count() << " seconds" << std::endl;
    
    
    return 0;
}

It takes around 16 seconds, very slow.

A very naive solution would be to multiprocessing those for loops by putting the very big sum into a function. And taking the advantage that there is no more space to multiprocessing in np.einsum and pure Python has very slow for loops. If I have many CPUs, I can always take the advantage of this to let my C++ faster than Python boosted with numpy since in pure C++, for loops are much faster.

I am trying to find more clever routines to this problem. How to also make judicious use of C++ libraries to speed up sums of such types?

1

There are 1 best solutions below

2
sehe On

Holy smokes. Don't trust ChatGPT. The function could be as simple as

static std::mt19937 gen(std::random_device{}());
using Matrix4 = boost::multi_array<double, 4>;

Matrix4 generateRandom4DVector(std::array<long, 4> shape) {
    std::uniform_real_distribution dis(-1.0, 1.0); // Random numbers between -1 and 1

    Matrix4 vec(shape);
    std::generate_n(vec.origin(), vec.num_elements(), bind(dis, gen));
    return vec;
}

However, I see you're going to do math with it. Use a matrix library, like Eigen, Armadillo etc. I think Boost Ublas also supports it. They will employ expression templates to heavily optimize subexpressions, or recognize cases for special-case algorithms, which is what will give you the NumPy performance.