I have 3 vectors(numpy arrays in Python) in C++ and in Python, I wish to do the following tensor contraction:
import numpy as np
import time
N_t, N = 400, 400
a = np.random.rand(N_t, 2, 2, N)
b = np.random.rand(2, 2, N)
c = np.random.rand(N_t, 2, 2, N)
start_time = time.time()
d = np.einsum('iacm, cdm, jbdm -> ijabm', a, b, c)
print(time.time() - start_time)
Just for simplicity, 3 arrays are generated from random. Python uses around 2 seconds.
Now, in C++, without any optimisation, I can write (with the aid of ChatGPT to avoid labour work typing out some common features)
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
using namespace std;
// Function to generate a random 4D vector with given shape
std::vector<std::vector<std::vector<std::vector<double> > > > generateRandom4DVector(int dim1, int dim2, int dim3, int dim4) {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<double> dis(-1.0, 1.0); // Random numbers between -1 and 1
std::vector<std::vector<std::vector<std::vector<double> > > > vec(dim1,
std::vector<std::vector<std::vector<double> > >(dim2,
std::vector<std::vector<double> >(dim3,std::vector<double>(dim4))));
// Populate the vector with random values
for (int i = 0; i < dim1; ++i) {
for (int j = 0; j < dim2; ++j) {
for (int k = 0; k < dim3; ++k) {
for (int l = 0; l < dim4; ++l) {
vec[i][j][k][l] = dis(gen); // Generate random number and assign to vector element
}
}
}
}
return vec;
}
int main() {
int dim1 = 400, dim2 = 2, dim3 = 2, dim4 = 400;
std::vector<std::vector<std::vector<std::vector<double> > > > x = generateRandom4DVector(dim1, dim2, dim3, dim4);
std::vector<std::vector<std::vector<std::vector<double> > > > y = generateRandom4DVector(dim1, dim2, dim3, dim4);
std::vector<std::vector<std::vector<std::vector<double> > > > z = generateRandom4DVector(dim1, dim2, dim3, dim4);
std::vector<std::vector<std::vector<std::vector<std::vector<int> > > > > w(
dim1, std::vector<std::vector<std::vector<std::vector<int> > > >(
dim1, std::vector<std::vector<std::vector<int> > >(
2, std::vector<std::vector<int> >(
2, std::vector<int>(dim4)
)
)
));
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < dim1; i++) {
for (int j = 0; j < dim1; j++) {
for (int a = 0; a < 2; a++) {
for (int b = 0; b < 2; b++) {
for (int m = 0; m < dim4; m++) {
for (int c = 0; c < 2; c++) {
for (int d = 0; d < 2; d++) {
w[i][j][a][b][m] += x[i][a][c][m] * y[0][c][d][m] * z[j][b][d][m];
}
}
}
}
}
}
}
// Stop measuring time
auto end = std::chrono::high_resolution_clock::now();
// Calculate duration
std::chrono::duration<double> duration = end - start;
// Output duration in seconds
std::cout << "Elapsed time: " << duration.count() << " seconds" << std::endl;
return 0;
}
It takes around 16 seconds, very slow.
A very naive solution would be to multiprocessing those for loops by putting the very big sum into a function. And taking the advantage that there is no more space to multiprocessing in np.einsum and pure Python has very slow for loops. If I have many CPUs, I can always take the advantage of this to let my C++ faster than Python boosted with numpy since in pure C++, for loops are much faster.
I am trying to find more clever routines to this problem. How to also make judicious use of C++ libraries to speed up sums of such types?
Holy smokes. Don't trust ChatGPT. The function could be as simple as
However, I see you're going to do math with it. Use a matrix library, like Eigen, Armadillo etc. I think Boost Ublas also supports it. They will employ expression templates to heavily optimize subexpressions, or recognize cases for special-case algorithms, which is what will give you the NumPy performance.