I need to construct a large std::vector<std::shared_ptr<A>> many_ptr_to_A
.
Ideally, for A
a non-default constructor with arguments is used.
Several variants are defined in the code sample below:
#include <iostream>
#include <vector>
#include <memory>
#include <ctime>
class A
{
public:
A(std::vector<double> data):
data(data)
{}
A():
data(std::vector<double>(3, 1.))
{}
std::vector<double> data;
};
int main()
{
int n = 20000000;
std::vector<std::shared_ptr<A>> many_ptr_to_A;
// option 1
std::clock_t start = std::clock();
std::vector<A> many_A(n, std::vector<double>(3, 1.));
std::cout << double(std::clock() - start) / CLOCKS_PER_SEC << std::endl;
// end option 1
many_ptr_to_A.clear();
// option 2
start = std::clock();
many_ptr_to_A.reserve(n);
for (int i=0; i<n; i++) {
many_ptr_to_A.push_back(std::shared_ptr<A>(new A(std::vector<double>(3, 1.))));
}
std::cout << double(std::clock() - start) / CLOCKS_PER_SEC << std::endl;
// end option 2
many_ptr_to_A.clear();
// option 3
start = std::clock();
A* raw_ptr_to_A = new A[n];
for (int i=0; i<n; i++) {
many_ptr_to_A.push_back(std::shared_ptr<A>(&raw_ptr_to_A[i]));
}
std::cout << double(std::clock() - start) / CLOCKS_PER_SEC << std::endl;
// end option 3
return 0;
}
Option 1
Rather fast but unfortunately I need pointers instead of raw objects. A method to create pointers to the resulting allocated space and preventing the vector from deleting the objects would be great but I can't think of one.
Option 2
This works and I can feed specific data in the constructor for every A
. Unfortunately, this is rather slow. Using std::make_shared
instead of new
is not really improving the situation.
Even worse, this seems to be a big bottleneck when used in multiple threads. Assuming, that I run option 2 in 10 threads with n_thread = n / 10
, instead of being around ten times faster the whole thing is around four times slower. Why does this happen? Is it a problem when multiple thread try to allocate many small pieces of memory?
The number of cores on the server I'm using is larger than the number of threads. The rest of my application scales nicely with the number of cores, thus this actually represents a bottleneck.
Unfortunately, I'm not really experienced when it comes to parallelization...
Option 3
With this approach I tried to combine the fast allocation with a raw new
at one go and the shared_ptrs. This compiles, but unfortunately yields a segmentation fault when the destructor of the vector is called. I don't fully understand why this happens. Is it because A
is not POD?
In this approach I would manually fill the object-specific data into the objects after their creation.
Question
How can I perform the allocation of a large number of shared_ptr
to A
in an efficient way which also scales nicely when used on many threads/cores?
Am I missing an obvious way to construct the std::vector<std::shared_ptr<A>> many_ptr_to_A
in one go?
My system is a Linux/Debian server. I compile with g++ and -O3, -std=c++11 options.
Any help is highly appreciated :)
Option 3 is undefined behaviour, you have
n
shared_ptrs which will all try todelete
a singleA
, but there must be only onedelete[]
for the whole array, notdelete
usedn
times. You could do this though:This creates a single array, then creates
n
shared_ptr objects which all share ownership of the array and which each point to a different element of the array. This is done by creating oneshared_ptr
that owns the array (and a suitable deleter) and then creatingn-1
shared_ptrs that alias the first one, i.e. share the same reference count, even though theirget()
member will return a different pointer.A
unique_ptr<A[]>
is initialized with the array first, so thatdefault_delete<A[]>
will be used as the deleter, and that deleter will be transferred into the firstshared_ptr
, so that when the lastshared_ptr
gives up ownership the rightdelete[]
will be used to free the whole array. To get the same effect you could create the first shared_ptr like this instead:Or: