I have been trying to measure the performance of a std::variant
-based (pseudo-) static dispatch scheme using std::visit
.
The idea is that instead of a
struct A {
virtual unsigned f() const noexcept = 0;
virtual ~A() noexcept {}
};
struct B0 : A {
virtual unsigned f() const noexcept { return 0; }
};
// more types like B0 ...
std::unique_ptr<A> a = std::make_unique<B0>();
I have
struct C0 {
unsigned f() const noexcpet { return 0; }
};
std::variant<C0,C1,C2,/*...*/> c = C0();
and I want to measure how much faster constructing a series of such objects is and how fast the dispatch is. Note that the first example (A/Bs...) requires dynamic memory on top of dynamic dispatch while the second example (Cs...) has automatic storage.
To this end I generalised B0 and C0 into type templates:
template <unsigned X>
struct B : A {
virtual unsigned f() const noexcept override { return X; }
};
template <unsigned X>
struct C {
unsigned f() const noexcept { return X; }
};
and then wrote a (maybe slightly over-engineered) test harness to fill a std::vector
and read it back. The full code is attached below. I am running this with -O1
and -O3
as C++17.
Effectively it pseudo-randomly fills pre-grown vectors bs
with B<...>
and cs
with C<...>
respectively, and then either calls bs[i]->f()
or std::visit([](auto const& c) { return c.f(); },cs[i])
(for more details see the attached benchmark code).
What I would have expected is that a test-instance of variant<C<0>>
blows its dynamic counter part unique_ptr<A>
out of the water by orders of magnitude (it does), but as I grow the variant, say to variant<C<0>,...,C<127>>
the efficiency of visit
starts to go down significantly to the point where the dynamic dispatch is faster (it doesn't as expected).
With -O3
(the -O1
results are fairly similar) I see the following results, which vary slightly across runs, but seem relatively stable (the times mostly stay within 10% deviation).
[0,0] time ctor virtual: 35.0315 ns
[0,0] time ctor variant: 2.9425 ns
[0,0] time call virtual: 14.0037 ns (L1)
[0,0] time call variant: 1.44748 ns (L2)
[0,1] time ctor virtual: 34.8007 ns
[0,1] time ctor variant: 2.95368 ns
[0,1] time call virtual: 19.6874 ns
[0,1] time call variant: 7.04521 ns
[0,7] time ctor virtual: 39.6325 ns
[0,7] time ctor variant: 2.97607 ns
[0,7] time call virtual: 30.7592 ns
[0,7] time call variant: 9.22505 ns (L4.1)
[0,31] time ctor virtual: 35.0002 ns
[0,31] time ctor variant: 2.95473 ns
[0,31] time call virtual: 24.3198 ns
[0,31] time call variant: 9.72678 ns (L4.2)
[0,127] time ctor virtual: 36.5918 ns
[0,127] time ctor variant: 2.95542 ns
[0,127] time call virtual: 26.701 ns (L3)
[0,127] time call variant: 9.88592 ns (L4.3)
Discussion
The small time for (L1) is explainable, I think, by caching and/or branch prediction. (L2) is absolutely as expected: If the variant is trivial, dispatch is extremely fast. All times for construction also make sense: ctor variant
does not at any point malloc
anything, so it is clear why it is that much faster than ctor virtual
and that the time is roughly constant regardless of the number of dynamic types.
call virtual
stays roughly the same as the number of dynamic types goes up (L3), which should be expected. However, why is call variant
not going up (more) between (L4.1) and (L4.3).
Note: Given the limitations of template programming in my test harness, I can not increase the range much more without exploding g++ during compilation / exhausting my memory.
Anyway, given that the test function f
is as simple as possible should imply that the measurement records the incurred overhead as accurately as possible.
Validation
The questions are,
- how can I validate these results such that they are representative and
- make sure no relevant parts have been optimised out by the compiler?
- Do other benchmarks arrive at the same conclusion, namely that
std::variant
dispatch is always faster by about factor 2-3?
Full benchmark
// g++ -Wall -Wextra -pedantic -std=c++17 -O3 a.cpp
#include <random>
#include <memory>
#include <variant>
#include <chrono>
#include <iostream>
using chronores = std::nano;
static constexpr char const resstr[] = "ns";
namespace helper {
template <template <unsigned> typename T, unsigned X, unsigned UB, typename... Args>
struct mkvar {
using type = typename mkvar<T,X+1,UB,Args...,T<X>>::type;
};
template <template <unsigned> typename T, unsigned UB, typename... Args>
struct mkvar<T,UB,UB,Args...> {
using type = std::variant<Args...,T<UB>>;
};
template <template <unsigned> typename T, unsigned LB, unsigned UB>
using mkvar_t = typename mkvar<T,LB,UB>::type;
template <unsigned X>
struct Num {
static constexpr unsigned value = X;
using inc = Num<X+1>;
};
template <typename NumX, typename NumUB, template <unsigned> typename T, bool use_variant>
struct ctor_Num {
static constexpr auto X = NumX::value;
static constexpr auto UB = NumUB::value;
template <typename Container>
static void run(unsigned x, Container& container) {
if (x == X) {
if constexpr (use_variant) {
container.emplace_back(T<X>());
} else {
container.emplace_back(std::make_unique<T<X>>());
}
} else {
ctor_Num<typename NumX::inc,NumUB,T,use_variant>::run(x,container);
}
}
};
template <typename NumX, template <unsigned> typename T, bool use_variant>
struct ctor_Num<typename NumX::inc,NumX,T,use_variant> {
template <typename Container>
static void run(unsigned, Container&) { }
};
template <unsigned X, unsigned UB, template <unsigned> typename T, bool use_variant, typename Container>
inline void ctor(unsigned x, Container& container) {
return ctor_Num<Num<X>,Num<UB>,T,use_variant>::run(x,container);
}
struct Time {
double& time;
std::chrono::time_point<std::chrono::steady_clock> start;
Time(double& time) : time(time) {
start = std::chrono::steady_clock::now();
}
~Time() {
auto const finish = std::chrono::steady_clock::now();
time += std::chrono::duration<double,chronores>(finish-start).count();
}
};
}
template <unsigned LB, unsigned UB>
struct measure {
struct A {
virtual unsigned f() const noexcept = 0;
virtual ~A() noexcept {}
};
template <unsigned X>
struct B : A {
virtual unsigned f() const noexcept override { return X; }
};
template <unsigned X>
struct C {
unsigned f() const noexcept { return X; }
};
static void main(std::size_t const N, std::size_t const R = 10, bool warmup = false) {
if (!warmup) main(N,1,true);
using namespace helper;
std::vector<std::unique_ptr<A>> bs;
bs.reserve(N);
std::vector<mkvar_t<C,LB,UB>> cs;
cs.reserve(N);
std::uniform_int_distribution<unsigned> distr(LB,UB);
double time_ctor_virtual = 0;
double time_ctor_variant = 0;
double time_call_virtual = 0;
double time_call_variant = 0;
unsigned volatile sum = 0;
std::mt19937 mt(42); mt.discard(100);
for (std::size_t r = 0; r < R; ++r) {
bs.clear();
cs.clear();
{
Time scope(time_ctor_virtual);
for (std::size_t i = 0; i < N; ++i) {
bs.emplace_back(std::make_unique<B<UB>>());
}
}
{
Time scope(time_ctor_variant);
for (std::size_t i = 0; i < N; ++i) {
cs.emplace_back(C<UB>());
}
}
bs.clear();
cs.clear();
for (std::size_t i = 0; i < N; ++i) {
auto const rn = distr(mt);
// effectively calls bs.emplace_back(std::make_unique<B<rn>>())
ctor<LB,UB,B,false>(rn,bs);
// effectively calls cs.emplace_back(C<rn>())
ctor<LB,UB,C,true >(rn,cs);
}
{
Time scope(time_call_variant);
for (std::size_t i = 0; i < N; ++i) {
sum += std::visit([](auto const& c) { return c.f(); },cs[i]);
}
}
{
Time scope(time_call_virtual);
for (std::size_t i = 0; i < N; ++i) {
sum += bs[i]->f();
}
}
}
(void)sum;
if (!warmup) {
std::cout << "[" << LB << "," << UB << "] time ctor virtual: " << (time_ctor_virtual/N/R) << " " << resstr << "\n";
std::cout << "[" << LB << "," << UB << "] time ctor variant: " << (time_ctor_variant/N/R) << " " << resstr << "\n";
std::cout << "[" << LB << "," << UB << "] time call virtual: " << (time_call_virtual/N/R) << " " << resstr << "\n";
std::cout << "[" << LB << "," << UB << "] time call variant: " << (time_call_variant/N/R) << " " << resstr << "\n";
}
}
};
int main() {
static constexpr std::size_t N = 400000;
measure<0,0>::main(N);
std::cout << "\n";
measure<0,1>::main(N);
std::cout << "\n";
measure<0,7>::main(N);
std::cout << "\n";
measure<0,31>::main(N);
std::cout << "\n";
measure<0,127>::main(N);
std::cout << "\n";
}