I think that every C++ programmer at some point heard the statement "virtual functions are slow". So I decided to benchmark virtual functions against regular member functions.
Unfortunately, I don't have much experience in benchmarking "low-level" C++ code, and don't known how to properly avoid compiler or HW (branch predictor mostly) optimization effects.
The problem that I'm facing right now is that when I run my benchmarks (which I provide below), I almost always get the result that calling both virtual and regular functions have the same speed, and sometimes virtual functions work even faster!
This is how I wrote my benchmark using google/benchmark library:
static void BM_MemberCall(benchmark::State& state) {
Base* b = new Derived();
for ([[maybe_unused]] auto _ : state) {
benchmark::DoNotOptimize(b->Foo());
benchmark::ClobberMemory();
}
}
BENCHMARK(BM_MemberCall);
static void BM_VirtualCall(benchmark::State& state) {
Base* b = new Derived();
for ([[maybe_unused]] auto _ : state) {
benchmark::DoNotOptimize(b->VirtualFoo());
benchmark::ClobberMemory();
}
}
BENCHMARK(BM_VirtualCall);
My class definitions are as follows, declared in util.h:
#pragma once
// generate 10 virtual functions
#define VFOO(NAME, ID, ...) \
__attribute__((noinline)) virtual int NAME ## ID ## _0() __VA_ARGS__; \
__attribute__((noinline)) virtual int NAME ## ID ## _1() __VA_ARGS__; \
__attribute__((noinline)) virtual int NAME ## ID ## _2() __VA_ARGS__; \
__attribute__((noinline)) virtual int NAME ## ID ## _3() __VA_ARGS__; \
__attribute__((noinline)) virtual int NAME ## ID ## _4() __VA_ARGS__; \
__attribute__((noinline)) virtual int NAME ## ID ## _5() __VA_ARGS__; \
__attribute__((noinline)) virtual int NAME ## ID ## _6() __VA_ARGS__; \
__attribute__((noinline)) virtual int NAME ## ID ## _7() __VA_ARGS__; \
__attribute__((noinline)) virtual int NAME ## ID ## _8() __VA_ARGS__; \
__attribute__((noinline)) virtual int NAME ## ID ## _9() __VA_ARGS__;
// generate 100 virtual functions
#define VFOO100(NAME, ...) \
VFOO(NAME, 0, __VA_ARGS__) \
VFOO(NAME, 1, __VA_ARGS__) \
VFOO(NAME, 2, __VA_ARGS__) \
VFOO(NAME, 3, __VA_ARGS__) \
VFOO(NAME, 4, __VA_ARGS__) \
VFOO(NAME, 5, __VA_ARGS__) \
VFOO(NAME, 6, __VA_ARGS__) \
VFOO(NAME, 7, __VA_ARGS__) \
VFOO(NAME, 8, __VA_ARGS__) \
VFOO(NAME, 9, __VA_ARGS__)
struct Base {
int Foo();
VFOO100(A);
__attribute__((noinline)) virtual int VirtualFoo();
VFOO100(Z);
};
struct Derived : public Base {
VFOO100(A, override);
__attribute__((noinline)) int VirtualFoo() override;
VFOO100(Z, override);
};
With the help of VFOO100 I generate 200 virtual functions in order to make virtual table big, and just in case there is some lexicological sorting involved in finding the right method, I generate 100 methods lexicologically lower than VirtualFoo and 100 bigger.
All methods definitions are the same (both regular and virtual, defined in separate TU named util.cpp):
{
asm("");
return 0;
}
Sample benchmark results:
(same)
---------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------
BM_MemberCall 1.23 ns 1.23 ns 524344569
BM_VirtualCall 1.23 ns 1.23 ns 564752961
(virtual functions are worse)
---------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------
BM_MemberCall 1.22 ns 1.22 ns 522068585
BM_VirtualCall 1.48 ns 1.33 ns 557440234
(virtual functions are faster!)
Benchmark Time CPU Iterations
---------------------------------------------------------
BM_MemberCall 1.33 ns 1.33 ns 475669505
BM_VirtualCall 1.25 ns 1.25 ns 555128195
My questions:
- If virtual functions are indeed slower than regular once, how can I measure it and see for myself? What mistakes did I make in my approach?
- Why in this tests I sometimes see virtual functions being slower than regular once?
My setup:
2024-02-12T17:41:29+03:00
Running ./src/apps/bench/function_calls/calls
Run on (12 X 2600 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 3.04, 4.39, 4.52


Virtual functions in C++ have caused slowness in a few ways.
This is similar to function pointers or std functions.
They add a memory access over a function pointer. A function pointer is a one step to code: a virtual call first has to get the vtable, then follow the pointer there. This extra memory access can be a cache miss and can add to L1 cache pressure.
They discourage value types and local storage. Because vtable based polymorphism doesn't support polymorphic storage out of the box, you end up working with pointers to object data not stored adjacent to each other in cache-friendly ways.
They encourage dynamic allocation.
mallocis 100s of instructions andfreeis thread-unfriendly. Most uses of vtables use the heap. This means that each objects existence has a significant time debt: so often you have overly big objects or lots of wasted time on small ones.They discourage polymophism. Each distinct type of object has O(vtable size) global overhead: and an object with 1 method tweaked has its own type. Making millions of combinatorially different types of objects is impractical given how C++ implements virtual dispatch. So you end up with switch statements in your polymorphic code after a certain point.
Each of these represent a reason I have not used virtual method based dispatch for performance reasons in a shipping commercial application.
Now, in many cases, the lack of performance is relative. Like, don't use virtual dispatch on a per pixel per frame basis for HD video running at 30 Hz.
Instead, lift the virtual dispatch to the per scanline spot. All of a sudden this huge overhead is 1000x less important.
But some of these are tricky, in that the costs are not always fully localized (memory and cache pressure, for example). So thinking about it at a higher level is sometimes needed.