Benchmarking C++ virtual functions

166 Views Asked by At

I think that every C++ programmer at some point heard the statement "virtual functions are slow". So I decided to benchmark virtual functions against regular member functions.

Unfortunately, I don't have much experience in benchmarking "low-level" C++ code, and don't known how to properly avoid compiler or HW (branch predictor mostly) optimization effects.

The problem that I'm facing right now is that when I run my benchmarks (which I provide below), I almost always get the result that calling both virtual and regular functions have the same speed, and sometimes virtual functions work even faster!

This is how I wrote my benchmark using google/benchmark library:

static void BM_MemberCall(benchmark::State& state) {
    Base* b = new Derived();

    for ([[maybe_unused]] auto _ : state) {
        benchmark::DoNotOptimize(b->Foo());
        benchmark::ClobberMemory();
    }
}
BENCHMARK(BM_MemberCall);

static void BM_VirtualCall(benchmark::State& state) {
    Base* b = new Derived();

    for ([[maybe_unused]] auto _ : state) {
        benchmark::DoNotOptimize(b->VirtualFoo());
        benchmark::ClobberMemory();
    }
}
BENCHMARK(BM_VirtualCall);

My class definitions are as follows, declared in util.h:

#pragma once

// generate 10 virtual functions
#define VFOO(NAME, ID, ...) \
    __attribute__((noinline)) virtual int NAME ## ID ## _0() __VA_ARGS__; \
    __attribute__((noinline)) virtual int NAME ## ID ## _1() __VA_ARGS__; \
    __attribute__((noinline)) virtual int NAME ## ID ## _2() __VA_ARGS__; \
    __attribute__((noinline)) virtual int NAME ## ID ## _3() __VA_ARGS__; \
    __attribute__((noinline)) virtual int NAME ## ID ## _4() __VA_ARGS__; \
    __attribute__((noinline)) virtual int NAME ## ID ## _5() __VA_ARGS__; \
    __attribute__((noinline)) virtual int NAME ## ID ## _6() __VA_ARGS__; \
    __attribute__((noinline)) virtual int NAME ## ID ## _7() __VA_ARGS__; \
    __attribute__((noinline)) virtual int NAME ## ID ## _8() __VA_ARGS__; \
    __attribute__((noinline)) virtual int NAME ## ID ## _9() __VA_ARGS__;

// generate 100 virtual functions
#define VFOO100(NAME, ...) \
    VFOO(NAME, 0, __VA_ARGS__) \
    VFOO(NAME, 1, __VA_ARGS__) \
    VFOO(NAME, 2, __VA_ARGS__) \
    VFOO(NAME, 3, __VA_ARGS__) \
    VFOO(NAME, 4, __VA_ARGS__) \
    VFOO(NAME, 5, __VA_ARGS__) \
    VFOO(NAME, 6, __VA_ARGS__) \
    VFOO(NAME, 7, __VA_ARGS__) \
    VFOO(NAME, 8, __VA_ARGS__) \
    VFOO(NAME, 9, __VA_ARGS__)

struct Base {
    int Foo();
    VFOO100(A);
    __attribute__((noinline)) virtual int VirtualFoo();
    VFOO100(Z);
};

struct Derived : public Base {
    VFOO100(A, override);
    __attribute__((noinline)) int VirtualFoo() override;
    VFOO100(Z, override);
};

With the help of VFOO100 I generate 200 virtual functions in order to make virtual table big, and just in case there is some lexicological sorting involved in finding the right method, I generate 100 methods lexicologically lower than VirtualFoo and 100 bigger.

All methods definitions are the same (both regular and virtual, defined in separate TU named util.cpp):

{
    asm("");
    return 0;
}

Sample benchmark results:

(same)
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_MemberCall        1.23 ns         1.23 ns    524344569
BM_VirtualCall       1.23 ns         1.23 ns    564752961

(virtual functions are worse)
---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_MemberCall        1.22 ns         1.22 ns    522068585
BM_VirtualCall       1.48 ns         1.33 ns    557440234

(virtual functions are faster!)
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_MemberCall        1.33 ns         1.33 ns    475669505
BM_VirtualCall       1.25 ns         1.25 ns    555128195

My questions:

  1. If virtual functions are indeed slower than regular once, how can I measure it and see for myself? What mistakes did I make in my approach?
  2. Why in this tests I sometimes see virtual functions being slower than regular once?

My setup:

2024-02-12T17:41:29+03:00
Running ./src/apps/bench/function_calls/calls
Run on (12 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB
Load Average: 3.04, 4.39, 4.52
2

There are 2 best solutions below

0
Yakk - Adam Nevraumont On

Virtual functions in C++ have caused slowness in a few ways.

  1. They can block inlining. Vtables are more complex than direct function calls. With a direct function call, it is trivial to know what code you are invoking; sometimes you need LTO to cross translation units. With a vtable, innoculous changes can block devirtualization. This is difficult to profile, because your simple benchmark by definition is simple.

This is similar to function pointers or std functions.

  1. They add a memory access over a function pointer. A function pointer is a one step to code: a virtual call first has to get the vtable, then follow the pointer there. This extra memory access can be a cache miss and can add to L1 cache pressure.

  2. They discourage value types and local storage. Because vtable based polymorphism doesn't support polymorphic storage out of the box, you end up working with pointers to object data not stored adjacent to each other in cache-friendly ways.

  3. They encourage dynamic allocation. malloc is 100s of instructions and free is thread-unfriendly. Most uses of vtables use the heap. This means that each objects existence has a significant time debt: so often you have overly big objects or lots of wasted time on small ones.

  4. They discourage polymophism. Each distinct type of object has O(vtable size) global overhead: and an object with 1 method tweaked has its own type. Making millions of combinatorially different types of objects is impractical given how C++ implements virtual dispatch. So you end up with switch statements in your polymorphic code after a certain point.

Each of these represent a reason I have not used virtual method based dispatch for performance reasons in a shipping commercial application.

Now, in many cases, the lack of performance is relative. Like, don't use virtual dispatch on a per pixel per frame basis for HD video running at 30 Hz.

Instead, lift the virtual dispatch to the per scanline spot. All of a sudden this huge overhead is 1000x less important.

But some of these are tricky, in that the costs are not always fully localized (memory and cache pressure, for example). So thinking about it at a higher level is sometimes needed.

0
Marek R On

Writing correct performance test for thigh code is very hard and full of subtle traps, from compiler and processor. There are numerous C++ talks from Chandler Carruth showing scale of the problem.

Here is my attempt to write better performance test demo, but I not giving warranty there is no problems with it.

#include <memory>
#include <tuple>
#include <array>

class Base {
public:
  virtual ~Base() {}

  virtual int foo(int) = 0;
};

class A : public Base {
public:
  int foo(int x) override {
    return bar(x);
  }

  int bar(int x) {
    return x + 1;
  }
};

class B : public Base {
public:
  int foo(int x) override {
    return bar(x);
  }

  int bar(int x) {
    return x - 1;
  }
};

class C : public Base {
public:
  int foo(int x) override {
    return bar(x);
  }

  int bar(int x) {
    return x / 2;
  }
};

class D : public Base {
public:
  int foo(int x) override {
    return bar(x);
  }

  int bar(int x) {
    return 3*x + 1;
  }
};

auto makePointerTuple()
{
  return std::tuple{std::make_unique<A>(), std::make_unique<B>(), std::make_unique<C>(), std::make_unique<D>()};
}

auto makeValuesTuple()
{
  return std::tuple{A(), B(), C(), D()};
}

auto makeArray()
{
  return std::array<std::unique_ptr<Base>, 4>{std::make_unique<A>(), std::make_unique<B>(), std::make_unique<C>(), std::make_unique<D>()};
}

static void TestVirtualCall(benchmark::State& state) {
  auto arr = makeArray();
  int accumulator = 0;
  for (auto _ : state) {
    benchmark::DoNotOptimize(arr);
    std::apply([&accumulator](auto&... args) {
        ((accumulator = args->foo(accumulator)), ...);
      }, arr);
    benchmark::DoNotOptimize(accumulator);
  }
}
BENCHMARK(TestVirtualCall);

static void TestRegualrCall(benchmark::State& state) {
  auto t = makePointerTuple();
  int accumulator = 0;
  for (auto _ : state) {
    benchmark::DoNotOptimize(t);
    std::apply([&accumulator](auto&... args) {
        ((accumulator = args->bar(accumulator)), ...);
      }, t);
    benchmark::DoNotOptimize(accumulator);
  }
}
BENCHMARK(TestRegualrCall);


static void TestVirtualFunclWithKnownType(benchmark::State& state) {
  auto t = makePointerTuple();
  int accumulator = 0;
  for (auto _ : state) {
    benchmark::DoNotOptimize(t);
    std::apply([&accumulator](auto&... args) {
        ((accumulator = args->foo(accumulator)), ...);
      }, t);
    benchmark::DoNotOptimize(accumulator);
  }
}
BENCHMARK(TestVirtualFunclWithKnownType);

static void TestVirtualFunclWithKnownValuesType(benchmark::State& state) {
  auto t = makeValuesTuple();
  int accumulator = 0;
  for (auto _ : state) {
    benchmark::DoNotOptimize(t);
    std::apply([&accumulator](auto&... args) {
        ((accumulator = args.foo(accumulator)), ...);
      }, t);
    benchmark::DoNotOptimize(accumulator);
  }
}
BENCHMARK(TestVirtualFunclWithKnownValuesType);

In fact I'm surprised that TestVirtualFunclWithKnownValuesType produced the worst outcome on clang with libstdc++.

Note that gcc was able to optimize virtual calls if it can determine the type of the object (this is what I was warning about in comment under a question).


My questions:

  1. If virtual functions are indeed slower than regular once, how can I measure it and see for myself? What mistakes did I make in my approach?

I suspect that in your code, this line:

benchmark::DoNotOptimize(b->VirtualFoo());

Compiler moved fetch of address of virtual function outside testing loop. p doesn't change, so address of virtual function doesn't change too, so fetching it can be done before testing loop. Adding this line can change a lot:

benchmark::DoNotOptimize(b);
  1. Why in this tests I sometimes see virtual functions being slower than regular once?

How nay times did you run your test? Are you sure you didn't reached timer limitations and you see just pure statistical fluctuations?