Cache a lot of callback, then call them all batch-ly without v-table cost

Question

Cache a lot of callback, then call them all batch-ly without v-table cost

349 Views Asked by cppBeginner At 11 August 2025 at 07:01

C1, C2,... are callback classes.
They derived from a common interface CBase with the callback CBase::f().
All of them override CBase::f() with final modifier.

I have to register ~50 instance of any class that derived from C1, and ~50 instance of any class that derived from C2.
(see @@ in the below code for example)

Main objective: When I call allF(), C1::f() / C2::f() of every registered instances have to be called.

Here is a simplified version, it works (Full demo) :-

#include <iostream>
#include <vector>
class CBase{
    public: virtual void f(){std::cout<<"CBase"<<std::endl;}
};
class C1 : public CBase{
    public: virtual void f() final{std::cout<<"C1"<<std::endl;}
};
class C2 : public CBase{
    public: virtual void f() final{std::cout<<"C2"<<std::endl;}
};

This is the callback registering :-

//-------- begin registering -----
std::vector<CBase*> cBase;   
void regis(CBase* c){
    cBase.push_back(c);
}
void allF(){ //must be super fast
    for(auto ele:cBase){
        ele->f();    //#
    }
}
int main() {
    C1 a;
    C1 b;
    C2 c;   //@@
    //or ... class C2Extend : public C2{};   C2Extend c;  
    regis(&a);
    regis(&b);
    regis(&c);
    allF();  //print C1 C1 C2
}

Problem

According to the profile result, if I can avoid the v-table cost at #, I would get significant performance gain.

How to do it elegantly?

My poor solution

A possible workaround is : create many arrays to store each CX (Full demo):-

//-------- begin registering -----
std::vector<C1*> c1s;
std::vector<C2*> c2s;

void regis(C1* c){
    c1s.push_back(c);
}
void regis(C2* c){
    c2s.push_back(c);
}
void allF(){ //must be super fast
    for(auto ele:c1s){
        ele->f();    //#
    }
    for(auto ele:c2s){
        ele->f();    //#
    }
}
int main() {
    C1 a;
    C1 b;
    C2 c;
    regis(&a);
    regis(&b);
    regis(&c);
    allF();  //print C1 C1 C2
}

It is very faster. However, it is not scale well.
After a few development cycle, C3,C4, etc were born.
I have to create std::vector<C3*>,std::vector<C4*>, ... manually
My approach lead to maintainability hell.

More information (edited)

In the worst case, there are at most 20 classes. (C1 to C20)

In real case, C1,C2,... are special type of data-structures.
All of them require special initialization (f()) at a precisely-correct time.

Their instances are constructed at various .cpp.
Thus, an array storage std::vector<CBase*> cBase; caching all of them would be useful.

For example, C1 is map 1:1, C2 is map 1:N, C3 is map N:N.
Together with a custom allocator, I can achieve unearthly data locality.

More note: I don't care about order of callback. (Thank Fire Lancer)

Original Q&A

There are 3 best solutions below

Caleth On 29 August 2017 at 09:16

You could pull the globals into a class templated over the derived type, and when that is instantiated, ensure that it is part of the overall call.

typedef void(*Action)(); // function pointer type, receives static call 
std::set<Action> allFs; 

template<typename T>
struct FRegistry 
{
    static std::vector<T*> ts;
    static void doF() 
    { 
        // loop over the derived type, so no need for virtual
        for (T * t : ts) { t->f(); } 
    }
    static void regis(T * item) 
    { 
        allFs.insert(&FRegistry::doF); // Ensure the global call includes this instantiation
        ts.push_back(t); // register the instance
    }
}

template<typename T>
std::vector<T*> FRegistry<T>::ts = {}; // member initialisation

template <typename T>
regis(T * t)
{ 
    FRegistry<T>::regis(t); 
}

void allF()
{ 
    for (Action a : allFs) { a(); } // call each doF
}

Usage is unchanged

int main() {
    C1 a;
    C1 b;
    C2 c;
    regis(&a);
    regis(&b);
    regis(&c);
    allF();  //print C1 C1 C2
}

Fire Lancer On 29 August 2017 at 09:18

A virtual call is already a very simple and fast implementation, so if thats a problem, nothing will be fast enough without a structural change. Notably I wouldnt expect a simple std::function or manually working with function pointers to be a great gain. Essentially a virtual call might look like:

class CBase{
    // Compiler generated
    struct Vtable
    {
        void (CBase::*f)();
    };
    public: virtual void f(){std::cout<<"CBase"<<std::endl;}

    // Compiler addded instance field
    Vtable *vtable;
};
class C1 : public CBase{
    public: virtual void f() final{std::cout<<"C1"<<std::endl;}
    // Compiler generated static data to initialise vtable member
    static Vtable C1::type_vtable = { &C1::f };
};


CBase *ptr = vector.front();
ptr->f();
// Gets compiled as
ptr->(*ptr->vtable->f)();

So at the code level, its a couple of extra memory reads and then calling a function via a function pointer. However, this prevents many optimisations. At the compiler level, it can no longer inline the function. At the CPU level you need ptr->vtable to be in the CPU cache and risk it failing branch prediction, both of which have far higher costs compared to a direct function call than than a few memory reads might imply. This is especially the case if you have many base classes and they are ordered fairly randomly in the container (the CPU is likely to keep guessing wrong which is next).

The most optimial solution without a design change is more like you showed. Get rid of the virtual/indirect functions entirely and store in separate containers. This lets the compiler inline the function call if it thinks its worth it, and makes it easy for the CPU. You could maybe use overloads or templates, so only have at worst one place to call (and with templates, depending on needs something even more clever).

class Register
{
    std::vector<C1*> c1;
    std::vector<C2*> c2;


    void regis(C1 *c1);
    void regis(C2 *c2);
    //etc.
};

Be aware that you changed the order the objects are invoked. You sorted them by class type, but before was in the same order as regis was called.

Just sorting by class type (could use typeid or such) would also likely help the CPU somewhat, but you still loose inlining.

"Profile Guided Optimisation" (PGO, look it up in the compiler, MSVC and GCC for example can do this) might also help, with some extra build time effort. It lets the compiler optimise based on the code is that is actually run. Ive not looked in detail, at the generated code for real projects, but I understand MSVC at least can "inline" common virtual calls using like a switch statement on the typeid, allowing for better optimisations and possibly working nicer with modern CPUs.

A more major design change is to avoid small virtual functions. Make the virtual functions at a higher level (e.g. pass an entire container to a virtual function, rather than each element).

**Quentin** · Accepted Answer

Your "poor solution" starts looking much better when you automate it using templates. Our goal: store c1s, c2s, etc in a single vector.

To do this, we need to map derived types to consecutive integers. A simple way to do that is to use a global counter, and a function template that increments and stores it every time it is instantiated.

static std::size_t typeIndexCounter = 0;

template <class>
std::size_t indexForType() {
    static std::size_t const index = typeIndexCounter++;
    return index;
}

The first call to indexForType<T>() will reserve a new index for T, and return the same one on subsequent calls.

Then, we need a way to erase enough information about our callback vectors so we can store them and call the correct f on them.

struct Group {
    using CbVec = std::vector<void *>;

    void (*call)(CbVec &);
    CbVec callbacks;
};

static std::vector<Group> groups;

call will hold a function that iterates over the pointers, downcasts them and calls f. Just like your solution, this factors out all of the calls to a single type into only one virtual call.

CbVec could hold CBase * instead of void *, but I'll explain that choice later.

Now we need a function to populate groups upon requesting a Groupfor some type:

template <class T>
Group &groupFor() {

    std::size_t const index = indexForType<T>();
    if(index < groups.size())
        // Group already exists, return it
        return groups[index];

    assert(
        index == groups.size() &&
        "Something went wrong... Did someone call detail_callbacks::indexForType?"
    );

    // Register the new group, with its downcasting function
    groups.push_back({
        [](Group::CbVec &callbacks) {
            for(void *p : callbacks)
                static_cast<T*>(p)->f();
        },
        {}
    });

    // Return the new group
    return groups.back();
}

Here you can see that we use a lambda expression to generate the downcasting functions. The reason I've chosen to store void *'s instead of CBase *'s is that the performance-sensitive downcast in there becomes a no-op, while a base-to-derived cast might have required pointer adjustments (and further complications in case of virtual inheritance).

Finally, the public API. All of the above has been defined inside namespace detail_callbacks, and we just need to put the pieces together:

template <
    class T,
    class = std::enable_if_t<std::is_base_of<CBase, T>::value>
>
void regis(T *callback) {
    detail_callbacks::groupFor<T>().callbacks.push_back(static_cast<void*>(callback));
}

void allF() {
    for(auto &group : detail_callbacks::groups)
        group.call(group.callbacks);
}

And there you go! New derived callbacks are now automatically registered.

See it live on Coliru

Cache a lot of callback, then call them all batch-ly without v-table cost

Problem

My poor solution

There are 3 best solutions below

Related Questions in C++

Related Questions in C++14

Related Questions in VTABLE

Trending Questions

Popular # Hahtags

Popular Questions