Theoretically can the Ackermann function be optimized?

6.1k Views Asked by At

I am wondering if there can be a version of Ackermann function with better time complexity than the standard variation.

This is not a homework and I am just curious. I know the Ackermann function doesn't have any practical use besides as a performance benchmark, because of the deep recursion. I know the numbers grow very large very quickly, and I am not interested in computing it.

Even though I use Python 3 and the integers won't overflow, I do have finite time, but I have implemented a version of it myself according to the definition found on Wikipedia, and computed the output for extremely small values, just to make sure the output is correct.

enter image description here

def A(m, n):
    if not m:
        return n + 1
    return A(m - 1, A(m, n - 1)) if n else A(m - 1, 1)

The above code is a direct translation of the image, and is extremely slow, I don't know how it can be optimized, is it impossible to optimize it?

One thing I can think of is to memoize it, but the recursion runs backwards, each time the function is recursively called the arguments were not encountered before, each successive function call the arguments decrease rather than increase, therefore each return value of the function needs to be calculated, memoization doesn't help when you call the function with different arguments the first time.

Memoization can only help if you call it with the same arguments again, it won't compute the results and will retrieve cached result instead, but if you call the function with any input with (m, n) >= (4, 2) it will crash the interpreter regardless.

I also implemented another version according to this answer:

def ack(x, y):
    for i in range(x, 0, -1):
        y = ack(i, y - 1) if y else 1
    return y + 1

But it is actually slower:

In [2]: %timeit A(3, 4)
1.3 ms ± 9.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit ack(3, 4)
2 ms ± 59.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Theoretically can Ackermann function be optimized? If not, can it be definitely proven that its time complexity cannot decrease?


I have just tested A(3, 9) and A(4, 1) will crash the interpreter, and the performance of the two functions for A(3, 8):

In [2]: %timeit A(3, 8)
432 ms ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit ack(3, 8)
588 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I did some more experiments:

from collections import Counter
from functools import cache

c = Counter()
def A1(m, n):
    c[(m, n)] += 1
    if not m:
        return n + 1
    return A(m - 1, A(m, n - 1)) if n else A(m - 1, 1)

def test(m, n):
    c.clear()
    A1(m, n)
    return c

The arguments indeed repeat.

But surprisingly caching doesn't help at all:

In [9]: %timeit Ackermann = cache(A); Ackermann(3, 4)
1.3 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Caching only helps when the function is called with the same arguments again, as explained:

In [14]: %timeit Ackermann(3, 2)
101 ns ± 0.47 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

I have tested it with different arguments numerous times, and it always gives the same efficiency boost (which is none).

7

There are 7 best solutions below

10
On BEST ANSWER

Solution

I recently wrote a bunch of solutions based on the same paper that templatetypedef mentioned. Many use generators, one for each m-value, yielding the values for n=0, n=1, n=2, etc. This one might be my favorite:

def A_Stefan_generator_stack3(m, n):
    def a(m):
        if not m:
            yield from count(1)
        x = 1
        for i, ai in enumerate(a(m-1)):
            if i == x:
                x = ai
                yield x
    return next(islice(a(m), n, None))

Explanation

Consider the generator a(m). It yields A(m,0), A(m,1), A(m,2), etc. The definition of A(m,n) uses A(m-1, A(m, n-1)). So a(m) at its index n yields A(m,n), computed like this:

  • A(m,n-1) gets yielded by the a(m) generator itself at index n-1. Which is just the previous value (x) yielded by this generator.
  • A(m-1, A(m, n-1)) = A(m-1, x) gets yielded by the a(m-1) generator at index x. So the a(m) generator iterates over the a(m-1) generator and grabs the value at index i == x.

Benchmark

Here are times for computing all A(m,n) for m≤3 and n≤17, also including templatetypedef's solution:

 1325 ms  A_Stefan_row_class
 1228 ms  A_Stefan_row_lists
  544 ms  A_Stefan_generators
 1363 ms  A_Stefan_paper
  459 ms  A_Stefan_generators_2
  866 ms  A_Stefan_m_recursion
  704 ms  A_Stefan_function_stack
  468 ms  A_Stefan_generator_stack
  945 ms  A_Stefan_generator_stack2
  582 ms  A_Stefan_generator_stack3
  467 ms  A_Stefan_generator_stack4
 1652 ms  A_templatetypedef

Note: Even faster (much faster) solutions using math insights/formulas are possible, see my comment and pts's answer. I intentionally didn't do that, as I was interested in coding techniques, for avoiding deep recursion and avoiding re-calculation. I got the impression that that's also what the question/OP wanted, and they confirmed that now (under a deleted answer, visible if you have enough reputation).

Code

def A_Stefan_row_class(m, n):
    class A0:
        def __getitem__(self, n):
            return n + 1
    class A:
        def __init__(self, a):
            self.a = a
            self.n = 0
            self.value = a[1]
        def __getitem__(self, n):
            while self.n < n:
                self.value = self.a[self.value]
                self.n += 1
            return self.value
    a = A0()
    for _ in range(m):
        a = A(a)
    return a[n]


from collections import defaultdict

def A_Stefan_row_lists(m, n):
    memo = defaultdict(list)
    def a(m, n):
        if not m:
            return n + 1
        if m not in memo:
            memo[m] = [a(m-1, 1)]
        Am = memo[m]
        while len(Am) <= n:
            Am.append(a(m-1, Am[-1]))
        return Am[n]
    return a(m, n)


from itertools import count

def A_Stefan_generators(m, n):
    a = count(1)
    def up(a, x=1):
        for i, ai in enumerate(a):
            if i == x:
                x = ai
                yield x
    for _ in range(m):
        a = up(a)
    return next(up(a, n))


def A_Stefan_paper(m, n):
    next = [0] * (m + 1)
    goal = [1] * m + [-1]
    while True:
        value = next[0] + 1
        transferring = True
        i = 0
        while transferring:
            if next[i] == goal[i]:
                goal[i] = value
            else:
                transferring = False
            next[i] += 1
            i += 1
        if next[m] == n + 1:
            return value


def A_Stefan_generators_2(m, n):
    def a0():
        n = yield
        while True:
            n = yield n + 1
    def up(a):
        next(a)
        a = a.send
        i, x = -1, 1
        n = yield
        while True:
            while i < n:
                x = a(x)
                i += 1
            n = yield x
    a = a0()
    for _ in range(m):
        a = up(a)
    next(a)
    return a.send(n)


def A_Stefan_m_recursion(m, n):
    ix = [None] + [(-1, 1)] * m
    def a(m, n):
        if not m:
            return n + 1
        i, x = ix[m]
        while i < n:
            x = a(m-1, x)
            i += 1
        ix[m] = i, x
        return x
    return a(m, n)


def A_Stefan_function_stack(m, n):
    def a(n):
        return n + 1
    for _ in range(m):
        def a(n, a=a, ix=[-1, 1]):
            i, x = ix
            while i < n:
                x = a(x)
                i += 1
            ix[:] = i, x
            return x
    return a(n)


from itertools import count, islice

def A_Stefan_generator_stack(m, n):
    a = count(1)
    for _ in range(m):
        a = (
            x
            for a, x in [(a, 1)]
            for i, ai in enumerate(a)
            if i == x
            for x in [ai]
        )
    return next(islice(a, n, None))


from itertools import count, islice

def A_Stefan_generator_stack2(m, n):
    a = count(1)
    def up(a):
        i, x = 0, 1
        while True:
            i, x = x+1, next(islice(a, x-i, None))
            yield x
    for _ in range(m):
        a = up(a)
    return next(islice(a, n, None))


def A_Stefan_generator_stack3(m, n):
    def a(m):
        if not m:
            yield from count(1)
        x = 1
        for i, ai in enumerate(a(m-1)):
            if i == x:
                x = ai
                yield x
    return next(islice(a(m), n, None))


def A_Stefan_generator_stack4(m, n):
    def a(m):
        if not m:
            return count(1)
        return (
            x
            for x in [1]
            for i, ai in enumerate(a(m-1))
            if i == x
            for x in [ai]
        )
    return next(islice(a(m), n, None))


def A_templatetypedef(i, n):
    positions = [-1] * (i + 1)
    values = [0] + [1] * i
    
    while positions[i] != n:       
        values[0]    += 1
        positions[0] += 1
            
        j = 1
        while j <= i and positions[j - 1] == values[j]:
            values[j] = values[j - 1]
            positions[j] += 1
            j += 1

    return values[i]


funcs = [
    A_Stefan_row_class,
    A_Stefan_row_lists,
    A_Stefan_generators,
    A_Stefan_paper,
    A_Stefan_generators_2,
    A_Stefan_m_recursion,
    A_Stefan_function_stack,
    A_Stefan_generator_stack,
    A_Stefan_generator_stack2,
    A_Stefan_generator_stack3,
    A_Stefan_generator_stack4,
    A_templatetypedef,
]

N = 18
args = (
    [(0, n) for n in range(N)] +
    [(1, n) for n in range(N)] +
    [(2, n) for n in range(N)] +
    [(3, n) for n in range(N)]
)

from time import time

def print(*args, print=print, file=open('out.txt', 'w')):
    print(*args)
    print(*args, file=file, flush=True)
    
expect = none = object()
for _ in range(3):
  for f in funcs:
    t = time()
    result = [f(m, n) for m, n in args]
    # print(f'{(time()-t) * 1e3 :5.1f} ms ', f.__name__)
    print(f'{(time()-t) * 1e3 :5.0f} ms ', f.__name__)
    if expect is none:
        expect = result
    elif result != expect:
        raise Exception(f'{f.__name__} failed')
    del result
  print()
6
On

You are asking these two questions in your text:

Theoretically can Ackermann function be optimized?

Sure, you can implement it naively, and then optimize that in technical ways (e.g. memoization, or storing intermediate values, and so on). But I think this is not the actual question you are asking.

If not, can it be definitely proven that its time complexity cannot decrease?

Optimization has nothing to do with time complexity. The "O" notation abstracts away from constant multiplicative factors. You can have two algorithms where the one calculates the Ackermann function in 1 millionth of the time or space than the other, but they would still have the same O complexity.

To quote Wikipedia,

the Ackermann function, named after Wilhelm Ackermann, is one of the simplest1 and earliest-discovered examples of a total computable function that is not primitive recursive.

The function is not primitive recursive, and furthermore, it

grows faster than any primitive recursive function.

"Primitive recursive" means that you can implement the algorithm with loops where the bound is known beforehand; i.e. you do not need a possible infinitely repeating while loop. This is, granted, a bit of an abstract concept, and to quote Wikipedia again:

The importance of primitive recursive functions lies in the fact that most computable functions that are studied in number theory (and more generally in mathematics) are primitive recursive.

And yes, it has been proven that Ackermann is not primitive recursive. Discovering that it is actually not so would probably not make you any money, but certainly put your name on a pedestal, in the Theoretical Computer Science community.

You cannot optimize this kind of complexity away - consider your program just being a different format of writing a proof that the Ackerman is, indeed, primitive recursive; you'd just have to convert it back into mathematical/logical terms, and voila. The fact that this has not happened over the many years, together with the existence of "proof positive" like the one linked tells you that it is, in fact, behaving as advertised.

NB: finally, all of this has also to be seen in the light of the Ackerman function having likely been designed to have this property - as that Wikipedia page mentions, before it was discovered or created, some people thought that all computable functions were primitive recursive. While I don't know what drove Mr. Ackerman to do this back in the 1920's, the Ackerman function is now verily a corner-stone of complexity theory in Theoretical Computer Science; a very fascinating story.

2
On

v0 is pretty much your code:

def ack(m, n):
  if m == 0: return n + 1
  return ack(m - 1, 1) if n == 0 else ack(m - 1, ack(m, n - 1))

This calculates ack(3, 9) in 2.49s. ack() is called 11164370 times. Surely we can cache the values already calculated saving tonns of calls to the function.

v1 uses a dict for a cache and only calculates if the result is not in the cache:

c = {}

def ack(m, n):
  global c

  if "{}-{}".format(m, n) in c: return c["{}-{}".format(m, n)]
  else:
    if m == 0: ret = n + 1
    else: ret = ack(m - 1, 1) if n == 0 else ack(m - 1, ack(m, n - 1))

    c["{}-{}".format(m, n)] = ret
    return ret

This calculates ack(3, 9) in 0.03s and with that ack(3, 9) is no longer suitable for measuring execution time. This time ack() is called 12294 times, the saving is enormous. But we can do it better. From now on we use ack(3, 13) that currently runs for 0.23s.

v2 only cares for caching where m > 0, since the case of m == 0 is trivial. With that space complexity is somewhat reduced:

c = {}

def ack(m, n):
  global c

  if m == 0: return n + 1
  else:
    if "{}-{}".format(m, n) in c: return c["{}-{}".format(m, n)]
    else: ret = ack(m - 1, 1) if n == 0 else ack(m - 1, ack(m, n - 1))

    c["{}-{}".format(m, n)] = ret
    return ret

This finishes ack(3, 13) in 0.18s. But we can to even better.

v3 saves a bit of time by calculating the key in the cache only once per iteration:

c = {}

def ack(m, n):
  global c

  if m == 0: return n + 1
  else:
    key = "{}-{}".format(m, n)
    if key in c: return c[key]
    else: ret = ack(m - 1, 1) if n == 0 else ack(m - 1, ack(m, n - 1))

    c[key] = ret
    return ret

This time it runs for 0.14s. I surely can't do much more to it around midnight but I will think about it some more. Ez jó mulatság, férfi munka volt - for those who know what that means.

3
On

Here’s my stab at a Python implementation of the Grossman-Zeitman algorithm, which iteratively computes A(i, n) in O(A(i, n)) time. For a description of how this algorithm works, check the linked question.

def ackermann(i, n):
    positions = [-1] * (i + 1)
    values = [0] + [1] * i
    
    while positions[i] != n:       
        values[0]    += 1
        positions[0] += 1
            
        j = 1
        while j <= i and positions[j - 1] == values[j]:
            values[j] = values[j - 1]
            positions[j] += 1
            j += 1

    return values[i]

Given that the inner loop is very tight and there’s no recursion, I suspect this will likely outperform the basic recursive solution you initially posted. However, I haven’t correctness- or time-tested this implementation; it’s based on the Java code I wrote in the linked question.

0
On

But surprisingly caching doesn't help at all:

In [9]: %timeit Ackermann = cache(A); Ackermann(3, 4)
1.3 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That's because you did that wrong. Only your Ackermann is memoized, not your A. And the recursive calls all go to A.

Times for m,n = 3,8, including a properly memoized version B:

440.30 ms  A(m, n)
431.11 ms  Ackermann = cache(A); Ackermann(m, n)
  1.74 ms  B.cache_clear(); B(m, n)

About B:

  • Print B.cache_info()) afterwards to see how well the cache did: CacheInfo(hits=1029, misses=5119, maxsize=None, currsize=5119). So B had 5,119 "real" calls (where it had to do work) and 1,029 calls that were answered from cache. Without the memoization, it gets called 2,785,999 times.
  • For m,n = 3,12 it takes ~32 ms.
  • For m,n = 3,13 it crashes due to the deep recursion.

Code:

from timeit import repeat
import sys

sys.setrecursionlimit(999999)

setup = '''
from functools import cache

def A(m, n):
    if not m:
        return n + 1
    return A(m - 1, A(m, n - 1)) if n else A(m - 1, 1)

@cache
def B(m, n):
    if not m:
        return n + 1
    return B(m - 1, B(m, n - 1)) if n else B(m - 1, 1)

m, n = 3, 8
'''

codes = [
    'A(m, n)',
    'Ackermann = cache(A); Ackermann(m, n)',
    'B.cache_clear(); B(m, n)',
]

for code in codes:
    t = min(repeat(code, setup, number=1))
    print(f'{t*1e3:6.2f} ms ', code)
0
On

TL;DR The Ackermann function grows so rapidly that for (m >= 4 and n >= 3) the result won't fit to RAM. For smaller values of m or n, it's very easy to compute the values directly (without recursion) and quickly.

I know that this answer doesn't help optimizing recursive function calls, nevertheless it provides a fast solution (with analysis) for computing values of the Ackermann function on real, contemporary computers, and it provides a direct answer to the first paragraph of the question.

We want to store the result in an unsigned binary big integer variable on a computer. To store the value 2 ** b, we need (b + 1) bits of data, and (c1 + c2 * ceil(log2(b))) bits of header for storing the length b itself. (c1 is a nonnegative integer, c2 is a positive integer.) Thus we need more than b bits of RAM to store 2 ** b.

Computers with huge amounts of RAM:

  • In 2023 consumer-grade PCs have up to 128 GiB of RAM, and there are commercially available servers with 2 TiB of RAM (https://ramforyou.com/is-it-possible-for-a-computer-have-1tb-ram).
  • In 2020, a single-rack server with 6 TiB of RAM was built (https://www.youtube.com/watch?v=uHAfTty9UWY).
  • In 2017, a large server with 160 TiB of RAM was constructed (https://www.forbes.com/sites/aarontilley/2017/05/16/hpe-160-terabytes-memory/).
  • So we can say that in 2023 it's not feasible to build a computer with 1 PiB == 1024 TiB of RAM, and 1 EiB == 1024 PiB == 1048576 TiB == (2 ** 63) bits is completely impossible in 2023 and the near future.
  • The current estimate for the number of atoms in the universe is 10 ** 82 == 10 * 10 ** 81 < 16 * 2 ** 270 < 2 ** 274.
  • Let's imagine the biggest computer. Even if there are many more atoms, say 2 ** 300, and we can use all of them in a single computer, and we can store 1 EiB of data in a single atom, thus we have 2 ** 363 bits of memory; that's still too small to store the Big value 2 ** (2 ** 363).

Let's see which of Knuth's up-arrow notation (https://en.wikipedia.org/wiki/Knuth%27s_up-arrow_notation) values are less than Big:

  • 2 ^ b == 2 ** b: It works for 1 <= b <= (2 ** 363 - 1).

    2 ^ (2 ** 363) == 2 ** (2 ** 363) == Big >= Big.

  • 2 ^^ b: It works for 1 <= b <= 5.

    2 ^^ 1 == 2;

    2 ^^ 2 == 2 ** 2 == 4;

    2 ^^ 3 == 2 ** (2 ** 2) == 16;

    2 ^^ 4 == 2 ** (2 ** (2 ** 2)) == 65536;

    2 ^^ 5 == 2 ** (2 ** (2 ** (2 ** 2))) == 2 ** 65536 == 2 ** (2 ** 16) < Big;

    2 ^^ 6 == 2 ** (2 ** (2 ** (2 ** (2 ** 2)))) == 2 ** (2 ** 65536) >= Big.

  • 2 ^^^ b: It works for 1 <= b <= 3.

    2 ^^^ 1 == 2;

    2 ^^^ 2 == 2 ^^ 2 == 4;

    2 ^^^ 3 == 2 ^^ (2 ^^ 2) == 65536;

    2 ^^^ 4 == 2 ^^ (2 ^^ (2 ^^ 2)) == 2 ^^ 65536 >= 2 ^^ 6 >= Big.

  • 2 ^^^^ b: It works for 1 <= b <= 2.

    2 ^^^^ 1 == 2;

    2 ^^^^ 2 == 2 ^^^ 2 == 4;

    2 ^^^^ 3 == 2 ^^^ (2 ^^^ 2) == 2 ^^^ 4 == 2 ^^ 65536 >= 2 ^^ 6 >= Big.

  • 2 ^^^^^ b and even more arrows: It works for 1 <= b <= 2. The value is at least 2 ^^^^ b, see there.

Thus here is how to compute the feasible values in Python:

def up_arrow(a, b):
  if b <= 2:
    if b < 0:
      raise ValueError
    return (1, 2, 4)[b]
  elif a == 1:
    if b >> 363:
      raise ValueError
    return 1 << b  # This may run out of memory for large b.
  elif a == 2:
    if b > 5:
      raise ValueError
    if b == 5:
      return 1 << 65536
    return (16, 65536)[b - 3]
  elif a == 3:
    if b > 3:
      raise ValueError
    return 65536
  else:
    raise ValueError

Given that for m >= 3, ack(m, n) == up_arrow(m - 2, n + 3) - 3 (see also https://en.wikipedia.org/wiki/Ackermann_function#Table_of_values), we can compute the feasible values of the Ackermann function:

def ack(m, n):
  if n < 0:
    raise ValueError
  if m in (0, 1):
    return n + (m + 1)  # This may run out of memory for large n.
  elif m == 2:
    return (n << 1) + 3  # This may run out of memory for large n.
  elif m == 3:
    if n >> 360:
      raise ValueError
    return (1 << (n + 3)) - 3  # This may run out of memory for large n.
  elif m == 4:
    if n > 2:
      raise ValueError
    if n == 2:
      return (1 << 65536) - 3
    return (13, 65533)[n]
  elif m == 5:
    if n > 0:
      raise ValueError
    return 65533
  else:
    raise ValueError

print([ack(m, 0) for m in range(6)])
print([ack(m, 1) for m in range(5)])
print([ack(m, 2) for m in range(5)])
print([ack(m, 3) for m in range(4)])
print([ack(m, 4) for m in range(4)])
7
On

The question has tag Python, but using ctypes to call C++ code is also a Python feature. It contains 3 versions (I measure time cost of result = [ackermann(m,n) for m<=3, n <= 17], like in the top answer)

  • Use int64_t and recursion with memoization => 32ms. It'll only work for small inputs (just like all the algorithms in the top answer). It's focused on optimizing recursive function.
  • Use bignum and same algorithm as above => 32ms Since the bignum lib I used is implemented using 64 bit base type, if the result fit in small numbers it's no different from using int64_t directly. I added this because comments mentioned it's unfair to compare C++ int64_t vs Python unlimited length integer.
  • Use bignum + math formula => ~~0.14 ms This one will also work with bigger inputs.

Summary: > 3000x faster than top Python answer if you use math, or 13-15x faster if you don't use math. I use this single-header-file bignum library. Stackoverflow has answer length limit so I can't copy the file here.

// ackermann.cpp
#include <iostream>
#include <vector>
#include <chrono>
#include <string>
#include "num.hpp"
using namespace std;

class MyTimer {
    std::chrono::time_point<std::chrono::system_clock> start;

public:
    void startCounter() {
        start = std::chrono::system_clock::now();
    }

    int64_t getCounterNs() {
        return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count();
    }

    int64_t getCounterMs() {
        return std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start).count();
    }

    double getCounterMsPrecise() {
        return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count()
                / 1000000.0;
    }
};

extern "C" {
  int64_t ackermann(int64_t m, int64_t n) {
    static std::vector<int64_t> cache[4];
    // special signal to clear cache
    if (m < 0 && n < 0) {      
      for (int i = 0; i < 4; i++) {
        cache[i].resize(0);
        cache[i].shrink_to_fit();      
      }
      return -1;
    }
    
    if (n >= cache[m].size()) {
      int cur = cache[m].size();
      cache[m].resize(n + 1);
      for (int i = cur; i < n; i++) cache[m][i] = ackermann(m, i);
    }

    if (cache[m][n]) return cache[m][n];
    if (m == 0) return cache[m][n] = n + 1;
    
    // These 3 lines are kinda cheating, since it uses math formula for special case
    // So I commented them out because the question is about optimizing recursion.
    // if (m == 1) return cache[m][n] = n + 2;
    // if (m == 2) return cache[m][n] = 2 * n + 3;
    // if (m == 3) return cache[m][n] = (1LL << (n + 3)) - 3;

    if (n == 0) return cache[m][n] = ackermann(m - 1, 1);

    return cache[m][n] = ackermann(m - 1, ackermann(m, n - 1));        
  }

  Num ackermann_bignum_smallres(int64_t m, int64_t n) {
    static std::vector<Num> cache[4];
    // special signal to clear cache
    if (m < 0 && n < 0) {      
      for (int i = 0; i < 4; i++) {
        cache[i].resize(0);
        cache[i].shrink_to_fit();      
      }
      return -1;
    }
    
    if (n >= cache[m].size()) {
      int cur = cache[m].size();
      cache[m].resize(n + 1);
      for (int i = cur; i < n; i++) cache[m][i] = ackermann(m, i);
    }

    if (cache[m][n] > 0) return cache[m][n];
    if (m == 0) return cache[m][n] = n + 1;

    if (n == 0) return cache[m][n] = ackermann(m - 1, 1);

    return cache[m][n] = ackermann(m - 1, ackermann(m, n - 1));
  }

  //-----
  Num bignum_pow(const Num& x, const Num& n) {
    if (n == 0) return 1;
    Num mid = bignum_pow(x, n / 2);
    if (n % 2 == 0) return mid * mid;
    else return mid * mid * x;
  }

  Num ackermann_bignum(Num m, Num n) {
    if (m <= 1) return n + (m + 1);
    else if (m == 2) return Num(2) * n + 3;
    else if (m == 3) return bignum_pow(2, n + 3) - 3;
    else {
      cout << "Don't put m >= 4\n";
      exit(1);
    } 
  }
}

Num dummy = 0;
int main(int argc, char* argv[])
{
  int test_type = 0;
  if (argc > 1) {
    try {
      test_type = std::stoi(string(argv[1]));
    } catch (...) {
      test_type = 0;
    }
  }
  int lim = (test_type == 0) ? 63 : 17;

  MyTimer timer;
  timer.startCounter();
    
  for (int m = 0; m <= 3; m++)
  for (int n = 0; n <= lim; n++) {
    if (test_type == 0) {
      dummy = ackermann_bignum(m, n);      
    } else if (test_type == 1) {
      dummy = ackermann_bignum_smallres(m, n);
    } else {
      dummy = ackermann(m, n);      
    }
    cout << "ackermann(" << m << ", " << n << ") = " << dummy << "\n";    
  }

  cout << "ackermann cost = " << timer.getCounterMsPrecise() << "\n";
}

Compile the above using g++ ackermann.cpp -shared -fPIC -O3 -std=c++17 -o ackermann.so

To run separately, use

g++ -o main_cpp ackermann.cpp -O3 -std=c++17
./main_cpp

Then in the same folder, run this script (modified from @Stefan Pochmann answer)

def A_Stefan_row_class(m, n):
    class A0:
        def __getitem__(self, n):
            return n + 1
    class A:
        def __init__(self, a):
            self.a = a
            self.n = 0
            self.value = a[1]
        def __getitem__(self, n):
            while self.n < n:
                self.value = self.a[self.value]
                self.n += 1
            return self.value
    a = A0()
    for _ in range(m):
        a = A(a)
    return a[n]


from collections import defaultdict

def A_Stefan_row_lists(m, n):
    memo = defaultdict(list)
    def a(m, n):
        if not m:
            return n + 1
        if m not in memo:
            memo[m] = [a(m-1, 1)]
        Am = memo[m]
        while len(Am) <= n:
            Am.append(a(m-1, Am[-1]))
        return Am[n]
    return a(m, n)


from itertools import count

def A_Stefan_generators(m, n):
    a = count(1)
    def up(a, x=1):
        for i, ai in enumerate(a):
            if i == x:
                x = ai
                yield x
    for _ in range(m):
        a = up(a)
    return next(up(a, n))


def A_Stefan_paper(m, n):
    next = [0] * (m + 1)
    goal = [1] * m + [-1]
    while True:
        value = next[0] + 1
        transferring = True
        i = 0
        while transferring:
            if next[i] == goal[i]:
                goal[i] = value
            else:
                transferring = False
            next[i] += 1
            i += 1
        if next[m] == n + 1:
            return value


def A_Stefan_generators_2(m, n):
    def a0():
        n = yield
        while True:
            n = yield n + 1
    def up(a):
        next(a)
        a = a.send
        i, x = -1, 1
        n = yield
        while True:
            while i < n:
                x = a(x)
                i += 1
            n = yield x
    a = a0()
    for _ in range(m):
        a = up(a)
    next(a)
    return a.send(n)


def A_Stefan_m_recursion(m, n):
    ix = [None] + [(-1, 1)] * m
    def a(m, n):
        if not m:
            return n + 1
        i, x = ix[m]
        while i < n:
            x = a(m-1, x)
            i += 1
        ix[m] = i, x
        return x
    return a(m, n)


def A_Stefan_function_stack(m, n):
    def a(n):
        return n + 1
    for _ in range(m):
        def a(n, a=a, ix=[-1, 1]):
            i, x = ix
            while i < n:
                x = a(x)
                i += 1
            ix[:] = i, x
            return x
    return a(n)


from itertools import count, islice

def A_Stefan_generator_stack(m, n):
    a = count(1)
    for _ in range(m):
        a = (
            x
            for a, x in [(a, 1)]
            for i, ai in enumerate(a)
            if i == x
            for x in [ai]
        )
    return next(islice(a, n, None))


from itertools import count, islice

def A_Stefan_generator_stack2(m, n):
    a = count(1)
    def up(a):
        i, x = 0, 1
        while True:
            i, x = x+1, next(islice(a, x-i, None))
            yield x
    for _ in range(m):
        a = up(a)
    return next(islice(a, n, None))


def A_Stefan_generator_stack3(m, n):
    def a(m):
        if not m:
            yield from count(1)
        x = 1
        for i, ai in enumerate(a(m-1)):
            if i == x:
                x = ai
                yield x
    return next(islice(a(m), n, None))


def A_Stefan_generator_stack4(m, n):
    def a(m):
        if not m:
            return count(1)
        return (
            x
            for x in [1]
            for i, ai in enumerate(a(m-1))
            if i == x
            for x in [ai]
        )
    return next(islice(a(m), n, None))


def A_templatetypedef(i, n):
    positions = [-1] * (i + 1)
    values = [0] + [1] * i
    
    while positions[i] != n:       
        values[0]    += 1
        positions[0] += 1
            
        j = 1
        while j <= i and positions[j - 1] == values[j]:
            values[j] = values[j - 1]
            positions[j] += 1
            j += 1

    return values[i]

import ctypes
mylib = ctypes.CDLL('./ackermann.so')
mylib.ackermann.argtypes = [ctypes.c_int64, ctypes.c_int64]
mylib.ackermann.restype = ctypes.c_int64

def c_ackermann(m, n):
    return mylib.ackermann(m,n)

funcs = [
    c_ackermann,
    A_Stefan_row_class,
    A_Stefan_row_lists,
    A_Stefan_generators,
    A_Stefan_paper,
    A_Stefan_generators_2,
    A_Stefan_m_recursion,
    A_Stefan_function_stack,
    A_Stefan_generator_stack,
    A_Stefan_generator_stack2,
    A_Stefan_generator_stack3,
    A_Stefan_generator_stack4,
    A_templatetypedef
]

N = 18
args = (
    [(0, n) for n in range(N)] +
    [(1, n) for n in range(N)] +
    [(2, n) for n in range(N)] +
    [(3, n) for n in range(N)]
)

from time import time

def print(*args, print=print, file=open('out.txt', 'w')):
    print(*args)
    print(*args, file=file, flush=True)
    
expect = none = object()
for _ in range(3):
  for f in funcs:
    t = time()
    result = [f(m, n) for m, n in args]
    # print(f'{(time()-t) * 1e3 :5.1f} ms ', f.__name__)
    print(f'{(time()-t) * 1e3 :5.0f} ms ', f.__name__)
    if expect is none:
        expect = result
    elif result != expect:
        raise Exception(f'{f.__name__} failed')
    del result
  print()

  c_ackermann(-1, -1)

Output:

   32 ms  c_ackermann
 1897 ms  A_Stefan_row_class
 1427 ms  A_Stefan_row_lists
  437 ms  A_Stefan_generators
 1366 ms  A_Stefan_paper
  479 ms  A_Stefan_generators_2
  801 ms  A_Stefan_m_recursion
  725 ms  A_Stefan_function_stack
  716 ms  A_Stefan_generator_stack
 1113 ms  A_Stefan_generator_stack2
  551 ms  A_Stefan_generator_stack3
  682 ms  A_Stefan_generator_stack4
 1622 ms  A_templatetypedef