Here is a simplified singly-linked list, where each node owns the next, along with a function for destroying the list:
struct Node {
Node* next = nullptr;
~Node() {
delete next;
}
};
void Destroy(Node* head) {
delete head;
}
Clang 15.0.0 with -O3 (Compiler Explorer) gives recursive code for this that uses call instructions, i.e. no tail call recursion through ~Node itself, only operator delete(void*):
Destroy(Node*): # @Destroy(Node*)
test rdi, rdi
je .LBB0_1
push rbx
mov rbx, rdi
call Node::~Node() [base object destructor]
mov rdi, rbx
pop rbx
jmp operator delete(void*)@PLT # TAILCALL
.LBB0_1:
ret
Node::~Node() [base object destructor]: # @Node::~Node() [base object destructor]
push rbx
mov rbx, qword ptr [rdi]
test rbx, rbx
je .LBB1_1
mov rdi, rbx
call Node::~Node() [base object destructor]
mov rdi, rbx
pop rbx
jmp operator delete(void*)@PLT # TAILCALL
.LBB1_1:
pop rbx
ret
Here is an open-coded version of something similar:
struct Node2 {
Node2* next = nullptr;
};
void Destroy2(Node2* head) {
auto* const next = head->next;
delete head;
if (next) {
Destroy2(next);
}
}
Even with -01 clang turns the tail call into an efficient loop with O(1) stack frames involved for an arbitrary number of list nodes:
Destroy2(Node2*): # @Destroy2(Node2*)
push rbx
.LBB2_1: # =>This Inner Loop Header: Depth=1
mov rbx, qword ptr [rdi]
call operator delete(void*)@PLT
mov rdi, rbx
test rbx, rbx
jne .LBB2_1
pop rbx
ret
I understand that compiler optimizations aren't guaranteed, but I'm surprised clang isn't able to do something more efficient with the basic Destroy case. It leads me to think that the key difference is in the fact that Destroy2 is able to free the memory for head before it deals with head->next. But it seems to me that shouldn't matter unless operator delete is allowed to have some visible side effect.
Is there an important semantic difference between these two from the point of view of the abstract machine, preventing clang from optimizing the first case? If so, is there a way to make ~Node more friendly to the optimizer so that I don't need to open-code a destroy function?
It's worth calling out explicitly something I alluded to in the question: that the operations in
~Nodeare in the opposite order that they are inDestroy2. The destructor is something like this:So the question is why the compiler can't reverse those two operations, putting them in the same order as in
Destroy2and allowing a tail call to~Nodefornext.I can't say definitively that this is the answer, but Richard Smith points out that this could be because the compiler is limited in its ability to prove that anything preventing this would be UB. For example if there is a cycle in the list then maybe this could go wrong—a cycle would cause UB even for the existing code, but the compiler may not be able to see that. Perhaps more relevantly without an assumption about finite memory/object count (that the compiler probably doesn't make), the transformation isn't legal because the list may be unbounded without UB in the existing code.
As for what to do about it, Richard also points out that in C++20 we can use a destroying
operator deleteto reorder the operations ourselves, giving the effect ofDestroy2with the convenience of just usingdeletelike normal:As of today the generated code is good for raw pointers and bad for
std::unique_ptr, but the latter gets better if you add[[gnu::flatten]]to the operator.