I was trying to read the Paxos Commit paper and am struggling witih moving past the introduction. The intial section builds a motivation for a fault-tolerant transaction coordinator implementation in the two-phase commit protocol by describing regular two-phase commit as "blocking" when the transaction coordinator fails
The failure of that coordinator can cause the protocol to block, with no process knowing the outcome, until the coordinator is repaired.
My question is this - if the coordinator fails, assuming the coordinator's state is a deterministic function of the responses of the resource-managers (or the individual databases); then why can't we simply have any of the other resource-managers query every other resource-manager for their response and "repair" progress? Essentially taking up the role of the coordinator after a timeout period.
This is assuming the individual resource-managers are modelled as fault-tolerant black boxes (eg. they are implemented with their own multi-paxos implementation on a cluster of n machines)
What you propose is indeed what many people have done with 2PC, the very same paper you referenced explains why that strategy is not correct in section 3, in Lamport's words:
In my words: imagine the original coordinator is not dead, but just stuck for a long time (GC, deadlock, whatever). After the timeout, another node would pick up the slack. Now, the original coordinator could wake up and choose to commit while the new coordinator could choose to abort. Depending on the interleaving of messages, some RMs would end up in the committed state, and others on the aborted state, which is a system failure.