Cassandra Node refuses to Join Cluster "Compaction Executor" error

282 Views Asked by At

We have a 3 node Cassandra Cluster running the following version

[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]

Node1 stopped communicating with the rest of the cluster this morning, the logs showed this:

ERROR [CompactionExecutor:242] 2020-09-15 19:24:48,753 CassandraDaemon.java:235 - Exception in thread Thread[CompactionExecutor:242,1,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,749 AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread Thread[MutationStage-2,5,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,771 StorageService.java:466 - Stopping gossiper
ERROR [MutationStage-2] 2020-09-15 19:24:56,791 StorageService.java:476 - Stopping native transport
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,541 LogTransaction.java:277 - Transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377] indicates txn was not completed, trying to abort it now
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,545 LogTransaction.java:280 - Failed to abort transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377]
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,566 LogTransaction.java:225 - Unable to delete /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377/md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log as it does not exist, see debug log file for stack trace

Cassandra starts up fine on the "broken node", but refuses to rejoin the cluster.

When I do a nodetool status I get this:

**Error: The node does not have system_traces yet, probably still bootstrapping**

Gossip is not running, i've tried disabling and re-enabling, no joy.

I've also tried both a repair and a rebuild, both came back with no errors at all.

Any and all help would be appreciated.

Thanks.

1

There are 1 best solutions below

4
On

The symptoms you described indicates to me that the node had some form of hardware failure and the data/ disk is possibly inaccessible.

In instances like this, the disk failure policy in cassandra.yaml kicked in:

disk_failure_policy: stop

This would explain why gossip is unavailable (on default port 7000) and the node would not be accepting any client connections either (on default CQL port 9042).

If there is an impending hardware failure, there's a good chance the disk/volume is mounted as read-only. There's also the possibility that the disk is full. Check the operating system logs for clues and you will likely need to escalate the issue to your sysadmin team. Cheers!