We have a 3 node Cassandra Cluster running the following version
[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]
Node1 stopped communicating with the rest of the cluster this morning, the logs showed this:
ERROR [CompactionExecutor:242] 2020-09-15 19:24:48,753 CassandraDaemon.java:235 - Exception in thread Thread[CompactionExecutor:242,1,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,749 AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread Thread[MutationStage-2,5,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,771 StorageService.java:466 - Stopping gossiper
ERROR [MutationStage-2] 2020-09-15 19:24:56,791 StorageService.java:476 - Stopping native transport
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,541 LogTransaction.java:277 - Transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377] indicates txn was not completed, trying to abort it now
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,545 LogTransaction.java:280 - Failed to abort transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377]
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,566 LogTransaction.java:225 - Unable to delete /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377/md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log as it does not exist, see debug log file for stack trace
Cassandra starts up fine on the "broken node", but refuses to rejoin the cluster.
When I do a nodetool status I get this:
**Error: The node does not have system_traces yet, probably still bootstrapping**
Gossip is not running, i've tried disabling and re-enabling, no joy.
I've also tried both a repair and a rebuild, both came back with no errors at all.
Any and all help would be appreciated.
Thanks.
The symptoms you described indicates to me that the node had some form of hardware failure and the
data/
disk is possibly inaccessible.In instances like this, the disk failure policy in
cassandra.yaml
kicked in:This would explain why gossip is unavailable (on default port
7000
) and the node would not be accepting any client connections either (on default CQL port9042
).If there is an impending hardware failure, there's a good chance the disk/volume is mounted as read-only. There's also the possibility that the disk is full. Check the operating system logs for clues and you will likely need to escalate the issue to your sysadmin team. Cheers!