Dealing with uncompactable/overlapping sstables in Cassandra

829 Views Asked by At

We have a new cluster running Cassandra 2.2.14, and have left compactions to "sort themselves out". This is in our UAT environment, so load is low. We run STCS.

We are seeing forever growing tombstones. I understand that compactions will take care of the data eventually once the sstable is eligible for compaction. This is not occuring often enough for us, so I enabled some settings as a test (I am aware they are aggressive, this is purely for testing):

'tombstone_compaction_interval': '120', 
'unchecked_tombstone_compaction': 'true', 
'tombstone_threshold': '0.2', 
'min_threshold': '2'

this did result in some compactions occurring, however the amount of dropped tombstones are low, nor did it go below the threshold (0.2). After these settings were applied, this is what I can see from sstablemetadata:

Estimated droppable tombstones: 0.3514636277302944
Estimated droppable tombstones: 0.0
Estimated droppable tombstones: 6.007563159628437E-5

Note that this is only one CF, and there are much worse CF's out there (90% tombstones, etc). Using this as an example, but all CF's are suffering the same symptoms.

tablestats:

               SSTable count: 3
                Space used (live): 3170892738
                Space used (total): 3170892738
                Space used by snapshots (total): 3170892750
                Off heap memory used (total): 1298648
                SSTable Compression Ratio: 0.8020960426857765
                Number of keys (estimate): 506775
                Memtable cell count: 4
                Memtable data size: 104
                Memtable off heap memory used: 0
                Memtable switch count: 2
                Local read count: 2161
                Local read latency: 14.531 ms
                Local write count: 212
                Local write latency: NaN ms
                Pending flushes: 0
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 645872
                Bloom filter off heap memory used: 645848
                Index summary off heap memory used: 192512
                Compression metadata off heap memory used: 460288
                Compacted partition minimum bytes: 61
                Compacted partition maximum bytes: 5839588
                Compacted partition mean bytes: 8075
                Average live cells per slice (last five minutes): 1.0
                Maximum live cells per slice (last five minutes): 1
                Average tombstones per slice (last five minutes): 124.0
                Maximum tombstones per slice (last five minutes): 124

The obvious answer here is that the tombstones were not eligible for removal.

gc_grace_seconds is set to 10 days, and has not been moved. I dumped one of the sstables to json, and I can see tombstones dating back to April 2019:

{"key": "353633393435353430313436373737353036315f657370a6215211e68263740a8cc4fdec",
 "cells": [["d62cf4f420fb11e6a92baabbb43c0a93",1566793260,1566793260977489,"d"],
           ["d727faf220fb11e6a67702e5d23e41ec",1566793260,1566793260977489,"d"],
           ["d7f082ba20fb11e6ac99efca1d29dc3f",1566793260,1566793260977489,"d"],
           ["d928644a20fb11e696696e95ac5b1fdd",1566793260,1566793260977489,"d"],
           ["d9ff10bc20fb11e69d2e7d79077d0b5f",1566793260,1566793260977489,"d"],
           ["da935d4420fb11e6a960171790617986",1566793260,1566793260977489,"d"],
           ["db6617c020fb11e6925271580ce42b57",1566793260,1566793260977489,"d"],
           ["dc6c40ae20fb11e6b1163ce2bad9d115",1566793260,1566793260977489,"d"],
           ["dd32495c20fb11e68f7979c545ad06e0",1566793260,1566793260977489,"d"],
           ["ddd7d9d020fb11e6837dd479bf59486e",1566793260,1566793260977489,"d"]]},

So I do not believe gc_grace_seconds is the issue here. I have run a manual user defined compaction over every Data.db file within the column family folder (singular Data.db file only, one at a time). Compactions ran, but there was very little change to tombstone values. The old data still remains.

I can confirm repairs have occurred, yesterday actually. I can also confirm repairs have been running regularly, with no issues showing in the logs.

So repairs are fine. Compactions are fine. All I can think of is overlapping SSTables.

The final test is to run a full compaction on the column family. I performed a user defined( not nodetool compact) on the 3 SSTables using JMXterm. This resulted in a singular SSTable file, with the following:

Estimated droppable tombstones: 9.89886650537452E-6

If i look for the example EPOCH as above (1566793260), it is not visible. Nor is the key. So it was compacted out or Cassandra did something. The total number of lines containing a tombstone ("d") flag is 1317, of the 120million line dump. And the EPOCH values are all within 10 days. Good.

So I assume the -6 value is a very small percentage and sstablemetadata is having problems showing it. So, success right? But it took a full compaction to remove the old tombstones. As far as I am aware, a full compaction is only a last ditch effort maneuver.

My questions are -

  1. How can I determine if overlapping sstables is my issue here? I cant see any other reason why the data would not compact out unless it is overlapping related.
  2. How can I resolve overlapping sstables, without performing a full compaction? I am afraid this is simply going to reoccur in a few weeks time. I don't want to get stuck having to perform full compactions regularly to keep tombstones at bay.
  3. What are the reasons for the creation of overlapping sstables? Is this a data design problem, or some other issue?

Cheers.

1

There are 1 best solutions below

1
On BEST ANSWER

To answer your questions:

How can I determine if overlapping sstables is my issue here? I cant see any other reason why the data would not compact out unless it is overlapping related.

If the tombstones weren't generated by using TTL, more of the time the tombstones and the shadowed data could locate into different sstables. When using STCS and there is low volume of write into the cluster, few compaction will be triggered which causes the tombstones stay for extended time. If you have the partition key of a tombstone, run nodetool getsstables -- <keyspace> <table> <key> on a node will return all sstables that contain the key in the local node. You can dump the sstable content to confirm.

How can I resolve overlapping sstables, without performing a full compaction? I am afraid this is simply going to reoccur in a few weeks time. I don't want to get stuck having to perform full compactions regularly to keep tombstones at bay.

There is a new option in "nodetool compaction -s" which can do a major compaction and slit the output to 4 sstables with different sizes. This solves the previous problem of the major compaction which creates a single large sstable. If the droppable tombstones ratio is as high as 80-90%, the resulted sstable size will be even smaller as the majority tombstones had been purged.

In the newer version Cassandra (3.10+), there is a new tool, nodetool garbagecollect, to clean up the tombstones. However, there is limitations in this tool. Not all kinds of tombstones could be removed by it.

All being said, for your situation that there are overlapping sstables and low volume of activities/less frequency of compactions, either you have to find out all related sstables and use user defined compaction, or do major compaction with "-s". https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsCompact.html

What are the reasons for the creation of overlapping sstables? Is this a data design problem, or some other issue?

Fast growing of tombstones usually indicates a data modeling problem: whether the application is inserting null, or periodically deleting data, or using collection and doing update instead of appending. If your data is time series, check if it makes sense to use TTL and TWCS.