Undetectable tombstones in Apache Cassandra

One of the usual suspects for performance issues in the read path of Apache Cassandra is the presence of tombstones. We are used to check how many tombstones are accessed per read early in the process, to identify the possible cause of excessive GC pauses or high read latencies.
While trying to understand unexpected high read latencies for a customer a few months ago, we found out that one special (although fairly common) kind of tombstone was not counted in the metrics nor traced in the logs : primary key deletes.

A tale of two tombstones

Tombstones in Cassandra divide in two main categories:

  • cell tombstone
  • range tombstone

The former is a tombstone that shadows a single cell. This kind of tombstone is usually generated by TTLs, where once a cell expires it turns into a tombstone. It can also be the result of updating a cell to a null value or of deleting a single cell.

The latter is a tombstone that can shadow multiple cells. This kind of tombstone is generated by DELETE statements that can have different spans:

  • Partition deletes: a single tombstone will shadow the whole partition
  • Multiple rows deletes: a single tombstone will shadow a subset of the partition
  • Single row deletes: a single tombstone will shadow a single row only

To be fully accurate, range tombstones are always composed of two tombstones, one for each bound of the shadowed range.

Considering the following table:

CREATE TABLE test.tombstones (
    id int,
    clust1 text,
    clust2 text,
    val1 text,
    val2 text,
    PRIMARY KEY (id, clust1, clust2)
) WITH CLUSTERING ORDER BY (clust1 ASC, clust2 ASC)

Placing a partition level range tombstone would be achieved using the following query:

DELETE FROM test.tombstones WHERE id = ?

Placing a multi-row range tombstone would require the following query:

DELETE FROM test.tombstones WHERE id = ? AND clust1 = ?

Since Apache Cassandra 3.0.0 it is also possible to perform the following kind of range delete thanks to CASSANDRA-6237:

DELETE FROM test.tombstones WHERE id = ? AND clust1 > ? AND clust1 <= ?

Creating a single row range tombstone would be done as follows:

DELETE FROM test.tombstones WHERE id = ? AND clust1 = ? AND clust2 = ?

And we can create cell tombstones by either deleting/nulling a single column (or setting a TTL at write time):

DELETE val1 from test.tombstones WHERE id = ? AND clust1 = ? AND clust2 = ? 

or:

UPDATE test.tombstones SET val1 = null WHERE id = ? AND clust1 = ? AND clust2 = ? 

Detecting the bug

Our customer had some pretty powerful hardware with top notch SSD drives and more than enough CPU and RAM to get the best out of Apache Cassandra 3.11.1.

Most tables were exhibiting correct read perfomance but one, and the cluster was going through a lot of long GC pauses (> 1s):

Keyspace     Table             p50           p75           p95
ks1          table1       24985.15      62174.25     670692.26
ks1          table2         772.01       1965.33       9462.06
ks1          table3          39.03         56.19       2807.62

All 3 tables had the same data model and nodetool tablestats output showed no tombstones per read. Since what we were seeing made no sense, we suspected that one of the metrics we were using was not accurate.

Although we weren’t seeing any tombstone read according to the logs and the metrics, the customer confirmed that DELETE statements were performed regularly on the full primary key, so we experimented to find out why Cassandra behaved this way.

Reproducing the problem was fairly easy and we used CCM to test different versions of C*, using the test.tombstones table defined above.

We wrote a short Python script to populate the table with a configurable number of rows, on a single partition spreading over 10 clustering keys:

futures = []
for i in range(int(sys.argv[1])):
    futures.append(session.execute_async("""
    	insert into test.tombstones(id, clust1, clust2, val1, val2) 
    	values(1,'""" + str(i%10) + """','""" + str(i) + """','test','test')"""))
    
    if i%1000==0 and i>0:
        for future in futures:
            rows = future.result()
        print str(i+1) + " rows..."
        futures = []

Then we made another script to delete a subset of the rows by primary key (single row deletes):

futures = []
for i in range(int(sys.argv[1])):
    futures.append(session.execute_async("""
    	DELETE FROM test.tombstones 
    	WHERE id=1 
    	AND clust1 = '""" + str(str(i%10)) + """' 
    	AND clust2='""" + str(i) + """'"""))
    
    if i%1000==0 and i>0:
        for future in futures:
            rows = future.result()
        print str(i+1) + " rows..."
        futures = []

The procedure was to run the first script to add 10,000 rows, flush all memtables and then run the deletion script to remove 1,000 rows (so that tombstones are stored in different memtables/SSTables than the shadowed data).

Reading the table in Apache Cassandra 2.0.13 gave the following traces:

cqlsh> tracing on
Now tracing requests.

cqlsh> select count(*) from test.tombstones where id = 1;

 count
-------
  9000

(1 rows)


Tracing session: 9849e2b0-7f8f-11e8-9faa-4f54a5a22ee7

 activity                                                                  | timestamp    | source    | source_elapsed
---------------------------------------------------------------------------+--------------+-----------+----------------
                                                        execute_cql3_query | 15:39:37,762 | 127.0.0.1 |              0
    Parsing select count(*) from test.tombstones where id = 1 LIMIT 10000; | 15:39:37,762 | 127.0.0.1 |             66
                                                       Preparing statement | 15:39:37,762 | 127.0.0.1 |            162
                            Executing single-partition query on tombstones | 15:39:37,763 | 127.0.0.1 |           1013
                                              Acquiring sstable references | 15:39:37,763 | 127.0.0.1 |           1035
                                               Merging memtable tombstones | 15:39:37,763 | 127.0.0.1 |           1080
                       Partition index with 16 entries found for sstable 3 | 15:39:37,763 | 127.0.0.1 |           1786
                               Seeking to partition beginning in data file | 15:39:37,763 | 127.0.0.1 |           1803
 Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones | 15:39:37,764 | 127.0.0.1 |           2308
                                Merging data from memtables and 1 sstables | 15:39:37,764 | 127.0.0.1 |           2331
                                  Read 9000 live and 3000 tombstoned cells | 15:39:37,796 | 127.0.0.1 |          34053
                                                          Request complete | 15:39:37,831 | 127.0.0.1 |          69508                                                        

Although we deleted 1000 rows, Apache Cassandra 2.0 reports 3000 tombstones were read: Read 9000 live and 3000 tombstoned cells.
Dumping one of the SSTables to JSON, we see that each row is composed of 3 cells:

[
	{"key": "00000001","columns": [
	...
		["0:100:","",1530711392080000], ["0:100:val1","test",1530711392080000], ["0:100:val2","test",1530711392080000],
		["0:1000:","",1530711392233000], ["0:1000:val1","test",1530711392233000], ["0:1000:val2","test",1530711392233000],
	...
	}
]

So C* 2.0 does not count the tombstones but rather the individual cells that are shadowed by the tombstones (at least in this specific case).

Now with Apache Cassandra 2.1.18 to 3.11.1:

cqlsh> paging off
Disabled Query paging.
cqlsh> tracing on
Now Tracing is enabled
cqlsh> select count(*) from test.tombstones where id = 1;

 count
-------
  9000

(1 rows)

Tracing session: e39af870-7e14-11e8-a105-4f54a5a22ee7

 activity                                                                                        | timestamp                  | source    | source_elapsed
-------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
                                                                              Execute CQL3 query | 2018-07-02 18:28:01.783000 | 127.0.0.1 |              0
                Parsing select count(*) from test.tombstones where id = 1; [SharedPool-Worker-1] | 2018-07-02 18:28:01.784000 | 127.0.0.1 |             53
...
...
                                      Read 9000 live and 0 tombstone cells [SharedPool-Worker-2] | 2018-07-02 18:28:01.810000 | 127.0.0.1 |          27004
                                                                                Request complete | 2018-07-02 18:28:01.825313 | 127.0.0.1 |          42313

Since 2.1.x all the way to 3.11.1 (with 2.2.x and 3.0.x having the same behavior), there are no more tombstones reported : Read 9000 live and 0 tombstone cells.

The issue was not affecting the other kinds of range tombstones (partition and multiple rows) and cell tombstones were correctly counted as well.

It was then necessary to go down the read path to understand why some tombstones could be missed.

Merging mechanisms in Apache Cassandra

Being built on top of an LSM tree storage engine, C* has to merge cells that can spread in memtables and several SSTables in order to return back a consistent view of the queried data.
While the local node has to merge both cells and tombstones together to get the current state of the rows, it also has to exchange tombstones with other nodes in case of digest mismatch, to achieve read repairs.
The output of the merge will be:

  • a list of rows, containing a list of cells which can each be live or not (tombstones are cells that aren’t live anymore)
  • a list of tombstones which contains only range tombstones.

Different kinds of tombstones get merged differently with the data they shadow though:

  • Partition and multi row tombstones: data doesn’t survive the merge and the range tombstones are returned as such so that they can be used for digest comparison and read repair.
  • Cell tombstones: they are returned as cells with liveness set to false.
  • Single row tombstones: they are merged upstream and aren’t returned as tombstones. The row will be returned with no cell and liveness set to false.

After a redesign of the read path in 2.1 that aimed at optimizing performance by taking out unnecessary tombstoned cells during the merge phase, single row tombstones stopped being counted and became impossible to detect. Only range tombstones that survived the merge and individually tombstoned cells were counted, but a row with no cell would be skipped silently.

Patched for 3.11.2 and 4.0

CASSANDRA-8527 was committed for 3.11.2 and the upcoming 4.0 in order to count all empty, (non live) rows as tombstones:

cqlsh> SELECT count(*) from test.tombstones where id = 1;

 count
-------
  9000

(1 rows)

Tracing session: c0c19e20-7f91-11e8-90d3-b70479b8c91e

 activity                                                                                 | timestamp                  | source    | source_elapsed | client
------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                       Execute CQL3 query | 2018-07-04 15:54:21.570000 | 127.0.0.1 |              0 | 127.0.0.1
 Parsing SELECT count(*) from test.tombstones where id = 1; [Native-Transport-Requests-1] | 2018-07-04 15:54:21.570000 | 127.0.0.1 |            173 | 127.0.0.1
                                        Preparing statement [Native-Transport-Requests-1] | 2018-07-04 15:54:21.570000 | 127.0.0.1 |            315 | 127.0.0.1
                             Executing single-partition query on tombstones [ReadStage-2] | 2018-07-04 15:54:21.571000 | 127.0.0.1 |            947 | 127.0.0.1
                                               Acquiring sstable references [ReadStage-2] | 2018-07-04 15:54:21.571000 | 127.0.0.1 |           1061 | 127.0.0.1
  Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [ReadStage-2] | 2018-07-04 15:54:21.571000 | 127.0.0.1 |           1271 | 127.0.0.1
                                                Key cache hit for sstable 1 [ReadStage-2] | 2018-07-04 15:54:21.572000 | 127.0.0.1 |           1698 | 127.0.0.1
                                  Merged data from memtables and 1 sstables [ReadStage-2] | 2018-07-04 15:54:21.613000 | 127.0.0.1 |          42574 | 127.0.0.1
                               Read 9000 live rows and 1000 tombstone cells [ReadStage-2] | 2018-07-04 15:54:21.613000 | 127.0.0.1 |          42746 | 127.0.0.1
                                                                         Request complete | 2018-07-04 15:54:21.620624 | 127.0.0.1 |          50624 | 127.0.0.1

This allows us to safely rely on traces and metrics when troubleshooting high read latencies in Apache Cassandra 3.11.2+.

Takeways

Tombstones generated by full primary key deletes will not be reported in the logs nor counted in the metrics between Apache Cassandra 2.1.0 up to 3.11.1.

You can only guess about their presence if you’re observing unexpected latencies that cannot be explained by anything else.
As a general rule, it is advised to ask the dev teams if they are performing DELETE statements and which type precisely, even if the metrics suggest that there are no tombstones.

Upgrading to 3.11.2+ allows you to detect those tombstones both in logs and metrics. As a consequence, they will now be counted in the failure threshold above which C* will cancel in flight queries, while they were succeeding before the upgrade.

tombstones performance