The LIMIT clause in Apache Cassandra might not work as you think

Apache Cassandra can store data on disk in an orderly fashion, which makes it great for time series. As you may have seen in numerous tutorials, to get the last 10 rows of a time series, just use a descending clustering order and add a LIMIT 10 clause. Simple and efficient!
Well if we take a closer look, it might not be as efficient as one would think, which we will cover in this blog post.

Cassandra! Give me the last 10 rows for this partition

Let’s say you’re using the following data model :

CREATE TABLE time_series (
    id uuid,
    ts timestamp,
    value text,
    PRIMARY KEY(id, ts)
) WITH CLUSTERING ORDER BY(ts DESC)
    AND compaction = {'class':'TimeWindowCompactionStrategy', 'compaction_window_unit':'MINUTES','compaction_window_size':'1'}
    AND default_time_to_live=600
;

The above will create a single SSTable for each minute worth of data. Using a Python script, we will create a single row per second on the same partition, and flush memtables every minute so that we have each SSTable holding exactly 60 rows.

Our use case here would be that we only want the last 10 records each time we query the database, and we’re using TWCS in the hope that we would only hit a single SSTable each time we do this (or maybe even only the memtables). Since we have a descending time based clustering key, we would expect that query to be fairly efficient, right ?

If we activate tracing and run our LIMIT 10 query on an Apache Cassandra 3.0.8 cluster, here’s our output :

cqlsh> TRACING ON;

cqlsh> select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 10 ;

 id                                   | ts                       | value
--------------------------------------+--------------------------+-------
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:22+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:21+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:20+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:19+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:18+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:17+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:16+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:15+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:14+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:13+0000 |  test

(10 rows)

Tracing session: 94db1cf0-ed16-11e6-acd2-d74cd6c92d07

 activity                                                                                                               | timestamp                  | source    | source_elapsed
------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
                                                                                                     Execute CQL3 query | 2017-02-07 10:19:52.383000 | 127.0.0.1 |              0
 Parsing select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 10 ; [SharedPool-Worker-1] | 2017-02-07 10:19:52.383000 | 127.0.0.1 |            161
                                                                              Preparing statement [SharedPool-Worker-1] | 2017-02-07 10:19:52.383000 | 127.0.0.1 |            356
                                                                     reading data from /127.0.0.2 [SharedPool-Worker-1] | 2017-02-07 10:19:52.383000 | 127.0.0.1 |            762
                                           READ message received from /127.0.0.1 [MessagingService-Incoming-/127.0.0.1] | 2017-02-07 10:19:52.384000 | 127.0.0.2 |             21
                                              Sending READ message to /127.0.0.2 [MessagingService-Outgoing-/127.0.0.2] | 2017-02-07 10:19:52.384000 | 127.0.0.1 |           1107
                                                   Executing single-partition query on time_series [SharedPool-Worker-1] | 2017-02-07 10:19:52.384000 | 127.0.0.2 |            222
                                                                     Acquiring sstable references [SharedPool-Worker-1] | 2017-02-07 10:19:52.384000 | 127.0.0.2 |            255
                                              Partition index with 0 entries found for sstable 12 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 |            426
                                                                     Key cache hit for sstable 11 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 |            542
                                                                     Key cache hit for sstable 10 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 |            596
                                                                      Key cache hit for sstable 9 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 |            642
                                                                      Key cache hit for sstable 8 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 |            687
                                                                      Key cache hit for sstable 7 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 |            731
                                                                      Key cache hit for sstable 6 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 |            777
                                                                      Key cache hit for sstable 5 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 |            816
                                                                      Key cache hit for sstable 4 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 |            858
                        Skipped 0/9 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 |            908
                                                       Merging data from memtables and 9 sstables [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 |            934
                                                               Read 10 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 |           1146
                                                                 Enqueuing response to /127.0.0.1 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 |           1187
                                  Sending REQUEST_RESPONSE message to /127.0.0.1 [MessagingService-Outgoing-/127.0.0.1] | 2017-02-07 10:19:52.386000 | 127.0.0.2 |           1410
                               REQUEST_RESPONSE message received from /127.0.0.2 [MessagingService-Incoming-/127.0.0.2] | 2017-02-07 10:19:52.386000 | 127.0.0.1 |           3179
                                                              Processing response from /127.0.0.2 [SharedPool-Worker-5] | 2017-02-07 10:19:52.386000 | 127.0.0.1 |           3369
                                                                                                       Request complete | 2017-02-07 10:19:52.386584 | 127.0.0.1 |           3584

The trace shows that data was merged from 9 SSTables before returning 10 rows : Merging data from memtables and 9 sstables.

Let’s try it on 300 rows now :

cqlsh> select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 300 ;

 id                                   | ts                       | value
--------------------------------------+--------------------------+-------
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:19:53+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:19:52+0000 |  test
 ...
 ...
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:12:05+0000 |  test

(300 rows)

Tracing session: 964e5c00-ed16-11e6-acd2-d74cd6c92d07

 activity                                                                                                                | timestamp                  | source    | source_elapsed
-------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
                                                                                                      Execute CQL3 query | 2017-02-07 10:19:54.816000 | 127.0.0.1 |              0
 Parsing select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 300 ; [SharedPool-Worker-1] | 2017-02-07 10:19:54.816000 | 127.0.0.1 |            162
                                                                               Preparing statement [SharedPool-Worker-1] | 2017-02-07 10:19:54.816000 | 127.0.0.1 |            304
                                                    Executing single-partition query on time_series [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 |            720
                                                                      Acquiring sstable references [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 |            750
                                                                      Key cache hit for sstable 12 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 |            809
                                                                      Key cache hit for sstable 11 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 |            904
                                                                      Key cache hit for sstable 10 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 |            965
                                                                       Key cache hit for sstable 9 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 |           1007
                                                                       Key cache hit for sstable 8 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 |           1054
                                                                       Key cache hit for sstable 7 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 |           1125
                                                                       Key cache hit for sstable 6 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 |           1170
                                                                       Key cache hit for sstable 5 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 |           1209
                                                                       Key cache hit for sstable 4 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 |           1251
                         Skipped 0/9 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 |           1287
                                                        Merging data from memtables and 9 sstables [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 |           1310
                                                               Read 300 live and 0 tombstone cells [SharedPool-Worker-2] | 2017-02-07 10:19:54.818000 | 127.0.0.1 |           2356
                                                                                                        Request complete | 2017-02-07 10:19:54.818704 | 127.0.0.1 |           2704

Once again, data was merged from 9 SSTables plus the memtables to get our 300 rows.

Worse, as we’ve set a 10 minutes TTL on our data, once cells start to expire we are going to be reading a handful of tombstones :

cqlsh> select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 80 ;

 id                                   | ts                       | value
--------------------------------------+--------------------------+-------
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:20:43+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:20:42+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:20:41+0000 |  test
 ...
 ...
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:19:44+0000 |  test
 9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:19:43+0000 |  test

(60 rows)

Tracing session: e7078800-ed17-11e6-acd2-d74cd6c92d07

 activity                                                                                                               | timestamp                  | source    | source_elapsed
------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
                                                                                                     Execute CQL3 query | 2017-02-07 10:29:19.744000 | 127.0.0.1 |              0
 Parsing select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 80 ; [SharedPool-Worker-1] | 2017-02-07 10:29:19.744000 | 127.0.0.1 |            124
                                                                              Preparing statement [SharedPool-Worker-1] | 2017-02-07 10:29:19.744000 | 127.0.0.1 |            298
                                                   Executing single-partition query on time_series [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 |            823
                                                                     Acquiring sstable references [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 |            901
                                              Partition index with 0 entries found for sstable 13 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 |           1091
                                                                     Key cache hit for sstable 12 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 |           1195
                                                                     Key cache hit for sstable 11 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 |           1243
                                                                     Key cache hit for sstable 10 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 |           1317
                                                                      Key cache hit for sstable 9 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 |           1352
                                                                      Key cache hit for sstable 8 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 |           1382
                                                                      Key cache hit for sstable 7 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 |           1413
                                                                      Key cache hit for sstable 6 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 |           1449
                                                                      Key cache hit for sstable 5 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 |           1483
                                                                      Key cache hit for sstable 4 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 |           1513
                       Skipped 0/10 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 |           1543
                                                      Merging data from memtables and 10 sstables [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 |           1567
                                                             Read 60 live and 540 tombstone cells [SharedPool-Worker-2] | 2017-02-07 10:29:19.748000 | 127.0.0.1 |           3968
                                                                                                       Request complete | 2017-02-07 10:29:19.749004 | 127.0.0.1 |           5004

This might explain why we are reading way more tombstones and hitting way more SSTables then you would expect.

Why ?

So far, Cassandra was not optimized to determine if a row partially exists in multiple SSTables or not. On top of that, remember that compaction strategies can mix old and new data into one SSTable. Then reading only the “latest” SSTables would give no guarantee that we’re reading all that exists for a specific row. Thus Cassandra must read the entire partition, merge data, and then return the first 10 rows to make sure no cells were missed. Limiting the number of rows in the result set is done very late in the read path and Cassandra still performs a lot of disk seeks that could negatively affect latencies.

Leveled Compaction Strategy (LCS) usually mitigates this issue because it works hard at having a specific partition live in at most a single SSTable by level, at the expense of a high I/O and CPU consumption.

The optimization that allows Cassandra to read only a subset of the SSTables came in 3.4 with CASSANDRA-8180.
Since that release, Cassandra uses the min/max values of clustering keys in the index, orders SSTables accordingly, and process them one at a time until it reaches the required number of rows.

But I’m still using 2.x !

And this is a totally valid choice, as the 3.x tick tock releases still lack production testing at scale to be considered stable.
If you’re running such queries that use the LIMIT clause on DESCENDING ordered clustering keys, the only solution to avoid reading all SSTables that contain the target partition is to add a range clause on the clustering key.

Based on our query above, and since we’re using TWCS with 1 minute time windows, we can add a condition on the ts column to only query the last minute :

cqlsh:test> select * from time_series where id = 4ba7505d-e4e5-4000-8454-f7cc919bafab and ts >= '2017-02-07 14:13:00' limit 10;

 id                                   | ts                       | value
--------------------------------------+--------------------------+-------
 4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:31+0000 |  test
 4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:30+0000 |  test
 4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:29+0000 |  test
 4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:28+0000 |  test
 4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:27+0000 |  test
 4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:26+0000 |  test
 4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:25+0000 |  test
 4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:24+0000 |  test
 4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:23+0000 |  test
 4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:22+0000 |  test

(10 rows)

Tracing session: 5ce2cb10-ed37-11e6-9747-fd68e1eea510

 activity                                                                                                                                         | timestamp                  | source    | source_elapsed
--------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
                                                                                                                               Execute CQL3 query | 2017-02-07 14:14:31.873000 | 127.0.0.1 |              0
 Parsing select * from time_series where id = 4ba7505d-e4e5-4000-8454-f7cc919bafab and ts >= '2017-02-07 14:13:00' limit 10; [SharedPool-Worker-1] | 2017-02-07 14:14:31.873000 | 127.0.0.1 |            155
                                                                                                        Preparing statement [SharedPool-Worker-1] | 2017-02-07 14:14:31.873000 | 127.0.0.1 |            321
                                                                                               reading data from /127.0.0.2 [SharedPool-Worker-1] | 2017-02-07 14:14:31.874000 | 127.0.0.1 |           1224
                                                                        Sending READ message to /127.0.0.2 [MessagingService-Outgoing-/127.0.0.2] | 2017-02-07 14:14:31.874000 | 127.0.0.1 |           1463
                                                                     READ message received from /127.0.0.1 [MessagingService-Incoming-/127.0.0.1] | 2017-02-07 14:14:31.875000 | 127.0.0.2 |             25
                                                                             Executing single-partition query on time_series [SharedPool-Worker-2] | 2017-02-07 14:14:31.875000 | 127.0.0.2 |            186
                                                                                               Acquiring sstable references [SharedPool-Worker-2] | 2017-02-07 14:14:31.875000 | 127.0.0.2 |            239
                                                                                               Key cache hit for sstable 38 [SharedPool-Worker-2] | 2017-02-07 14:14:31.875000 | 127.0.0.2 |            370
                                                                                               Key cache hit for sstable 37 [SharedPool-Worker-2] | 2017-02-07 14:14:31.875000 | 127.0.0.2 |            609
                                                                                               Key cache hit for sstable 36 [SharedPool-Worker-2] | 2017-02-07 14:14:31.875000 | 127.0.0.2 |            721
                                                                                               Key cache hit for sstable 35 [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 |            794
                                                                                               Key cache hit for sstable 34 [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 |            869
                                                                                               Key cache hit for sstable 33 [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 |            941
                                                  Skipped 5/6 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 |           1002
                                                                                 Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 |           1041
                                                                                         Read 10 live and 0 tombstone cells [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 |           1350
                                                                                           Enqueuing response to /127.0.0.1 [SharedPool-Worker-2] | 2017-02-07 14:14:31.876001 | 127.0.0.2 |           1416
                                                            Sending REQUEST_RESPONSE message to /127.0.0.1 [MessagingService-Outgoing-/127.0.0.1] | 2017-02-07 14:14:31.876001 | 127.0.0.2 |           1517
                                                         REQUEST_RESPONSE message received from /127.0.0.2 [MessagingService-Incoming-/127.0.0.2] | 2017-02-07 14:14:31.877000 | 127.0.0.1 |           3655
                                                                                        Processing response from /127.0.0.2 [SharedPool-Worker-4] | 2017-02-07 14:14:31.877000 | 127.0.0.1 |           3945
                                                                                                                                 Request complete | 2017-02-07 14:14:31.877250 | 127.0.0.1 |           4250

Here Cassandra skips all SSTables that are not within our timestamp range and effectively reads a single SSTable instead of 10.
If we do not know with certainty that all 10 rows will be in the latest time window, we can loop on time windows until we get the required number of rows. This effectively reproduces a section of the code that was added internally through CASSANDRA-8180.

Takeaways

Read latency in Apache Cassandra is closely related to the number of SSTables your queries will have to read from.
If you’re using a version prior to 3.4, reading a partition with several rows spread over multiple SSTables will actually read from all SSTables whether you add a LIMIT clause or not. This is true whatever the clustering order your table is using.
To reduce read latencies for compaction strategies that produce a large number of SSTables, you must add a restrictive clause on the clustering key. This will allow Cassandra to exclude SSTables that are not included in the clustering range.

cassandra read path limit twcs