The LIMIT clause in Apache Cassandra might not work as you think
Apache Cassandra can store data on disk in an orderly fashion, which makes it great for time series. As you may have seen in numerous tutorials, to get the last 10 rows of a time series, just use a descending clustering order and add a LIMIT 10 clause. Simple and efficient!
Well if we take a closer look, it might not be as efficient as one would think, which we will cover in this blog post.
Cassandra! Give me the last 10 rows for this partition
Let’s say you’re using the following data model :
CREATE TABLE time_series (
id uuid,
ts timestamp,
value text,
PRIMARY KEY(id, ts)
) WITH CLUSTERING ORDER BY(ts DESC)
AND compaction = {'class':'TimeWindowCompactionStrategy', 'compaction_window_unit':'MINUTES','compaction_window_size':'1'}
AND default_time_to_live=600
;
The above will create a single SSTable for each minute worth of data. Using a Python script, we will create a single row per second on the same partition, and flush memtables every minute so that we have each SSTable holding exactly 60 rows.
Our use case here would be that we only want the last 10 records each time we query the database, and we’re using TWCS in the hope that we would only hit a single SSTable each time we do this (or maybe even only the memtables). Since we have a descending time based clustering key, we would expect that query to be fairly efficient, right ?
If we activate tracing and run our LIMIT 10
query on an Apache Cassandra 3.0.8 cluster, here’s our output :
cqlsh> TRACING ON;
cqlsh> select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 10 ;
id | ts | value
--------------------------------------+--------------------------+-------
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:22+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:21+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:20+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:19+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:18+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:17+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:16+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:15+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:14+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:15:13+0000 | test
(10 rows)
Tracing session: 94db1cf0-ed16-11e6-acd2-d74cd6c92d07
activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
Execute CQL3 query | 2017-02-07 10:19:52.383000 | 127.0.0.1 | 0
Parsing select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 10 ; [SharedPool-Worker-1] | 2017-02-07 10:19:52.383000 | 127.0.0.1 | 161
Preparing statement [SharedPool-Worker-1] | 2017-02-07 10:19:52.383000 | 127.0.0.1 | 356
reading data from /127.0.0.2 [SharedPool-Worker-1] | 2017-02-07 10:19:52.383000 | 127.0.0.1 | 762
READ message received from /127.0.0.1 [MessagingService-Incoming-/127.0.0.1] | 2017-02-07 10:19:52.384000 | 127.0.0.2 | 21
Sending READ message to /127.0.0.2 [MessagingService-Outgoing-/127.0.0.2] | 2017-02-07 10:19:52.384000 | 127.0.0.1 | 1107
Executing single-partition query on time_series [SharedPool-Worker-1] | 2017-02-07 10:19:52.384000 | 127.0.0.2 | 222
Acquiring sstable references [SharedPool-Worker-1] | 2017-02-07 10:19:52.384000 | 127.0.0.2 | 255
Partition index with 0 entries found for sstable 12 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 | 426
Key cache hit for sstable 11 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 | 542
Key cache hit for sstable 10 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 | 596
Key cache hit for sstable 9 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 | 642
Key cache hit for sstable 8 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 | 687
Key cache hit for sstable 7 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385000 | 127.0.0.2 | 731
Key cache hit for sstable 6 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 | 777
Key cache hit for sstable 5 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 | 816
Key cache hit for sstable 4 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 | 858
Skipped 0/9 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 | 908
Merging data from memtables and 9 sstables [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 | 934
Read 10 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 | 1146
Enqueuing response to /127.0.0.1 [SharedPool-Worker-1] | 2017-02-07 10:19:52.385001 | 127.0.0.2 | 1187
Sending REQUEST_RESPONSE message to /127.0.0.1 [MessagingService-Outgoing-/127.0.0.1] | 2017-02-07 10:19:52.386000 | 127.0.0.2 | 1410
REQUEST_RESPONSE message received from /127.0.0.2 [MessagingService-Incoming-/127.0.0.2] | 2017-02-07 10:19:52.386000 | 127.0.0.1 | 3179
Processing response from /127.0.0.2 [SharedPool-Worker-5] | 2017-02-07 10:19:52.386000 | 127.0.0.1 | 3369
Request complete | 2017-02-07 10:19:52.386584 | 127.0.0.1 | 3584
The trace shows that data was merged from 9 SSTables before returning 10 rows : Merging data from memtables and 9 sstables
.
Let’s try it on 300 rows now :
cqlsh> select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 300 ;
id | ts | value
--------------------------------------+--------------------------+-------
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:19:53+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:19:52+0000 | test
...
...
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:12:05+0000 | test
(300 rows)
Tracing session: 964e5c00-ed16-11e6-acd2-d74cd6c92d07
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
Execute CQL3 query | 2017-02-07 10:19:54.816000 | 127.0.0.1 | 0
Parsing select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 300 ; [SharedPool-Worker-1] | 2017-02-07 10:19:54.816000 | 127.0.0.1 | 162
Preparing statement [SharedPool-Worker-1] | 2017-02-07 10:19:54.816000 | 127.0.0.1 | 304
Executing single-partition query on time_series [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 | 720
Acquiring sstable references [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 | 750
Key cache hit for sstable 12 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 | 809
Key cache hit for sstable 11 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 | 904
Key cache hit for sstable 10 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 | 965
Key cache hit for sstable 9 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817000 | 127.0.0.1 | 1007
Key cache hit for sstable 8 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 | 1054
Key cache hit for sstable 7 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 | 1125
Key cache hit for sstable 6 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 | 1170
Key cache hit for sstable 5 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 | 1209
Key cache hit for sstable 4 [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 | 1251
Skipped 0/9 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 | 1287
Merging data from memtables and 9 sstables [SharedPool-Worker-2] | 2017-02-07 10:19:54.817001 | 127.0.0.1 | 1310
Read 300 live and 0 tombstone cells [SharedPool-Worker-2] | 2017-02-07 10:19:54.818000 | 127.0.0.1 | 2356
Request complete | 2017-02-07 10:19:54.818704 | 127.0.0.1 | 2704
Once again, data was merged from 9 SSTables plus the memtables to get our 300 rows.
Worse, as we’ve set a 10 minutes TTL on our data, once cells start to expire we are going to be reading a handful of tombstones :
cqlsh> select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 80 ;
id | ts | value
--------------------------------------+--------------------------+-------
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:20:43+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:20:42+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:20:41+0000 | test
...
...
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:19:44+0000 | test
9de0d7cd-4108-41f9-8723-a5c54938af16 | 2017-02-07 09:19:43+0000 | test
(60 rows)
Tracing session: e7078800-ed17-11e6-acd2-d74cd6c92d07
activity | timestamp | source | source_elapsed
------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
Execute CQL3 query | 2017-02-07 10:29:19.744000 | 127.0.0.1 | 0
Parsing select * from test.time_series where id = 9de0d7cd-4108-41f9-8723-a5c54938af16 LIMIT 80 ; [SharedPool-Worker-1] | 2017-02-07 10:29:19.744000 | 127.0.0.1 | 124
Preparing statement [SharedPool-Worker-1] | 2017-02-07 10:29:19.744000 | 127.0.0.1 | 298
Executing single-partition query on time_series [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 | 823
Acquiring sstable references [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 | 901
Partition index with 0 entries found for sstable 13 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 | 1091
Key cache hit for sstable 12 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 | 1195
Key cache hit for sstable 11 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 | 1243
Key cache hit for sstable 10 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745000 | 127.0.0.1 | 1317
Key cache hit for sstable 9 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 | 1352
Key cache hit for sstable 8 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 | 1382
Key cache hit for sstable 7 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 | 1413
Key cache hit for sstable 6 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 | 1449
Key cache hit for sstable 5 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 | 1483
Key cache hit for sstable 4 [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 | 1513
Skipped 0/10 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 | 1543
Merging data from memtables and 10 sstables [SharedPool-Worker-2] | 2017-02-07 10:29:19.745001 | 127.0.0.1 | 1567
Read 60 live and 540 tombstone cells [SharedPool-Worker-2] | 2017-02-07 10:29:19.748000 | 127.0.0.1 | 3968
Request complete | 2017-02-07 10:29:19.749004 | 127.0.0.1 | 5004
This might explain why we are reading way more tombstones and hitting way more SSTables then you would expect.
Why ?
So far, Cassandra was not optimized to determine if a row partially exists in multiple SSTables or not. On top of that, remember that compaction strategies can mix old and new data into one SSTable. Then reading only the “latest” SSTables would give no guarantee that we’re reading all that exists for a specific row. Thus Cassandra must read the entire partition, merge data, and then return the first 10 rows to make sure no cells were missed. Limiting the number of rows in the result set is done very late in the read path and Cassandra still performs a lot of disk seeks that could negatively affect latencies.
Leveled Compaction Strategy (LCS) usually mitigates this issue because it works hard at having a specific partition live in at most a single SSTable by level, at the expense of a high I/O and CPU consumption.
The optimization that allows Cassandra to read only a subset of the SSTables came in 3.4 with CASSANDRA-8180.
Since that release, Cassandra uses the min/max values of clustering keys in the index, orders SSTables accordingly, and process them one at a time until it reaches the required number of rows.
But I’m still using 2.x !
And this is a totally valid choice, as the 3.x tick tock releases still lack production testing at scale to be considered stable.
If you’re running such queries that use the LIMIT
clause on DESCENDING
ordered clustering keys, the only solution to avoid reading all SSTables that contain the target partition is to add a range clause on the clustering key.
Based on our query above, and since we’re using TWCS with 1 minute time windows, we can add a condition on the ts
column to only query the last minute :
cqlsh:test> select * from time_series where id = 4ba7505d-e4e5-4000-8454-f7cc919bafab and ts >= '2017-02-07 14:13:00' limit 10;
id | ts | value
--------------------------------------+--------------------------+-------
4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:31+0000 | test
4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:30+0000 | test
4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:29+0000 | test
4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:28+0000 | test
4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:27+0000 | test
4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:26+0000 | test
4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:25+0000 | test
4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:24+0000 | test
4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:23+0000 | test
4ba7505d-e4e5-4000-8454-f7cc919bafab | 2017-02-07 13:14:22+0000 | test
(10 rows)
Tracing session: 5ce2cb10-ed37-11e6-9747-fd68e1eea510
activity | timestamp | source | source_elapsed
--------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------
Execute CQL3 query | 2017-02-07 14:14:31.873000 | 127.0.0.1 | 0
Parsing select * from time_series where id = 4ba7505d-e4e5-4000-8454-f7cc919bafab and ts >= '2017-02-07 14:13:00' limit 10; [SharedPool-Worker-1] | 2017-02-07 14:14:31.873000 | 127.0.0.1 | 155
Preparing statement [SharedPool-Worker-1] | 2017-02-07 14:14:31.873000 | 127.0.0.1 | 321
reading data from /127.0.0.2 [SharedPool-Worker-1] | 2017-02-07 14:14:31.874000 | 127.0.0.1 | 1224
Sending READ message to /127.0.0.2 [MessagingService-Outgoing-/127.0.0.2] | 2017-02-07 14:14:31.874000 | 127.0.0.1 | 1463
READ message received from /127.0.0.1 [MessagingService-Incoming-/127.0.0.1] | 2017-02-07 14:14:31.875000 | 127.0.0.2 | 25
Executing single-partition query on time_series [SharedPool-Worker-2] | 2017-02-07 14:14:31.875000 | 127.0.0.2 | 186
Acquiring sstable references [SharedPool-Worker-2] | 2017-02-07 14:14:31.875000 | 127.0.0.2 | 239
Key cache hit for sstable 38 [SharedPool-Worker-2] | 2017-02-07 14:14:31.875000 | 127.0.0.2 | 370
Key cache hit for sstable 37 [SharedPool-Worker-2] | 2017-02-07 14:14:31.875000 | 127.0.0.2 | 609
Key cache hit for sstable 36 [SharedPool-Worker-2] | 2017-02-07 14:14:31.875000 | 127.0.0.2 | 721
Key cache hit for sstable 35 [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 | 794
Key cache hit for sstable 34 [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 | 869
Key cache hit for sstable 33 [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 | 941
Skipped 5/6 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 | 1002
Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 | 1041
Read 10 live and 0 tombstone cells [SharedPool-Worker-2] | 2017-02-07 14:14:31.876000 | 127.0.0.2 | 1350
Enqueuing response to /127.0.0.1 [SharedPool-Worker-2] | 2017-02-07 14:14:31.876001 | 127.0.0.2 | 1416
Sending REQUEST_RESPONSE message to /127.0.0.1 [MessagingService-Outgoing-/127.0.0.1] | 2017-02-07 14:14:31.876001 | 127.0.0.2 | 1517
REQUEST_RESPONSE message received from /127.0.0.2 [MessagingService-Incoming-/127.0.0.2] | 2017-02-07 14:14:31.877000 | 127.0.0.1 | 3655
Processing response from /127.0.0.2 [SharedPool-Worker-4] | 2017-02-07 14:14:31.877000 | 127.0.0.1 | 3945
Request complete | 2017-02-07 14:14:31.877250 | 127.0.0.1 | 4250
Here Cassandra skips all SSTables that are not within our timestamp range and effectively reads a single SSTable instead of 10.
If we do not know with certainty that all 10 rows will be in the latest time window, we can loop on time windows until we get the required number of rows. This effectively reproduces a section of the code that was added internally through CASSANDRA-8180.
Takeaways
Read latency in Apache Cassandra is closely related to the number of SSTables your queries will have to read from.
If you’re using a version prior to 3.4, reading a partition with several rows spread over multiple SSTables will actually read from all SSTables whether you add a LIMIT clause or not. This is true whatever the clustering order your table is using.
To reduce read latencies for compaction strategies that produce a large number of SSTables, you must add a restrictive clause on the clustering key. This will allow Cassandra to exclude SSTables that are not included in the clustering range.