Compaction in Apache Cassandra isn’t usually the first (or second) topic that gets discussed when it’s time to start optimizing your system. Most of the time we focus on data modeling and query patterns. An incorrect data model can turn a single query into hundreds of queries, resulting in increased latency, decreased throughput, and missed SLAs. If you’re using spinning disks the problem is magnified by time consuming disk seeks.
That said, compaction is also an incredibly important process. Understanding how a compaction strategy complements your data model can have a significant impact on your application’s performance. For instance, in Alex Dejanovski’s post on TimeWindowCompactionStrategy, he shows how a simple change to the compaction strategy can significantly decrease disk usage. As he demonstrated, a cluster mainly concerned with high rates of TTL’ed time series data can achieve major space savings and significantly improved performance. Knowing how each compaction strategy works in detail will help you make the right choice for your data model and access patterns. Likewise, knowing the nuance of compaction in general can help you understand why the system isn’t behaving as you’d expect when there’s a problem. In this post we’ll discuss some of the nuance of compaction, which will help you better know your database.
Apache Cassandra can store data on disk in an orderly fashion, which makes it great for time series. As you may have seen in numerous tutorials, to get the last 10 rows of a time series, just use a descending clustering order and add a LIMIT 10 clause. Simple and efficient!
Well if we take a closer look, it might not be as efficient as one would think, which we will cover in this blog post.
The de-facto tool to model and test workloads on Cassandra is cassandra-stress. It is a widely known tool, appearing in numerous blog posts to illustrate performance testing on Cassandra and often recommended for stress testing specific data models. Theoretically there is no reason why cassandra-stress couldn’t fit your performance testing needs. But cassandra-stress has some caveats when modeling real workloads, the most important of which we will cover in this blog post.
In our first post about TimeWindowCompactionStrategy, Alex Dejanovski discussed use cases and the reasons for its introduction in 3.0.8 as a replacement for DateTieredCompactionStrategy. In our experience switching production environments storing time series data to TWCS, we have seen the performance of many production systems improve dramatically.