One of the usual suspects for performance issues in the read path of Apache Cassandra is the presence of tombstones. We are used to check how many tombstones are accessed per read early in the process, to identify the possible cause of excessive GC pauses or high read latencies.
While trying to understand unexpected high read latencies for a customer a few months ago, we found out that one special (although fairly common) kind of tombstone was not counted in the metrics nor traced in the logs : primary key deletes.
Apache Cassandra versions 3.x and below have an all or nothing approach when it comes the datacenter user authorization security model. That is, a user has access to all datacenters in the cluster or no datacenters in the cluster. This has changed to something a little more fine grained for versions 4.0 and above, all thanks to Blake Eggleston and the work he has done on CASSANDRA-13985.
This is our second post in our series on performance tuning with Apache Cassandra. In the first post, we examined a fantastic tool for helping with performance analysis, the flame graph. We specifically looked at using Swiss Java Knife to generate them.
Data is critical to modern business and operational teams need to have a Disaster Recovery Plan (DRP) to deal with the risks of potential data loss. At TLP, we are regularly involved in the data recovery and restoration process and in this post we will share information we believe will be useful for those interested in initiating or improving their backup and restore strategy for Apache Cassandra. We will consider some common solutions, and detail the solution we consider the most efficient in AWS + EBS environments as it allows the best Recovery Time Objective (RTO) and is relatively easy to implement.
There are many knobs to turn in Apache Cassandra. Finding the right value for all of them is hard. Yet even with all values finely tuned unexpected things happen. In this post we will see how
gc_grace_seconds can break the promises of the Hinted Handoff.
We are happy to announce the release of Cassandra Reaper 1.1.0.
Apache Cassandra provides tools to replace nodes in a cluster, however these methods generally involve obtaining data from other nodes in the cluster via the bootstrap process. In this post we will step through a method to replace a node without bootstrapping in order to speed up the process.
We’ve been a bit quiet on the blog for the last couple months with regard to Reaper status updates. That’s not to say we haven’t been busy, in fact it’s been quite the opposite. We’ve been putting Reaper through it’s paces on fairly large deployments, improving stability, squashing bugs, and exposing more information about the state of the cluster.
After having spent quite a bit of time learning Docker and after hearing strong community interest for the technology even though few have played with it, I figured it’d be be best to share what I’ve learned. Hopefully the knowledge transfer helps newcomers get up and running with Cassandra in a concise, yet deeply informed manner.
One of the challenges of running large scale distributed systems is being able to pinpoint problems. It’s all too common to blame a random component (usually a database) whenever there’s a hiccup even when there’s no evidence to support the claim. We’ve already discussed the importance of monitoring tools, graphing and alerting metrics, and using distributed tracing systems like ZipKin to correctly identify the source of a problem in a complex system.
What impact on latency should you expect from applying the kernel patches for the Meltdown security vulnerability?
After seeing a lot of questions surrounding incremental repair on the mailing list and after observing several outages caused by it, we figured it would be good to write down our advices in a blog post.
In celebration of National Pickle Day, we’re proud to announce the 1.0 version of Reaper for Apache Cassandra. This release is a huge milestone for us. We’d like to start by thanking everyone who’s reported bugs and helped us test. We’d especially love to give a huge thank you to the teams which have sponsored development of the project along the way.
In this blog post we will take a look at consistency mechanisms in Apache Cassandra. There are three reasonably well documented features serving this purpose:
We’re delighted to introduce cassandra-reaper.io, the dedicated site for the open source Reaper project! Since we adopted Reaper from the incredible folks at Spotify, we’ve added a significant number of features, expanded the supported versions past 2.0, added support for incremental repair, and added a Cassandra backend to simplify operations.
A handy feature was silently added to Apache Cassandra’s
nodetool just over a year ago. The feature added was the
-j (jobs) option. This little gem controls the number of compaction threads to use when running either a
upgradesstables. The option was added to
nodetool via CASSANDRA-11179 to version 3.5. It has been back ported to Apache Cassandra versions 2.1.14, 2.2.6, and 3.5.
One of the big challenges people face when starting out working with Cassandra and time series data is understanding the impact of how your write workload will affect your cluster. Writing too quickly to a single partition can create hot spots that limit your ability to scale out. Partitions that get too large can lead to issues with repair, streaming, and read performance. Reading from the middle of a large partition carries a lot of overhead, and results in increased GC pressure. Cassandra 4.0 should improve the performance of large partitions, but it won’t fully solve the other issues I’ve already mentioned. For the foreseeable future, we will need to consider their performance impact and plan for them accordingly.
Since we created our hard fork of Spotify’s great repair tool, Reaper, we’ve been committed to make it the “de facto” community tool to manage repairing Apache Cassandra clusters.
This required Reaper to support all versions of Apache Cassandra (starting from 1.2) and some features it lacked like incremental repair.
Another thing we really wanted to bring in was to remove the dependency on a Postgres database to store Reaper data. As Apache Cassandra users, it felt natural to store these in our favorite database.
Auto bootstrapping is a handy feature when it comes to growing an Apache Cassandra cluster. There are some unknowns about how this feature works which can lead to data inconsistencies in the cluster. In this post I will go through a bit about the history of the feature, the different knobs and levers available to operate it, and resolving some of the common issues that may arise.
The Last Pickle (TLP) intends to hire a project manager in the US to work directly with customers in the US and around the world. You will be part of the TLP team, coordinating and managing delivery of high-quality consulting services including expert advice, documentation and run books, diagnostics and troubleshooting, and proof-of-concept code.
This blog post describes how to monitor Apache Cassandra using the Intel Snap open source telemetry framework. The document also covers some introductory knowledge on how monitoring in Cassandra works. It will use Apache Cassandra 3.0.10 and the resulting monitoring metrics will be visualised using Grafana. Docker containers will be used for Intel Snap and Grafana.
The amount of metrics your platform is collecting can overwhelm a metrics system. This is a common problem as many of today’s metrics solutions like Graphite do not scale successfully. If you don’t have the option to use a metrics backend that can scale, like DataDog, you’re left trying to find a way to cut back the number of metrics you’re collecting. This blog goes through some customisations that provide improvements alleviating Graphite’s inability to scale. It describes how to install and use customisations made to the metrics and metrics-reporter-config libraries used in Cassandra.
Compaction in Apache Cassandra isn’t usually the first (or second) topic that gets discussed when it’s time to start optimizing your system. Most of the time we focus on data modeling and query patterns. An incorrect data model can turn a single query into hundreds of queries, resulting in increased latency, decreased throughput, and missed SLAs. If you’re using spinning disks the problem is magnified by time consuming disk seeks.
That said, compaction is also an incredibly important process. Understanding how a compaction strategy complements your data model can have a significant impact on your application’s performance. For instance, in Alex Dejanovski’s post on TimeWindowCompactionStrategy, he shows how a simple change to the compaction strategy can significantly decrease disk usage. As he demonstrated, a cluster mainly concerned with high rates of TTL’ed time series data can achieve major space savings and significantly improved performance. Knowing how each compaction strategy works in detail will help you make the right choice for your data model and access patterns. Likewise, knowing the nuance of compaction in general can help you understand why the system isn’t behaving as you’d expect when there’s a problem. In this post we’ll discuss some of the nuance of compaction, which will help you better know your database.
Apache Cassandra can store data on disk in an orderly fashion, which makes it great for time series. As you may have seen in numerous tutorials, to get the last 10 rows of a time series, just use a descending clustering order and add a LIMIT 10 clause. Simple and efficient!
Well if we take a closer look, it might not be as efficient as one would think, which we will cover in this blog post.
The de-facto tool to model and test workloads on Cassandra is cassandra-stress. It is a widely known tool, appearing in numerous blog posts to illustrate performance testing on Cassandra and often recommended for stress testing specific data models. Theoretically there is no reason why cassandra-stress couldn’t fit your performance testing needs. But cassandra-stress has some caveats when modeling real workloads, the most important of which we will cover in this blog post.
In our first post about TimeWindowCompactionStrategy, Alex Dejanovski discussed use cases and the reasons for its introduction in 3.0.8 as a replacement for DateTieredCompactionStrategy. In our experience switching production environments storing time series data to TWCS, we have seen the performance of many production systems improve dramatically.
In this post we’ll explore a new compaction strategy available in Apache Cassandra. We’ll dig into it’s use cases, limitations, and share our experiences of using it with various production clusters.
In this post I’ll introduce you to an advanced option in Apache Cassandra called user defined compaction. As the name implies, this is a process by which we tell Cassandra to create a compaction task for one or more tables explicitly. This task is then handed off to the Cassandra runtime to be executed like any other compaction.
Two weeks ago marked another Cassandra summit. As usual I submitted a handful of talks, and surprisingly they all got accepted. The first talk (video linked) I gave was an introduction to a tool I started back at DataStax called Dataset Manager for Apache Cassandra, further referred to as CDM. CDM started as a a simple question - what can we do to help people learn how to use Apache Cassandra? How can new users avoid the headaches of incorrect data modeling, repeated production deployments, and costly schema migrations.
As explained “in extenso” by Alain in his installment on how Apache Cassandra deletes data, removing rows or cells from a table is done by a special kind of write called a tombstone. But did you know that inserting a null value into a field from a CQL statement also generates a tombstone? This happens because Cassandra cannot decide whether inserting a null value means that we are trying to void a field that previously had a value or that we do not want to insert a value for that specific field.
There’s nothing fun about installing some software for the first time and feeling like it doesn’t work at all. Unfortunately if you’ve just installed Cassandra 2.2 or 3.0 on a recent Linux distribution, you may run into a not-so-friendly
Connection error when trying to use
Deleting distributed and replicated data from a system such as Apache Cassandra is far trickier than in a relational database. The process of deletion becomes more interesting when we consider that Cassandra stores its data in immutable files on disk. In such a system, to record the fact that a delete happened, a special value called a “tombstone” needs to be written as an indicator that previous values are to be considered deleted. Though this may seem quite unusual and/or counter-intuitive (particularly when you realize that a delete actually takes up space on disk), we’ll use this blog post to explain what is actually happening along side examples that you can follow on your own.
Mesophere recently opened sourced their DataCenter Operating System (DC/OS), a platform to manage Data Center resources. DCOS is built on Apache Mesos which provides tooling to “Program against your datacenter like it’s a single pool of resources”. While Mesos provides primitives to request resources from a pool, DC/OS provides common applications as packages in a repository called the Universe. It also provides a web UI and a CLI to manage these resources. One helpful way to understand Mesos and DC/OS is imagine if you were to package an application as a container: You want tools to deploy and configure this container without having to deal directly with provisioning and system level configuration.
The Last Pickle was born out of our passion for the open source community and firmly held belief that Cassandra would become the ubiquitous database platform of the next generation. We have maintained our focus on Apache Cassandra since starting in March 2011 as the first pure Apache Cassandra consultancy in the world. In the last five years we have been part of the success of customers big and small around the globe as they harness the power of Cassandra to grow their companies.
Managing Cassandra effectively often means managing multiple nodes the exact same way. As Cassandra is a peer to peer system where all the nodes are equals, there is no master or slaves. This Cassandra property allows us to easily manage any cluster by simply running the same command on all the nodes to have a change applied cluster-wide.
Last week Sudip Chakrabarti from Lightspeed Venture Partners published the blog post ‘In the land of microservices, the network is the king(maker)’. The writeup captures the Zeitgeist; foreseeing the pain and solution many large systems around us will face as they migrate to distributed architectures. Written so well I’ve felt compelled to do more than just retweet.
Have you ever relived a past experience and found it better the second time round? For example I used to love Monkey Magic when I was a child, but am not such a big fan now. When oh when will Pigsy change his piggish ways? The first time I got to see how a Database actually worked was in 2010 when I started playing with Apache Cassandra. It was a revelation after years of using platforms such as Microsoft SQL Server, and I decided to make a career out of working with Cassandra. It’s now 2016 and the Cassandra storage engine recently went through some big changes with the release of 3.x. Once again I have a chance to dive into the storage engine, and this time it is even more enjoyable as I have been able to see how the project has improved. Which has resulted in this post on how the 3.x storage engine ecodes Partitions on disk.
I recently had to remove a disk from all the Apache Cassandra instances of a production cluster. This post purpose is to share the full process, optimising the overall operation time and reducing the down time for each node.
Someone recently asked me: What happens when I drop a column in CQL? With all the recent changes in the storage engine I took the opportunity to explore the new code. In short, we will continue to read the dropped column from disk until the files are rewritten by compaction or you force Cassandra to rewrite the files.
Since versions of Cassandra dating back to
0.4, the ability to set logging levels dynamically has been available. Before I go any further, I want to make it clear that dynamic log level adjustment is A Very Good Thing. Unfortunately for security conscious installations, this can cause issues with information exposure. For many, this may seem trivial, but it is minor issues like this that can put an enterprise in violation of industry regulations, potentially creating serious liability concerns.
Integrating Zipkin tracing into Cassandra, it’s possible to create one tracing view across an entire platform. The article is a write-up of the Distributed Tracing from Application to Database presentation at Cassandra Summit 2015, Santa Clara.
Benchmarking schemas and configuration changes using the
cassandra-stress tool, before pushing such changes out to production is one of the things every Cassandra developer should know and regularly practice.
Timeout errors in Apache Cassandra occur when less than
Consistency Level number of replicas return to the coordinator. It’s the distributed systems way of shrugging and saying “not my problem mate”. From the Coordinator’s perspective the request may have been lost, the replica may have failed while doing the work, or the response may have been lost. Recently we wanted to test how write timeouts were handled as part of back porting CASSANDRA-8819 for a client. To do so we created a network partition that dropped response messages from a node.
This is a tutorial extracted from part of a presentation I gave at Cassandra Summit 2015 titled Hardening Cassandra for Compliance (or Paranoia). The slides are available and the “SSL Certificates: a brief interlude” section is probably the most expedient route if you are impatient. We build on that process here by actually installing everything on a local three node cluster. I’ll provide a link to the video of the presentation as soon as it is posted.
Connecting to Cassandra’s JMX service through firewalls can be tricky. JMX connects you through one port (7199 by default), and then opens up a dynamic port with the application. This makes remote JMX connections difficult to set up securely.
I’m happy to announce that Patricia Gorla has joined The Last Pickle. We first noticed Patricia due to her community involvement, including talking at Hadoop World New York and the Cassandra Summit in London. Prior to joining TLP she spent her time at OpenSource Connections helping customers with Solr and Cassandra.
I recently found an afternoon to play with Riemann, something I’ve been wanting to do for a while. A lot of people have said nice things about it and it was time we got in on the action. One of the catalysts for action was the addition of configurable reporters for the Metrics library in Cassandra v2.0.2. If you’ve not heard about this head over to the DataStax blog and read the guest post by Chris Burroughs.
A lot of folks have been having issues lately with the performance of insert-heavy workloads via CQL. Though batch statements are available in the new 2.0 release, the general consensus in the community has been that switching back to the Thrift API is the most immediate and well understand path for alleviating mutation performance issues with CQL.
We got our first press release quote!
I’m extremely happy today to be announcing that Nate McCall and I have started working together. We’ll be continuing the journey I started over two years ago; working with clients to deliver and improve Apache Cassandra based solutions. We both feel that continuing to contribute to the Cassandra Community is a critical part of our success, and that of our clients. And we are looking forward to combining our efforts on both these fronts.
Here in ATX (Austin, Texas) we have the distinct pleasure of having the majority of DataStax’s engineering team in-town. This includes Apache Cassandra project chair Jonathan Ellis, who last night gave a presentation at our monthly Cassandra meetup entitled: “You got your transactions in my NoSQL - An Introduction to Cassandra 2.0.” For those of you not based in Austin I’ve uploaded a video of the presentation to You Tube.
The final version of CQL 3 that ships with Cassandra v1.2 adds some new features to the
PRIMARY KEY clause. It overloads the concept in ways that differ from the standard SQL definition, and in some places shares ideas with Hive. But from a Cassandra point of view it allows for the same flexibility as the Thrift API.
I like humans. Many of my friends a humans; my wife is a human. And I admire their ability to arbitrarily order items in a list. Applying a manual order to a list of items has been discussed a few times on the Cassandra user list. And I’ve been thinking about it recently.
Row Level Isolation in Cassandra 1.1 is the most important new feature in Cassandra so far. There have been a lot of great improvements since version 0.5, but row level isolation adds an entirely new feature. One that opens the door to new use cases.
Some recent Zoo Keeper reading.
Recent Cassandra reading.
Recently I was working on a Cassandra cluster and experienced a strange situation that resulted in a partition of sorts between the nodes. Whether you actually call it a partition or not is a matter for discussion (see ” You Can’t Sacrifice Partition Tolerance - Updated October 22, 2010”]. But weird stuff happened, Cassandra remained available, and it was fixed with zero site down time. It was also a good example of how and why Cassandra is a Highly Available data store, and fun to fix. So here are all the nerdy details…
Cassandra 0.8.1 added support for Composite Types and Reversed Types, through a handy little type composition language. Hopefully I’ll get to show some of the things you can do with Composite Types later, for now the best resource I know is Ed Anuff’s presentation on Cassandra Indexing Techniques at Cassandra SF 2011.
Recently, like 2 hours ago, I was planning some work to rebalance a Cassandra cluster and I wanted to see how the steps involved would effect the range ownership of the nodes. So I replicated the logic from RandomPartitioner.describeOwnership() in a handy python script.
I’ve had a few conversations about query performance on wide rows recently, so it seemed about time to dig into how the different slice queries work.
For a read or write request to start in Cassandra at least as many nodes must be seen as
UP by the coordinator node as the request has specified via the ConsistencyLevel. Otherwise the client will get an
UnavailableException and the cluster will appear down for that request. That may not necessarily mean it is down for all keys or all requests.
There’s been a few good AWS discussions recently on the Cassandra User List, and some interesting blog posts.
Deletes in Cassandra rely on Tombstones to support the Eventual Consistency model. Tombstones are markers that can exist at different levels of the data model and let the cluster know that a delete was recored on a replica, and when it happened. Tombstones then play a role in keeping deleted data hidden and help with freeing space used by deleted columns on disk.
Updated: I’ve added information on the new
memtable_total_space_in_mb setting in version 0.8 and improved the information about
memtable_throughput. Thanks for the feedback.
Requests that write and read data in Cassandra, like any data base, have competing characteristics that need to be balanced. This post compares the approach taken by Cassandra to traditional Relation Database Systems.
I’ve tidied up a previous Introduction to Cassandra presentation with better diagrams and improved explanations. It’s available as Keynote, PDF or plain old web pages. It’s a basic introduction to the way Cassandra works as a clustered system. It covers Partitioning, Replication, Hinted Handoff, Read Repair and Anti Entrophy. It does not cover the data model.