Radovan Zvoncek

Phantom Consistency Mechanisms

14 Sep 2017

In this blog post we will take a look at consistency mechanisms in Apache Cassandra. There are three reasonably well documented features serving this purpose:

Read repair gives the option to sync data on read requests.
Hinted handoff is a buffering mechanism for situations when nodes are temporarily unavailable.
Anti-entropy repair (or simply just repair) is a process of synchronizing data across the board.

What is far less known, and what we will explore in detail in this post, is a fourth mechanism Apache Cassandra uses to ensure data consistency. We are going to see Cassandra perform another flavour of read repairs but in far sneakier way.

Setting things up

In order to see this sneaky repair happening, we need to orchestrate a few things. Let’s just blaze through some initial setup using Cassandra Cluster Manager (ccm - available on github).

# create a cluster of 2x3 nodes
ccm create sneaky-repair -v 2.1.15
ccm updateconf 'num_tokens: 32'
ccm populate --vnodes -n 3:3

# start nodes in one DC only
ccm node1 start --wait-for-binary-proto
ccm node2 start --wait-for-binary-proto
ccm node3 start --wait-for-binary-proto

# create table and keypsace
ccm node1 cqlsh -e "CREATE KEYSPACE sneaky WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3};"
ccm node1 cqlsh -e "CREATE TABLE sneaky.repair (k TEXT PRIMARY KEY , v TEXT);"

# insert some data
ccm node1 cqlsh -e "INSERT INTO sneaky.repair (k, v) VALUES ('firstKey', 'firstValue');"

The familiar situation

At this point, we have a cluster up and running. Suddenly, “the requirements change” and we need to expand the cluster by adding one more data center. So we will do just that and observe what happens to the consistency of our data.

Before we proceed, we need to ensure some determinism and turn off Cassandra’s known consistency mechanisms (we will not be disabling anti-entropy repair as that process must be initiated by an operator anyway):

# disable hinted handoff
ccm node1 nodetool disablehandoff
ccm node2 nodetool disablehandoff
ccm node3 nodetool disablehandoff

# disable read repairs
ccm node1 cqlsh -e "ALTER TABLE sneaky.repair WITH read_repair_chance = 0.0 AND dclocal_read_repair_chance = 0.0"

Now we expand the cluster:

# start nodes
ccm node4 start --wait-for-binary-proto
ccm node5 start --wait-for-binary-proto
ccm node6 start --wait-for-binary-proto

# alter keyspace
ccm node1 cqlsh -e "ALTER KEYSPACE sneaky WITH replication ={'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2':3 };"

With these commands, we have effectively added a new DC into the cluster. From this point, Cassandra can start using the new DC to serve client requests. However, there is a catch. We have not populated the new nodes with data. Typically, we would do a nodetool rebuild. For this blog post we will skip that, because this situation allows some sneakiness to be observed.

Sneakiness: blocking read repairs

Without any data being put on the new nodes, we can expect no data to be actually readable from the new DC. We will go to one of the new nodes (node4) and do a read request with LOCAL_QUORUM consistency to ensure only the new DC participates in the request. After the read request we will also check the read repair statistics from nodetool, but we will set that information aside for later:

ccm node4 cqlsh -e "CONSISTENCY LOCAL_QUORUM; SELECT * FROM sneaky.repair WHERE k ='firstKey';"
ccm node4 nodetool netstats | grep -A 3 "Read Repair"

 k | v
---+---

(0 rows)

No rows are returned as expected. Now, let’s do another read request (again from node4), this time involving at least one replica from the old DC thanks to QUORUM consistency:

ccm node4 cqlsh -e "CONSISTENCY QUORUM; SELECT * FROM sneaky.repair WHERE k ='firstKey';"
ccm node4 nodetool netstats | grep -A 3 "Read Repair"

 k        | v
----------+------------
 firstKey | firstValue

(1 rows)

We now got a hit! This is quite unexpected because we did not run rebuild or repair meanwhile and hinted handoff and read repairs have been disabled. How come Cassandra went ahead and fixed our data anyway?

In order to shed some light onto this issue, let’s examine the nodetool netstat output from before. We should see something like this:

# after first SELECT using LOCAL_QUORUM
ccm node4 nodetool netstats  | grep -A 3 "Read Repair"
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0

# after second SELECT using QUORUM
ccm node4 nodetool netstats  | grep -A 3 "Read Repair"
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 1
Mismatch (Background): 0

# after third SELECT using LOCAL_QUORUM
ccm node4 nodetool netstats  | grep -A 3 "Read Repair"
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 1
Mismatch (Background): 0

From this output we can tell that:

No read repairs happened (Attempted is 0).
One blocking read repair actually did happen (Mismatch (Blocking) is 1).
No background read repair happened (Mismatch (Background) is 0).

It turns out there are two read repairs that can happen:

A blocking read repair happens when a query can not complete with desired consistency level without actually repairing the data. read_repair_chance has no impact on this.
A background read repair happens in situations when a query succeeds but inconsistencies are found. This happens with read_repair_chance probability.

The take-away

To sum things up, it is not possible to entirely disable read repairs and Cassandra will sometimes try to fix inconsistent data for us. While this is pretty convenient, it also has some inconvenient implications. The best way to avoid any surprises is to keep the data consistent by running regular repairs.

In situations featuring non-negligible amounts of inconsistent data this sneakiness can cause a lot of unexpected load on the nodes, as well as the cross-DC network links. Having to do cross-DC reads can also introduce additional latency. Read-heavy workloads and workloads with large partitions are particularly susceptible to problems caused by blocking read repair.

A particular situation when a lot of inconsistent data is guaranteed happens when a new data center gets added to the cluster. In these situations, LOCAL_QUORUM is necessary to avoid doing blocking repairs until a rebuild or a full repair is done. Using a LOCAL_QUORUM is twice as important when the data center expansion happens for the first time. In one data center scenario QUORUM and LOCAL_QUORUM have virtually the same semantics and it is easy to forget which one is actually used.

cassandra consistency quorum