Apache Cassandra - Data Center Switch

Did you ever wonder how to change the hardware efficiently on your Apache Cassandra cluster? Or did you maybe read this blog post from Anthony last week about how to set up a balanced cluster, found it as exciting as we do, but are still unsure how to change the number of vnodes?

A Data Center “Switch” might be what you need.

We will go through this process together hereafter. Additionally, If you’re just looking at adding or removing a data center from an existing cluster, these operations are described as part of the data center switch. I hope the structure of the post will make it easy for you to find the part you need!

Warning/Note: If you are unable to get hardware easily, this won’t work for you. The Data Center (DC) Switch is well adapted to cloud environments or for teams with spare/available hardware, as we will need to create a new data center before we can actually free up the previously used machines. Generally we use the same number of nodes in the new data center, but this is not mandatory. Thus, you’ll want to use a proportional number of machines to keep performance unchanged in case the hardware changes.

Definition

First things first, what is a “Data Center Switch” in our Apache Cassandra context?

The idea is to transition to a new data center, freshly added for this operation, and then to remove the old one. In between, clients need to be switched to the new data center.

Logical isolation / topology between data centers in Cassandra helps keep this operation safe and allows you to rollback the operation at almost any stage and with little effort. This technique does not generate any downtime when it is well executed.

Use Cases

A DC Switch can be used for changes that cannot be easily or safely performed within an existing data center, through JMX, or even with a rolling restart. It can also allow you to make changes such as modifying the number of vnodes in use for a cluster, or to change the hardware without creating unwanted transitional states where some nodes would have distinct hardware (or operating system).

It could also be used for a very critical upgrade. Note that with an upgrade it’s important to keep in mind that streaming in a cluster running mixed versions of Casandra is not recommended. In this case, it’s better to add the new Data Center using the same Cassandra version, feed it with the data from the old data center, and only then upgrade the Cassandra version in the new data center. When the new data center is considered stable, the clients can be routed to the new datacenter and the old one can be removed.

It might even fit your needs in some case I cannot think about right now. Once you know and understand this process, you might find yourself making original use of it.

Limitations and Risks

  • Again, this is a good fit for a cloud environment or when getting new servers. In some cases it might not be possible (or is not worth it) to double the hardware in use, even for a short period of time.
  • You cannot use the DC Switch to change global settings like the cluster_name, as this value is unique for the whole cluster.

How-to - DC Switch

This section provides a detailed runbook to perform a Data Center Switch.

We will cover the following aspects of a data center switch.

Phases described here are mostly independent and at the end of each phase the cluster should be in a stable state. You can run only phase 1 (add a DC) or phase 3 (remove a DC) for example. You should always read about and pay attention to the phase 0 though.

Rollback

Before we start, it is important to note that until the very last step, anything can be easily rolled back. You can always safely and quickly go back to the previous state during the procedure. Simply stated, you should be able to run the opposite commands to the one shown here, in the reverse order until the cluster is in an acceptable state. It’s good to check (and store!) the value for the configuration we are about to change, before any step.

Keep in mind that after each Phase below the cluster is in stable state and can remain like this for a long time or forever without problem.

Phase 0 - Prepare Configuration for Multiple Data Centers

First, the preparation phase. We want to make sure Cassandra and Clients will react as expected in a Multi-DC environment.

Server Side (Cassandra)

Step 1: All the keyspaces use NTS

  • Confirm that each user keyspace is using the NetworkTopologyStrategy:
$ cqlsh -e ​"DESCRIBE KEYSPACES;"​
$ cqlsh -e ​"DESCRIBE KEYSPACE <my_ks>;"​ | grep replication
$ cqlsh -e ​​"DESCRIBE KEYSPACE <my_other_ks>;"​ | grep replication
...

If not, change it to be NetworkTopologyStrategy.

  • Confirm that all the keyspaces (including system_*, possibly opscenter, …) also use either the NetworkTopologyStrategy or the LocalStrategy. To be clear:
    • keyspaces using SimpleStrategy (i.e. possibly system_auth, system_distributed, …) should be switched to NetworkTopologyStrategy.
    • system keyspace or any other keyspace, using LocalStrategy should not be changed.

Note: SimpleStrategy is not good in most cases and none of the keyspaces should use it in a Multi-DC context. This is because client operations would touch the distinct data centers to answer reads, breaking the expected isolation between the data centers.

$ cqlsh -e ​"DESCRIBE KEYSPACE system_auth;"​ | grep replication
...

In case you need to change this, do something like:

# Custom Keyspace
ALTER KEYSPACE tlp_labs WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'old-dc': '3'
};
[...]

# system_* keyspaces from SimpleStrategy to NetworkTopologyStrategy
ALTER KEYSPACE system_auth WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'old-dc': '3' # Or more... But it's not today's topic ;-)
};
ALTER KEYSPACE system_distributed WITH replication = {
  'class': 'NetworkTopologyStrategy',
  old-dc': '3': '3'
};
ALTER KEYSPACE system_traces WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'old-dc': '3': '3'
};
[...]

Warning:

This process might have consequences on data availability, be sure to understand the consequences on the token ownership not to break the service availability.

You can avoid this problem by mirroring the previous distribution SimpleStrategy was producing. Using NetworkTopologyStrategy in combination with GossipingPropertyFileSnitch copying the previous SimpleStrategy/SimpleSnitch behaviour (data center and rack names), you should be able to make this change a ‘non-event’, where actually nothing happens from the Cassandra topology perspective.

In other words, if both topologies result in the same logical placement of the nodes, then there is no movement and no risk. If the operation results in a topology change, (ie 2 clusters considered previously as one for example) it’s good to consider the consequences ahead and to run a full repair after the transition.

Step 2:

Make sure the existing nodes are not using the SimpleSnitch.

Instead, the snitch must be one that considers the data center (and racks).

For example:

 endpoint_snitch: Ec2Snitch

or

 endpoint_snitch: GossipingPropertyFileSnitch

If your cluster is using SimpleSnitch at the moment, be careful changing this value, as you might induce a change in the topology or where the data belongs. It is worth reading in detail about this specific topic if that is the case.

That’s it for now on Cassandra, we are done configuring the server side.

Client Side (Cassandra)

With this, since there are many clients out there, you might have to adapt to your specific case or driver. I take the Datastax Java driver as an example here.

All the clients using the cluster should go through the checks/changes below.

Step 4: Use a ​DCAware​ policy and disable remote connections

Cluster cluster = Cluster.builder()
				  .addContactPoint(​<ip_contact_list_from_old_dc>​)
				  .withLoadBalancingPolicy(
					  DCAwareRoundRobinPolicy.builder()
					  .withLocalDc(​"<old_dc_name>"​)
					  .withUsedHostsPerRemoteDc(​0​)
				  	  .build()
				  ).build();

Step 5: Pin connections to the existing data center

It’s important to use a consistency level that aims at retrieving the data from the current data center and not across the whole cluster. In general, use a consistency of the form: LOCAL_*. If you were using QUORUM before, change it to LOCAL_QUORUM for example.

At this point, all the clients should be ready to receive the new data center. Clients should now ignore the new data center’s nodes and only contact the local data center as defined above.

Phase 1 - Add a New Data Center

Step 6: Create and configure new Cassandra nodes

Choose the right hardware and number of nodes for the new data center, then bring the machines up.

Configure Cassandra nodes exactly like the old nodes except for those configuration that you intended to change with the new DC along with the data center name. The data center name is defined depending on the Snitch you picked. It can either be determined by the IP address, a File or AWS region name.

To perform this change using GossipingPropertyFileSnitch, edit the cassandra-rackdc.properties file on all nodes:

dc=<new_dc>
...

To create a new data center in the same region in AWS, you have to set the dc_suffix option in the cassandra-rackdc.properties file on all nodes:

# to have us-east-1-awesome-cluster for example
dc_suffix=-awesome-cluster

Step 7: Add new nodes to the cluster

Nodes can now be added, one at the time, just start Cassandra using the service start method that is specific to your operating system. For example on my linux systems we can run:

service cassandra ​start

Notes:

  • Start with the seed nodes for this new data center. Using two or three nodes as seeds per DC is a standard recommendation.
    -seeds: “<​old_ip1​>, <​​old_​ip2​>, <​old_​ip3​>, <​new_ip1​>, <​new_​ip2​>, <​new_​ip3​>”
    
  • There should be no streaming, adding a node should be quick - check the logs, to make sure of it. tail ​-fn 100 /​var​/​log​/cassandra/system.log
  • Due to the previous point, the nodes should join quickly as part of the new data center, check nodetool status. Make sure a node appears as UN before moving to the next node.

Step 8: Start accepting writes on the new data center

The next step is to accept writes for this data center by changing the topology so the new DC is also part of replication strategy:

ALTER​ KEYSPACE <my_ks> ​WITH​ ​replication​ = {
	'class': 'NetworkTopologyStrategy',
	'<old_dc_name>': '<replication_factor>',
	'<my_new_dc>': '<replication_factor>'
};
ALTER​ KEYSPACE <my_other_ks> ​WITH​ ​replication​ = {
	'class': 'NetworkTopologyStrategy',
	'<old_dc_name>': '<replication_factor>',
	'<my_new_dc>': '<replication_factor>'
};
[...]

Include system keyspaces that should now be using the NetworkTopologyStrategy for replication:

ALTER KEYSPACE system_auth WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<old_dc_name>': '<replication_factor>',
  '<new_dc_name>': '<replication_factor>'
};
ALTER KEYSPACE system_distributed WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<old_dc_name>': '<replication_factor>',
  '<new_dc_name>': '<replication_factor>'
};
ALTER KEYSPACE system_traces WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<old_dc_name>': '<replication_factor>',
  '<new_dc_name>': '<replication_factor>'
};
[...]

Note: Here again, do not modify tables using LocalStrategy.

To make sure that the keyspaces were altered as expected, you can see the ownership with:

nodetool​ status <my_ks>

Note: this is a a good moment to detect any imbalances in ownership and fix any issue there, before actually streaming the data to the new data center. We detailed this part in the post “How To Set Up A Cluster With Even Token Distribution” that Anthony wrote. You should really check this if you plan to use a low number of vnodes, if not, you might go through an operational nightmare trying to handle imbalances.

Step 9: Stream historical data to the new data center

Stream the historical data to the new data center to fill the gap, as the new cluster is now receiving writes, but is still missing all the past data. This is done by running this on all the nodes of the new data center:

nodetool​ rebuild old_dc_name

The output (in the logs) should look like:

INFO  [RMI TCP Connection(8)-192.168.1.31] ... - rebuild from dc: ..., ..., (All tokens)
INFO  [RMI TCP Connection(8)-192.168.1.31] ... - [Stream ...] Executing streaming plan for Rebuild
INFO  [StreamConnectionEstablisher:1] ... - [Stream ...] Starting streaming to ...
...
INFO  [StreamConnectionEstablisher:2] ... - [Stream ...] Beginning stream session with ...

Note:

  • nodetool​ setstreamthroughput X can help reducing the burden caused by the streaming on the nodes answering requests or, the other wait around, to make the transfer faster.
  • A good way to know the query finished is to run the command above from a screen or using tmux for example screen -R rebuild.

Phase 2 - Switch Clients to the new DC

At this point the new data center can be tested and should be a mirror of the previous one, except for the things you changed of course.

Step 10: Client Switch

The clients can now be routed to the new data center. To do so, change the contact point and the data center name. Doing this one client at the time, while observing impacts, is probably the safest way when there are many clients plugged to a single cluster. Back to our Java driver example, it would now look like this:

Cluster cluster = Cluster.builder()
				  .addContactPoint(​<ip_contact_list_from_new_dc>​)
				  .withLoadBalancingPolicy(
					  DCAwareRoundRobinPolicy.builder()
					  .withLocalDc(​"<new_dc_name>"​)
					  .withUsedHostsPerRemoteDc(​0​)
				  	  .build()
				  ).build();

Note: Before going forward you can (and probably should) make sure that no client is connected to old nodes anymore. You can do this in a number of ways:

  • Look at netstats for opened (native or thrift) connections:
    netstat -tupawn | grep -e 9042 -e 9160
    
  • Check that the node does not receive local reads (i.e. ReadStage should not increase) and does not act as a coordinator (i.e. RequestResponseStage should not increase).
    watch -d "nodetool tpstats"
    
  • Monitoring system/Dashboards: Ensure there are no local reads heading to the old data center.

Phase 3 - Remove the old DC

Step 11: Stop replicating data in the old data center

Alter the keyspaces so they no longer reference the old data center:

ALTER​ KEYSPACE <my_ks> ​WITH​ ​replication​ = {
	'class': 'NetworkTopologyStrategy',
	'<my_new_dc>': '<replication_factor>'
};
ALTER​ KEYSPACE <my_other_ks> ​WITH​ ​replication​ = {
	'class': 'NetworkTopologyStrategy',
	'<my_new_dc>': '<replication_factor>'
};

[...]

ALTER KEYSPACE system_auth WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<new_dc_name>': '<replication_factor>'
};
ALTER KEYSPACE system_distributed WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<new_dc_name>': '<replication_factor>'
};
ALTER KEYSPACE system_traces WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<new_dc_name>': '<replication_factor>'
};

Step 12: Decommission old nodes

Finally we need to get rid of the old nodes. Stopping the nodes is not enough as Cassandra will continue to expect them to come back to life anytime. We want them out, safely, but once and for all. This is what a decommission does. To cleanly remove the old data center we need to decommission all the nodes in this data center, one by one.

On the bright side, the operation should be almost instantaneous (at least very quick), because this datacenter is not owning any data anymore from a Cassandra perspective. Thus, there should be no data to stream to other nodes. If streaming happens, you probably forgot about a keyspace using SimpleStrategy or NetworkTopologyStrategy that still uses the old data center.

Sequentially, on each node of the old data center, run:

$ nodetool decommission

This should be fast, not to say immediate as this command should trigger no streaming at all due to the changes we made in the keyspaces replication configuration. This data center should not own any token ranges anymore as we removed the data center from all the keyspaces, in the previous step.

Step 13: Remove old nodes from the seeds

To remove any reference to the old data center, we need to update the cassandra.yaml file.

-seeds: “<​new_ip1​>, <​new_​ip2​>, <​new_​ip3​>”

And that’s it! You have now successfully switched over to a new Data Center. Of course during the process, ensure that the changes you just made are actually acknowledged in this new data center.

Cassandra Data Center Switch Ops