Anthony Grasso

Node Replacement without Bootstrapping

21 Feb 2018

Apache Cassandra provides tools to replace nodes in a cluster, however these methods generally involve obtaining data from other nodes in the cluster via the bootstrap process. In this post we will step through a method to replace a node without bootstrapping in order to speed up the process.

Just Typical

The typical action to take when a node is to be replaced in the cluster is to first ensure it is in the DN (Down and Normal) state as shown using the following command.

nodetool status

Once the state of the node to replace is confirmed, a new replacement node can be bootstrapped into the cluster. This is typically done using the cassandra.replace_address_first_boot option in cassandra-env.sh file and ensuring that auto_bootstrap is enabled by omitting it from the cassandra.yaml file of the new node. This process is covered in the Auto Bootstrapping Part 1 blog post.

When a node is added in this manner, it will acquire the token range of the node it is replacing and obtain data for those token ranges from other nodes. Bootstrapping a node is the process undertaken when the inserted node streams data to join the cluster, which is also covered in the Auto Bootstrapping Part 1 blog post. The down side to this process is that when using a version of Cassandra before 2.1 or cassandra.consistent.rangemovement=false, the new node may stream data from a replica containing data that is old and or inconsistent with other nodes. In this case, once the new node completes the bootstrap process and joins the cluster, there will be more than one node with inconsistent data. In a cluster with a Replication Factor of 3, inconsistent data could be returned on QUORUM reads. To fix such data inconsistencies in the cluster, a repair must be run on the newly inserted node for the token ranges it owns after it successfully joins the cluster.

The bootstrap process is convenient, however there may be cases where we want the data as is on the old node copied to the replacement node. Whilst performing such an operation requires a number of manual steps similar to those defined in the Removing a disk mapping from Cassandra blog post, there are a number of advantages to this process:

It is potentially faster than bootstrapping a node because data is copied from node to another node, rather than copied from multiple sources. This in turn reduces network traffic in the cluster.
It is far more difficult for data inconsistencies to occur during the replacement process as the same data is being transferred between nodes.

The trade off with this operation is that the node being replaced must be available or at least its volumes must be accessible in order for the copy to occur. Thus, this restriction rules the process out for a disk failure scenario. However, on the flip side, the process lends itself well to physical hardware and AWS instance upgrades.

The Process

Before we dive into detailed steps, a summary of the process to replace an existing node in the cluster with a new node without bootstrapping is as follows.

Check the status of the old node.
Copy data from the old node to the new node multiple times. If the node is in a DN state skip to step 5.
Gracefully shutdown Cassandra the old node.
Update the copied data on new node to account for any new SSTables flushed to disk when the old node was shutdown.
Configure the new node to replace the old node.
Insert new node into cluster.

The Devil is in the Detailed Steps

The above process makes the following assumptions:

The new node has been provisioned with a version of Cassandra that is the same as the old node that will be replaced.
The Cassandra process on the new node is in stopped state.

It is worth noting that the initial copy in step 2 is likely to take in the order of tens of hours depending on how much data is stored on the node.

Detailed steps to perform the above process will use the following terms.

old node - The node to be replaced in the cluster.

new node - The node that will be replacing the old node in the cluster.

The steps are as follows.

1. Check the status of the old node

We will need to check the status of the old node to determine how we proceed with the data copying to the new node. Perform this check by running the following command.

nodetool status

Note down the Status and Host ID of the old node as we will use this information in the coming steps.

2. Copy data from the old node to the new node

We will need to locate the directory that contains the data on the old node before copying it. This can easily be done by inspecting the cassandra.yaml file for the data_file_directories setting. We will need to repeat the remaining parts to this step for each data directory listed in the data_file_directories setting.

Use rsync to make an initial copy of the data on the old node to the new node. The following rsync command is performed on the new node and effectively “pulls” the data from the old node.

rsync -azP --delete-before --bwlimit $BANDWIDTH_LIMIT \
$USER_NAME@$OLD_NODE_IP_ADDRESS:$OLD_NODE_DATA_DIRECTORY/ \
$NEW_NODE_DATA_DIRECTORY/

Where the options used are:

-a - Archive mode which includes recursion and permission preservation.
-z - Compress data during transfer.
-P - Shows progress and verbose output.
–delete-before - Removes any existing file in the destination folder that is not present in the source folder.
–bwlimit - Maximum bandwidth in kilobytes per second to use for file transfer. See the $BANDWIDTH_LIMIT description for further information on what this value should be set to.

Where the values for the options and arguments are:

$BANDWIDTH_LIMIT - A good starting point for this value could be the value of the stream_throughput_outbound_megabits_per_sec setting in the cassandra.yaml file which defaults to 200. Note that there is a unit difference here. The Cassandra setting is in Mega Bits per Second where as the rsync option is in Kilo Bytes per Second. Hence, we will need to take the stream_throughput_outbound_megabits_per_sec value, divide it by 8 and then multiply it by 1,000 to convert it to the value to use for --bwlimit. Depending on the network and the bandwidth available, it is possible to stop the command, tune the value and restart the rsync.
$USER_NAME - user name used when accessing the node via SSH.
$OLD_NODE_IP_ADDRESS - IP address of the old node to be replaced.
$OLD_NODE_DATA_DIRECTORY - Absolute path of the data directory containing all keyspace data; this includes all the system keyspaces on the old node. This will be the value defined for data_file_directories setting in the cassandra.yaml file on the old node. Be sure to include the trailing / at the end of the data directory path.
$NEW_NODE_DATA_DIRECTORY - Location of the data directory that will contain a copy of the all the data from the old node. This will be the value defined for data_file_directories setting in the cassandra.yaml file on the new node. Be sure to include the trailing / at the end of the data directory path.

The reason we want to copy all the keyspace data including the system keyspaces is because the system keyspaces contain information about the tokens allocated to a node. Further to this, the system keyspace tells a node that it has data for the tokens allocated to it. Without this information the new node will assume it has no tokens allocated to it and proceed to acquire new tokens.

As mentioned previously, it is likely that the very first rsync operation will take in the order of tens of hours depending on how much data is on the old node. This being the case it is probably a good idea to execute the rsync within a screen session or using nohup.

Once the first rysnc has completed, if Status noted down in step 1 was DN (Down/Normal) then only the single rsync copy operation will be needed. This is because the old node will never accept any further writes while in this state. In this case skip over to step 5. However, if the node is in a UN (Up/Normal) state it is worth repeating this step a few times to the point where the data on the new node is almost identical to that of the old node. At that point the rsync operation will take only a few minutes to complete.

3. Gracefully shutdown Cassandra the old node

Shutdown the Cassandra process on the old node when there is a minimal data difference between the old node and the new node. This is to prevent the old node from accepting any more writes and to flush any remaining write mutations it accepted to disk. The old node can be shutdown gracefully using the following commands.

nodetool drain
sleep 5
sudo cassandra stop

Once the Cassandra process on the old node has stopped, change the cluster_name setting in the cassandra.yaml file to something other than its current value. e.g. Non-Existent-Cluster. This is to prevent the old node from ever joining the cluster again, in the event the Cassandra process was started by accident.

4. Update the copied data on new node again

The final rsync will need to be executed once the node has been shutdown gracefully. This is so we can copy any new SSTables that were written to disk during the previous copy and as part of the shutdown. Execute the update rsync command as follows.

rsync -azP --delete-before --bwlimit $BANDWIDTH_LIMIT \
$USER_NAME@$OLD_NODE_IP_ADDRESS:$OLD_NODE_DATA_DIRECTORY/ \
$NEW_NODE_DATA_DIRECTORY/

If there are minimal data differences between the old node and the new node, this update should take no longer than a few minutes to complete.

Note that this step will need to be repeated for each data directory listed in the data_file_directories setting.

5. Configure the new node to replace the old node

Configure the cassandra.yaml file such that the new node can join the cluster by ensuring that the cluster_name setting is identical to the remaining nodes in the cluster.

Finally, ensure that the commitlogs and the saved caches folders are empty on the new node. The location of the commitlog and save caches can be found in the cassandra.yaml file under the commitlog_directory and saved_caches_directory settings respectively.

The reason for ensuring that the commitlog and saved caches are empty is we want to avoid having out of date information present when the node starts. In step 3 we ran the nodetool drain command on the old node before performing a final copy of the data to the new node in step 4. All the changes in the commitlog will have been written to SSTables on disk when the nodetool drain command ran, and as such the commitlog information is of no use. If it is replayed, the new node may use out of date data in the commitlog.

Note that if the commitlog, saved caches and data directories are missing when Cassandra starts, it will create them. On first installing Apache Cassandra, the process may start automatically potentially causing issues when inserting a new node. One way to prevent this from occurring is to add the following policy on the node.

echo exit 101 > /usr/sbin/policy-rc.d
chmod +x /usr/sbin/policy-rc.d

Now we are ready to insert our replacement node!

6. Insert new node into cluster

The new node can be inserted by simply starting the Cassandra process as follows.

service cassandra start

The following messages will be observed in the logs which indicate that the old node has been replaced by the new node. Furthermore, the last log message below will be repeated for each token that the old node owned.

INFO  [main] 2018-02-15 21:37:51,126 StorageService.java:900 - Using saved tokens [...]
WARN  [GossipStage:1] 2018-02-15 21:37:53,040 StorageService.java:1970 - Not updating host ID $HOST_ID for $OLD_NODE_IP_ADDRESS because it's mine
INFO  [GossipStage:1] 2018-02-16 13:23:56,421 StorageService.java:2028 - Nodes $OLD_NODE_IP_ADDRESS and $NEW_NODE_IP_ADDRESS have the same token ... Ignoring

Note that in the above message the $HOST_ID will be the Host ID noted down in Step 1. In addition, when nodetool status is executed, the IP address of the new node ($NEW_NODE_IP_ADDRESS) will be associated with the $HOST_ID in the output.

If the old node was a seed node, update the seed_provider setting in the cassandra.yaml file on all nodes. Update this setting by removing the IP address of the old node and replacing it with the IP address of the new node.

Conclusion

As can be seen from the steps it is possible to replace a node in the cluster without the need to use the Apache Cassandra bootstrap functionality. The advantage of this process is that it isolates the movement of data in the cluster to just two nodes; the old node and the new node. Thus, no other nodes in the cluster need to stream data to the new node as part of its insertion. For cluster that contains nodes requiring a hardware upgrade it could be advantageous to use these steps. Please feel free to leave comments or questions on this particular process.

Last updated: 2019-02-21

cassandra operational bootstrapping