Node Replacement without Bootstrapping
Apache Cassandra provides tools to replace nodes in a cluster, however these methods generally involve obtaining data from other nodes in the cluster via the bootstrap process. In this post we will step through a method to replace a node without bootstrapping in order to speed up the process.
Just Typical
The typical action to take when a node is to be replaced in the cluster is to first ensure it is in the DN
(Down and
Normal) state as shown using the following command.
nodetool status
Once the state of the node to replace is confirmed, a new replacement node can be bootstrapped into the cluster. This is
typically done using the cassandra.replace_address_first_boot
option in cassandra-env.sh file
and ensuring that auto_bootstrap
is enabled by omitting it from the cassandra.yaml file of the new node. This
process is covered in the
Auto Bootstrapping Part 1
blog post.
When a node is added in this manner, it will acquire the token range of the node it is replacing and obtain data for
those token ranges from other nodes. Bootstrapping a node is the process undertaken when the inserted node streams data
to join the cluster, which is also covered in the Auto Bootstrapping Part 1 blog post. The down
side to this process is that when using a version of Cassandra before 2.1 or cassandra.consistent.rangemovement=false
,
the new node may stream data from a replica containing data that is old and or inconsistent
with other nodes. In this case, once the new node completes the bootstrap process and joins the cluster, there will be
more than one node with inconsistent data. In a cluster with a Replication Factor of 3, inconsistent data could be
returned on QUORUM reads. To fix such data inconsistencies in the cluster, a repair must be run on the newly inserted node
for the token ranges it owns after it successfully joins the cluster.
The bootstrap process is convenient, however there may be cases where we want the data as is on the old node copied to the replacement node. Whilst performing such an operation requires a number of manual steps similar to those defined in the Removing a disk mapping from Cassandra blog post, there are a number of advantages to this process:
- It is potentially faster than bootstrapping a node because data is copied from node to another node, rather than copied from multiple sources. This in turn reduces network traffic in the cluster.
- It is far more difficult for data inconsistencies to occur during the replacement process as the same data is being transferred between nodes.
The trade off with this operation is that the node being replaced must be available or at least its volumes must be accessible in order for the copy to occur. Thus, this restriction rules the process out for a disk failure scenario. However, on the flip side, the process lends itself well to physical hardware and AWS instance upgrades.
The Process
Before we dive into detailed steps, a summary of the process to replace an existing node in the cluster with a new node without bootstrapping is as follows.
- Check the status of the old node.
- Copy data from the old node to the new node multiple times. If the node is in a
DN
state skip to step 5. - Gracefully shutdown Cassandra the old node.
- Update the copied data on new node to account for any new SSTables flushed to disk when the old node was shutdown.
- Configure the new node to replace the old node.
- Insert new node into cluster.
The Devil is in the Detailed Steps
The above process makes the following assumptions:
- The new node has been provisioned with a version of Cassandra that is the same as the old node that will be replaced.
- The Cassandra process on the new node is in
stopped
state.
It is worth noting that the initial copy in step 2 is likely to take in the order of tens of hours depending on how much data is stored on the node.
Detailed steps to perform the above process will use the following terms.
old node - The node to be replaced in the cluster.
new node - The node that will be replacing the old node in the cluster.
The steps are as follows.
1. Check the status of the old node
We will need to check the status of the old node to determine how we proceed with the data copying to the new node. Perform this check by running the following command.
nodetool status
Note down the Status
and Host ID
of the old node as we will use this information in the coming steps.
2. Copy data from the old node to the new node
We will need to locate the directory that contains the data on the old node before copying it. This can easily be
done by inspecting the cassandra.yaml file for the data_file_directories
setting. We will need to repeat the
remaining parts to this step for each data directory listed in the data_file_directories
setting.
Use rsync
to make an initial copy of the data on the old node to the new node. The following rsync
command is
performed on the new node and effectively “pulls” the data from the old node.
rsync -azP --delete-before --bwlimit $BANDWIDTH_LIMIT \
$USER_NAME@$OLD_NODE_IP_ADDRESS:$OLD_NODE_DATA_DIRECTORY/ \
$NEW_NODE_DATA_DIRECTORY/
Where the options used are:
- -a - Archive mode which includes recursion and permission preservation.
- -z - Compress data during transfer.
- -P - Shows progress and verbose output.
- –delete-before - Removes any existing file in the destination folder that is not present in the source folder.
- –bwlimit - Maximum bandwidth in kilobytes per second to use for file transfer. See the
$BANDWIDTH_LIMIT
description for further information on what this value should be set to.
Where the values for the options and arguments are:
- $BANDWIDTH_LIMIT - A good starting point for this value could be the value of the
stream_throughput_outbound_megabits_per_sec
setting in the cassandra.yaml file which defaults to 200. Note that there is a unit difference here. The Cassandra setting is in Mega Bits per Second where as thersync
option is in Kilo Bytes per Second. Hence, we will need to take thestream_throughput_outbound_megabits_per_sec
value, divide it by 8 and then multiply it by 1,000 to convert it to the value to use for--bwlimit
. Depending on the network and the bandwidth available, it is possible to stop the command, tune the value and restart thersync
. - $USER_NAME - user name used when accessing the node via SSH.
- $OLD_NODE_IP_ADDRESS - IP address of the old node to be replaced.
- $OLD_NODE_DATA_DIRECTORY - Absolute path of the data directory containing all keyspace data; this includes
all the system keyspaces on the old node. This will be the value defined for
data_file_directories
setting in the cassandra.yaml file on the old node. Be sure to include the trailing/
at the end of the data directory path. - $NEW_NODE_DATA_DIRECTORY - Location of the data directory that will contain a copy of the all the data from the
old node. This will be the value defined for
data_file_directories
setting in the cassandra.yaml file on the new node. Be sure to include the trailing/
at the end of the data directory path.
The reason we want to copy all the keyspace data including the system keyspaces is because the system keyspaces contain information about the tokens allocated to a node. Further to this, the system keyspace tells a node that it has data for the tokens allocated to it. Without this information the new node will assume it has no tokens allocated to it and proceed to acquire new tokens.
As mentioned previously, it is likely that the very first rsync
operation will take in the order of tens of hours
depending on how much data is on the old node. This being the case it is probably a good idea to execute the rsync
within a screen
session or using nohup
.
Once the first rysnc
has completed, if Status noted down in step 1 was DN
(Down/Normal) then only the single rsync
copy operation will be needed. This is because the old node will never accept any further writes while in this state. In
this case skip over to step 5. However, if the node is in a UN
(Up/Normal) state it is worth repeating this step a few
times to the point where the data on the new node is almost identical to that of the old node. At that point the rsync
operation will take only a few minutes to complete.
3. Gracefully shutdown Cassandra the old node
Shutdown the Cassandra process on the old node when there is a minimal data difference between the old node and the new node. This is to prevent the old node from accepting any more writes and to flush any remaining write mutations it accepted to disk. The old node can be shutdown gracefully using the following commands.
nodetool drain
sleep 5
sudo cassandra stop
Once the Cassandra process on the old node has stopped, change the cluster_name
setting in the cassandra.yaml file
to something other than its current value. e.g. Non-Existent-Cluster
. This is to prevent the old node from ever
joining the cluster again, in the event the Cassandra process was started by accident.
4. Update the copied data on new node again
The final rsync
will need to be executed once the node has been shutdown gracefully. This is so we can copy any new
SSTables that were written to disk during the previous copy and as part of the shutdown. Execute the update rsync
command as follows.
rsync -azP --delete-before --bwlimit $BANDWIDTH_LIMIT \
$USER_NAME@$OLD_NODE_IP_ADDRESS:$OLD_NODE_DATA_DIRECTORY/ \
$NEW_NODE_DATA_DIRECTORY/
If there are minimal data differences between the old node and the new node, this update should take no longer than a few minutes to complete.
Note that this step will need to be repeated for each data directory listed in the data_file_directories
setting.
5. Configure the new node to replace the old node
Configure the cassandra.yaml file such that the new node can join the cluster by ensuring that the cluster_name
setting is identical to the remaining nodes in the cluster.
Finally, ensure that the commitlogs and the saved caches folders are empty on the new node. The location of the
commitlog and save caches can be found in the cassandra.yaml file under the commitlog_directory
and
saved_caches_directory
settings respectively.
The reason for ensuring that the commitlog and saved caches are empty is we want to avoid having out of date information
present when the node starts. In step 3 we ran the nodetool drain
command on the old node before performing a final
copy of the data to the new node in step 4. All the changes in the commitlog will have been written to SSTables on disk
when the nodetool drain
command ran, and as such the commitlog information is of no use. If it is replayed, the new
node may use out of date data in the commitlog.
Note that if the commitlog, saved caches and data directories are missing when Cassandra starts, it will create them. On first installing Apache Cassandra, the process may start automatically potentially causing issues when inserting a new node. One way to prevent this from occurring is to add the following policy on the node.
echo exit 101 > /usr/sbin/policy-rc.d
chmod +x /usr/sbin/policy-rc.d
Now we are ready to insert our replacement node!
6. Insert new node into cluster
The new node can be inserted by simply starting the Cassandra process as follows.
service cassandra start
The following messages will be observed in the logs which indicate that the old node has been replaced by the new node. Furthermore, the last log message below will be repeated for each token that the old node owned.
INFO [main] 2018-02-15 21:37:51,126 StorageService.java:900 - Using saved tokens [...]
WARN [GossipStage:1] 2018-02-15 21:37:53,040 StorageService.java:1970 - Not updating host ID $HOST_ID for $OLD_NODE_IP_ADDRESS because it's mine
INFO [GossipStage:1] 2018-02-16 13:23:56,421 StorageService.java:2028 - Nodes $OLD_NODE_IP_ADDRESS and $NEW_NODE_IP_ADDRESS have the same token ... Ignoring
Note that in the above message the $HOST_ID
will be the Host ID noted down in Step 1. In addition, when
nodetool status
is executed, the IP address of the new node ($NEW_NODE_IP_ADDRESS
) will be associated with the
$HOST_ID
in the output.
If the old node was a seed node, update the seed_provider
setting in the cassandra.yaml file on all nodes. Update
this setting by removing the IP address of the old node and replacing it with the IP address of the new node.
Conclusion
As can be seen from the steps it is possible to replace a node in the cluster without the need to use the Apache Cassandra bootstrap functionality. The advantage of this process is that it isolates the movement of data in the cluster to just two nodes; the old node and the new node. Thus, no other nodes in the cluster need to stream data to the new node as part of its insertion. For cluster that contains nodes requiring a hardware upgrade it could be advantageous to use these steps. Please feel free to leave comments or questions on this particular process.
Last updated: 2019-02-21