During a cluster’s lifespan, there will be scenarios where a node has been offline for longer than the
gc_grace_seconds window or has entered an unrecoverable state. Due to CASSANDRA-6961’s introduction in Cassandra 2.0.7, the process for reviving nodes that have been offline for longer than
gc_grace_seconds has been dramatically shortened in cases where the cluster does not ingest deletion mutations.
Before we visit the new workflows, let’s gain a firm understanding of what occurs when a node is offline.
If a node has been offline for longer than
gc_grace_seconds a few things happen:
- Some hints will never be delivered to the revived node.
- Some tombstones will never be delivered to the revived node.
- The revived node is not consistent since it has missed recent mutation events.
If the downed node is able to be revived, it may experience the following problems:
- Read requests can serve deleted data which tombstones were meant to remove, but the tombstones may not yet have been delivered.
- If a tombstone expired and was forgotten by the cluster by exceeding the
gc_grace_secondswindow, the matching data has no recollection of a deletion event and can continue to propagate around the cluster as “live data”. This previously-deleted data is called “zombie data”.
- Read requests can serve stale data that has not been updated during the time the node has been offline.
The procedure for solving the problems listed above changed in Cassandra 2.0.7 and may not be common knowledge. This post will explain this new way to resolve the problems caused by a node being offline for a longer than ideal period, as well as the new streamlined solution for clusters that do not ingest deletions.
While the following instructions are meant to highlight an older procedure for restoring a node that has been dead for longer than
gc_grace_seconds, this procedure also works for nodes that are within clusters that ingest deletions.
Stop the Node
This procedure assumes at least one of the following scenarios is true:
- The node is unrecoverable.
- The node has been offline for longer than
- The formerly online node will be swapped out and has now been shut down.
Start a New Node
A new node should be launched with the same settings as the previous node, but not allowed to auto-start and join the existing Cassandra cluster. Once the node has been launched without auto-starting the Cassandra process, we’ll want to enable the
Note that CASSANDRA-7356 introduced a more friendly approach to the previous
replace_address flag that was introduced via CASSANDRA-5916. The new
replace_address_first_boot flag should always be preferred since it allows an operator to forget to remove the flag after the node starts up, without going through another
replace_address process upon an unwitting restart.
replace_address_first_boot flag by adding the following line to the bottom of the
cassandra-env.sh on the new node:
With the flag, the Cassandra node will:
- Join the gossip ring.
- Report itself as the new owner of the previous IP’s token ranges thereby avoiding any token shuffling.
- Begin a bootstrap process with those token ranges.
- Move into the
UN(Up/Normal) state as shown via
- At this point it will begin accepting reads and writes.
Post-Cassandra 2.0.7 the new recommendation changes in cases where deletion mutations are not ingested by the cluster.
WARNING: This section is only relevant if the cluster does not ingest deletion mutations. If the cluster ingests deletions and a node has been offline and cannot complete a full repair before the
gc_grace_secondswindow expires, resurrecting zombie data is still a concern. If zombie data is a concern, please follow the Post-1.2.17/2.0.2 Resolution.
If zombie data is not a concern, but ensuring highly consistent nodes is a priority, following these instructions can ensure the node is fully consistent before allowing the node to respond to read requests.
Set the Node to Hibernate Mode
Add the following line to
The above line ensures that the node starts up in a hibernation state.
After the above JVM option has been added, start the node that will be revived. If Cassandra was installed using Debian or RHEL packages, use the command:
sudo service cassandra start
Cassandra will then start in hibernation mode. During this time the node will be able to communication with the Cassandra cluster without accepting client requests. This is important because we will be able to perform maintenance operations while not serving stale data.
Repair Missing Data
The next step will be to repair any missing data on the revived node using:
The repair process, similar to the bootstrap process, will:
- Build Merkle trees on each of the nodes holding replicas.
- Find all missing information required to make each Merkle leaf identical.
- Stream all missing information into the revived node.
- Compact all new SSTables as they are being written.
Because the repair process does not have to start from a 0-bytes baseline, we should effectively stream far less information repairing a node rather than bootstrapping or using the
replace_address_first_boot procedure on a clean node.
The status for the repair can be confirmed in a few ways:
nodetool repaircommand will return when the repair is complete.
system.logwill output a statement upon repair completion.
nodetool compactionstatswill have far fewer pending compactions.
Join the Repaired Node to the Cluster
Once the repair and subsequent pending compactions have been completed, have the revived node join the cluster using:
The above command is analogous to the last hidden step of the bootstrap process:
- Announce to the cluster that an existing node has re-joined the cluster.
Follow-up Task: Don’t Go into Hibernate Mode Upon Restart
The only follow-up required with this procedure is to remove the following line within
Removing the above line will ensure that the node does not go into hibernation mode upon the next restart of the Cassandra process.
The process for reviving nodes that have been offline for longer than
gc_grace_seconds has been shortened dramatically in situations where the cluster does not ingest deletion mutations. (Due to CASSANDRA-6961’s introduction in Cassandra 2.0.7.)
Instead of requiring the legacy process of:
- Removing the downed node from the cluster.
- Bootstrapping a clean node into the cluster.
- Performing a rolling repair.
- Performing a rolling cleanup.
Or using the newer process of:
- Replacing a new, clean node by its IP address.
We can now follow the simplified process of:
- Starting the downed node in hibernation mode.
- Repairing all missing data.
- Start allowing the revived node to communicate with client requests.
- Ensure the node will not enter hibernation mode upon restart.
While the number of steps seem to be relatively the same between the legacy process and the newest process, the newer process:
- Prevents streaming a node’s worth of data while making use of previously held data.
- Does not require repairing all nodes within the affected data center.
- Does not require cleaning up all nodes within the affected data center.
Hopefully this guide is found useful in alleviating the maintenance window of an operator by making use of previously held data without shifting token ownership and creating time consuming follow up tasks.