Re-Bootstrapping Without Bootstrapping
During a cluster’s lifespan, there will be scenarios where a node has been offline for longer than the gc_grace_seconds
window or has entered an unrecoverable state. Due to CASSANDRA-6961’s introduction in Cassandra 2.0.7, the process for reviving nodes that have been offline for longer than gc_grace_seconds
has been dramatically shortened in cases where the cluster does not ingest deletion mutations.
Before we visit the new workflows, let’s gain a firm understanding of what occurs when a node is offline.
If a node has been offline for longer than gc_grace_seconds
a few things happen:
- Some hints will never be delivered to the revived node.
- Some tombstones will never be delivered to the revived node.
- The revived node is not consistent since it has missed recent mutation events.
If the downed node is able to be revived, it may experience the following problems:
- Read requests can serve deleted data which tombstones were meant to remove, but the tombstones may not yet have been delivered.
- If a tombstone expired and was forgotten by the cluster by exceeding the
gc_grace_seconds
window, the matching data has no recollection of a deletion event and can continue to propagate around the cluster as “live data”. This previously-deleted data is called “zombie data”. - Read requests can serve stale data that has not been updated during the time the node has been offline.
The procedure for solving the problems listed above changed in Cassandra 2.0.7 and may not be common knowledge. This post will explain this new way to resolve the problems caused by a node being offline for a longer than ideal period, as well as the new streamlined solution for clusters that do not ingest deletions.
Post-1.2.17/2.0.9 Resolution
While the following instructions are meant to highlight an older procedure for restoring a node that has been dead for longer than gc_grace_seconds
, this procedure also works for nodes that are within clusters that ingest deletions.
Stop the Node
This procedure assumes at least one of the following scenarios is true:
- The node is unrecoverable.
- The node has been offline for longer than
gc_grace_seconds
. - The formerly online node will be swapped out and has now been shut down.
Start a New Node
A new node should be launched with the same settings as the previous node, but not allowed to auto-start and join the existing Cassandra cluster. Once the node has been launched without auto-starting the Cassandra process, we’ll want to enable the replace_address_first_boot
flag.
Note that CASSANDRA-7356 introduced a more friendly approach to the previous replace_address
flag that was introduced via CASSANDRA-5916. The new replace_address_first_boot
flag should always be preferred since it allows an operator to forget to remove the flag after the node starts up, without going through another replace_address
process upon an unwitting restart.
Activate this replace_address_first_boot
flag by adding the following line to the bottom of the cassandra-env.sh
on the new node:
JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address_first_boot=<dead_node_ip>"
With the flag, the Cassandra node will:
- Join the gossip ring.
- Report itself as the new owner of the previous IP’s token ranges thereby avoiding any token shuffling.
- Begin a bootstrap process with those token ranges.
- Move into the
UN
(Up/Normal) state as shown vianodetool status
. - At this point it will begin accepting reads and writes.
Post-2.0.7 Resolution
A previous version of this blog post included a workflow which used -Dcassandra.join_ring=false
post-Cassandra 2.0.7 in order to repair a hibernating node in order to save on repair time costs.
However, after a good mailing list discussion and follow up with Alex Dejanovski and Alain Rodriquez it turns out that starting a node up with -Dcassandra.join_ring=false
would:
- Prevent writes from being received on the hibernating node.
- Allow clients to contact the hibernating node directly, even if requests could not be routed to the node through other coordinators which saw the node as down.
- Have other nodes collect hints for the hibernating node. However, if the node wasn’t fully repaired and joined the ring within the hint window, which defaults to 3 hours, the writes would be once again not be delivered to the hibernating node without running another repair.
Because of the all of the above caveats, we have decided to remove the previous section and instead recommend the replace_address_first_boot
method for all versions of Cassandra that support this feature.
Recap
The legacy process to replace a node that has been offline for longer than gc_grace_seconds
used to require multiple time consuming steps:
- Removing the downed node from the cluster.
- Bootstrapping a clean node into the cluster.
- Performing a rolling repair.
- Performing a rolling cleanup.
We can now perform one step to consistently bring a node that has been down for longer than gc_grace_seconds
online without changing any previous replica ownership by:
- Replacing a new, clean node by its IP address.
Hopefully this guide is found useful in alleviating the maintenance window of an operator by not shifting token ownership and creating time consuming follow up tasks.