Joaquin Casares

Re-Bootstrapping Without Bootstrapping

02 Aug 2018

During a cluster’s lifespan, there will be scenarios where a node has been offline for longer than the gc_grace_seconds window or has entered an unrecoverable state. Due to CASSANDRA-6961’s introduction in Cassandra 2.0.7, the process for reviving nodes that have been offline for longer than gc_grace_seconds has been dramatically shortened in cases where the cluster does not ingest deletion mutations.

Before we visit the new workflows, let’s gain a firm understanding of what occurs when a node is offline.

If a node has been offline for longer than gc_grace_seconds a few things happen:

Some hints will never be delivered to the revived node.
Some tombstones will never be delivered to the revived node.
The revived node is not consistent since it has missed recent mutation events.

If the downed node is able to be revived, it may experience the following problems:

Read requests can serve deleted data which tombstones were meant to remove, but the tombstones may not yet have been delivered.
If a tombstone expired and was forgotten by the cluster by exceeding the gc_grace_seconds window, the matching data has no recollection of a deletion event and can continue to propagate around the cluster as “live data”. This previously-deleted data is called “zombie data”.
Read requests can serve stale data that has not been updated during the time the node has been offline.

The procedure for solving the problems listed above changed in Cassandra 2.0.7 and may not be common knowledge. This post will explain this new way to resolve the problems caused by a node being offline for a longer than ideal period, as well as the new streamlined solution for clusters that do not ingest deletions.

Post-1.2.17/2.0.9 Resolution

While the following instructions are meant to highlight an older procedure for restoring a node that has been dead for longer than gc_grace_seconds, this procedure also works for nodes that are within clusters that ingest deletions.

Stop the Node

This procedure assumes at least one of the following scenarios is true:

The node is unrecoverable.
The node has been offline for longer than gc_grace_seconds.
The formerly online node will be swapped out and has now been shut down.

Start a New Node

A new node should be launched with the same settings as the previous node, but not allowed to auto-start and join the existing Cassandra cluster. Once the node has been launched without auto-starting the Cassandra process, we’ll want to enable the replace_address_first_boot flag.

Note that CASSANDRA-7356 introduced a more friendly approach to the previous replace_address flag that was introduced via CASSANDRA-5916. The new replace_address_first_boot flag should always be preferred since it allows an operator to forget to remove the flag after the node starts up, without going through another replace_address process upon an unwitting restart.

Activate this replace_address_first_boot flag by adding the following line to the bottom of the cassandra-env.sh on the new node:

JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address_first_boot=<dead_node_ip>"

With the flag, the Cassandra node will:

Join the gossip ring.
Report itself as the new owner of the previous IP’s token ranges thereby avoiding any token shuffling.
Begin a bootstrap process with those token ranges.
Move into the UN (Up/Normal) state as shown via nodetool status.
At this point it will begin accepting reads and writes.

Post-2.0.7 Resolution

A previous version of this blog post included a workflow which used -Dcassandra.join_ring=false post-Cassandra 2.0.7 in order to repair a hibernating node in order to save on repair time costs.

However, after a good mailing list discussion and follow up with Alex Dejanovski and Alain Rodriquez it turns out that starting a node up with -Dcassandra.join_ring=false would:

Prevent writes from being received on the hibernating node.
Allow clients to contact the hibernating node directly, even if requests could not be routed to the node through other coordinators which saw the node as down.
Have other nodes collect hints for the hibernating node. However, if the node wasn’t fully repaired and joined the ring within the hint window, which defaults to 3 hours, the writes would be once again not be delivered to the hibernating node without running another repair.

Because of the all of the above caveats, we have decided to remove the previous section and instead recommend the replace_address_first_boot method for all versions of Cassandra that support this feature.

Recap

The legacy process to replace a node that has been offline for longer than gc_grace_seconds used to require multiple time consuming steps:

Removing the downed node from the cluster.
Bootstrapping a clean node into the cluster.
Performing a rolling repair.
Performing a rolling cleanup.

We can now perform one step to consistently bring a node that has been down for longer than gc_grace_seconds online without changing any previous replica ownership by:

Replacing a new, clean node by its IP address.

Hopefully this guide is found useful in alleviating the maintenance window of an operator by not shifting token ownership and creating time consuming follow up tasks.

cassandra bootstrapping removenode repair bootstrap join_ring streaming gc_grace_seconds tombstones