Re-Bootstrapping Without Bootstrapping

During a cluster’s lifespan, there will be scenarios where a node has been offline for longer than the gc_grace_seconds window or has entered an unrecoverable state. Due to CASSANDRA-6961’s introduction in Cassandra 2.0.7, the process for reviving nodes that have been offline for longer than gc_grace_seconds has been dramatically shortened in cases where the cluster does not ingest deletion mutations.

Before we visit the new workflows, let’s gain a firm understanding of what occurs when a node is offline.

If a node has been offline for longer than gc_grace_seconds a few things happen:

  • Some hints will never be delivered to the revived node.
  • Some tombstones will never be delivered to the revived node.
  • The revived node is not consistent since it has missed recent mutation events.

If the downed node is able to be revived, it may experience the following problems:

  • Read requests can serve deleted data which tombstones were meant to remove, but the tombstones may not yet have been delivered.
  • If a tombstone expired and was forgotten by the cluster by exceeding the gc_grace_seconds window, the matching data has no recollection of a deletion event and can continue to propagate around the cluster as “live data”. This previously-deleted data is called “zombie data”.
  • Read requests can serve stale data that has not been updated during the time the node has been offline.

The procedure for solving the problems listed above changed in Cassandra 2.0.7 and may not be common knowledge. This post will explain this new way to resolve the problems caused by a node being offline for a longer than ideal period, as well as the new streamlined solution for clusters that do not ingest deletions.

Post-1.2.17/2.0.9 Resolution

While the following instructions are meant to highlight an older procedure for restoring a node that has been dead for longer than gc_grace_seconds, this procedure also works for nodes that are within clusters that ingest deletions.

Stop the Node

This procedure assumes at least one of the following scenarios is true:

  • The node is unrecoverable.
  • The node has been offline for longer than gc_grace_seconds.
  • The formerly online node will be swapped out and has now been shut down.

Start a New Node

A new node should be launched with the same settings as the previous node, but not allowed to auto-start and join the existing Cassandra cluster. Once the node has been launched without auto-starting the Cassandra process, we’ll want to enable the replace_address_first_boot flag.

Note that CASSANDRA-7356 introduced a more friendly approach to the previous replace_address flag that was introduced via CASSANDRA-5916. The new replace_address_first_boot flag should always be preferred since it allows an operator to forget to remove the flag after the node starts up, without going through another replace_address process upon an unwitting restart.

Activate this replace_address_first_boot flag by adding the following line to the bottom of the cassandra-env.sh on the new node:

JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address_first_boot=<dead_node_ip>"

With the flag, the Cassandra node will:

  • Join the gossip ring.
  • Report itself as the new owner of the previous IP’s token ranges thereby avoiding any token shuffling.
  • Begin a bootstrap process with those token ranges.
  • Move into the UN (Up/Normal) state as shown via nodetool status.
  • At this point it will begin accepting reads and writes.

Post-2.0.7 Resolution

Post-Cassandra 2.0.7 the new recommendation changes in cases where deletion mutations are not ingested by the cluster.

WARNING: This section is only relevant if the cluster does not ingest deletion mutations. If the cluster ingests deletions and a node has been offline and cannot complete a full repair before the gc_grace_seconds window expires, resurrecting zombie data is still a concern. If zombie data is a concern, please follow the Post-1.2.17/2.0.2 Resolution.

If zombie data is not a concern, but ensuring highly consistent nodes is a priority, following these instructions can ensure the node is fully consistent before allowing the node to respond to read requests.

Set the Node to Hibernate Mode

Add the following line to cassandra-env.sh:

JVM_OPTS="$JVM_OPTS -Dcassandra.join_ring=false"

The above line ensures that the node starts up in a hibernation state.

Start Cassandra

After the above JVM option has been added, start the node that will be revived. If Cassandra was installed using Debian or RHEL packages, use the command:

sudo service cassandra start

Cassandra will then start in hibernation mode. During this time the node will be able to communication with the Cassandra cluster without accepting client requests. This is important because we will be able to perform maintenance operations while not serving stale data.

Repair Missing Data

The next step will be to repair any missing data on the revived node using:

nodetool repair

The repair process, similar to the bootstrap process, will:

  • Build Merkle trees on each of the nodes holding replicas.
  • Find all missing information required to make each Merkle leaf identical.
  • Stream all missing information into the revived node.
  • Compact all new SSTables as they are being written.

Because the repair process does not have to start from a 0-bytes baseline, we should effectively stream far less information repairing a node rather than bootstrapping or using the replace_address_first_boot procedure on a clean node.

The status for the repair can be confirmed in a few ways:

  • The nodetool repair command will return when the repair is complete.
  • The system.log will output a statement upon repair completion.
  • The nodetool compactionstats will have far fewer pending compactions.

Join the Repaired Node to the Cluster

Once the repair and subsequent pending compactions have been completed, have the revived node join the cluster using:

nodetool join

The above command is analogous to the last hidden step of the bootstrap process:

  • Announce to the cluster that an existing node has re-joined the cluster.

Follow-up Task: Don’t Go into Hibernate Mode Upon Restart

The only follow-up required with this procedure is to remove the following line within cassandra-env.sh:

JVM_OPTS="$JVM_OPTS -Dcassandra.join_ring=false"

Removing the above line will ensure that the node does not go into hibernation mode upon the next restart of the Cassandra process.

Recap

The process for reviving nodes that have been offline for longer than gc_grace_seconds has been shortened dramatically in situations where the cluster does not ingest deletion mutations. (Due to CASSANDRA-6961’s introduction in Cassandra 2.0.7.)

Instead of requiring the legacy process of:

  • Removing the downed node from the cluster.
  • Bootstrapping a clean node into the cluster.
  • Performing a rolling repair.
  • Performing a rolling cleanup.

Or using the newer process of:

  • Replacing a new, clean node by its IP address.

We can now follow the simplified process of:

  • Starting the downed node in hibernation mode.
  • Repairing all missing data.
  • Start allowing the revived node to communicate with client requests.
  • Ensure the node will not enter hibernation mode upon restart.

While the number of steps seem to be relatively the same between the legacy process and the newest process, the newer process:

  • Prevents streaming a node’s worth of data while making use of previously held data.
  • Does not require repairing all nodes within the affected data center.
  • Does not require cleaning up all nodes within the affected data center.

Hopefully this guide is found useful in alleviating the maintenance window of an operator by making use of previously held data without shifting token ownership and creating time consuming follow up tasks.

cassandra bootstrapping removenode repair bootstrap join_ring streaming gc_grace_seconds tombstones
blog comments powered by Disqus