Joaquin Casares

Assassinate - A Command of Last Resort within Apache Cassandra

18 Sep 2018

The nodetool assassinate command is meant specifically to remove cosmetic issues after nodetool decommission or nodetool removenode commands have been properly run and at least 72 hours have passed. It is not a command that should be run under most circumstances nor included in your regular toolbox. Rather the lengthier nodetool decommission process is preferred when removing nodes to ensure no data is lost. Note that you can also use the nodetool removenode command if cluster consistency is not the primary concern.

This blog post will explain:

How gossip works and why assassinate can disrupt it.
How to properly remove nodes.
When and how to assassinate nodes.
How to resolve issues when assassination attempts fail.

Gossip: Cassandra’s Decentralized Topology State

Since all topological changes happen within Cassandra’s gossip layer, before we discuss how to manipulate the gossip layer, let’s discuss the fundamentals of how gossip works.

From Wikipedia:

A gossip (communication) protocol is a procedure or process of computer-computer communication that is based on the way social networks disseminate information or how epidemics spread… Modern distributed systems often use gossip protocols to solve problems that might be difficult to solve in other ways, either because the underlying network has an inconvenient structure, is extremely large, or because gossip solutions are the most efficient ones available.

The gossip state within Cassandra is the decentralized, eventually consistent, agreed upon topological state of all nodes within a Cassandra cluster. Cassandra gossip heartbeats keep the topological gossip state updated, are emitted via each node in the cluster, and contain the following information:

What that node’s status is, and
What its neighbors statuses’ are.

When a node goes offline the gossip heartbeat will not be emitted and the node’s neighbors will detect that the node is offline (with help from an algorithm which is tuned by the phi_convict_threshold parameter as defined within the cassandra.yaml), and the neighbor will broadcast an updated status saying that the neighbor node is unavailable until further notice.

However, as soon as the node comes online, two things will happen:

The revived node will:
- Ask a neighbor node what the current gossip state is.
- Modify the received gossip state to include its own status.
- Assume the modified state as its own.
- Broadcast the new gossip state across the network.
A neighbor node will:
- Discover the revived node is back online, either by:
  - First-hand discovery, or
  - Second-hand gossiping.
- Update the received gossip state with the new information.
- Modify the received gossip state to include its own status.
- Assume the modified state as its own.
- Broadcast the new gossip state across the network.

The above gossip protocol is responsible for the UN/DN, or Up|Down/Normal, statuses seen within nodetool status and is responsible for ensuring requests and replicas are properly routed to the available and responsible nodes, among other tasks.

Differences Between `Assassination`, `Decommission`, and `Removenode`

There are three main commands used to take a node offline: assassination, decommission, and removenode. Having the node be in the LEFT state ensures that each node’s gossip state will eventually be consistent and agree that:

The deprecated node has in fact been deprecated.
The deprecated node was deprecated after a given timestamp.
The deprecated token ranges are now owned by a new node.
Ideally, the deprecated LEFT stage will be automatically purged in 72 hours.

Underlying Actions of `Decommission` and `Removenode` on the Gossip Layer

When nodetool decommission and nodetool removenode commands are run, we are changing the state of the gossip layer to the LEFT state for the deprecated node.

Following the gossip protocol procedure in the previous section, the LEFT status will spread across the cluster as the the new truth since the LEFT status has a more recent timestamp than the previous status.

As more nodes begin to assimilate the LEFT status, the cluster will ultimately reach consensus that the deprecated node has LEFT the cluster.

Underlying Actions of `Assassination`

Unlike nodetool decommission and nodetool removenode above, when nodetool assassinate is used we are updating the gossip state to the LEFT state, then forcing an incrementation of the gossip generation number, and updating the application state to the LEFT state explicitly, which will then propagate as normal.

Removing Nodes: The “Proper” Way

When clusters grow large, an operator may need to remove a node, either due to hardware faults or horizontally scaling down the cluster. At that time, the operator will need to modify the topological gossip state with either a nodetool decommission command for online nodes or nodetool removenode for offline nodes.

Decommissioning a Node: While Saving All Replicas

The typical command to run on a live node that will be exiting the cluster is:

nodetool decommission

The nodetool decommission command will:

Extend certain token ranges within the gossip state.
Stream all of the decommissioned node’s data to the new replicas in a consistent manner (the opposite of bootstrap).
Report to the gossip state that the node has exited the ring.

While this command may take a while to complete, the extra time spent on this command will ensure that all owned replicas are streamed off the node and towards the new replica owners.

Removing a Node: And Losing Non-Replicated Replicas

Sometimes a node may be offline due to hardware issues and/or has been offline for longer than gc_grace_seconds within a cluster that ingests deletion mutations. In this case, the node needs to be removed from the cluster while remaining offline to prevent “zombie data” from propagating around the cluster due to already expired tombstones, as defined by the gc_grace_seconds window. In the case where the node will remain offline, the following command should be run on a neighbor node:

nodetool removenode $HOST_ID

The nodetool removenode command will:

Extend certain token ranges within the gossip state.
Report to the gossip state that the node has exited the ring.
Will NOT stream any of the decommissioned node’s data to the new replicas.
Streaming will occur between the other existing replicas to the new replica.

Increasing Consistency After Removing a Node

Typically a follow up repair is required in a rolling fashion around the data center to ensure each new replica has the required information:

nodetool repair -pr

Note that:

The above command will only repair replica consistencies if the replication factor is greater than 1 and one of the surviving nodes contains a replica of the data.
Running a rolling repair will generate disk, CPU, and network load proportional to the amount of data needing to be repaired.
Throttling a rolling repair by repairing only one node at a time may be ideal.
Using Reaper for Apache Cassandra can schedule, manage, and load balance the repair operations throughout the lifetime of the cluster.

How We Can Detect Assassination is Needed

In either of the above cases, sometimes the gossip state will continue to be out of sync. There will be echoes of past statuses that claim not only the node is still part of the cluster, but it may still be alive. And then missing. Intermittently.

When the gossip truth is continuously inconsistent, nodetool assassinate will resolve these inconsistencies, but should only be run after nodetool decommission or nodetool removenode have been run and at least 72 hours have passed.

These issues are typically cosmetic and appear as similar lines within the system.log:

2014-09-26 01:26:41,901 DEBUG [Reconnection-1] Cluster - Failed reconnection to /172.x.y.zzz:9042 ([/172.x.y.zzz:9042] Cannot connect), scheduling retry in 600000 milliseconds

Or may appear as UNREACHABLE within the nodetool describecluster output:

Cluster Information:
	Name: Production Cluster
       Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
       Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
       Schema versions:
              65e78f0e-e81e-30d8-a631-a65dff93bf82: [172.x.y.z]
              UNREACHABLE: [172.x.y.zzz]

Sometimes you may find yourself looking even deeper and spot the deprecated node within nodetool gossipinfo months after removing the node:

/172.x.y.zzz
generation:1486750581
heartbeat:9999
STATUS:131099:LEFT,-5747706879722151680,1487011227852
TOKENS: not present

Note that the LEFT status should stick around for 72 hours to ensure all nodes come to the consensus that the node has been removed. So please don’t rush things if that’s the case. Again, it’s only cosmetic.

In all of these cases the truth may be slightly outdated and an operator may want to set the record straight with truth-based gossip states instead of cosmetic rumor-filled gossip states that include offline deprecated nodes.

How to Run the Assassination Command

Pre-2.2.0, operators used to have to use Java MBeans to assassinate a token (see below). Post-2.2.0, operators can use the nodetool assassinate method.

From an online node, run the command:

nodetool assassinate $IP_ADDRESS

Internally, the nodetool assassinate command will execute the unsafeAssassinateEndpoint command over JMX on the Gossiper MBean.

Java Mbeans Assassination

If using a version of Cassandra that does not yet have the nodetool assassinate command, we’ll have to rely on jmxterm.

You can use the following command to download jmxterm:

wget http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0.0-uber.jar

Then we’ll want to use the Gossiper MBean and run the unsafeAssassinateEndpoint command:

echo "run -b org.apache.cassandra.net:type=Gossiper unsafeAssassinateEndpoint $IP_TO_ASSASSINATE" \
    | java -jar jmxterm-1.0.0-uber.jar -l $IP_OF_LIVE_NODE:7199

Both of the assassination commands will trigger the same MBean command over JMX, however the nodetool assassinatecommand is preferred for its ease of use without additional dependencies.

Resolving Failed Assassination Attempts: And Why the First Attempts Failed

When clusters grow large enough, are geospatially distant enough, or are under intense load, the gossip state may become a bit out of sync with reality. Sometimes this causes assassination attempts to fail and while the solution may sound unnerving, it’s relatively simple once you consider how gossip states act and are maintained.

Because gossip states can be decentralized across high latency nodes, sometimes gossip state updates can be delayed and cause a variety of race conditions that may show offline nodes as still being online. Most times these race conditions will be corrected in relatively short-time periods, as tuned by the phi_convict_threshold within the cassandra.yaml (between a value of 8 for hardware and 12 for virtualized instances). In almost all cases the gossip state will converge into a global truth.

However, because gossip state from nodes that are no longer participating in gossip heartbeat rounds do not have an explicit source and are instead fueled by rumors, dead nodes may sometimes continue to live within the gossip state even after calling the assassinate command.

To solve these issues, you must ensure all race conditions are eliminated.

If a gossip state will not forget a node that was removed from the cluster more than a week ago:

Login to each node within the Cassandra cluster.
Download jmxterm on each node, if nodetool assassinate is not an option.
Run nodetool assassinate, or the unsafeAssassinateEndpoint command, multiple times in quick succession.
- I typically recommend running the command 3-5 times within 2 seconds.
- I understand that sometimes the command takes time to return, so the “2 seconds” suggestion is less of a requirement than it is a mindset.
- Also, sometimes 3-5 times isn’t enough. In such cases, shoot for the moon and try 20 assassination attempts in quick succession.

What we are trying to do is to create a flood of messages requesting all nodes completely forget there used to be an entry within the gossip state for the given IP address. If each node can prune its own gossip state and broadcast that to the rest of the nodes, we should eliminate any race conditions that may exist where at least one node still remembers the given IP address.

As soon as all nodes come to agreement that they don’t remember the deprecated node, the cosmetic issue will no longer be a concern in any system.logs, nodetool describecluster commands, nor nodetool gossipinfo output.

Recap: How To Properly Remove Nodes Completely

Operators shouldn’t opt for the assassinate command as a first resort for taking a Cassandra node out since it is sledgehammer and most of the time operators are dealing with a screw.

However, when operators follow best practices and perform a nodetool decommission for live nodes or nodetool removenode for offline nodes, sometimes lingering cosmetic issues may lead the operator to want to keep the gossip state consistent.

After at least a week of inconsistent gossip state, nodetool assassinate or the unsafeAssassinateEndpoint command may be used to remove deprecated nodes from the gossip state.

When a single assassination attempt does not work across an entire cluster, sometimes the assassination attempt needs to be attempted multiple times on all node within the cluster simultaneously. Doing so will ensure that each node modifies its own gossip state to accurately reflect the deprecated node’s absence within the gossip state as well as ensuring no node will further broadcast rumors of a false gossip state.

cassandra assassinate unsafeAssassinateEndpoint removenode decommission streaming gossip gossip state phi_convict_threshold race condition UNREACHABLE jmxterm