Assassinate - A Command of Last Resort within Apache Cassandra
The nodetool assassinate
command is meant specifically to remove cosmetic issues after nodetool decommission
or nodetool removenode
commands have been properly run and at least 72 hours have passed. It is not a command that should be run under most circumstances nor included in your regular toolbox. Rather the lengthier nodetool decommission
process is preferred when removing nodes to ensure no data is lost. Note that you can also use the nodetool removenode
command if cluster consistency is not the primary concern.
This blog post will explain:
- How gossip works and why
assassinate
can disrupt it. - How to properly remove nodes.
- When and how to assassinate nodes.
- How to resolve issues when assassination attempts fail.
Gossip: Cassandra’s Decentralized Topology State
Since all topological changes happen within Cassandra’s gossip layer, before we discuss how to manipulate the gossip layer, let’s discuss the fundamentals of how gossip works.
From Wikipedia:
A gossip (communication) protocol is a procedure or process of computer-computer communication that is based on the way social networks disseminate information or how epidemics spread… Modern distributed systems often use gossip protocols to solve problems that might be difficult to solve in other ways, either because the underlying network has an inconvenient structure, is extremely large, or because gossip solutions are the most efficient ones available.
The gossip state within Cassandra is the decentralized, eventually consistent, agreed upon topological state of all nodes within a Cassandra cluster. Cassandra gossip heartbeats keep the topological gossip state updated, are emitted via each node in the cluster, and contain the following information:
- What that node’s status is, and
- What its neighbors statuses’ are.
When a node goes offline the gossip heartbeat will not be emitted and the node’s neighbors will detect that the node is offline (with help from an algorithm which is tuned by the phi_convict_threshold
parameter as defined within the cassandra.yaml
), and the neighbor will broadcast an updated status saying that the neighbor node is unavailable until further notice.
However, as soon as the node comes online, two things will happen:
- The revived node will:
- Ask a neighbor node what the current gossip state is.
- Modify the received gossip state to include its own status.
- Assume the modified state as its own.
- Broadcast the new gossip state across the network.
- A neighbor node will:
- Discover the revived node is back online, either by:
- First-hand discovery, or
- Second-hand gossiping.
- Update the received gossip state with the new information.
- Modify the received gossip state to include its own status.
- Assume the modified state as its own.
- Broadcast the new gossip state across the network.
- Discover the revived node is back online, either by:
The above gossip protocol is responsible for the UN/DN
, or Up|Down/Normal
, statuses seen within nodetool status
and is responsible for ensuring requests and replicas are properly routed to the available and responsible nodes, among other tasks.
Differences Between Assassination
, Decommission
, and Removenode
There are three main commands used to take a node offline: assassination
, decommission
, and removenode
. Having the node be in the LEFT
state ensures that each node’s gossip state will eventually be consistent and agree that:
- The deprecated node has in fact been deprecated.
- The deprecated node was deprecated after a given timestamp.
- The deprecated token ranges are now owned by a new node.
- Ideally, the deprecated
LEFT
stage will be automatically purged in 72 hours.
Underlying Actions of Decommission
and Removenode
on the Gossip Layer
When nodetool decommission
and nodetool removenode
commands are run, we are changing the state of the gossip layer to the LEFT
state for the deprecated node.
Following the gossip protocol procedure in the previous section, the LEFT
status will spread across the cluster as the the new truth since the LEFT
status has a more recent timestamp than the previous status.
As more nodes begin to assimilate the LEFT
status, the cluster will ultimately reach consensus that the deprecated node has LEFT
the cluster.
Underlying Actions of Assassination
Unlike nodetool decommission
and nodetool removenode
above, when nodetool assassinate
is used we are updating the gossip state to the LEFT
state, then forcing an incrementation of the gossip generation number, and updating the application state to the LEFT
state explicitly, which will then propagate as normal.
Removing Nodes: The “Proper” Way
When clusters grow large, an operator may need to remove a node, either due to hardware faults or horizontally scaling down the cluster. At that time, the operator will need to modify the topological gossip state with either a nodetool decommission
command for online nodes or nodetool removenode
for offline nodes.
Decommissioning a Node: While Saving All Replicas
The typical command to run on a live node that will be exiting the cluster is:
nodetool decommission
The nodetool decommission
command will:
- Extend certain token ranges within the gossip state.
- Stream all of the decommissioned node’s data to the new replicas in a consistent manner (the opposite of bootstrap).
- Report to the gossip state that the node has exited the ring.
While this command may take a while to complete, the extra time spent on this command will ensure that all owned replicas are streamed off the node and towards the new replica owners.
Removing a Node: And Losing Non-Replicated Replicas
Sometimes a node may be offline due to hardware issues and/or has been offline for longer than gc_grace_seconds
within a cluster that ingests deletion mutations. In this case, the node needs to be removed from the cluster while remaining offline to prevent “zombie data” from propagating around the cluster due to already expired tombstones, as defined by the gc_grace_seconds
window. In the case where the node will remain offline, the following command should be run on a neighbor node:
nodetool removenode $HOST_ID
The nodetool removenode
command will:
- Extend certain token ranges within the gossip state.
- Report to the gossip state that the node has exited the ring.
- Will NOT stream any of the decommissioned node’s data to the new replicas.
- Streaming will occur between the other existing replicas to the new replica.
Increasing Consistency After Removing a Node
Typically a follow up repair is required in a rolling fashion around the data center to ensure each new replica has the required information:
nodetool repair -pr
Note that:
- The above command will only repair replica consistencies if the replication factor is greater than 1 and one of the surviving nodes contains a replica of the data.
- Running a rolling repair will generate disk, CPU, and network load proportional to the amount of data needing to be repaired.
- Throttling a rolling repair by repairing only one node at a time may be ideal.
- Using Reaper for Apache Cassandra can schedule, manage, and load balance the repair operations throughout the lifetime of the cluster.
How We Can Detect Assassination is Needed
In either of the above cases, sometimes the gossip state will continue to be out of sync. There will be echoes of past statuses that claim not only the node is still part of the cluster, but it may still be alive. And then missing. Intermittently.
When the gossip truth is continuously inconsistent, nodetool assassinate
will resolve these inconsistencies, but should only be run after nodetool decommission
or nodetool removenode
have been run and at least 72 hours have passed.
These issues are typically cosmetic and appear as similar lines within the system.log
:
2014-09-26 01:26:41,901 DEBUG [Reconnection-1] Cluster - Failed reconnection to /172.x.y.zzz:9042 ([/172.x.y.zzz:9042] Cannot connect), scheduling retry in 600000 milliseconds
Or may appear as UNREACHABLE
within the nodetool describecluster
output:
Cluster Information:
Name: Production Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
65e78f0e-e81e-30d8-a631-a65dff93bf82: [172.x.y.z]
UNREACHABLE: [172.x.y.zzz]
Sometimes you may find yourself looking even deeper and spot the deprecated node within nodetool gossipinfo
months after removing the node:
/172.x.y.zzz
generation:1486750581
heartbeat:9999
STATUS:131099:LEFT,-5747706879722151680,1487011227852
TOKENS: not present
Note that the LEFT
status should stick around for 72 hours to ensure all nodes come to the consensus that the node has been removed. So please don’t rush things if that’s the case. Again, it’s only cosmetic.
In all of these cases the truth may be slightly outdated and an operator may want to set the record straight with truth-based gossip states instead of cosmetic rumor-filled gossip states that include offline deprecated nodes.
How to Run the Assassination Command
Pre-2.2.0, operators used to have to use Java MBeans to assassinate a token (see below). Post-2.2.0, operators can use the nodetool assassinate
method.
From an online node, run the command:
nodetool assassinate $IP_ADDRESS
Internally, the nodetool assassinate
command will execute the unsafeAssassinateEndpoint
command over JMX on the Gossiper
MBean.
Java Mbeans Assassination
If using a version of Cassandra that does not yet have the nodetool assassinate
command, we’ll have to rely on jmxterm.
You can use the following command to download jmxterm:
wget http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0.0-uber.jar
Then we’ll want to use the Gossiper
MBean and run the unsafeAssassinateEndpoint
command:
echo "run -b org.apache.cassandra.net:type=Gossiper unsafeAssassinateEndpoint $IP_TO_ASSASSINATE" \
| java -jar jmxterm-1.0.0-uber.jar -l $IP_OF_LIVE_NODE:7199
Both of the assassination commands will trigger the same MBean command over JMX, however the nodetool assassinate
command is preferred for its ease of use without additional dependencies.
Resolving Failed Assassination Attempts: And Why the First Attempts Failed
When clusters grow large enough, are geospatially distant enough, or are under intense load, the gossip state may become a bit out of sync with reality. Sometimes this causes assassination attempts to fail and while the solution may sound unnerving, it’s relatively simple once you consider how gossip states act and are maintained.
Because gossip states can be decentralized across high latency nodes, sometimes gossip state updates can be delayed and cause a variety of race conditions that may show offline nodes as still being online. Most times these race conditions will be corrected in relatively short-time periods, as tuned by the phi_convict_threshold
within the cassandra.yaml
(between a value of 8
for hardware and 12
for virtualized instances). In almost all cases the gossip state will converge into a global truth.
However, because gossip state from nodes that are no longer participating in gossip heartbeat rounds do not have an explicit source and are instead fueled by rumors, dead nodes may sometimes continue to live within the gossip state even after calling the assassinate command.
To solve these issues, you must ensure all race conditions are eliminated.
If a gossip state will not forget a node that was removed from the cluster more than a week ago:
- Login to each node within the Cassandra cluster.
- Download jmxterm on each node, if
nodetool assassinate
is not an option. - Run
nodetool assassinate
, or theunsafeAssassinateEndpoint
command, multiple times in quick succession.- I typically recommend running the command 3-5 times within 2 seconds.
- I understand that sometimes the command takes time to return, so the “2 seconds” suggestion is less of a requirement than it is a mindset.
- Also, sometimes 3-5 times isn’t enough. In such cases, shoot for the moon and try 20 assassination attempts in quick succession.
What we are trying to do is to create a flood of messages requesting all nodes completely forget there used to be an entry within the gossip state for the given IP address. If each node can prune its own gossip state and broadcast that to the rest of the nodes, we should eliminate any race conditions that may exist where at least one node still remembers the given IP address.
As soon as all nodes come to agreement that they don’t remember the deprecated node, the cosmetic issue will no longer be a concern in any system.logs
, nodetool describecluster
commands, nor nodetool gossipinfo
output.
Recap: How To Properly Remove Nodes Completely
Operators shouldn’t opt for the assassinate
command as a first resort for taking a Cassandra node out since it is sledgehammer and most of the time operators are dealing with a screw.
However, when operators follow best practices and perform a nodetool decommission
for live nodes or nodetool removenode
for offline nodes, sometimes lingering cosmetic issues may lead the operator to want to keep the gossip state consistent.
After at least a week of inconsistent gossip state, nodetool assassinate
or the unsafeAssassinateEndpoint
command may be used to remove deprecated nodes from the gossip state.
When a single assassination attempt does not work across an entire cluster, sometimes the assassination attempt needs to be attempted multiple times on all node within the cluster simultaneously. Doing so will ensure that each node modifies its own gossip state to accurately reflect the deprecated node’s absence within the gossip state as well as ensuring no node will further broadcast rumors of a false gossip state.