Rolling Reboots with cstarpar

Welcome to the third post in our cstar series. So far, the first post gave an introduction to cstar, while the second post explained how to extend cstar with custom commands. In this post we will look at cstar’s cousin cstarpar. Both utilities deliver the same topology-aware orchestration, yet cstarpar executes commands locally, allowing operations cstar is not capable of.

Using ssh

cstarpar relies heavily on ssh working smoothly and without any user prompts. When we run a command with cstar, it will take the command, ssh into the remote host, and execute the command on our behalf. For example, we can run hostname on each node of a 3-node cluster:

$ cstar run --seed-host 172.16.0.1 --command hostname
...
$ cat ~/.cstar/jobs/8ff6811e-31e7-4975-bec4-260eae885ef6/ec2-*/out
ip-172-16-0-1
ip-172-16-0-3
ip-172-16-0-2

If we switch to cstarpar, it will execute the hostname command locally and we will see something different:

$ cstarpar --seed-host 172.16.0.1 hostname
...
$ cat ~/.cstar/jobs/a1735406-ae58-4e44-829b-9e8d4a90fd06/ec2-*/out
radovan-laptop
radovan-laptop
radovan-laptop

To make cstarpar execute commands on remote machines we just need to make the command explicitly use ssh:

$ cstarpar --seed-host 172.16.0.1 "ssh {} hostname"
...
cat ~/.cstar/jobs/2c54f7a1-8982-4f2e-ada4-8b45cde4c4eb/ec2-*/out
ip-172-16-0-1
ip-172-16-0-3
ip-172-16-0-2

Here we can see the hostname was executed on the remote hosts.

Reboots

The true advantage of local execution is that there is no need for interaction with the remote host. This approach allows operations that would normally prevent that interaction, such as reboots. For example, the following command reboots the entire cluster in a topology-aware fashion, albeit very roughly because it gracelessly kills all processes, including Cassandra:

$ cstarpar --seed-host 172.16.0.1 -- "ssh {} sudo reboot &"

Note that this example used the sudo reboot & command. The reboot command on its own causes the reboot immediately. This is so drastic that it causes Python’s subprocess module to think an error occured. Placing the & after the command, directing to run the command in the background, allows the shell execution return back to Python cleanly. Once the host is down, cstarpar will mark the host as such in the job status report.

It is important to ensure the hosts are configured to start the Cassandra process automatically after the reboot, because just like cstar, cstartpar will proceed with next hosts only if all hosts are up and will otherwise wait indefinitely for the rebooted host to come back.

Since cstarpar can execute local commands and scripts, it need not support complex commands in the same way cstar does. To run a complex command with cstarpar, we can use a script file. To illustrate this, the script below will add a graceful shutdown of Cassandra before executing the actual reboot:

$ cat ~/gentle_reboot.sh
FQDN=$1

echo "Draining Cassandra"
ssh ${FQDN} nodetool drain && sleep 5

echo "Stopping Cassandra process"
ssh ${FQDN} sudo service cassandra stop && sleep 5

echo "Rebooting"
ssh ${FQDN} sudo reboot &

The reboot command then runs like this:

$ cstarpar --seed-host 172.16.0.1 -- "bash /absolute/path/to/gentle_reboot.sh {}"

Replication and Conclusion

For this post, I used a simple three node cluster provisioned with tlp-cluster. cstarpar relies heavily on ssh working smoothly and without user prompts. Initially, I attempted the connection without any specific ssh configuration on my laptop or the AWS hosts, the ssh calls looked like this:

$ cstarpar --seed-host ${SEED_IP} --ssh-identity-file=${PATH_TO_KEY}  --ssh-username ubuntu "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ubuntu@{} hostname"

In the gentle_reboot.sh I also had to add some options:

ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ${PATH_TO_KEY} ubuntu@${FQDN} sudo reboot &

Once configured, I was able to harness the full power of cstarpar, which supplements cstar functionality by executing commands locally. This was demonstrated to be useful for operations for which the cstar’s mode of operation is not well suited, such as reboots. Importantly, to leverage the most value from cstarpar, it is critical to have ssh configured to run smoothly and without any user prompts.

cassandra cstar distributed shell operations topology