Creating Flamegraphs with Apache Cassandra in Kubernetes (cass-operator)

In a previous blog post recommending disabling read repair chance, some flamegraphs were generated to demonstrate the effect read repair chance had on a cluster. Let’s go through how those flamegraphs were captured, step-by-step using Apache Cassandra 3.11.6, Kubernetes and the cass-operator, nosqlbench and the async-profiler.

In previous blog posts we would have used the existing tools of tlp-cluster or ccm, tlp-stress or cassandra-stress, and sjk. Here we take a new approach that is a lot more fun, as with k8s the same approach can be used locally or in the cloud. No need to switch between ccm clusters for local testing and tlp-cluster for cloud testing. Nor are you bound to AWS for big instance testing, that’s right: no vendor lock-in. Cass-operator and K8ssandra is getting a ton of momentum from DataStax, so it is only deserved and exciting to introduce them to as much of the open source world as we can.

This blog post is not an in-depth dive into using cass-operator, rather a simple teaser to demonstrate how we can grab some flamegraphs, as quickly as possible. The blog post is split into three sections

  • Setting up Kubernetes and getting Cassandra running
  • Getting access to Cassandra from outside Kubernetes
  • Stress testing and creating flamegraphs

Setup

Let’s go through a quick demonstration using Kubernetes, the cass-operator, and some flamegraphs.

First, download four yaml configuration files that will be used. This is not strictly necessary for the latter three, as kubectl may reference them by their URLs, but let’s download them for the sake of having the files locally and being able to make edits if and when desired.

wget https://thelastpickle.com/files/2021-01-31-cass_operator/01-kind-config.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/02-storageclass-kind.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/11-install-cass-operator-v1.1.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/13-cassandra-cluster-3nodes.yaml

The next steps involve kind and kubectl to create a local cluster we can test. To use kind you have docker running locally, it is recommended to have 4 CPU and 12GB RAM for this exercise.

kind create cluster --name read-repair-chance-test --config 01-kind-config.yaml

kubectl create ns cass-operator
kubectl -n cass-operator apply -f 02-storageclass-kind.yaml
kubectl -n cass-operator apply -f 11-install-cass-operator-v1.1.yaml

# watch and wait until the pod is running
watch kubectl -n cass-operator get pod

# create 3 node C* cluster
kubectl -n cass-operator apply -f 13-cassandra-cluster-3nodes.yaml

# again, wait for pods to be running
watch kubectl -n cass-operator get pod

# test the three nodes are up
kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- nodetool status

Access

For this example we are going to run NoSqlBench from outside the k8s cluster, so we will need access to a pod’s Native Protocol interface via port-forwarding. This approach is practical here because it was desired to have the benchmark connect to just one coordinator. In many situations you would instead run NoSqlBench from a separate dedicated pod inside the k8s cluster.

# get the cql username
kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " username" | awk -F" " '{print $2}' | base64 -d && echo ""

# get the cql password
kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " password" | awk -F" " '{print $2}' | base64 -d && echo ""

# port forward the native protocol (CQL)
kubectl -n cass-operator port-forward --address 0.0.0.0 cluster1-dc1-default-sts-0 9042:9042 

The above sets up the k8s cluster, a k8s storageClass, and the cass-operator with a three node Cassandra cluster. For a more in depth look at this setup checkout this tutorial.

Stress Testing and Flamegraphs

With a cluster to play with, let’s generate some load and then go grab some flamegraphs.

Instead of using SJK (Swiss Java Knife), as our previous blog posts have done, we will use the async-profiler. The async-profiler does not suffer from Safepoint bias problem, an issue we see more often than we would like in Cassandra nodes (protip: make sure you configure ParallelGCThreads and ConcGCThreads to the same value).

Open a new terminal window and do the following.

# get the latest NoSqlBench jarfile
wget https://github.com/nosqlbench/nosqlbench/releases/latest/download/nb.jar

# generate some load, use credentials as found above

java -jar nb.jar cql-keyvalue username=<cql_username> password=<cql_password> whitelist=127.0.0.1 rampup-cycles=10000 main-cycles=500000 rf=3 read_cl=LOCAL_ONE

# while the load is still running,
# open a shell in the coordinator pod, download async-profiler and generate a flamegraph
kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- /bin/bash

wget https://github.com/jvm-profiling-tools/async-profiler/releases/download/v1.8.3/async-profiler-1.8.3-linux-x64.tar.gz

tar xvf async-profiler-1.8.3-linux-x64.tar.gz

async-profiler-1.8.3-linux-x64/profiler.sh -d 300 -f /tmp/flame_away.svg <CASSANDRA_PID>

exit

# copy the flamegraph out of the pod
kubectl -n cass-operator cp cluster1-dc1-default-sts-0:/tmp/flame_away.svg flame_away.svg

Keep It Clean

After everything is done, it is time to clean up after yourself.

Delete the CassandraDatacenters first, otherwise Kubernetes will block deletion because we use a finalizer. Note, this will delete all data in the cluster.

kubectl delete cassdcs --all-namespaces --all

Remove the operator Deployment, CRD, etc.

# this command can take a while, be patient

kubectl delete -f https://raw.githubusercontent.com/datastax/cass-operator/v1.5.1/docs/user/cass-operator-manifests-v1.16.yaml

# if troubleshooting, to forcibly remove resources, though
# this should not be necessary, and take care as this will wipe all resources

kubectl delete "$(kubectl api-resources --namespaced=true --verbs=delete -o name | tr "\n" "," | sed -e 's/,$//')" --all

To remove the local Kubernetes cluster altogether

kind delete cluster --name read-repair-chance-test

To stop and remove the docker containers that are left running…

docker stop $(docker ps | grep kindest | cut -d" " -f1)
docker rm $(docker ps -a | grep kindest | cut -d" " -f1)

More… the cass-operator tutorials

There is a ton of documentation and tutorials getting released on how to use the cass-operator. If you are keen to learn more the following is highly recommended: Managing Cassandra Clusters in Kubernetes Using Cass-Operator.

cassandra kubernetes cass-operator nosqlbench flamegraphs async-profiler