The Fine Print When Using Multiple Data Directories

One of the longest lived features in Cassandra is the ability to allow a node to store data on more than one than one directory or disk. This feature can help increase cluster capacity or prevent a node from running out space if bootstrapping a new one will take too long to complete. Recently I was working on a cluster and saw how this feature has the potential to silently cause problems in a cluster. In this post we will go through some fine print when configuring Cassandra to use multiple disks.

Jay… what?

The feature which allows Cassandra to store data on multiple disks is commonly referred to as JBOD [pronounced jay-bod] which stands for “Just a Bunch Of Disks/Drives”. In Cassandra this feature is controlled by the data_file_directories setting in the cassandra.yaml file. In relation to this setting, Cassandra also allows its behaviour on disk failure to be controlled using the disk_failure_policy setting. For now I will leave the details of the setting alone, so we can focus exclusively on the data_file_directories setting.

Simple drives, simple pleasures

The data_file_directories feature is fairly straight forward in that it allows Cassandra to use multiple directories to store data. To use it just specify the list of directories you want Cassandra to use for data storage. For example.

data_file_directories:
    - /var/lib/cassandra/data
    - /var/lib/cassandra/data-extra

The feature has been around from day one of Cassandra’s life and the way in which Cassandra uses multiple directories has mostly stayed the same. There are no special restrictions to the directories, they can be on the same volume/disk or a different volume/disk. As far as Cassandra is concerned, the paths specified in the setting are just the directories it has available to read and write data.

At a high level, the way the feature works is Cassandra tries to evenly split data into each of the directories specified in the data_file_directories setting. No two directories will ever have an identical SSTable file name in them. Below is an example of what you could expect to see if you inspected each data directory when using this feature. In this example the node is configured to use two directories: …/data0/ and …/data1/

$ ls .../data0/music/playlists-3b90f8a0a50b11e881a5ad31ff0de720/
backups                      mc-5-big-Digest.crc32  mc-5-big-Statistics.db
mc-5-big-CompressionInfo.db  mc-5-big-Filter.db     mc-5-big-Summary.db
mc-5-big-Data.db             mc-5-big-Index.db      mc-5-big-TOC.txt

$ ls .../data1/music/playlists-3b90f8a0a50b11e881a5ad31ff0de720/
backups                      mc-6-big-Digest.crc32  mc-6-big-Statistics.db
mc-6-big-CompressionInfo.db  mc-6-big-Filter.db     mc-6-big-Summary.db
mc-6-big-Data.db             mc-6-big-Index.db      mc-6-big-TOC.txt

Data resurrection

One notable change which modified how Cassandra uses the data_file_directories setting was CASSANDRA-6696. The change was implemented in Cassandra version 3.2. To explain this problem and how it was fixed, consider the case where a node has two data directories A and B. Prior to this change in Cassandra, you could have a node that had data for a specific token in one SSTable that was on disk A. The node could also have a tombstone associated with that token in another SSTable on disk B. If the gc_grace_seconds passed, and no compactions were processed to reclaim the data tombstone there would be an issue if disk B failed. In this case if disk B did fail, the tombstone is lost and the data on disk A is still present! Running a repair in this case would resurrect the data by propagating it to other replicas! To fix this issue, CASSANDRA-6696 changed Cassandra so that a token range was always stored on a single disk.

This change did make Cassandra more robust when using the data_file_directories setting, however this change was no silver bullet and caution still needs to be taken when it is used. Most notably, consider the case where each data directory is mounted to a dedicated disk and the cluster schema results in wide partitions. In this scenario one of the disks could easily reach its maximum capacity due to the wide partitions while the other disk still has plenty of storage capacity.

How to lose a volume and influence a node

For a node running Cassandra version less than 3.2 and using the data_file_directories setting there are a number vulnerabilities to watch out for. If each data directory is mounted to a dedicated disk, and one of the disk dies or the mount disappears then this can silently cause problems. To explain this problem, consider the case where we installed and the data is located in /var/lib/cassandra/data. Say we want to add another directory to store data in only this time the data will be on another volume. It makes sense to have the data directories in the same location, so we create the directory /var/lib/cassandra/data-extra. We then mount our volume so that /var/lib/cassandra/data-extra points to it. If the disk backing /var/lib/cassandra/data-extra died or we forgot to put the mount information in fstab and lose the mount on a restart, then we will effectively lose system table data. Cassandra will start because the directory /var/lib/cassandra/data-extra exists however it will be empty.

Similarly, I have seen cases where a directory was manually added to a node that was managed by chef. In this case the node was running out disk space and there was no time to wait for new node to bootstrap. To avoid a node going down an additional volume was attached, mounted, and the data_file_directories setting in the cassandra.yaml modified to include the new data directory. Some time later chef was executed on the node to deploy an update, and as a result it reset cassandra.yaml configuration. Resetting the cassandra.yaml cleared the additional data directory that was listed under data_file_directories setting. When the node was restarted, the Cassandra process never knew that there was another data directory it had to read from.

Either of these cases can lead to more problems in the cluster. Remember how earlier I mentioned that a complete SSTable file will be always stored in a single data directory when using the data_file_directories setting? This behaviour applies to all data stored by Cassandra including its system data! So that means, in the above two scenarios Cassandra could potentially lose system table data. This is a problem because the system data table stores information about what data the node owns, what the schema is, and whether the node has bootstrapped. If the system table is lost and the node restarted the node will think it is a new node, take on a new identity and new token ranges. This results in a token range movement in the cluster. We have covered this topic in more detail in our auto bootstrapping blog post. This problem gets worse when a seed node loses its system table and comes back as a new node. This is because seed nodes never stream data and if a cleanup is run cluster wide, data is then lost.

Testing the theory

We can test these above scenarios for different versions of Cassandra using ccm. I created the following script to setup a three node cluster in ccm with each node configured to be a seed node and use two data directories. We use seed nodes to show the worst case scenario that can occur when a node using multiple data directories loses one of the directories.

#!/bin/bash

# This script creates a three node CCM cluster to demo the data_file_directories
# feature in different versions of Cassandra.

set -e

CLUSTER_NAME="${1:-TestCluster}"
NUMBER_NODES="3"
CLUSTER_VERSION="${2:-3.0.15}"

echo "Cluster Name: ${CLUSTER_NAME}"
echo "Cluster Version: ${CLUSTER_VERSION}"
echo "Number nodes: ${NUMBER_NODES}"

ccm create ${CLUSTER_NAME} -v ${CLUSTER_VERSION}

# Modifies the configuration of a node in the CCM cluster.
function update_node_config {
  CASSANDRA_YAML_SETTINGS="cluster_name:${CLUSTER_NAME} \
                          num_tokens:32 \
                          endpoint_snitch:GossipingPropertyFileSnitch \
                          seeds:127.0.0.1,127.0.0.2,127.0.0.3"

  for key_value_setting in ${CASSANDRA_YAML_SETTINGS}
  do
    setting_key=$(echo ${key_value_setting} | cut -d':' -f1)
    setting_val=$(echo ${key_value_setting} | cut -d':' -f2)
    sed -ie "s/${setting_key}\:\ .*/${setting_key}:\ ${setting_val}/g" \
      ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra.yaml
  done

  # Create and configure the additional data directory
  extra_data_dir="/Users/anthony/.ccm/${CLUSTER_NAME}/node${1}/data1"

  sed -ie '/data_file_directories:/a\'$'\n'"- ${extra_data_dir}
    " ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra.yaml

  mkdir ${extra_data_dir}

  sed -ie "s/dc=.*/dc=datacenter1/g" \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-rackdc.properties
  sed -ie "s/rack=.*/rack=rack${1}/g" \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-rackdc.properties

  # Tune Cassandra memory usage so we can run multiple nodes on the one machine
  sed -ie 's/\#MAX_HEAP_SIZE=\"4G\"/MAX_HEAP_SIZE=\"500M\"/g' \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-env.sh
  sed -ie 's/\#HEAP_NEWSIZE=\"800M\"/HEAP_NEWSIZE=\"120M\"/g' \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-env.sh

  # Allow remote access to JMX without authentication. This is for
  # demo purposes only - Never do this in production
  sed -ie 's/LOCAL_JMX=yes/LOCAL_JMX=no/g' \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-env.sh
  sed -ie 's/com\.sun\.management\.jmxremote\.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-env.sh
}

for node_num in $(seq ${NUMBER_NODES})
do
  echo "Adding 'node${node_num}'"
  ccm add node${node_num} \
    -i 127.0.0.${node_num} \
    -j 7${node_num}00 \
    -r 0 \
    -b

  update_node_config ${node_num}

  # Set localhost aliases - Mac only
  echo "ifconfig lo0 alias 127.0.0.${node_num} up"
  sudo ifconfig lo0 alias 127.0.0.${node_num} up
done

sed -ie 's/use_vnodes\:\ false/use_vnodes:\ true/g' \
  ~/.ccm/${CLUSTER_NAME}/cluster.conf

I first tested Cassandra version 2.1.20 using the following process.

Run the script and check the nodes were created.

$ ccm status
Cluster: 'mutli-dir-test'
-------------------------
node1: DOWN (Not initialized)
node3: DOWN (Not initialized)
node2: DOWN (Not initialized)

Start the cluster.

$ for i in $(seq 1 3); do echo "Starting node${i}"; ccm node${i} start; sleep 10; done
Starting node1
Starting node2
Starting node3

Check the cluster is up and note the Host IDs.

$  ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns (effective)  Host ID                               Rack
UN  127.0.0.1  47.3 KB    32      73.5%             4682088e-4a3c-4fbc-8874-054408121f0a  rack1
UN  127.0.0.2  80.35 KB   32      71.7%             b2411268-f168-485d-9abe-77874eef81ce  rack2
UN  127.0.0.3  64.33 KB   32      54.8%             8b55a1c6-f971-4e01-a34b-bb37dd55bb89  rack3

Insert some test data into the cluster.

$ ccm node1 cqlsh
Connected to TLP-578-2120 at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 2.1.20 | CQL spec 3.2.1 | Native protocol v3]
Use HELP for help.
cqlsh> CREATE KEYSPACE music WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };
cqlsh> CREATE TABLE music.playlists (
   ...  id uuid,
   ...  song_order int,
   ...  song_id uuid,
   ...  title text,
   ...  artist text,
   ...  PRIMARY KEY (id, song_id));
cqlsh> INSERT INTO music.playlists (id, song_order, song_id, artist, title)
   ...  VALUES (62c36092-82a1-3a00-93d1-46196ee77204, 1,
   ...  a3e64f8f-bd44-4f28-b8d9-6938726e34d4, 'Of Monsters and Men', 'Little Talks');
cqlsh> INSERT INTO music.playlists (id, song_order, song_id, artist, title)
   ...  VALUES (62c36092-82a1-3a00-93d1-46196ee77205, 2,
   ...  8a172618-b121-4136-bb10-f665cfc469eb, 'Birds of Tokyo', 'Plans');
cqlsh> INSERT INTO music.playlists (id, song_order, song_id, artist, title)
   ...  VALUES (62c36092-82a1-3a00-93d1-46196ee77206, 3,
   ...  2b09185b-fb5a-4734-9b56-49077de9edbf, 'Lorde', 'Royals');
cqlsh> exit

Write the data to disk by running nodetool flush on all the nodes.

$ for i in $(seq 1 3); do echo "Flushing node${i}"; ccm node${i} nodetool flush; done
Flushing node1
Flushing node2
Flushing node3

Check we can retrieve data from each node.

$ for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

 id          | song_order | song_id     | artist              | title
-------------+------------+-------------+---------------------+--------------
 62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
 62c36092... |          3 | 2b09185b... |               Lorde |       Royals
 62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

Look for a node that has all of the system.local SSTable files in a single directory. In this particular test, there were no SSTable files in data directory data0 of node1.

$ ls .../node1/data0/system/local-7ad54392bcdd35a684174e047860b377/
$
$ ls .../node1/data1/system/local-7ad54392bcdd35a684174e047860b377/
system-local-ka-5-CompressionInfo.db  system-local-ka-5-Summary.db          system-local-ka-6-Index.db
system-local-ka-5-Data.db             system-local-ka-5-TOC.txt             system-local-ka-6-Statistics.db
system-local-ka-5-Digest.sha1         system-local-ka-6-CompressionInfo.db  system-local-ka-6-Summary.db
system-local-ka-5-Filter.db           system-local-ka-6-Data.db             system-local-ka-6-TOC.txt
system-local-ka-5-Index.db            system-local-ka-6-Digest.sha1
system-local-ka-5-Statistics.db       system-local-ka-6-Filter.db

Stop node1 and simulate a disk or volume mount going missing by removing the data1 directory entry from the data_file_directories setting.

$ ccm node1 stop

Before the change the setting entry was:

data_file_directories:
- .../node1/data0
- .../node1/data1

After the change the setting entry was:

data_file_directories:
- .../node1/data0

Start node1 again and check the logs. From the logs we can see the messages where the node has generated a new Host ID and took ownership of new tokens.

WARN  [main] 2018-08-21 12:34:57,111 SystemKeyspace.java:765 - No host ID found, created c62c54bf-0b85-477d-bb06-1f5d696c7fef (Note: This should happen exactly once per node).
INFO  [main] 2018-08-21 12:34:57,241 StorageService.java:844 - This node will not auto bootstrap because it is configured to be a seed node.
INFO  [main] 2018-08-21 12:34:57,259 StorageService.java:959 - Generated random tokens. tokens are [659824738410799181, 501008491586443415, 4528158823720685640, 3784300856834360518, -5831879079690505989, 8070398544415493492, -2664141538712847743, -303308032601096386, -553368999545619698, 5062218903043253310, -8121235567420561418, 935133894667055035, -4956674896797302124, 5310003984496306717, -1155160853876320906, 3649796447443623633, 5380731976542355863, -3266423073206977005, 8935070979529248350, -4101583270850253496, -7026448307529793184, 1728717941810513773, -1920969318367938065, -8219407330606302354, -795338012034994277, -374574523137341910, 4551450772185963221, -1628731017981278455, -7164926827237876166, -5127513414993962202, -4267906379878550578, -619944134428784565]

Check the cluster status again. From the output we can see that the Host ID for node1 changed from 4682088e-4a3c-4fbc-8874-054408121f0a to c62c54bf-0b85-477d-bb06-1f5d696c7fef

$ ccm node2 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns (effective)  Host ID                               Rack
UN  127.0.0.1  89.87 KB   32      100.0%            c62c54bf-0b85-477d-bb06-1f5d696c7fef  rack1
UN  127.0.0.2  88.69 KB   32      100.0%            b2411268-f168-485d-9abe-77874eef81ce  rack2
UN  127.0.0.3  106.66 KB  32      100.0%            8b55a1c6-f971-4e01-a34b-bb37dd55bb89  rack3

Check we can retrieve data from each node again.

for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

Cassandra 2.1.20 Results

When we run the above test against a cluster using Apache Cassandra version 2.1.20 and remove the additional data directory data1 from node1, we can see that our cql statement fails when retrieving data from node1. The error produced shows that the song_order column is unknown to the node.

$ for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

<stdin>:1:InvalidRequest: code=2200 [Invalid query] message="Undefined name song_order in selection clause"

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

An interesting side note, if nodetool drain is run node1 before it is shut down then the above error never occurs. Instead the following output appears when we run our cql statement to retrieve data from the nodes. As we can see below the query that failed now returns no rows of data.

$ for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

 id | song_order | song_id | artist | title
----+------------+---------+--------+-------

(0 rows)

 id          | song_order | song_id     | artist              | title
-------------+------------+-------------+---------------------+--------------
 62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
 62c36092... |          3 | 2b09185b... |               Lorde |       Royals
 62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

 id          | song_order | song_id     | artist              | title
-------------+------------+-------------+---------------------+--------------
 62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
 62c36092... |          3 | 2b09185b... |               Lorde |       Royals
 62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

Cassandra 2.2.13 Results

When we run the above test against a cluster using Apache Cassandra version 2.1.20 and remove the additional data directory data1 from node1, we can see that the cql statement fails retrieving data from node1. The error produced is similar to that produced in version 2.1.20 where id column name is unknown.

$ for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

<stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="Undefined name id in selection clause"

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

Unlike Cassandra version 2.1.20, node1 never generated a new Host ID or calculated new tokens. This is because it replayed the commitlog and recovered most of the writes that had gone missing.

INFO  [main] ... CommitLog.java:160 - Replaying .../node1/commitlogs/CommitLog-5-1534865605274.log, .../node1/commitlogs/CommitLog-5-1534865605275.log
WARN  [main] ... CommitLogReplayer.java:149 - Skipped 1 mutations from unknown (probably removed) CF with id 5bc52802-de25-35ed-aeab-188eecebb090
...
INFO  [main] ... StorageService.java:900 - Using saved tokens [-1986809544993962272, -2017257854152001541, -2774742649301489556, -5900361272205350008, -5936695922885734332, -6173514731003460783, -617557464401852062, -6189389450302492227, -6817507707445347788, -70447736800638133, -7273401985294399499, -728761291814198629, -7345403624129882802, -7886058735316403116, -8499251126507277693, -8617790371363874293, -9121351096630699623, 1551379122095324544, 1690042196927667551, 2403633816924000878, 337128813788730861, 3467690847534201577, 419697483451380975, 4497811278884749943, 4783163087653371572, 5213928983621160828, 5337698449614992094, 5502889505586834056, 6549477164138282393, 7486747913914976739, 8078241138082605830, 8729237452859546461]

Cassandra 3.0.15 Results

When we run the above test against a cluster using Apache Cassandra version 3.0.15 and remove the additional data directory data1 from node1, we can see that the cql statement returns no data from node1.

$ for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

 id | song_order | song_id | artist | title
----+------------+---------+--------+-------

(0 rows)

 id          | song_order | song_id     | artist              | title
-------------+------------+-------------+---------------------+--------------
 62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
 62c36092... |          3 | 2b09185b... |               Lorde |       Royals
 62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

 id          | song_order | song_id     | artist              | title
-------------+------------+-------------+---------------------+--------------
 62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
 62c36092... |          3 | 2b09185b... |               Lorde |       Royals
 62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

Cassandra 3.11.3 Results

When we run the above test against a cluster using Apache Cassandra version 3.11.3 and remove the additional data directory data1 from node1, the node fails to start and we can see the following error message in the logs.

ERROR [main] 2018-08-21 16:30:53,489 CassandraDaemon.java:708 - Exception encountered during startup
java.lang.RuntimeException: A node with address /127.0.0.1 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
    at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:558) ~[apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:804) ~[apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:664) ~[apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:613) ~[apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:379) [apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:602) [apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:691) [apache-cassandra-3.11.3.jar:3.11.3]

In this case, the cluster reports node1 as down and still shows its original Host ID.

$ ccm node2 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns (effective)  Host ID                               Rack
DN  127.0.0.1  191.96 KiB  32      100.0%            35a3c8ff-fa20-4f10-81cd-7284caeb00bd  rack1
UN  127.0.0.2  191.82 KiB  32      100.0%            2ebe4f0b-dc8f-4f46-93cd-37c410174a49  rack2
UN  127.0.0.3  170.46 KiB  32      100.0%            0384793e-7f59-40aa-a487-97f410dded4b  rack3

After inspecting the SSTables in both data directories we can see that a new Host ID 7d910c98-f69b-41b4-988a-f432b2e54b38 has been assigned to the node even though it failed to start.

$ ./tools/bin/sstabledump .../node1/data0/system/local-7ad54392bcdd35a684174e047860b377/mc-12-big-Data.db | grep host_id
      { "name" : "host_id", "value" : "35a3c8ff-fa20-4f10-81cd-7284caeb00bd", "tstamp" : "2018-08-21T06:24:08.106Z" },


$ ./tools/bin/sstabledump .../node1/data1/system/local-7ad54392bcdd35a684174e047860b377/mc-10-big-Data.db | grep host_id
      { "name" : "host_id", "value" : "7d910c98-f69b-41b4-988a-f432b2e54b38" },

Take away messages

As we have seen from testing there are potential dangers with using multiple directories in Cassandra. By simply removing one of the data directories in the setting a node can become a brand new node and affect the rest of the cluster. The JBOD feature can be useful in emergencies where disk space is urgently needed, however its usage in this case should be temporary.

The use of multiple disks in a Cassandra node, I feel is better done at the OS or hardware layer. Systems like LVM and RAID were designed to allow multiple disks to be used together to make up a volume. Using something like LVM or RAID rather than Cassandra’s JBOD feature reduces the complexity of the Cassandra configuration and the number of moving parts on the Cassandra side that can go wrong. By using the JBOD feature in Cassandra, it subtlety increases operational complexity and reduces the nodes ability to fail fast. In most cases I feel it is more useful for a node to fail out right rather than limp on and potentially impact the cluster in a negative way.

As a final thought, I think one handy feature that could be added to Apache Cassandra to help prevent issues associated with JBOD is the ability to check if the data, commitlog, saved_caches and hints are all empty prior to bootstrapping. If they are empty, then the node proceeds as normal. If they contain data, then perhaps the node could fail to start and print an error message in the logs.

cassandra operational
blog comments powered by Disqus