How To Set Up A Cluster With Even Token Distribution
Apache Cassandra is fantastic for storing large amounts of data and being flexible enough to scale out as the data grows. This is all fun and games until the data that is distributed in the cluster becomes unbalanced. In this post we will go through how to set up a cluster with predictive token allocation using the allocate_tokens_for_keyspace
setting, which will help to evenly distribute the data as it grows.
Unbalanced clusters are bad mkay
An unbalanced load on a cluster means that some nodes will contain more data than others. An unbalanced cluster can be caused by the following:
- Hot spots - by random chance one node ends up responsible for a higher percentage of the token space than the other nodes in the cluster.
- Wide rows - due to data modelling issues, for example a partition row which grows significantly larger than the other rows in the data.
The above issues can have a number of impacts on individual nodes in the cluster, however this is a completely different topic and requires a more detailed post. In summary though, a node that contains disproportionately more tokens and/or data than other nodes in the cluster may experience one or more of the following issues:
- Run out storage more quickly than the other nodes.
- Serve more requests than the other nodes.
- Suffer from higher read and write latencies than the other nodes.
- Time to run repairs is longer than other nodes.
- Time to run compactions is longer than other nodes.
- Time to replace the node if it fails is longer than other nodes.
What about vnodes, don’t they help?
Both issues that cause data imbalance in the cluster (hot spots, wide rows) can be prevented by manual control. That is, specify the tokens using the initial_token
setting in the casandra.yaml file for each node and ensure your data model evenly distributes data across the cluster. The second control measure (data modelling) is something we always need to do when adding data to Cassandra. The first point however, defining the tokens manually, is cumbersome to do when maintaining a cluster, especially when growing or shrinking it. As a result, token management was automated early on in Cassandra (version 1.2 - CASSANDRA-4119) through the introduction of Virtual Nodes (vnodes).
Vnodes break up the available range of tokens into smaller ranges, defined by the num_tokens
setting in the cassandra.yaml file. The vnode ranges are randomly distributed across the cluster and are generally non-contiguous. If we use a large number for num_tokens
to break up the token ranges, the random distribution means it is less likely that we will have hot spots. Using statistical computation, the point where all clusters of any size always had a good token range balance was when 256 vnodes were used. Hence, the num_tokens
default value of 256 was the recommended by the community to prevent hot spots in a cluster. The problem here is that the performance for operations requiring token-range scans (e.g. repairs, Spark operations) will tank big time. It can also cause problems with bootstrapping due to large numbers of SSTables generated. Furthermore, as Joseph Lynch and Josh Snyder pointed out in a paper they wrote, the higher the value of num_tokens
in large clusters, the higher the risk of data unavailability .
Token allocation gets smart
This paints a pretty grim picture of vnodes, and as far as operators are concerned, they are caught between a rock and hard place when selecting a value for num_tokens
. That was until Cassandra version 3.0 was released, which brought with it a more intelligent token allocation algorithm thanks to CASSANDRA-7032. Using a ranking system, the algorithm feeds in the replication factor of a keyspace, the number of tokens, and the partitioner, to derive token ranges that are evenly distributed across the cluster of nodes.
The algorithm is configured by settings in the cassandra.yaml configuration file. Prior to this algorithm being added, the configuration file contained the necessary settings to configure the algorithm with the exception of the one to specify the keyspace name. When the algorithm was added, the allocate_tokens_for_keyspace
setting was introduced into the configuration file. The setting allows a keyspace name to be specified so that during the bootstrap of a node we query the keyspace for its replication factor and pass that to the token allocation algorithm.
However, therein lies the problem, for existing clusters updating this setting is easy, as a keyspace already exists, but for a cluster starting from scratch we have a chicken and egg situation. How do we specify a keyspace that doesn’t exist!? And there are other caveats, too…
- It works for only a single replication factor. As long as all the other keyspaces are using the same replication as the one specified for
allocate_tokens_for_keyspace
all is fine. However, if you have keyspaces with a different replication factor they can potentially cause hot spots. - It works when nodes are only added to the cluster. The process for token distribution when a node is removed from the cluster remains unchanged, and hence can cause hot spots.
- It works with only the default partitioner,
Murmur3Partitioner
.
Additionally, this is no silver bullet for all unbalanced clusters; we still need to make sure we have a data model that evenly distributes data across partitions. Wide partitions can still be an issue and no amount of token shuffling will fix this.
Despite these drawbacks, this feature gives us the ability to allocate tokens in a more predictable way whilst leveraging the advantage of vnodes. This means we can specify a small value for vnodes (e.g. 4) and still be able to avoid hot spots. The question then becomes, in the case of starting a brand new cluster from scratch, which comes first the chicken or the egg?
One does not simply start a cluster… with evenly distributed tokens
While it might be possible to rectify an unbalance cluster due to unfortunate token allocations, it is better for the token allocation to be set up correctly when the cluster is created. To set up a brand new cluster that takes advantage of the allocate_tokens_for_keyspace
setting we need to use the following steps. The method below takes into account a cluster with nodes that spread across multiple racks. The examples used in each step, assumes that our cluster will be configured as follows:
- 4 vnodes (
num_tokens
= 4). - 3 racks with a single seed node in each rack.
- A replication factor of 3, i.e. one replica per rack.
I have chosen 4 vnodes for demonstration purposes. Specifically it makes:
- The uneven token distribution in a small cluster very obvious.
- Identifying the endpoints displayed in nodetool ring easy.
- The initial_token setup less verbose and easier to follow.
This blog post in no way recommends 4 as the default number of vnodes for new clusters. The value of vnodes (num_tokens
) should be chosen according to the initial target size of your cluster.
1. Calculate and set tokens for the seed node in each rack
We will need to set the tokens for the seed nodes in each rack manually. This is to prevent each node from randomly calculating its own token ranges. We can calculate the token ranges that we will use for the initial_token
setting using the following python
code:
$ python
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> num_tokens = 4
>>> num_racks = 3
>>> print "\n".join(['[Node {}] initial_token: {}'.format(r + 1, ','.join([str(((2**64 / (num_tokens * num_racks)) * (t * num_racks + r)) - 2**63) for t in range(num_tokens)])) for r in range(num_racks)])
[Node 1] initial_token: -9223372036854775808,-4611686018427387905,-2,4611686018427387901
[Node 2] initial_token: -7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202
[Node 3] initial_token: -6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503
We can then uncomment the initial_token
setting in the cassandra.yaml file in each of the seed nodes, set it to value generated by our python
command, and set the num_tokens
setting to the number of vnodes. When the node first starts the value for the initial_token
setting will used, subsequent restarts will use the num_tokens
setting.
Note that we need to manually calculate and specify the initial tokens for only the seed node in each rack. All other nodes will be configured differently.
2. Start the seed node in each rack
We can start the seed nodes one at a time using the following command:
$ sudo service cassandra start
When we watch the logs, we should see messages similar to the following appear:
...
INFO [main] ... - This node will not auto bootstrap because it is configured to be a seed node.
INFO [main] ... - tokens manually specified as [-9223372036854775808,-4611686018427387905,-2,4611686018427387901]
...
After starting the first of the seed nodes, we can use nodetool status
to verify that 4 tokens are being used:
$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.31.36.11 99 KiB 4 100.0% 5d7e200d-ba1a-4297-a423-33737302e4d5 rack1
We will wait for this message appear in logs, then start the next seed node in the cluster.
INFO [main] ... - Starting listening for CQL clients on ...
Once all seed nodes in the cluster are up, we can use nodetool ring
to verify the token assignments in the cluster. It should look something like this:
$ nodetool ring
Datacenter: dc1
==========
Address Rack Status State Load Owns Token
7686143364045646503
172.31.36.11 rack1 Up Normal 65.26 KiB 66.67% -9223372036854775808
172.31.36.118 rack2 Up Normal 65.28 KiB 66.67% -7686143364045646507
172.31.43.239 rack3 Up Normal 99.03 KiB 66.67% -6148914691236517206
172.31.36.11 rack1 Up Normal 65.26 KiB 66.67% -4611686018427387905
172.31.36.118 rack2 Up Normal 65.28 KiB 66.67% -3074457345618258604
172.31.43.239 rack3 Up Normal 99.03 KiB 66.67% -1537228672809129303
172.31.36.11 rack1 Up Normal 65.26 KiB 66.67% -2
172.31.36.118 rack2 Up Normal 65.28 KiB 66.67% 1537228672809129299
172.31.43.239 rack3 Up Normal 99.03 KiB 66.67% 3074457345618258600
172.31.36.11 rack1 Up Normal 65.26 KiB 66.67% 4611686018427387901
172.31.36.118 rack2 Up Normal 65.28 KiB 66.67% 6148914691236517202
172.31.43.239 rack3 Up Normal 99.03 KiB 66.67% 7686143364045646503
We can then move to the next step.
3. Create only the keyspace for the cluster
On any one of the seed nodes we will use cqlsh
to create the cluster keyspace using the following commands:
$ cqlsh NODE_IP_ADDRESS -u ***** -p *****
Connected to ...
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cassandra@cqlsh>
cassandra@cqlsh> CREATE KEYSPACE keyspace_with_replication_factor_3
WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3}
AND durable_writes = true;
Note that this keyspace can be any name, it can even be the keyspace that contains the tables we will use for our data.
4. Set the number of tokens and the keyspace for all remaining nodes
We will set the num_tokens
and allocate_tokens_for_keyspace
settings in the cassandra.yaml file on all of the remaining nodes as follows:
num_tokens: 4
...
allocate_tokens_for_keyspace: keyspace_with_replication_factor_3
We have assigned the allocate_tokens_for_keyspace
value to be the name of keyspace created in the previous step. Note that at this point the Cassandra service on all other nodes is still down.
5. Start the remaining nodes in the cluster, one at a time
We can start the remaining nodes in the cluster using the following command:
$ sudo service cassandra start
When we watch the logs we should see messages similar to the following appear to say that we are using the new token allocation algorithm:
INFO [main] ... - JOINING: waiting for ring information
...
INFO [main] ... - Using ReplicationAwareTokenAllocator.
WARN [main] ... - Selected tokens [...]
...
INFO ... - JOINING: Finish joining ring
As per step 2 when we started the seed nodes, we will wait for this message to appear in the logs before starting the next node in the cluster.
INFO [main] ... - Starting listening for CQL clients on ...
Once all the nodes are up, our shiny, new, evenly-distributed-tokens cluster is ready to go!
Proof is in the token allocation
While we can learn a fair bit from talking about the theory for the allocate_tokens_for_keyspace
setting, it is still good to put it to the test and see what difference it makes when used in a cluster. I decided to create two clusters running Apache Cassandra 3.11.3 and compare the load distribution after inserting some data. For this test, I provisioned both clusters with 9 nodes using tlp-cluster and generated load using tlp-stress. Both clusters used 4 vnodes, but one of the clusters was setup using the even token distribution method described above.
Cluster using random token allocation
I started with a cluster that uses the traditional random token allocation system. For this cluster I set num_tokens: 4
and endpoint_snitch: GossipingPropertyFileSnitch
in the cassandra.yaml on all the nodes. Nodes were split across three racks by specifying the rack in the cassandra-rackdc.properties file.
Once the cluster instances were up and Cassandra was installed, I started each node one at a time. After all nodes were started, the cluster looked like this:
ubuntu@ip-172-31-39-54:~$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.31.36.95 65.29 KiB 4 16.1% 4ada2c52-0d1b-45cd-93ed-185c92038b39 rack1
UN 172.31.39.79 65.29 KiB 4 20.4% c282ef62-430e-4c40-a1d2-47e54c5c8685 rack2
UN 172.31.47.155 65.29 KiB 4 21.2% 48d865d7-0ad0-4272-b3c1-297dce306a34 rack1
UN 172.31.43.170 87.7 KiB 4 24.5% 27aa2c78-955c-4ea6-9ea0-3f70062655d9 rack1
UN 172.31.39.54 65.29 KiB 4 30.8% bd2d745f-d170-4fbf-bf9c-be95259597e3 rack3
UN 172.31.35.165 70.36 KiB 4 25.5% 056e2472-c93d-4275-a334-e82f87c4b53a rack3
UN 172.31.35.149 70.37 KiB 4 24.8% 06b0e1e4-5e73-46cb-bf13-626eb6ce73b3 rack2
UN 172.31.35.33 65.29 KiB 4 23.8% 137602f0-3248-459f-b07c-c0b3e647fa48 rack2
UN 172.31.37.129 99.03 KiB 4 12.9% cd92c974-b32e-4181-9e14-fb52dd27b09e rack3
I ran tlp-stress
against the cluster using the command below. This generated a write-only load that randomly inserted 10 million unique key value pairs into the cluster. tlp-stress
inserted data into a newly created keyspace and tabled called tlp_stress.keyvalue
.
tlp-stress run KeyValue --replication "{'class':'NetworkTopologyStrategy','dc1':3}" --cl LOCAL_QUORUM --partitions 10M --iterations 100M --reads 0 --host 172.31.43.170
After running tlp-stress
the cluster load distribution for the tlp_stress
keyspace looked like this:
ubuntu@ip-172-31-39-54:~$ nodetool status tlp_stress
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.31.36.95 1.29 GiB 4 20.8% 4ada2c52-0d1b-45cd-93ed-185c92038b39 rack1
UN 172.31.39.79 2.48 GiB 4 39.1% c282ef62-430e-4c40-a1d2-47e54c5c8685 rack2
UN 172.31.47.155 1.82 GiB 4 35.1% 48d865d7-0ad0-4272-b3c1-297dce306a34 rack1
UN 172.31.43.170 3.45 GiB 4 44.1% 27aa2c78-955c-4ea6-9ea0-3f70062655d9 rack1
UN 172.31.39.54 2.16 GiB 4 54.3% bd2d745f-d170-4fbf-bf9c-be95259597e3 rack3
UN 172.31.35.165 1.71 GiB 4 29.1% 056e2472-c93d-4275-a334-e82f87c4b53a rack3
UN 172.31.35.149 1.14 GiB 4 26.2% 06b0e1e4-5e73-46cb-bf13-626eb6ce73b3 rack2
UN 172.31.35.33 2.61 GiB 4 34.7% 137602f0-3248-459f-b07c-c0b3e647fa48 rack2
UN 172.31.37.129 562.15 MiB 4 16.6% cd92c974-b32e-4181-9e14-fb52dd27b09e rack3
I verified the data load distribution by checking the disk usage on all nodes using pssh
(parallel ssh).
ubuntu@ip-172-31-39-54:~$ pssh -ivl ... -h hosts.txt "du -sh /var/lib/cassandra/data"
[1] ... [SUCCESS] 172.31.35.149
1.2G /var/lib/cassandra/data
[2] ... [SUCCESS] 172.31.43.170
3.5G /var/lib/cassandra/data
[3] ... [SUCCESS] 172.31.36.95
1.3G /var/lib/cassandra/data
[4] ... [SUCCESS] 172.31.39.79
2.5G /var/lib/cassandra/data
[5] ... [SUCCESS] 172.31.35.33
2.7G /var/lib/cassandra/data
[6] ... [SUCCESS] 172.31.35.165
1.8G /var/lib/cassandra/data
[7] ... [SUCCESS] 172.31.37.129
564M /var/lib/cassandra/data
[8] ... [SUCCESS] 172.31.39.54
2.2G /var/lib/cassandra/data
[9] ... [SUCCESS] 172.31.47.155
1.9G /var/lib/cassandra/data
As we can see from the above results, there was large load distribution across nodes. Node 172.31.37.129
held the smallest amount of data (roughly 560 MB), whilst node 172.31.43.170
held six times that amount of data (~ roughly 3.5 GB). Effectively the difference between the smallest and largest data load is 3.0 GB!!
Cluster using predictive token allocation
I then moved on to setting up the cluster with predictive token allocation. Similar to the previous cluster, I set num_tokens: 4
and endpoint_snitch: GossipingPropertyFileSnitch
in the cassandra.yaml on all the nodes. These settings were common to all nodes in this cluster. Nodes were again split across three racks by specifying the rack in the cassandra-rackdc.properties file.
I set the initial_token
setting for each of the seed nodes and started the Cassandra process on them one at a time. One seed node allocated to each rack in the cluster.
The initial keyspace that would be specified in the allocate_tokens_for_keyspace
setting was created via cqlsh
using the following command:
CREATE KEYSPACE keyspace_with_replication_factor_3 WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'} AND durable_writes = true;
I then set allocate_tokens_for_keyspace: keyspace_with_replication_factor_3
in the cassandra.yaml file for the remaining non-seed nodes and started the Cassandra process on them one at a time. After all nodes were started, the cluster looked like this:
ubuntu@ip-172-31-36-11:~$ nodetool status keyspace_with_replication_factor_3
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.31.36.47 65.4 KiB 4 32.3% 5ece457c-a3af-4173-b9c8-e937f8b63d3b rack2
UN 172.31.43.239 117.45 KiB 4 33.3% 55a94591-ad6a-48a5-b2d7-eee7ea06912b rack3
UN 172.31.37.44 70.49 KiB 4 33.3% 93054390-bc83-487c-8940-b99e7b85e5c2 rack3
UN 172.31.36.11 104.3 KiB 4 35.4% 5d7e200d-ba1a-4297-a423-33737302e4d5 rack1
UN 172.31.39.186 65.41 KiB 4 31.2% ecd00ff5-a90a-4d33-b7ab-bdd22e3e50b8 rack1
UN 172.31.38.137 65.39 KiB 4 33.3% 64802174-885a-4c04-b530-a9b4685b1b96 rack1
UN 172.31.40.56 65.39 KiB 4 33.3% 0846effa-e4ac-4a19-845e-2162cd2b7680 rack3
UN 172.31.36.118 104.32 KiB 4 35.4% 5ad47bc0-9bcc-4fc5-b5b0-0a15ad63345f rack2
UN 172.31.41.196 65.4 KiB 4 32.3% 4128ca20-b4fa-4173-88b2-aac62539a6d8 rack2
I ran tlp-stress
against the cluster using the same command that was used to test the cluster with random token allocation. After running tlp-stress
the cluster load distribution for the tlp_stress
keyspace looked like this:
ubuntu@ip-172-31-36-11:~$ nodetool status tlp_stress
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.31.36.47 2.16 GiB 4 32.3% 5ece457c-a3af-4173-b9c8-e937f8b63d3b rack2
UN 172.31.43.239 2.32 GiB 4 33.3% 55a94591-ad6a-48a5-b2d7-eee7ea06912b rack3
UN 172.31.37.44 2.32 GiB 4 33.3% 93054390-bc83-487c-8940-b99e7b85e5c2 rack3
UN 172.31.36.11 1.84 GiB 4 35.4% 5d7e200d-ba1a-4297-a423-33737302e4d5 rack1
UN 172.31.39.186 2.01 GiB 4 31.2% ecd00ff5-a90a-4d33-b7ab-bdd22e3e50b8 rack1
UN 172.31.38.137 2.32 GiB 4 33.3% 64802174-885a-4c04-b530-a9b4685b1b96 rack1
UN 172.31.40.56 2.32 GiB 4 33.3% 0846effa-e4ac-4a19-845e-2162cd2b7680 rack3
UN 172.31.36.118 1.83 GiB 4 35.4% 5ad47bc0-9bcc-4fc5-b5b0-0a15ad63345f rack2
UN 172.31.41.196 2.16 GiB 4 32.3% 4128ca20-b4fa-4173-88b2-aac62539a6d8 rack2
I again verified the data load distribution by checking the disk usage on all nodes using pssh
.
ubuntu@ip-172-31-36-11:~$ pssh -ivl ... -h hosts.txt "du -sh /var/lib/cassandra/data"
[1] ... [SUCCESS] 172.31.36.11
1.9G /var/lib/cassandra/data
[2] ... [SUCCESS] 172.31.43.239
2.4G /var/lib/cassandra/data
[3] ... [SUCCESS] 172.31.36.118
1.9G /var/lib/cassandra/data
[4] ... [SUCCESS] 172.31.37.44
2.4G /var/lib/cassandra/data
[5] ... [SUCCESS] 172.31.38.137
2.4G /var/lib/cassandra/data
[6] ... [SUCCESS] 172.31.36.47
2.2G /var/lib/cassandra/data
[7] ... [SUCCESS] 172.31.39.186
2.1G /var/lib/cassandra/data
[8] ... [SUCCESS] 172.31.40.56
2.4G /var/lib/cassandra/data
[9] ... [SUCCESS] 172.31.41.196
2.2G /var/lib/cassandra/data
As we can see from the above results, there was little variation in the load distribution across nodes compared to a cluster that used random token allocation. Node 172.31.36.118
held the smallest amount of data (roughly 1.83 GB) and nodes 172.31.43.239
, 172.31.37.44
, 172.31.38.137
, and 172.31.40.56
held the largest amount of data (roughly 2.32 GB each). The difference between the smallest and largest data load being roughly 400 MB which is significantly less than the data size difference in the cluster that used random token allocation.
Conclusion
Having a perfectly balanced cluster takes a bit of work and planning. While there are some steps to set up and caveats to using the allocate_tokens_for_keyspace
setting, the predictive token allocation is a definite must use when setting up a new cluster. As we have seen from testing, it allows us to take advantage of num_tokens
being set to a low value without having to worry about hot spots developing in the cluster.