Anthony Grasso

How To Set Up A Cluster With Even Token Distribution

21 Feb 2019

Apache Cassandra is fantastic for storing large amounts of data and being flexible enough to scale out as the data grows. This is all fun and games until the data that is distributed in the cluster becomes unbalanced. In this post we will go through how to set up a cluster with predictive token allocation using the allocate_tokens_for_keyspace setting, which will help to evenly distribute the data as it grows.

Unbalanced clusters are bad mkay

An unbalanced load on a cluster means that some nodes will contain more data than others. An unbalanced cluster can be caused by the following:

Hot spots - by random chance one node ends up responsible for a higher percentage of the token space than the other nodes in the cluster.
Wide rows - due to data modelling issues, for example a partition row which grows significantly larger than the other rows in the data.

The above issues can have a number of impacts on individual nodes in the cluster, however this is a completely different topic and requires a more detailed post. In summary though, a node that contains disproportionately more tokens and/or data than other nodes in the cluster may experience one or more of the following issues:

Run out storage more quickly than the other nodes.
Serve more requests than the other nodes.
Suffer from higher read and write latencies than the other nodes.
Time to run repairs is longer than other nodes.
Time to run compactions is longer than other nodes.
Time to replace the node if it fails is longer than other nodes.

What about vnodes, don’t they help?

Both issues that cause data imbalance in the cluster (hot spots, wide rows) can be prevented by manual control. That is, specify the tokens using the initial_token setting in the casandra.yaml file for each node and ensure your data model evenly distributes data across the cluster. The second control measure (data modelling) is something we always need to do when adding data to Cassandra. The first point however, defining the tokens manually, is cumbersome to do when maintaining a cluster, especially when growing or shrinking it. As a result, token management was automated early on in Cassandra (version 1.2 - CASSANDRA-4119) through the introduction of Virtual Nodes (vnodes).

Vnodes break up the available range of tokens into smaller ranges, defined by the num_tokens setting in the cassandra.yaml file. The vnode ranges are randomly distributed across the cluster and are generally non-contiguous. If we use a large number for num_tokens to break up the token ranges, the random distribution means it is less likely that we will have hot spots. Using statistical computation, the point where all clusters of any size always had a good token range balance was when 256 vnodes were used. Hence, the num_tokens default value of 256 was the recommended by the community to prevent hot spots in a cluster. The problem here is that the performance for operations requiring token-range scans (e.g. repairs, Spark operations) will tank big time. It can also cause problems with bootstrapping due to large numbers of SSTables generated. Furthermore, as Joseph Lynch and Josh Snyder pointed out in a paper they wrote, the higher the value of num_tokens in large clusters, the higher the risk of data unavailability .

Token allocation gets smart

This paints a pretty grim picture of vnodes, and as far as operators are concerned, they are caught between a rock and hard place when selecting a value for num_tokens. That was until Cassandra version 3.0 was released, which brought with it a more intelligent token allocation algorithm thanks to CASSANDRA-7032. Using a ranking system, the algorithm feeds in the replication factor of a keyspace, the number of tokens, and the partitioner, to derive token ranges that are evenly distributed across the cluster of nodes.

The algorithm is configured by settings in the cassandra.yaml configuration file. Prior to this algorithm being added, the configuration file contained the necessary settings to configure the algorithm with the exception of the one to specify the keyspace name. When the algorithm was added, the allocate_tokens_for_keyspace setting was introduced into the configuration file. The setting allows a keyspace name to be specified so that during the bootstrap of a node we query the keyspace for its replication factor and pass that to the token allocation algorithm.

However, therein lies the problem, for existing clusters updating this setting is easy, as a keyspace already exists, but for a cluster starting from scratch we have a chicken and egg situation. How do we specify a keyspace that doesn’t exist!? And there are other caveats, too…

It works for only a single replication factor. As long as all the other keyspaces are using the same replication as the one specified for allocate_tokens_for_keyspace all is fine. However, if you have keyspaces with a different replication factor they can potentially cause hot spots.
It works when nodes are only added to the cluster. The process for token distribution when a node is removed from the cluster remains unchanged, and hence can cause hot spots.
It works with only the default partitioner, Murmur3Partitioner.

Additionally, this is no silver bullet for all unbalanced clusters; we still need to make sure we have a data model that evenly distributes data across partitions. Wide partitions can still be an issue and no amount of token shuffling will fix this.

Despite these drawbacks, this feature gives us the ability to allocate tokens in a more predictable way whilst leveraging the advantage of vnodes. This means we can specify a small value for vnodes (e.g. 4) and still be able to avoid hot spots. The question then becomes, in the case of starting a brand new cluster from scratch, which comes first the chicken or the egg?

One does not simply start a cluster… with evenly distributed tokens

While it might be possible to rectify an unbalance cluster due to unfortunate token allocations, it is better for the token allocation to be set up correctly when the cluster is created. To set up a brand new cluster that takes advantage of the allocate_tokens_for_keyspace setting we need to use the following steps. The method below takes into account a cluster with nodes that spread across multiple racks. The examples used in each step, assumes that our cluster will be configured as follows:

4 vnodes (num_tokens = 4).
3 racks with a single seed node in each rack.
A replication factor of 3, i.e. one replica per rack.

I have chosen 4 vnodes for demonstration purposes. Specifically it makes:

The uneven token distribution in a small cluster very obvious.
Identifying the endpoints displayed in nodetool ring easy.
The initial_token setup less verbose and easier to follow.

This blog post in no way recommends 4 as the default number of vnodes for new clusters. The value of vnodes (num_tokens) should be chosen according to the initial target size of your cluster.

1. Calculate and set tokens for the seed node in each rack

We will need to set the tokens for the seed nodes in each rack manually. This is to prevent each node from randomly calculating its own token ranges. We can calculate the token ranges that we will use for the initial_token setting using the following python code:

$ python

Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> num_tokens = 4
>>> num_racks = 3
>>> print "\n".join(['[Node {}] initial_token: {}'.format(r + 1, ','.join([str(((2**64 / (num_tokens * num_racks)) * (t * num_racks + r)) - 2**63) for t in range(num_tokens)])) for r in range(num_racks)])
[Node 1] initial_token: -9223372036854775808,-4611686018427387905,-2,4611686018427387901
[Node 2] initial_token: -7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202
[Node 3] initial_token: -6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503

We can then uncomment the initial_token setting in the cassandra.yaml file in each of the seed nodes, set it to value generated by our python command, and set the num_tokens setting to the number of vnodes. When the node first starts the value for the initial_token setting will used, subsequent restarts will use the num_tokens setting.

Note that we need to manually calculate and specify the initial tokens for only the seed node in each rack. All other nodes will be configured differently.

2. Start the seed node in each rack

We can start the seed nodes one at a time using the following command:

$ sudo service cassandra start

When we watch the logs, we should see messages similar to the following appear:

...
INFO  [main] ... - This node will not auto bootstrap because it is configured to be a seed node.
INFO  [main] ... - tokens manually specified as [-9223372036854775808,-4611686018427387905,-2,4611686018427387901]
...

After starting the first of the seed nodes, we can use nodetool status to verify that 4 tokens are being used:

$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.36.11  99 KiB     4            100.0%            5d7e200d-ba1a-4297-a423-33737302e4d5  rack1

We will wait for this message appear in logs, then start the next seed node in the cluster.

INFO  [main] ... - Starting listening for CQL clients on ...

Once all seed nodes in the cluster are up, we can use nodetool ring to verify the token assignments in the cluster. It should look something like this:

$ nodetool ring

Datacenter: dc1
==========
Address        Rack        Status State   Load            Owns                Token
                                                                              7686143364045646503
31.36.11   rack1       Up     Normal  65.26 KiB       66.67%              -9223372036854775808
31.36.118  rack2       Up     Normal  65.28 KiB       66.67%              -7686143364045646507
31.43.239  rack3       Up     Normal  99.03 KiB       66.67%              -6148914691236517206
31.36.11   rack1       Up     Normal  65.26 KiB       66.67%              -4611686018427387905
31.36.118  rack2       Up     Normal  65.28 KiB       66.67%              -3074457345618258604
31.43.239  rack3       Up     Normal  99.03 KiB       66.67%              -1537228672809129303
31.36.11   rack1       Up     Normal  65.26 KiB       66.67%              -2
31.36.118  rack2       Up     Normal  65.28 KiB       66.67%              1537228672809129299
31.43.239  rack3       Up     Normal  99.03 KiB       66.67%              3074457345618258600
31.36.11   rack1       Up     Normal  65.26 KiB       66.67%              4611686018427387901
31.36.118  rack2       Up     Normal  65.28 KiB       66.67%              6148914691236517202
31.43.239  rack3       Up     Normal  99.03 KiB       66.67%              7686143364045646503

We can then move to the next step.

3. Create only the keyspace for the cluster

On any one of the seed nodes we will use cqlsh to create the cluster keyspace using the following commands:

$ cqlsh NODE_IP_ADDRESS -u ***** -p *****

Connected to ...
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cassandra@cqlsh>
cassandra@cqlsh> CREATE KEYSPACE keyspace_with_replication_factor_3
    WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3}
    AND durable_writes = true;

Note that this keyspace can be any name, it can even be the keyspace that contains the tables we will use for our data.

4. Set the number of tokens and the keyspace for all remaining nodes

We will set the num_tokens and allocate_tokens_for_keyspace settings in the cassandra.yaml file on all of the remaining nodes as follows:

num_tokens: 4
...
allocate_tokens_for_keyspace: keyspace_with_replication_factor_3

We have assigned the allocate_tokens_for_keyspace value to be the name of keyspace created in the previous step. Note that at this point the Cassandra service on all other nodes is still down.

5. Start the remaining nodes in the cluster, one at a time

We can start the remaining nodes in the cluster using the following command:

$ sudo service cassandra start

When we watch the logs we should see messages similar to the following appear to say that we are using the new token allocation algorithm:

INFO  [main] ... - JOINING: waiting for ring information
...
INFO  [main] ... - Using ReplicationAwareTokenAllocator.
WARN  [main] ... - Selected tokens [...]
...
INFO  ... - JOINING: Finish joining ring

As per step 2 when we started the seed nodes, we will wait for this message to appear in the logs before starting the next node in the cluster.

INFO  [main] ... - Starting listening for CQL clients on ...

Once all the nodes are up, our shiny, new, evenly-distributed-tokens cluster is ready to go!

Proof is in the token allocation

While we can learn a fair bit from talking about the theory for the allocate_tokens_for_keyspace setting, it is still good to put it to the test and see what difference it makes when used in a cluster. I decided to create two clusters running Apache Cassandra 3.11.3 and compare the load distribution after inserting some data. For this test, I provisioned both clusters with 9 nodes using tlp-cluster and generated load using tlp-stress. Both clusters used 4 vnodes, but one of the clusters was setup using the even token distribution method described above.

Cluster using random token allocation

I started with a cluster that uses the traditional random token allocation system. For this cluster I set num_tokens: 4 and endpoint_snitch: GossipingPropertyFileSnitch in the cassandra.yaml on all the nodes. Nodes were split across three racks by specifying the rack in the cassandra-rackdc.properties file.

Once the cluster instances were up and Cassandra was installed, I started each node one at a time. After all nodes were started, the cluster looked like this:

ubuntu@ip-172-31-39-54:~$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.36.95   65.29 KiB  4            16.1%             4ada2c52-0d1b-45cd-93ed-185c92038b39  rack1
UN  172.31.39.79   65.29 KiB  4            20.4%             c282ef62-430e-4c40-a1d2-47e54c5c8685  rack2
UN  172.31.47.155  65.29 KiB  4            21.2%             48d865d7-0ad0-4272-b3c1-297dce306a34  rack1
UN  172.31.43.170  87.7 KiB   4            24.5%             27aa2c78-955c-4ea6-9ea0-3f70062655d9  rack1
UN  172.31.39.54   65.29 KiB  4            30.8%             bd2d745f-d170-4fbf-bf9c-be95259597e3  rack3
UN  172.31.35.165  70.36 KiB  4            25.5%             056e2472-c93d-4275-a334-e82f87c4b53a  rack3
UN  172.31.35.149  70.37 KiB  4            24.8%             06b0e1e4-5e73-46cb-bf13-626eb6ce73b3  rack2
UN  172.31.35.33   65.29 KiB  4            23.8%             137602f0-3248-459f-b07c-c0b3e647fa48  rack2
UN  172.31.37.129  99.03 KiB  4            12.9%             cd92c974-b32e-4181-9e14-fb52dd27b09e  rack3

I ran tlp-stress against the cluster using the command below. This generated a write-only load that randomly inserted 10 million unique key value pairs into the cluster. tlp-stress inserted data into a newly created keyspace and tabled called tlp_stress.keyvalue.

tlp-stress run KeyValue --replication "{'class':'NetworkTopologyStrategy','dc1':3}" --cl LOCAL_QUORUM --partitions 10M --iterations 100M --reads 0 --host 172.31.43.170

After running tlp-stress the cluster load distribution for the tlp_stress keyspace looked like this:

ubuntu@ip-172-31-39-54:~$ nodetool status tlp_stress
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.36.95   1.29 GiB   4            20.8%             4ada2c52-0d1b-45cd-93ed-185c92038b39  rack1
UN  172.31.39.79   2.48 GiB   4            39.1%             c282ef62-430e-4c40-a1d2-47e54c5c8685  rack2
UN  172.31.47.155  1.82 GiB   4            35.1%             48d865d7-0ad0-4272-b3c1-297dce306a34  rack1
UN  172.31.43.170  3.45 GiB   4            44.1%             27aa2c78-955c-4ea6-9ea0-3f70062655d9  rack1
UN  172.31.39.54   2.16 GiB   4            54.3%             bd2d745f-d170-4fbf-bf9c-be95259597e3  rack3
UN  172.31.35.165  1.71 GiB   4            29.1%             056e2472-c93d-4275-a334-e82f87c4b53a  rack3
UN  172.31.35.149  1.14 GiB   4            26.2%             06b0e1e4-5e73-46cb-bf13-626eb6ce73b3  rack2
UN  172.31.35.33   2.61 GiB   4            34.7%             137602f0-3248-459f-b07c-c0b3e647fa48  rack2
UN  172.31.37.129  562.15 MiB  4            16.6%             cd92c974-b32e-4181-9e14-fb52dd27b09e  rack3

I verified the data load distribution by checking the disk usage on all nodes using pssh (parallel ssh).

ubuntu@ip-172-31-39-54:~$ pssh -ivl ... -h hosts.txt "du -sh /var/lib/cassandra/data"
[1] ... [SUCCESS] 172.31.35.149
1.2G    /var/lib/cassandra/data
[2] ... [SUCCESS] 172.31.43.170
3.5G    /var/lib/cassandra/data
[3] ... [SUCCESS] 172.31.36.95
1.3G    /var/lib/cassandra/data
[4] ... [SUCCESS] 172.31.39.79
2.5G    /var/lib/cassandra/data
[5] ... [SUCCESS] 172.31.35.33
2.7G    /var/lib/cassandra/data
[6] ... [SUCCESS] 172.31.35.165
1.8G    /var/lib/cassandra/data
[7] ... [SUCCESS] 172.31.37.129
564M    /var/lib/cassandra/data
[8] ... [SUCCESS] 172.31.39.54
2.2G    /var/lib/cassandra/data
[9] ... [SUCCESS] 172.31.47.155
1.9G    /var/lib/cassandra/data

As we can see from the above results, there was large load distribution across nodes. Node 172.31.37.129 held the smallest amount of data (roughly 560 MB), whilst node 172.31.43.170 held six times that amount of data (~ roughly 3.5 GB). Effectively the difference between the smallest and largest data load is 3.0 GB!!

Cluster using predictive token allocation

I then moved on to setting up the cluster with predictive token allocation. Similar to the previous cluster, I set num_tokens: 4 and endpoint_snitch: GossipingPropertyFileSnitch in the cassandra.yaml on all the nodes. These settings were common to all nodes in this cluster. Nodes were again split across three racks by specifying the rack in the cassandra-rackdc.properties file.

I set the initial_token setting for each of the seed nodes and started the Cassandra process on them one at a time. One seed node allocated to each rack in the cluster.

The initial keyspace that would be specified in the allocate_tokens_for_keyspace setting was created via cqlsh using the following command:

CREATE KEYSPACE keyspace_with_replication_factor_3 WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'} AND durable_writes = true;

I then set allocate_tokens_for_keyspace: keyspace_with_replication_factor_3 in the cassandra.yaml file for the remaining non-seed nodes and started the Cassandra process on them one at a time. After all nodes were started, the cluster looked like this:

ubuntu@ip-172-31-36-11:~$ nodetool status keyspace_with_replication_factor_3
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.36.47   65.4 KiB   4            32.3%             5ece457c-a3af-4173-b9c8-e937f8b63d3b  rack2
UN  172.31.43.239  117.45 KiB  4            33.3%             55a94591-ad6a-48a5-b2d7-eee7ea06912b  rack3
UN  172.31.37.44   70.49 KiB  4            33.3%             93054390-bc83-487c-8940-b99e7b85e5c2  rack3
UN  172.31.36.11   104.3 KiB  4            35.4%             5d7e200d-ba1a-4297-a423-33737302e4d5  rack1
UN  172.31.39.186  65.41 KiB  4            31.2%             ecd00ff5-a90a-4d33-b7ab-bdd22e3e50b8  rack1
UN  172.31.38.137  65.39 KiB  4            33.3%             64802174-885a-4c04-b530-a9b4685b1b96  rack1
UN  172.31.40.56   65.39 KiB  4            33.3%             0846effa-e4ac-4a19-845e-2162cd2b7680  rack3
UN  172.31.36.118  104.32 KiB  4            35.4%             5ad47bc0-9bcc-4fc5-b5b0-0a15ad63345f  rack2
UN  172.31.41.196  65.4 KiB   4            32.3%             4128ca20-b4fa-4173-88b2-aac62539a6d8  rack2

I ran tlp-stress against the cluster using the same command that was used to test the cluster with random token allocation. After running tlp-stress the cluster load distribution for the tlp_stress keyspace looked like this:

ubuntu@ip-172-31-36-11:~$ nodetool status tlp_stress
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.36.47   2.16 GiB   4            32.3%             5ece457c-a3af-4173-b9c8-e937f8b63d3b  rack2
UN  172.31.43.239  2.32 GiB   4            33.3%             55a94591-ad6a-48a5-b2d7-eee7ea06912b  rack3
UN  172.31.37.44   2.32 GiB   4            33.3%             93054390-bc83-487c-8940-b99e7b85e5c2  rack3
UN  172.31.36.11   1.84 GiB   4            35.4%             5d7e200d-ba1a-4297-a423-33737302e4d5  rack1
UN  172.31.39.186  2.01 GiB   4            31.2%             ecd00ff5-a90a-4d33-b7ab-bdd22e3e50b8  rack1
UN  172.31.38.137  2.32 GiB   4            33.3%             64802174-885a-4c04-b530-a9b4685b1b96  rack1
UN  172.31.40.56   2.32 GiB   4            33.3%             0846effa-e4ac-4a19-845e-2162cd2b7680  rack3
UN  172.31.36.118  1.83 GiB   4            35.4%             5ad47bc0-9bcc-4fc5-b5b0-0a15ad63345f  rack2
UN  172.31.41.196  2.16 GiB   4            32.3%             4128ca20-b4fa-4173-88b2-aac62539a6d8  rack2

I again verified the data load distribution by checking the disk usage on all nodes using pssh.

ubuntu@ip-172-31-36-11:~$ pssh -ivl ... -h hosts.txt "du -sh /var/lib/cassandra/data"
[1] ... [SUCCESS] 172.31.36.11
1.9G    /var/lib/cassandra/data
[2] ... [SUCCESS] 172.31.43.239
2.4G    /var/lib/cassandra/data
[3] ... [SUCCESS] 172.31.36.118
1.9G    /var/lib/cassandra/data
[4] ... [SUCCESS] 172.31.37.44
2.4G    /var/lib/cassandra/data
[5] ... [SUCCESS] 172.31.38.137
2.4G    /var/lib/cassandra/data
[6] ... [SUCCESS] 172.31.36.47
2.2G    /var/lib/cassandra/data
[7] ... [SUCCESS] 172.31.39.186
2.1G    /var/lib/cassandra/data
[8] ... [SUCCESS] 172.31.40.56
2.4G    /var/lib/cassandra/data
[9] ... [SUCCESS] 172.31.41.196
2.2G    /var/lib/cassandra/data

As we can see from the above results, there was little variation in the load distribution across nodes compared to a cluster that used random token allocation. Node 172.31.36.118 held the smallest amount of data (roughly 1.83 GB) and nodes 172.31.43.239, 172.31.37.44, 172.31.38.137, and 172.31.40.56 held the largest amount of data (roughly 2.32 GB each). The difference between the smallest and largest data load being roughly 400 MB which is significantly less than the data size difference in the cluster that used random token allocation.

Conclusion

Having a perfectly balanced cluster takes a bit of work and planning. While there are some steps to set up and caveats to using the allocate_tokens_for_keyspace setting, the predictive token allocation is a definite must use when setting up a new cluster. As we have seen from testing, it allows us to take advantage of num_tokens being set to a low value without having to worry about hot spots developing in the cluster.

cassandra operations configuration