Modeling real life workloads with cassandra-stress is hard

The de-facto tool to model and test workloads on Cassandra is cassandra-stress. It is a widely known tool, appearing in numerous blog posts to illustrate performance testing on Cassandra and often recommended for stress testing specific data models. Theoretically there is no reason why cassandra-stress couldn’t fit your performance testing needs. But cassandra-stress has some caveats when modeling real workloads, the most important of which we will cover in this blog post.

Using cassandra-stress

When the time comes to run performance testing on Cassandra there’s the choice between creating your own code or using cassandra-stress. Saving time by relying on a tool that ships with your favourite database is the obvious choice for many.

Predefined mode

Using cassandra-stress in predefined mode is very cool. By running a simple command line, you can supposedly test the limits of your cluster in terms of write ingestion for example :

cassandra-stress write n=100000 cl=one -mode native cql3 -node 10.0.1.24

Will output the following :

Created keyspaces. Sleeping 1s for propagation.
Sleeping 2s...
Warming up WRITE with 50000 iterations...
Connected to cluster: c228
Datatacenter: datacenter1; Host: /127.0.0.1; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.2; Rack: rack1
Datatacenter: datacenter1; Host: /127.0.0.3; Rack: rack1
Failed to connect over JMX; not collecting these stats
Running WRITE with 200 threads for 100000 iteration
Failed to connect over JMX; not collecting these stats
type,      total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
total,         14889,   14876,   14876,   14876,    13,4,     5,5,    49,4,    82,1,    98,0,   110,9,    1,0,  0,00000,      0,      0,       0,       0,       0,       0
total,         25161,    9541,    9541,    9541,    21,0,     8,4,    75,2,    84,7,    95,8,   107,4,    2,1,  0,15451,      0,      0,       0,       0,       0,       0
total,         33715,    8249,    8249,    8249,    24,5,    10,3,    91,9,   113,5,   135,1,   137,6,    3,1,  0,15210,      0,      0,       0,       0,       0,       0
total,         45773,   11558,   11558,   11558,    17,1,     8,8,    55,3,    90,6,    97,7,   105,3,    4,2,  0,11311,      0,      0,       0,       0,       0,       0
total,         57748,   11315,   11315,   11315,    17,6,     7,9,    67,2,    93,1,   118,0,   124,1,    5,2,  0,09016,      0,      0,       0,       0,       0,       0
total,         69434,   11039,   11039,   11039,    18,2,     8,5,    63,1,    95,4,   113,5,   122,8,    6,3,  0,07522,      0,      0,       0,       0,       0,       0
total,         83486,   13345,   13345,   13345,    15,1,     5,5,    56,8,    71,1,    91,3,   101,1,    7,3,  0,06786,      0,      0,       0,       0,       0,       0
total,         97386,   12358,   12358,   12358,    16,1,     7,1,    63,6,    92,4,   105,8,   138,6,    8,5,  0,05954,      0,      0,       0,       0,       0,       0
total,        100000,    9277,    9277,    9277,    19,6,     7,5,    64,2,    86,0,    96,6,    97,6,    8,7,  0,05802,      0,      0,       0,       0,       0,       0


Results:
op rate                   : 11449 [WRITE:11449]
partition rate            : 11449 [WRITE:11449]
row rate                  : 11449 [WRITE:11449]
latency mean              : 17,4 [WRITE:17,4]
latency median            : 7,4 [WRITE:7,4]
latency 95th percentile   : 63,6 [WRITE:63,6]
latency 99th percentile   : 91,8 [WRITE:91,8]
latency 99.9th percentile : 114,8 [WRITE:114,8]
latency max               : 138,6 [WRITE:138,6]
Total partitions          : 100000 [WRITE:100000]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:00:08
END

So at CL.ONE, a ccm cluster running on my laptop can go up to 9200 writes/s.
Is that interesting? Probably not…

So far cassandra-stress has run in self-driving mode, creating its own schema and generating its own data. Taking a close look at the table that was actually generated, here’s what we get :

CREATE TABLE keyspace1.standard1 (
    key blob PRIMARY KEY,
    "C0" blob,
    "C1" blob,
    "C2" blob,
    "C3" blob,
    "C4" blob
) WITH COMPACT STORAGE
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
    AND compression = {}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = 'NONE';

A compact storage table, with blob only columns and no clustering columns. What we will say is that it might not be representative of the data models you are running in production. Furthermore, keyspace1 is created at RF=1 by default. It is understandable since cassandra-stress should run on any cluster, even with a single node, but you might be happier with an RF=3 keyspace to model real life workloads.

Even though this is most likely nothing like your use case, it is a work everywhere, out of the box solution that is useful for evaluating hardware configurations (tuning IO for example) or directly comparing different versions and/or configurations of Cassandra.

That said, running a mixed workload will prove deceitful, even when simply comparing raw performance of different hardware configurations. Unless ran on a sufficient number of iterations, your read workload might exclusively be hitting memtables, and not a single sstable :

adejanovski$ cassandra-stress mixed n=100000 cl=one -mode native cql3 -node 10.0.1.24

Sleeping for 15s
Running with 16 threadCount
Running [WRITE, READ] with 16 threads for 100000 iteration
Failed to connect over JMX; not collecting these stats
type,      total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
READ,          10228,   10211,   10211,   10211,     0,8,     0,6,     1,8,     2,8,     7,8,     8,5,    1,0,  0,00000,      0,      0,       0,       0,       0,       0
WRITE,         10032,   10018,   10018,   10018,     0,8,     0,6,     1,7,     2,7,     7,6,     8,5,    1,0,  0,00000,      0,      0,       0,       0,       0,       0
total,         20260,   20226,   20226,   20226,     0,8,     0,6,     1,7,     2,8,     7,7,     8,5,    1,0,  0,00000,      0,      0,       0,       0,       0,       0
READ,          21766,   11256,   11256,   11256,     0,7,     0,5,     1,6,     2,4,     7,1,    14,2,    2,0,  0,03979,      0,      0,       0,       0,       0,       0
WRITE,         21709,   11387,   11387,   11387,     0,7,     0,5,     1,5,     2,5,     9,0,    14,1,    2,0,  0,03979,      0,      0,       0,       0,       0,       0
total,         43475,   22639,   22639,   22639,     0,7,     0,5,     1,5,     2,4,     8,9,    14,2,    2,0,  0,03979,      0,      0,       0,       0,       0,       0
READ,          32515,   10600,   10600,   10600,     0,7,     0,5,     1,8,     4,0,     7,9,    11,8,    3,0,  0,02713,      0,      0,       0,       0,       0,       0
WRITE,         32311,   10454,   10454,   10454,     0,7,     0,5,     1,9,     4,5,     7,8,    10,5,    3,0,  0,02713,      0,      0,       0,       0,       0,       0
total,         64826,   21050,   21050,   21050,     0,7,     0,5,     1,9,     4,2,     7,9,    11,8,    3,0,  0,02713,      0,      0,       0,       0,       0,       0
READ,          42743,   10065,   10065,   10065,     0,8,     0,6,     2,0,     3,2,     8,4,     8,7,    4,1,  0,02412,      0,      0,       0,       0,       0,       0
WRITE,         42502,   10031,   10031,   10031,     0,8,     0,6,     1,9,     2,9,     7,4,     8,9,    4,1,  0,02412,      0,      0,       0,       0,       0,       0
total,         85245,   20095,   20095,   20095,     0,8,     0,6,     2,0,     3,0,     7,7,     8,9,    4,1,  0,02412,      0,      0,       0,       0,       0,       0
READ,          50019,   10183,   10183,   10183,     0,8,     0,6,     1,8,     2,7,     8,0,    11,1,    4,8,  0,01959,      0,      0,       0,       0,       0,       0
WRITE,         49981,   10473,   10473,   10473,     0,8,     0,5,     1,8,     2,7,     7,2,    12,3,    4,8,  0,01959,      0,      0,       0,       0,       0,       0
total,        100000,   20651,   20651,   20651,     0,8,     0,6,     1,8,     2,7,     7,6,    12,3,    4,8,  0,01959,      0,      0,       0,       0,       0,       0


Results:
op rate                   : 20958 [READ:10483, WRITE:10476]
partition rate            : 20958 [READ:10483, WRITE:10476]
row rate                  : 20958 [READ:10483, WRITE:10476]
latency mean              : 0,7 [READ:0,8, WRITE:0,7]
latency median            : 0,5 [READ:0,6, WRITE:0,5]
latency 95th percentile   : 1,8 [READ:1,8, WRITE:1,8]
latency 99th percentile   : 2,8 [READ:2,9, WRITE:3,0]
latency 99.9th percentile : 6,5 [READ:7,8, WRITE:7,7]
latency max               : 14,2 [READ:14,2, WRITE:14,1]
Total partitions          : 100000 [READ:50019, WRITE:49981]
Total errors              : 0 [READ:0, WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:00:04 

Here we can see that read latencies have been almost as fast as writes, which even with the help of SSDs is still unexpected.
Checking nodetool cfhistograms will confirm that we didn’t hit a single SSTable :

adejanovski$ ccm node1 nodetool cfhistograms keyspace1 standard1
No SSTables exists, unable to calculate 'Partition Size' and 'Cell Count' percentiles

keyspace1/standard1 histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             0,00             11,86             17,08               NaN               NaN
75%             0,00             14,24             20,50               NaN               NaN
95%             0,00             20,50             29,52               NaN               NaN
98%             0,00             29,52             35,43               NaN               NaN
99%             0,00             42,51             51,01               NaN               NaN
Min             0,00              2,76              4,77               NaN               NaN
Max             0,00          25109,16          17436,92               NaN               NaN

As stated here, “No SSTables exists”.

After a few runs with different thread counts (the mixed workload runs several times, changing the number of threads), we finally get some flushes which mildly change the distribution :

keyspace1/standard1 histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             0,00             11,86             17,08               310                 5
75%             0,00             14,24             20,50               310                 5
95%             1,00             20,50             61,21               310                 5
98%             1,00             42,51             88,15               310                 5
99%             1,00             88,15            105,78               310                 5
Min             0,00              1,92              3,97               259                 5
Max             1,00         155469,30          17436,92               310                 5

There we can see that we’re reading 310 bytes partitions, and still hit only memtables in at least 75% of our reads.
Anyone operating Cassandra in production would love to have such outputs when running cfhistograms :)

As a best practice, always run separate write and read commands, each with an appropriate fixed throttle rate. This provides appropriately specified read-to-write ratio which improves the memtables and flushing issue and avoids coordinated omission.

Clearly the predefined mode has limited interest, and one might want to use a very high number of iterations to ensure data is accessed both in memory and on disk.

Let’s now dive into user defined workloads, which should prove way more useful.

User defined mode

User defined mode of cassandra-stress allows running performance tests on custom data models, using yaml files for configuration.
I will not detail how that works here and invite you to check this Datastax blog post and this Instaclustr blog post which should give you a good understanding of how to model your custom workload in cassandra-stress.

Our use case for cassandra-stress was to test how a customer workload would behave on different instance sizes in AWS. The plan was to model every table in yml files, try replicating partition size distribution and row mean size and use the rate limiter to be as close as possible to the observed production workload.

No support for maps or UDTs

The first problem we ran into is that cassandra-stress doesn’t support maps, and we had a map in a high traffic table. Changing that meant to go deep into cassandra-stress code to add support, which was not feasible in the time we had allocated to this project.
The poor man’s solution we chose here was to replace the map with a text column. While not being the same, we were lucky enough for that map to be written immutably only once and not partially updated over time. Getting the size right for that field was then done by trying different settings and checking on the mean row size.

Getting partition size distribution is hard

To be fair, cassandra-stress does a great job at allowing to have a specific distribution for partition sizes. As shown in the above links, you can use GAUSSIAN, EXTREME, UNIFORM, EXP and FIXED as distribution patterns.
Getting the right distribution will require several attempts though, and any change to the mean row size will require to tweak back distribution parameters to get a cfhistograms output that looks like your production one.

To batch or not to batch, it should be up to you…

There is no straightforward way to tell cassandra-stress not to use batches for queries and send them individually. You have to use the upper bound of your cluster distribution and use it as a divider for the select distribution ratio :

...
  - name: event_id
    population: UNIFORM(1..100B)
    cluster: EXTREME(10..500, 0.4)  
...
insert:
  partitions: FIXED(1)
  select: FIXED(1)/500
  batchtype: UNLOGGED
...

Unfortunately, and as reported in this JIRA ticket, the current implementation prevents it from working as one could expect.
Diving into the code, it appears clearly that all rows in a partition will get batched together, whatever happens (from SchemaInsert.java):

	public boolean run() throws Exception
        {
            List<BoundStatement> stmts = new ArrayList<>();
            partitionCount = partitions.size();

            for (PartitionIterator iterator : partitions)
                while (iterator.hasNext())
                    stmts.add(bindRow(iterator.next()));

            rowCount += stmts.size();

            // 65535 is max number of stmts per batch, so if we have more, we need to manually batch them
            for (int j = 0 ; j < stmts.size() ; j += 65535)
            {
                List<BoundStatement> substmts = stmts.subList(j, Math.min(j + stmts.size(), j + 65535));
                Statement stmt;
                if (stmts.size() == 1)
                {
                    stmt = substmts.get(0);
                }
                else
                {
                    BatchStatement batch = new BatchStatement(batchType);
                    batch.setConsistencyLevel(JavaDriverClient.from(cl));
                    batch.addAll(substmts);
                    stmt = batch;
                }

                client.getSession().execute(stmt);
            }
            return true;
        }

All partitions in an operation will get batched by chunks of 65k queries at most, with no trace of the batchSize argument that exists in the SchemaInsert constructor.

Since in our specific case, we were using Apache Cassandra 2.1, we patched that code path in order to get proper batch sizes (in our case, one row per batch).
The patch sadly does not apply on trunk as there were other changes made in between that prevent it from working correctly.

The code here should also be further optimized by using asynchronous queries instead of batches, as this has been considered a best practice for a long time now, especially if your operations contain several partitions.

I’ve submitted a new patch on CASSANDRA-11105 to fix the issue on the 4.0 trunk and switch to asynchronous queries instead of synchronous ones. You can still use batches of course if that fits your use case.

You can also consider the visits switch on inserts. It allows to spread the inserts for each partition in chunks, which will break partitions in several operations over the course of the stress test. The difficult part here is that you will have to closely correlate the number of iterations, the number of possible distinct partition keys, the number of visits per partition and the select ratio in order to get a representative distribution of inserts, otherwise the resulting partition sizes won’t match reality.

On the bright side, the visits switch can bring you closer to real life scenarios by spreading the inserts of a partition over time instead of doing it all at once (batched or not). The effect is that your rows will get spread over multiple SSTables instead of living in a single one.

You can achieve something similar by running several write tests before starting your read test.

Using our example from above, we could make sure all partitions would spread each row in different operations using the following command line :

cassandra-stress user profile=my_stress_config.yaml n=100000 'ops(insert=1)' cl=LOCAL_QUORUM truncate=always no-warmup -node 10.0.1.24 -mode native cql3 compression=lz4 -rate threads=20 -insert visits=FIXED\(500\)

Note that if you’re allowing a great number of distinct partition keys in your yaml, using 100k iterations like above will probably create small partitions as :

  • you won’t run enough operations to create enough distinct partitions
  • Some partitions will be visited way more times than they have rows, which will generate a lot of overwrites

You could try using the same distribution than your clustering key, but both won’t be correlated and you’ll end up with batches being generated.

You can find a few informations on the -insert switch in this JIRA comment.

Rate limiter ftw !

Once the yaml files are ready, you’ll have to find the right arguments for your cassandra-stress command line, which is not for the faint of heart. There are lots of switches here, and the documentation is not always as thorough as one would hope.
One feature that we were particularly interested in was the rate limiter. Since we had to model a production load over more than 10 tables, it was clear that all of it would be useless if they weren’t rate limited to match their specific production rate.
The -rate switch allows to specify both the number of threads used by cassandra-stress but also the rate limit :

cassandra-stress ... -rate threads=${THREADS} fixed="${RATE}/s"

We quickly observed upon testing that the rate limiter was not limiting queries but operations instead, and an operation could contain several partitions, which could contain several rows. This makes sense if you account for the bug above on batch sizes as one operation would most often be achieved by a single query. As being able to mimic the production rate was mandatory, we had to patch cassandra-stress into rate limiting each individual query.

Make sure you understand coordinated omission before running cassandra-stress with the rate limiter (which you should almost always do) and use fixed instead of throttle on the rate limiter.

The problem we couldn’t fix : compression

With our fancy patched stress.jar, we went and ran the full stress test suite : rates limits were respected, the mean row size was good and each query was sent individually.
Alas, our nodes soon backed up on compactions, which was totally unexpected as we were using SSDs that could definitely keep up with our load.

The good news was that our biggest table was using TWCS with a 1 day time window, which allowed to get the exact size on disk that was generated for each node per day. After letting the stress test run for a full day and waiting for compaction to catch up, it appears we had been generating 4 to 6 times the size of what we had in production.
It took some time to realize that the problem came from compression. While in production we had a compression ratio ranging from 0.15 to 0.25, the stress test environment shown ratios going from 0.6 to 0.9 in the worst case.
Our explanation for that behavior is that cassandra-stress generates highly randomized sequences of characters (and many out of the [a-zA-Z0-9] ranges) that compress poorly, by lack of repetition. Data in a table usually follows real life patterns that are prone to repetition of character sequences. Those compress well.

Such a difference in compression ratio is a big deal when it comes to I/O. For the same uncompressed data size, we couldn’t keep up with compactions and read latencies were way higher.
The only workaround we found, and it’s a poor man’s one once again, was to reduce our mean row size to match the size on disk. This is clearly unsatisfactory as the uncompressed size in memory does not match the production one then, which can change the heap usage of our workload.

Conclusion

There are a few deal breakers right now at using cassandra-stress to model a real life workload.
If you’ve tried to use it as it should work, it is highly probable that you’ve developed a love/hate relationship with the tool : it is very powerful in theory and is the natural solution to your problem, but in real life it can be difficult and time-consuming to get close to a production stress case against an existing defined datamodel.

On the bright side, there aren’t fundamental flaws and with a coordinate effort, without even redesigning it from the ground we could easily make it better.

Beware of short stress sessions, they have very limited value : you must be accessing data from disk in order to have realistic reads, and you must wait for serious compaction to kick in order to get something that looks like the real world.

I invite you to watch Chris Batey’s talk at Cassandra Summit 2016 for more insights on cassandra-stress.

cassandra stress performance test