<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
 
 <title>Aaron Morton - All About Cassandra</title>
 <link href="http://thelastpickle.com//topics/Cassandra/index.xml" rel="self"/>
 <link href="http://thelastpickle.com/"/>
 <updated>2012-01-12T13:49:14-08:00</updated>
 <id>http://thelastpickle.com/topics/Cassandra</id>
 <author>
   <name>Aaron Morton</name>
   <email>aaron@thelastpickle.com</email>
 </author>

 
 <entry>
   <title>Cassandra Reading 02-01-2012</title>
   <link href="http://thelastpickle.com/2012/01/02/Cassandra-Reading"/>
   <updated>2012-01-02T00:00:00-08:00</updated>
   <id>http://thelastpickle.com/2012/01/02/Cassandra-Reading</id>
   <content type="html">&lt;h2 id='cql_sql_in_cassandra'&gt;&lt;a href='http://www.slideshare.net/jericevans/cql-sql-in-cassandra'&gt;CQL: SQL In Cassandra&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;And a &lt;a href='http://www.acunu.com/blogs/eric-evans/cql-benchmarking/'&gt;blog post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By Eric Evans&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Good examples of simple statements.&lt;/li&gt;

&lt;li&gt;Performance comparisons show CQL has 5% to 10% lower throughput and higher latency.&lt;/li&gt;

&lt;li&gt;List of drivers and &lt;a href='http://code.google.com/a/apache-extras.org/hosting/search?q=label%3ACassandra'&gt;where&lt;/a&gt; to get them.&lt;/li&gt;

&lt;li&gt;More improvements coming.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='storm_cassandra_integration'&gt;&lt;a href='https://github.com/ptgoetz/storm-cassandra'&gt;Storm Cassandra Integration&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By P. Taylor Goetz&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;a href='https://github.com/nathanmarz/storm'&gt;storm&lt;/a&gt; bolt to persist data to Cassandra.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='cassandra_in_online_advertising_real_time_bidding'&gt;&lt;a href='http://www.slideshare.net/edwardcapriolo/m6d-cassandrapresentation'&gt;Cassandra in Online Advertising: Real Time Bidding&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Edward Capriolo&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10TB of data and growing.&lt;/li&gt;

&lt;li&gt;Latency requirements of less than 120ms.&lt;/li&gt;

&lt;li&gt;&amp;#8220;Distributed not Duplicated&amp;#8221; is a great phrase.&lt;/li&gt;

&lt;li&gt;Ed still loves the Cacti :)&lt;/li&gt;

&lt;li&gt;Good idea to use JMX to modify the (Cassandra) cache settings for a single node to compare against the others.&lt;/li&gt;

&lt;li&gt;Pay attention to how the cache hit rate varies with regard to the cache size, at some point may be better to accept a (say) 90% hit rate and give more memory to another CF.&lt;/li&gt;

&lt;li&gt;Tuning IO performance for peak and off-peak by modifying &lt;code&gt;nodetool setcompactionlimit&lt;/code&gt; to improve compaction performance.&lt;/li&gt;

&lt;li&gt;Running night time major compactions like Urban Airship. I wonder how leveled compaction would work with the mixed workload?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='replication_and_the_latencyconsistency_tradeoff'&gt;&lt;a href='http://dbmsmusings.blogspot.com/2011/12/replication-and-latency-consistency.html'&gt;Replication and the latency-consistency tradeoff&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Daniel Abadi&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A discussion about latency and consistency.&lt;/li&gt;

&lt;li&gt;&amp;#8220;there&amp;#8217;s no way to perform consistent replication across database replicas without some level of synchronous network communication.&amp;#8221;&lt;/li&gt;

&lt;li&gt;Not sure I agree that in Dynamo / Cassandra (not sure about Riak) &amp;#8220;updates generally go to the same node, and are then propagated synchronously to W other nodes (case (2)(c))&amp;#8221;.&lt;/li&gt;

&lt;li&gt;&lt;a href='http://dbmsmusings.blogspot.com/2011/12/replication-and-latency-consistency.html?showComment=1325454055745#c307358897668058915'&gt;I think&lt;/a&gt; how inconsistencies are handled during read requests is another source of latency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='cassandra_for_sys_admins'&gt;&lt;a href='http://www.slideshare.net/nmilford/cassandra-for-sysadmins'&gt;Cassandra for sys admins&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Nathan Milford&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;14 nodes in two DC&amp;#8217;s&lt;/li&gt;

&lt;li&gt;In productions since Cassandra version 0.4, awesome!&lt;/li&gt;

&lt;li&gt;A good list of things to monitor in a cluster.&lt;/li&gt;

&lt;li&gt;Good best practices for shutting down a node that give the fastest startup, also &lt;a href='http://blog.milford.io/2011/11/rolling-upgrades-for-cassandra'&gt;see&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='cassandra_for_lobs'&gt;&lt;a href='http://ruby.dzone.com/articles/cassandra-lobs'&gt;Cassandra for LOBS&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Dan Pritchett&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expensive SAN storage bombshell.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='expedia_hotel_price_cache'&gt;&lt;a href='http://www.slideshare.net/clibou/seattle-scalability-meetup-10505322/25'&gt;Expedia Hotel Price Cache&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By B. Todd Burruss&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pre-calculate, trading space for time.&lt;/li&gt;

&lt;li&gt;A rolling window of 2.8 billion data points.&lt;/li&gt;

&lt;li&gt;Test, measure, tune.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='data_modeling_examples'&gt;&lt;a href='http://www.slideshare.net/mattdennis/cassandra-nyc-2011-data-modeling'&gt;Data Modeling Examples&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Matthew Dennis&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I always check out Matthew&amp;#8217;s data model presentations to see what the best practices are.&lt;/li&gt;

&lt;li&gt;&amp;#8220;Usually better to keep a record that something happened as opposed to changing a value&amp;#8221;.&lt;/li&gt;

&lt;li&gt;Good advice on time series and the XACT_LOG.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='cassandra_in_production_things_we_learned'&gt;&lt;a href='http://devblog.seomoz.org/2011/11/cassandra-in-production-things-we-learned/'&gt;Cassandra In Production: Things We Learned&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Walt Jones&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using Cassandra 0.7, there are a &lt;em&gt;lot&lt;/em&gt; of improvements in Cassandra 1.0.&lt;/li&gt;

&lt;li&gt;12 AWS EC2 m1.xlarge nodes with 5TB of data.&lt;/li&gt;

&lt;li&gt;S3 archive&lt;/li&gt;

&lt;li&gt;The memory footprint issues has been eliminated in Cassandra 1.0.&lt;/li&gt;

&lt;li&gt;Please avoid using Super Columns.&lt;/li&gt;

&lt;li&gt;Please use the Random Partitioner.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='data_modeling_with_cassandra'&gt;&lt;a href='http://www.acunu.com/blogs/sam-overton/cassandra-data-modelling/'&gt;Data Modeling with Cassandra&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Sam Overton&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;De-normalize for a brighter future.&lt;/li&gt;

&lt;li&gt;No SQL &amp;#8220;Hello World&amp;#8221; twitter example.&lt;/li&gt;
&lt;/ul&gt;</content>
 </entry>
 
 <entry>
   <title>Anatomy of a Cassandra Partition</title>
   <link href="http://thelastpickle.com/2011/12/15/Anatomy-of-a-Cassandra-Partition"/>
   <updated>2011-12-15T00:00:00-08:00</updated>
   <id>http://thelastpickle.com/2011/12/15/Anatomy-of-a-Cassandra-Partition</id>
   <content type="html">&lt;p&gt;Recently I was working on a Cassandra cluster and experienced a strange situation that resulted in a partition of sorts between the nodes. Whether you actually call it a partition or not is a matter for discussion (see &lt;a href='http://codahale.com/you-cant-sacrifice-partition-tolerance/'&gt;&amp;#8221; You Can’t Sacrifice Partition Tolerance - Updated October 22, 2010&amp;#8221;&lt;/a&gt;]. But weird stuff happened, Cassandra remained available, and it was fixed with zero site down time. It was also a good example of how and why Cassandra is a Highly Available data store, and fun to fix. So here are all the nerdy details&amp;#8230;&lt;/p&gt;

&lt;p&gt;The cluster was running Cassandra 1.0.5 on 6 nodes, 3 in the East DC (nodes 20, 21 and 22) and 3 in the West DC (nodes 23, 24 and 25), and the Keyspace was configured with RF 3 in each DC. Application servers were running in each DC and it was important to keep a &lt;code&gt;LOCAL_QUOURM&lt;/code&gt; in each data centre. The code would drop back to QUORUM if necessary, but that was something we wanted to avoid.&lt;/p&gt;

&lt;p&gt;I was deploying some configuration changes that required a rolling restart. The first few nodes worked fine, after restarting Cassandra I watched the other nodes detect it was &lt;code&gt;UP&lt;/code&gt; and start sending hints to it. However when node #23 restarted the rest of the cluster continued to see it as &lt;code&gt;DOWN&lt;/code&gt;, even though node #23 could see the rest of the cluster as &lt;code&gt;UP&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id='is_cassandra_available_'&gt;Is Cassandra available ?&lt;/h2&gt;

&lt;p&gt;This was the first question to answer. For this situation available meant all keys could be written to or read from using the &lt;code&gt;LOCAL_QUOURM&lt;/code&gt; CL. Technically as I still had at least 2 nodes in each data centre I still had a Quorum, but strange stuff was happening so I wanted to confirm this.&lt;/p&gt;

&lt;p&gt;Node #23 appeared to be happy:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/nodetool -h localhost info
Token            : 28356863910078205288614550619314017621
Gossip active    : true
Load             : 275.44 GB
Generation No    : 1762556151
Uptime (seconds) : 67548
Heap Memory (MB) : 2926.44 / 8032.00
Data Center      : DC1
Rack             : RAC_unknown  
Exceptions       : 0&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(Note: this is from a little later, so the Uptime is not correct)&lt;/p&gt;

&lt;p&gt;When I looked at the ring from any node other than #23 all nodes were &lt;code&gt;UP&lt;/code&gt; other than node #23 (10.29.60.10):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra20# nodetool -h localhost ring
Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                               141784319550391026443072753096570088106     
10.37.114.8     DC1         RAC20       Up     Normal  285.86 GB       16.67%  0                                           
10.29.60.10     DC2         RAC23       Down   Normal  277.86 GB       16.67%  28356863910078205288614550619314017621      
10.6.130.70     DC1         RAC21       Up     Normal  244.9 GB        16.67%  56713727820156410577229101238628035242      
10.29.60.14     DC2         RAC24       Up     Normal  296.85 GB       16.67%  85070591730234615865843651857942052864      
10.37.114.10    DC1         RAC22       Up     Normal  255.81 GB       16.67%  113427455640312821154458202477256070485     
10.29.60.12     DC2         RAC25       Up     Normal  316.88 GB       16.67%  141784319550391026443072753096570088106  &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;But when I looked at the ring from node #23 all nodes were &lt;code&gt;UP&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# nodetool -h localhost ring
    Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                                   141784319550391026443072753096570088106     
    10.37.114.8     DC1         RAC20       Up     Normal  285.86 GB       16.67%  0                                           
    10.29.60.10     DC2         RAC23       Up     Normal  277.86 GB       16.67%  28356863910078205288614550619314017621      
    10.6.130.70     DC1         RAC21       Up     Normal  244.9 GB        16.67%  56713727820156410577229101238628035242      
    10.29.60.14     DC2         RAC24       Up     Normal  296.85 GB       16.67%  85070591730234615865843651857942052864      
    10.37.114.10    DC1         RAC22       Up     Normal  255.81 GB       16.67%  113427455640312821154458202477256070485     
    10.29.60.12     DC2         RAC25       Up     Normal  316.88 GB       16.67%  141784319550391026443072753096570088106  &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So it did not appear to be a cross DC networking issue, and technically we have a Quorum of replicas in each data centre.&lt;/p&gt;

&lt;p&gt;To confirm the consistency levels available to the application I opened the &lt;code&gt;cassandra-cli&lt;/code&gt; on nodes #20 and #23 and did some &lt;code&gt;get&lt;/code&gt;&amp;#8217;s changing the consistency level with &lt;code&gt;consistencylevel as&lt;/code&gt;. It was still working but things were strange, this is what I found.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node #23 could serve requests at &lt;code&gt;LOCAL_QUORUM&lt;/code&gt;, &lt;code&gt;QUORUM&lt;/code&gt; and &lt;code&gt;ALL&lt;/code&gt; consistency.&lt;/li&gt;

&lt;li&gt;All other nodes could serve requests at &lt;code&gt;LOCAL_QUOURM&lt;/code&gt; and &lt;code&gt;QUORUM&lt;/code&gt; but not &lt;code&gt;ALL&lt;/code&gt; consistency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also I could see requests been processed by node #23 using &lt;code&gt;nodetool tpstats&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;So from the point of view of the application the cluster was up, and the application guys confirmed this. However we had lost redundancy in the West DC, if we lost another node for any reason we would lose the &lt;code&gt;LOCAL_QUORUM&lt;/code&gt;. Any reason can include things like a planned restart.&lt;/p&gt;

&lt;p&gt;To be sure this had to be fixed, but we had time to fix it. This is part of the ops story with Cassandra, if something goes wrong you can normally fix it while the application keeps working.&lt;/p&gt;

&lt;h2 id='orientate'&gt;Orientate&lt;/h2&gt;

&lt;p&gt;My initial thought was that either the Gossip heartbeat from node #23 was not getting through, or it was been ignored. So I started looking around to get an idea of what view the nodes in the cluster had of node #23.&lt;/p&gt;

&lt;p&gt;Nodes in the rest of the cluster could telnet to node #23 on port 7000 the default &lt;code&gt;storage_port&lt;/code&gt;. A simple test but it helps to know the basic assumptions are still valid.&lt;/p&gt;

&lt;p&gt;All the nodes (including #23) had a consistent view of the state of the other nodes propagated via Gossip, for example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra20# bin/nodetool -h localhost gossipinfo
...
/10.29.60.10
  LOAD:2.98347080902E11
  STATUS:NORMAL,28356863910078205288614550619314017621
  RPC_ADDRESS:10.29.60.10
  SCHEMA:fe933880-19bd-11e1-0000-5ff37d368cb6
  RELEASE_VERSION:1.0.5 &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &amp;#8220;Application State&amp;#8221; transmitted via Gossip is not directly related to a node been marked as &lt;code&gt;DOWN&lt;/code&gt; (more later), but at least I could see that Gossip was distributing the same information about the node around the cluster. The properties listed above have the following meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;LOAD&lt;/code&gt;: The total live disk space used, in bytes.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;STATUS&lt;/code&gt;: The ring state of the node, it&amp;#8217;s token (if any) and some other information. &lt;code&gt;NORMAL&lt;/code&gt; indicates the node is a normal member of the ring, the other options are &lt;code&gt;BOOT&lt;/code&gt;, &lt;code&gt;LEAVING&lt;/code&gt;, &lt;code&gt;LEFT&lt;/code&gt;, &lt;code&gt;MOVING&lt;/code&gt;.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;RPC_ADDRESS&lt;/code&gt;: IP address to connect to this node on.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;SCHEMA&lt;/code&gt;: UUID of the schema this node is using.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;RELEASE_VERSION&lt;/code&gt;: Cassandra version.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If Application State was responsible &lt;code&gt;nodetool ring&lt;/code&gt; would have said something other than &lt;code&gt;DOWN&lt;/code&gt; for node #23.&lt;/p&gt;

&lt;p&gt;Node #23 should have been trying to tell the other nodes that it is here and happy to answer queries. So the next step was to look at the Gossip traffic and see if that was happening. I started by enabling &lt;code&gt;TRACE&lt;/code&gt; debugging on the main &lt;code&gt;org.apache.cassandra.gms.Gossiper&lt;/code&gt; class by adding the line below to &lt;code&gt;conf/log4j-server.properties&lt;/code&gt;, log4j is configured to watch for changes so you can just change the file and wait a few seconds.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;log4j.logger.org.apache.cassandra.gms.Gossiper=TRACE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To disable a live change to the logging config like this it&amp;#8217;s not enough to just comment out the line. You will need to a provide a new value for the &lt;code&gt;Gossiper&lt;/code&gt; logger. So after I saw what I wanted in &lt;code&gt;/var/log/cassandra/system.log&lt;/code&gt; on node #20 I changed the line the logging config to be:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;log4j.logger.org.apache.cassandra.gms.Gossiper=INFO&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This was the first thing I looked at in the logs on node #20:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;TRACE [GossipStage:1] 2011-12-13 00:58:49,636 Gossiper.java (line 647) local heartbeat version 526912 greater than 7951 for /10.29.60.10
TRACE [GossipStage:1] 2011-12-13 00:58:49,636 Gossiper.java (line 661) Adding state LOAD: 2.98347080902E11
TRACE [GossipStage:1] 2011-12-13 00:58:49,636 Gossiper.java (line 647) local heartbeat version 1231 greater than 1229 for /10.37.114.10
TRACE [GossipStage:1] 2011-12-13 00:58:49,636 Gossiper.java (line 647) local heartbeat version 4892 greater than 4890 for /10.29.60.12&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first line was the important one, something in the cluster was telling node #20 about node #23. The &lt;code&gt;version&lt;/code&gt; number referred to in the message is part of the heartbeat, a higher version number means the heartbeat is more recent. So it looked like node #20 was remembering something about node #23 from one of the machines that had been restarted, that would explain the higher local version number. But that didn&amp;#8217;t make sense, Cassandra can handle nodes restarting.&lt;/p&gt;

&lt;p&gt;This was not the right place to be looking. The log message could have been logged in response to information about node #23 that was sent from any other node. I needed to find proof of Gossip messages been received directly from node #23.&lt;/p&gt;

&lt;p&gt;Gossip uses a 3 step protocol, the initiating node sends an &lt;code&gt;Syn&lt;/code&gt; message, the target replies with an &lt;code&gt;Ack&lt;/code&gt; and the initiator replies with an &lt;code&gt;Ack2&lt;/code&gt;. So I went hunting information about &lt;code&gt;Syn&lt;/code&gt;&amp;#8217;s received from node #23 by enabling &lt;code&gt;TRACE&lt;/code&gt; logging on the verb handler:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is what I found in the node #20 logs:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;TRACE [GossipStage:1] 2011-12-13 02:04:56,165 GossipDigestSynVerbHandler.java (line 46) Received a GossipDigestSynMessage from /10.29.60.10
TRACE [GossipStage:1] 2011-12-13 02:04:56,166 GossipDigestSynVerbHandler.java (line 76) Gossip syn digests are : /10.6.130.70:1323732220:9792 /10.37.114.10:1323736718:5242 /10.29.60.12:1323733099:8904 /10.29.60.10:1762556151:11964 /10.29.60.14:1323732392:9619 /10.37.114.8:1323731527:10494 
TRACE [GossipStage:1] 2011-12-13 02:04:56,166 GossipDigestSynVerbHandler.java (line 90) Sending a GossipDigestAckMessage to /10.29.60.10&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It got a &lt;code&gt;Syn&lt;/code&gt; from node #23 and replied with an &lt;code&gt;Ack&lt;/code&gt;. It was receiving and responding to messages from node #23 but it still thought it was &lt;code&gt;DOWN&lt;/code&gt;. Things were getting weirder, time to go check the code.&lt;/p&gt;

&lt;p&gt;Right after logging that message &lt;code&gt;GossipDigestSynVerbHandler&lt;/code&gt; &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/GossipDigestSynVerbHandler.java#L79'&gt;notifies the &lt;code&gt;FailureDetector&lt;/code&gt;&lt;/a&gt; that it had heard from the endpoint. The &lt;code&gt;FailureDetector&lt;/code&gt; is ultimately responsible for deciding if a node is UP or DOWN, the clue is in the name. &lt;code&gt;GossipDigestSynVerbHandler&lt;/code&gt; actually calls the main &lt;code&gt;Gossiper&lt;/code&gt; class which does a few things and then decides if it wants to report to the &lt;code&gt;FailureDetector&lt;/code&gt;. I just wanted to know if the call was made, so I enabled the logging again:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE
log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And found this in the logs on node #20:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;TRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java (line 46) Received a GossipDigestSynMessage from /10.29.60.10
TRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java (line 76) Gossip syn digests are : /10.29.60.10:1762556151:12552 /10.29.60.14:1323732392:10208 /10.37.114.8:1323731527:11082 /10.37.114.10:1323736718:5830 /10.6.130.70:1323732220:10379 /10.29.60.12:1323733099:9493 
TRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java (line 90) Sending a GossipDigestAckMessage to /10.29.60.10
TRACE [GossipStage:1] 2011-12-13 02:14:37,037 GossipDigestSynVerbHandler.java (line 46) Received a GossipDigestSynMessage from /10.37.114.10
TRACE [GossipStage:1] 2011-12-13 02:14:37,037 GossipDigestSynVerbHandler.java (line 76) Gossip syn digests are : /10.37.114.10:1323736718:5831 /10.29.60.12:1323733099:9493 /10.6.130.70:1323732220:10379 /10.29.60.14:1323732392:10208 /10.37.114.8:1323731527:11082 /10.29.60.10:1762556151:526912 
TRACE [GossipStage:1] 2011-12-13 02:14:37,037 GossipDigestSynVerbHandler.java (line 90) Sending a GossipDigestAckMessage to /10.37.114.10
TRACE [GossipStage:1] 2011-12-13 02:14:37,038 FailureDetector.java (line 164) reporting /10.37.114.10
TRACE [GossipStage:1] 2011-12-13 02:14:37,038 FailureDetector.java (line 164) reporting /10.29.60.12
TRACE [GossipTasks:1] 2011-12-13 02:14:37,447 FailureDetector.java (line 185) PHI for /10.29.60.12 : 0.1740028340787402
TRACE [GossipTasks:1] 2011-12-13 02:14:37,447 FailureDetector.java (line 185) PHI for /10.6.130.70 : 0.17769446955863263
TRACE [GossipTasks:1] 2011-12-13 02:14:37,447 FailureDetector.java (line 185) PHI for /10.29.60.14 : 0.3766607052738764
TRACE [GossipTasks:1] 2011-12-13 02:14:37,447 FailureDetector.java (line 185) PHI for /10.37.114.10 : 0.16292431576785454
TRACE [GossipTasks:1] 2011-12-13 02:14:37,448 FailureDetector.java (line 185) PHI for /10.29.60.10 : 9511.26282656631
TRACE [GossipTasks:1] 2011-12-13 02:14:37,448 FailureDetector.java (line 189) notifying listeners that /10.29.60.10 is down
TRACE [GossipTasks:1] 2011-12-13 02:14:37,448 FailureDetector.java (line 190) intervals: 500.0 mean: 500.0
TRACE [GossipTasks:1] 2011-12-13 02:14:38,453 FailureDetector.java (line 185) PHI for /10.29.60.12 : 0.6019902450401402
TRACE [GossipTasks:1] 2011-12-13 02:14:38,453 FailureDetector.java (line 185) PHI for /10.6.130.70 : 0.5983077316197724
TRACE [GossipTasks:1] 2011-12-13 02:14:38,453 FailureDetector.java (line 185) PHI for /10.29.60.14 : 0.7697319392007641
TRACE [GossipTasks:1] 2011-12-13 02:14:38,453 FailureDetector.java (line 185) PHI for /10.37.114.10 : 0.5636623638423329
TRACE [GossipTasks:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 185) PHI for /10.29.60.10 : 9512.137495652863
TRACE [GossipTasks:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 189) notifying listeners that /10.29.60.10 is down
TRACE [GossipTasks:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 190) intervals: 500.0 mean: 500.0
TRACE [GossipStage:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 164) reporting /10.29.60.14
TRACE [GossipStage:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 164) reporting /10.37.114.10
TRACE [GossipStage:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 164) reporting /10.6.130.70
TRACE [GossipStage:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 164) reporting /10.29.60.12
TRACE [GossipStage:1] 2011-12-13 02:14:39,030 GossipDigestSynVerbHandler.java (line 46) Received a GossipDigestSynMessage from /10.6.130.70&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first line shows a &lt;code&gt;Syn&lt;/code&gt; message received from node #23, later a response &lt;code&gt;Ack&lt;/code&gt; was sent and the next &lt;code&gt;Syn&lt;/code&gt; handled without reporting to the &lt;code&gt;FailureDetector&lt;/code&gt;. By contrast look at the next &lt;code&gt;Syn&lt;/code&gt; message from 10.37.114.10, it results in the &lt;code&gt;FailureDetector&lt;/code&gt; been told (via the &lt;code&gt;report()&lt;/code&gt; function) that we&amp;#8217;ve head from both 10.37.114.10 and 10.29.60.12.&lt;/p&gt;

&lt;p&gt;You can also see that the &lt;code&gt;FailureDetector&lt;/code&gt; has logged some information about node #23 (10.29.60.10). At the &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L167'&gt;end of a Gossip round&lt;/a&gt; the node &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L559'&gt;asks the &lt;code&gt;FailureDetector&lt;/code&gt;&lt;/a&gt; to &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/FailureDetector.java#L175'&gt;interpret&lt;/a&gt; the information it has about each other node. The &lt;code&gt;FailureDetector&lt;/code&gt; tracks the intervals between when it is told about a node, these values are interpreted every second to &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2597'&gt;calculate the PHI value&lt;/a&gt; used to determine the nodes liveness. By default a node with a &lt;code&gt;PHI&lt;/code&gt; higher than 8 is considered &lt;code&gt;DOWN&lt;/code&gt;, as you can see node #23 was way &lt;code&gt;DOWN&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;At the time I did not pay any attention to the other log messages from the &lt;code&gt;FailureDetector&lt;/code&gt;, I just wanted to know why &lt;code&gt;report()&lt;/code&gt; was not been called.&lt;/p&gt;

&lt;h2 id='stop_and_think'&gt;Stop and think&lt;/h2&gt;

&lt;p&gt;This is what I knew:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All nodes in the cluster other than node #23 had a consistent view that #23 was down.&lt;/li&gt;

&lt;li&gt;New &amp;#8220;Application State&amp;#8221; about node #23 transmitted via Gossip was been ignored as the nodes thought it was from the past.&lt;/li&gt;

&lt;li&gt;Direct Gossip messages from node #23 were not been reported to the FailureDetector.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I put out a call for help on the #Cassandra IRC channel and started poking around &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L694'&gt;Gossiper.notifyFailureDetector()&lt;/a&gt; to see why it was not reporting messages from node #23.&lt;/p&gt;

&lt;p&gt;My assumption was that the high heartbeat version number node #20 had locally for node #23 was the reason for not reporting to the &lt;code&gt;FailureDetector&lt;/code&gt;, and &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L724'&gt;line 724&lt;/a&gt; reinforce this view of the world. The &lt;code&gt;FailureDetector&lt;/code&gt; was not called because the local version number was higher than the remote one. Done. I saw what I wanted to see, the version number is the problem.&lt;/p&gt;

&lt;p&gt;But I still didn&amp;#8217;t know why this was happening, it was unlikely to be a bug in the Gossip code. Rolling restarts are standard practice, we do them all the time. About this time I got some help from Brandon Williams (driftx) from &lt;a href='http://www.datastax.com'&gt;Data Stax&lt;/a&gt; in the IRC room. Getting help is always good, and Gossip is Brandon&amp;#8217;s thing.&lt;/p&gt;

&lt;h2 id='generational_issues'&gt;Generational issues&lt;/h2&gt;

&lt;p&gt;Heartbeat versions are only part of how Gossip works out if new information it receives really is new, or is old information delivered out of order. Sitting above the version number is the Generation. When a major change occurs on a node, such as a restart or a changing tokens, the Generation number is increased&lt;/p&gt;

&lt;p&gt;When a node detects a higher Generation for another endpoint in a &lt;code&gt;Syn&lt;/code&gt; Gossip message it &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L706'&gt;clears the&lt;/a&gt; the &lt;code&gt;FailureDetector&lt;/code&gt; information for endpoint. More importantly it takes action to get the most recent information about the node. First the node &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L951'&gt;includes a request&lt;/a&gt; in the &lt;code&gt;Ack&lt;/code&gt; it returns to the gossiper for all the information it has about the endpoint where the generation has increased. Second when &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L844'&gt;processing&lt;/a&gt; the digests in the returned &lt;code&gt;Ack2&lt;/code&gt; message the local endpoint state is &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L783'&gt;replaced&lt;/a&gt; with the new state.&lt;/p&gt;

&lt;p&gt;When node #23 restarted it should have picked up a new Generation number and the other nodes in the cluster should have noticed the change and updated their local state. The simple way to test if this was happening was do another restart on node #23 and see what happened, yay for redundant clusters with no single point of failure.&lt;/p&gt;

&lt;p&gt;While that was going on Brandon noticed that the Generation number for node #23 was strange. The number is initialized to the current UTC time in seconds when the server is first started. After than it is increased by 1 for every major change, such as a restart. The number shown in the logs and other places for node #23 was &lt;code&gt;1762556151&lt;/code&gt;, which equates to &lt;strong&gt;Fri, 07 Nov 2025 22:55:51 GMT&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This server was from the future.&lt;/p&gt;

&lt;p&gt;Still is should have incremented the Generation on restart and maintained the forward trajectory of the space time continuum. The restart completed. Node #23 came back on line with the same generation number &lt;code&gt;1762556151&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It was a server from the future that was stuck in present which was really it&amp;#8217;s past!&lt;/p&gt;

&lt;h2 id='back_to_the_future'&gt;Back to the future&lt;/h2&gt;

&lt;p&gt;The current Generation for a node is stored in the &lt;code&gt;LocationInfo&lt;/code&gt; CF in the &lt;code&gt;System&lt;/code&gt; KS. Every time the Generation is bumped it&amp;#8217;s read from there, incremented and written back. So the next step was to see what was actually stored in the CF using the &lt;code&gt;cassandra-cli&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] list LocationInfo;                      
Using default limit of 100
-------------------
RowKey: Ring
=&amp;gt; (column=00, value=0a257208, timestamp=1323730075935)
=&amp;gt; (column=2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, value=0a068246, timestamp=1323730075923)
=&amp;gt; (column=40000000000000000000000000000000, value=0a1d3c0e, timestamp=1323730075897)
=&amp;gt; (column=55555555555555555555555555555555, value=0a25720a, timestamp=1323730075907)
=&amp;gt; (column=6aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, value=0a1d3c0c, timestamp=1323730075944)
-------------------
RowKey: L
=&amp;gt; (column=436c75737465724e616d65, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=47656e65726174696f6e, value=690e78f6, timestamp=1762556150811000)
=&amp;gt; (column=50617274696f6e6572, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=546f6b656e, value=15555555555555555555555555555555, timestamp=1762556150877)
-------------------
RowKey: Cookies
=&amp;gt; (column=5072652d312e302068696e747320707572676564, value=6f68207965732c2074686579207765726520707572676564, timestamp=1320437246470)
-------------------
RowKey: Bootstrap
=&amp;gt; (column=42, value=01, timestamp=1762556150877)

4 Rows Returned.
Elapsed time: 37 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There were columns in the &amp;#8220;L&amp;#8221; row (the local node state) and the &amp;#8220;Cookies&amp;#8221; row that were written with a timestamp from the future (the ones that start with 17&amp;#8230;). Any attempt to overwrite these columns would fail unless it used a higher time stamp.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s the &amp;#8220;L&amp;#8221; row with proper formatting for the column names:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] assume LocationInfo comparator as ascii; 
Assumption for column family &amp;#39;LocationInfo&amp;#39; added successfully.
[default@system] get LocationInfo[&amp;#39;L&amp;#39;];
=&amp;gt; (column=ClusterName, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=Generation, value=690e78f6, timestamp=1762556150811000)
=&amp;gt; (column=Partioner, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=Token, value=15555555555555555555555555555555, timestamp=1762556150877)
Returned 4 results.
Elapsed time: 2 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Generation is hex &lt;code&gt;690e78f6&lt;/code&gt;, which is &lt;code&gt;1762556150&lt;/code&gt; binary. When node #23 started it &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/db/SystemTable.java#L312'&gt;read this value&lt;/a&gt;, added 1 to it to get &lt;code&gt;1762556151&lt;/code&gt; and wrote it back. However the timestamp used to write it back was the current time, which was less than the one on disk so the write was ignored during reads. So the next time it started it got the same Generation number.&lt;/p&gt;

&lt;h2 id='back_to_the_present'&gt;Back to the present&lt;/h2&gt;

&lt;p&gt;The fix was to somehow get the Generation number to be increased. The approach I took was to delete the &lt;code&gt;Generation&lt;/code&gt; column with the high time stamp, and purge it&amp;#8217;s existence out of the SSTables. It wasn&amp;#8217;t enough to just delete the column, I needed to delete it with a higher time stamp and then see that Tombstone was purged so that future writes with current time stamps would work as expected.&lt;/p&gt;

&lt;p&gt;This approach was possible because the &lt;code&gt;GC_GRACE_SECONDS&lt;/code&gt; on the &lt;code&gt;LocationInfo&lt;/code&gt; CF is 0. So tombstones are committed to disk, but will be purged the first time compaction processes them. If &lt;code&gt;GC_GRACE_SECONDS&lt;/code&gt; was higher I would have to either wait for the time to pass or temporarily reduce it.&lt;/p&gt;

&lt;p&gt;The other possible options were a full cluster stop and restart passing the &lt;code&gt;Dcassandra.load_ring_state=false&lt;/code&gt; option. Or some form of monkeying with the token and/or IP address.&lt;/p&gt;

&lt;p&gt;Here are the steps I took to get there, I&amp;#8217;ve included a few additional ones to illustrate what was happening with the tombstone.&lt;/p&gt;

&lt;h3 id='step_1__snapshot'&gt;Step 1 - Snapshot&lt;/h3&gt;

&lt;p&gt;Always backup data before you start exploring it&amp;#8217;s limits, think of it as safe word. If it all get&amp;#8217;s a little to scary just say the safe word and we can stop.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/nodetool -h localhost snapshot system -t pre-generation-delete
Requested snapshot for: system 
Snapshot directory: pre-generation-delete&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I snap shot a lot when doing stuff like this so that I can roll back to any point.&lt;/p&gt;

&lt;h3 id='step_1__delete_the_generation_column'&gt;Step 1 - Delete the Generation column&lt;/h3&gt;

&lt;p&gt;To specify a timestamp for the delete I used the &lt;a href='http://pycassa.github.com/pycassa/assorted/pycassa_shell.html'&gt;pycassaShell&lt;/a&gt; bundled with the &lt;a href='https://github.com/pycassa/pycassa'&gt;pycassa&lt;/a&gt; client, this is an interactive Python session with some handy tools.&lt;/p&gt;

&lt;p&gt;Remember I wanted to delete the Generation column:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] get LocationInfo[&amp;#39;L&amp;#39;];
=&amp;gt; (column=ClusterName, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=Generation, value=690e78f6, timestamp=1762556150811000)
=&amp;gt; (column=Partioner, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=Token, value=15555555555555555555555555555555, timestamp=1323805490467000)
Returned 4 results.
Elapsed time: 2 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So I needed to specify a timestamp higher than 1762556150811000:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# ./pycassaShell -H 10.29.60.10 -k system
[I]: IPython not found, falling back to default interpreter.
----------------------------------
Cassandra Interactive Python Shell
----------------------------------
Keyspace: system
Host: 10.29.60.10:9160

Available ColumnFamily instances:
 * HINTSCOLUMNFAMILY          ( HintsColumnFamily )
 * VERSIONS                   ( Versions )
 * INDEXINFO                  ( IndexInfo )
 * LOCATIONINFO               ( LocationInfo )
 * MIGRATIONS                 ( Migrations )
 * NODEIDINFO                 ( NodeIdInfo )
 * SCHEMA                     ( Schema )

Schema definition tools and cluster information are available through SYSTEM_MANAGER.
&amp;gt;&amp;gt;&amp;gt; LOCATIONINFO.remove(&amp;quot;L&amp;quot;, columns=[&amp;quot;Generation&amp;quot;], timestamp=1762556150811001)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The Generation column was no longer returned in query results:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] get LocationInfo[&amp;#39;L&amp;#39;];
=&amp;gt; (column=ClusterName, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=Partioner, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=Token, value=15555555555555555555555555555555, timestamp=1323805490467000)
Returned 4 results.
Elapsed time: 2 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='step_3__snapshot'&gt;Step 3 - Snapshot&lt;/h3&gt;

&lt;p&gt;Another snapshot for good luck. This one will also flush the &lt;code&gt;LocationInfo&lt;/code&gt; CF so all of our changes are on disk.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/nodetool -h localhost snapshot system -t post-generation-delete
Requested snapshot for: system 
Snapshot directory: post-generation-delete&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='step_4__check_the_delete'&gt;Step 4 - Check the delete&lt;/h3&gt;

&lt;p&gt;This was not necessary, I knew what was going to happen later but I like to confirm that my expectations are correct and it helps illustrate how tombstones work. What I expected to see was two SSTables for &lt;code&gt;LocationInfo&lt;/code&gt;, the first would have all the columns we saw originally and the second would have the tombstone for the &lt;code&gt;Generation&lt;/code&gt; column delete.&lt;/p&gt;

&lt;p&gt;First check which SSTables are on disk:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# ls -lah data/system
total 9428
...
-rw-r--r--  2 root  wheel   617B Dec 13 20:56 LocationInfo-hb-52-Data.db
-rw-r--r--  2 root  wheel    68B Dec 13 20:56 LocationInfo-hb-52-Digest.sha1
-rw-r--r--  2 root  wheel   1.9K Dec 13 20:56 LocationInfo-hb-52-Filter.db
-rw-r--r--  2 root  wheel    61B Dec 13 20:56 LocationInfo-hb-52-Index.db
-rw-r--r--  2 root  wheel   4.2K Dec 13 20:56 LocationInfo-hb-52-Statistics.db
-rw-r--r--  2 root  wheel    80B Dec 13 21:18 LocationInfo-hb-54-Data.db
-rw-r--r--  2 root  wheel    68B Dec 13 21:18 LocationInfo-hb-54-Digest.sha1
-rw-r--r--  2 root  wheel    16B Dec 13 21:18 LocationInfo-hb-54-Filter.db
-rw-r--r--  2 root  wheel    11B Dec 13 21:18 LocationInfo-hb-54-Index.db
-rw-r--r--  2 root  wheel   4.2K Dec 13 21:18 LocationInfo-hb-54-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Excellent, there were two. So I used &lt;code&gt;sstable2json&lt;/code&gt; to take a look in the oldest one:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/sstable2json data/system/LocationInfo-hb-52-Data.db
{&amp;#39;426f6f747374726170&amp;#39;: [[&amp;#39;42&amp;#39;, &amp;#39;01&amp;#39;, 1323805212929000]],
 &amp;#39;436f6f6b696573&amp;#39;: [[&amp;#39;5072652d312e302068696e747320707572676564&amp;#39;,
                     &amp;#39;6f68207965732c2074686579207765726520707572676564&amp;#39;,
                     1320437246470]],
 &amp;#39;4c&amp;#39;: [[&amp;#39;436c75737465724e616d65&amp;#39;, &amp;#39;737069&amp;#39;, 1320437246450000],
        [&amp;#39;47656e65726174696f6e&amp;#39;, &amp;#39;690e78f6&amp;#39;, 1762556150811000],
        [&amp;#39;50617274696f6e6572&amp;#39;,
         &amp;#39;6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572&amp;#39;,
         1320437246450000],
        [&amp;#39;546f6b656e&amp;#39;, &amp;#39;15555555555555555555555555555555&amp;#39;, 1323805490467000]],
 &amp;#39;52696e67&amp;#39;: [[&amp;#39;00&amp;#39;, &amp;#39;0a257208&amp;#39;, 1323730075935],
              [&amp;#39;2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&amp;#39;, &amp;#39;0a068246&amp;#39;, 1323730075923],
              [&amp;#39;40000000000000000000000000000000&amp;#39;, &amp;#39;0a1d3c0e&amp;#39;, 1323730075897],
              [&amp;#39;55555555555555555555555555555555&amp;#39;, &amp;#39;0a25720a&amp;#39;, 1323730075907],
              [&amp;#39;6aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&amp;#39;, &amp;#39;0a1d3c0c&amp;#39;, 1323730075944]]}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I&amp;#8217;ve reformatted the output a little. The row key &lt;code&gt;4c&lt;/code&gt; is the &amp;#8220;L&amp;#8221; row from above, and the &amp;#8220;Generation&amp;#8221; column is the second column in the list with &lt;code&gt;47656e65726174696f6e&lt;/code&gt; as the column name (it&amp;#8217;s just the hex encoded ASCII values). The column value and timestamp are the next two values.&lt;/p&gt;

&lt;p&gt;Now the newer SSTable:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/sstable2json data/system/LocationInfo-hb-54-Data.db
{
&amp;quot;4c&amp;quot;: [[&amp;quot;47656e65726174696f6e&amp;quot;,&amp;quot;4ee7c086&amp;quot;,1762556150811001,&amp;quot;d&amp;quot;]]
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There was a single row with the &lt;code&gt;4c&lt;/code&gt; / &amp;#8220;L&amp;#8221; row key. That had a single column with &amp;#8220;Generation&amp;#8221; as the name, the local time of the deletion as the value, a sensible timestamp and &amp;#8220;d&amp;#8221; as a flag to indicate it&amp;#8217;s a &lt;code&gt;DeletedColumn&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id='step_5__compact_away_the_tombstone'&gt;Step 5 - Compact away the Tombstone.&lt;/h3&gt;

&lt;p&gt;It was time to purge the Tombstone by running a manual compaction on the &lt;code&gt;LocationInfo&lt;/code&gt; CF.&lt;/p&gt;

&lt;p&gt;cassandra23# bin/nodetool -h localhost compact system LocationInfo&lt;/p&gt;

&lt;h3 id='step_6__confirm_the_purge'&gt;Step 6 - Confirm the purge&lt;/h3&gt;

&lt;p&gt;I now expected to see a single &lt;code&gt;LocationInfo&lt;/code&gt; SSTable:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# ls -lah data/system/
...
-rw-r--r--  1 root  wheel    68B Dec 13 21:24 LocationInfo-hb-55-Digest.sha1
-rw-r--r--  1 root  wheel   976B Dec 13 21:24 LocationInfo-hb-55-Filter.db
-rw-r--r--  1 root  wheel    61B Dec 13 21:24 LocationInfo-hb-55-Index.db
-rw-r--r--  1 root  wheel   4.2K Dec 13 21:24 LocationInfo-hb-55-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In the SSTable the &lt;code&gt;4c&lt;/code&gt; row should be present but the &lt;code&gt;47656e65726174696f6e&lt;/code&gt; &amp;#8220;Generation&amp;#8221; column should not:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/sstable2json data/system/LocationInfo-hb-55-Data.db
{&amp;#39;426f6f747374726170&amp;#39;: [[&amp;#39;42&amp;#39;, &amp;#39;01&amp;#39;, 1323805212929000]],
 &amp;#39;436f6f6b696573&amp;#39;: [[&amp;#39;5072652d312e302068696e747320707572676564&amp;#39;,
                     &amp;#39;6f68207965732c2074686579207765726520707572676564&amp;#39;,
                     1320437246470]],
 &amp;#39;4c&amp;#39;: [[&amp;#39;436c75737465724e616d65&amp;#39;, &amp;#39;737069&amp;#39;, 1320437246450000],
        [&amp;#39;50617274696f6e6572&amp;#39;,
         &amp;#39;6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572&amp;#39;,
         1320437246450000],
        [&amp;#39;546f6b656e&amp;#39;, &amp;#39;15555555555555555555555555555555&amp;#39;, 1323805490467000]],
 &amp;#39;52696e67&amp;#39;: [[&amp;#39;00&amp;#39;, &amp;#39;0a257208&amp;#39;, 1323730075935],
              [&amp;#39;2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&amp;#39;, &amp;#39;0a068246&amp;#39;, 1323730075923],
              [&amp;#39;40000000000000000000000000000000&amp;#39;, &amp;#39;0a1d3c0e&amp;#39;, 1323730075897],
              [&amp;#39;55555555555555555555555555555555&amp;#39;, &amp;#39;0a25720a&amp;#39;, 1323730075907],
              [&amp;#39;6aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&amp;#39;, &amp;#39;0a1d3c0c&amp;#39;, 1323730075944]]}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And just like that node #23 has no record of ever having a value for the &amp;#8220;Generation&amp;#8221; column.&lt;/p&gt;

&lt;h3 id='step_7__rewrite_the_generation_number'&gt;Step 7 - Re-write the Generation number&lt;/h3&gt;

&lt;p&gt;The next time I restarted node #23 I needed it to get a Generation number greater than the one all the other nodes have for it. And for all restarts after that it should continue to increase the Generation.&lt;/p&gt;

&lt;p&gt;So I set the Generation column to &lt;code&gt;1762556151&lt;/code&gt; decimal or &lt;code&gt;690e78f7&lt;/code&gt; hex using the &lt;code&gt;cassandra-cli&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] set LocationInfo[&amp;#39;L&amp;#39;][&amp;#39;Generation&amp;#39;] = bytes(&amp;#39;690e78f7&amp;#39;);
Value inserted.
Elapsed time: 1 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And then just to check:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] get LocationInfo[&amp;#39;L&amp;#39;];                                  
=&amp;gt; (column=ClusterName, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=Generation, value=690e78f7, timestamp=1323811677675000)
=&amp;gt; (column=Partioner, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=Token, value=15555555555555555555555555555555, timestamp=1323805490467000)
Returned 4 results.
Elapsed time: 1 msec(s).
[default@system] &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Looks good, got the next Generation number in there and the timestamp is sane.&lt;/p&gt;

&lt;h3 id='step_8__confirm_the_generation_rewrite'&gt;Step 8 - Confirm the Generation re-write&lt;/h3&gt;

&lt;p&gt;To get the change on disk I flushed the CF:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/nodetool -h localhost flush system LocationInfo&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I now expected to see two SSTables for &lt;code&gt;LocationInfo&lt;/code&gt; on disk:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# ls -lah data/system/
total 9428
-rw-r--r--  1 root  wheel   588B Dec 13 21:24 LocationInfo-hb-55-Data.db
-rw-r--r--  1 root  wheel    68B Dec 13 21:24 LocationInfo-hb-55-Digest.sha1
-rw-r--r--  1 root  wheel   976B Dec 13 21:24 LocationInfo-hb-55-Filter.db
-rw-r--r--  1 root  wheel    61B Dec 13 21:24 LocationInfo-hb-55-Index.db
-rw-r--r--  1 root  wheel   4.2K Dec 13 21:24 LocationInfo-hb-55-Statistics.db
-rw-r--r--  1 root  wheel    80B Dec 13 21:32 LocationInfo-hb-57-Data.db
-rw-r--r--  1 root  wheel    68B Dec 13 21:32 LocationInfo-hb-57-Digest.sha1
-rw-r--r--  1 root  wheel    16B Dec 13 21:32 LocationInfo-hb-57-Filter.db
-rw-r--r--  1 root  wheel    11B Dec 13 21:32 LocationInfo-hb-57-Index.db
-rw-r--r--  1 root  wheel   4.2K Dec 13 21:32 LocationInfo-hb-57-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And if we poke into the new one:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/sstable2json data/system/LocationInfo-hb-57-Data.db
{
&amp;quot;4c&amp;quot;: [[&amp;quot;47656e65726174696f6e&amp;quot;,&amp;quot;690e78f7&amp;quot;,1323811677675000]]
}&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='step_9__restart_node_23'&gt;Step 9 - Restart node #23&lt;/h3&gt;

&lt;p&gt;Before the restart node #23 was using 1762556151 as the Generation. This was because the previous on disk value was 1762556150, and when it started it read this and added 1. I had now set the on disk value to be 1762556151 so after the restart I expected the Generation to be 1762556152:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/nodetool -h localhost info
Token            : 28356863910078205288614550619314017621
Gossip active    : true
Load             : 275.44 GB
Generation No    : 1762556152
Uptime (seconds) : 495
Heap Memory (MB) : 959.72 / 8032.00
Data Center      : DC1
Rack             : RAC_unknown  
Exceptions       : 0&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Aces. And finally on disk the value should be 1762556152 decimal or 690E78F8 hex:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] get LocationInfo[&amp;#39;L&amp;#39;];            
=&amp;gt; (column=ClusterName, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=Generation, value=690e78f8, timestamp=1323812553690000)
=&amp;gt; (column=Partioner, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=Token, value=15555555555555555555555555555555, timestamp=1323805490467000)
Returned 4 results.
Elapsed time: 22 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After the node came back online it started using the new Generation number and the other nodes saw it as UP.&lt;/p&gt;

&lt;h2 id='wrapping_up'&gt;Wrapping up&lt;/h2&gt;

&lt;p&gt;I think in theory there is a danger that if we removed node #23 from the ring and added another node with the same IP in less than 4 days it will have problems with it&amp;#8217;s Generation not been seen as new. But thats a pretty small danger.&lt;/p&gt;

&lt;p&gt;The important thing in all this was that the site stayed up. It&amp;#8217;s one of the nicest things about working on Cassandra, you can fix problems while it&amp;#8217;s working.&lt;/p&gt;

&lt;p&gt;Oh and I fixed the other columns with wacky time stamps and am about to write a patch to log when things like this happen.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Reversed Comparators</title>
   <link href="http://thelastpickle.com/2011/10/03/Reverse-Comparators"/>
   <updated>2011-10-03T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/10/03/Reverse-Comparators</id>
   <content type="html">&lt;p&gt;Cassandra 0.8.1 &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2355'&gt;added&lt;/a&gt; support for Composite Types and Reversed Types, through a handy little type composition language. Hopefully I&amp;#8217;ll get to show some of the things you can do with Composite Types later, for now the best resource I know is Ed Anuff&amp;#8217;s presentation on &lt;a href='http://www.slideshare.net/edanuff/indexing-in-cassandra'&gt;Cassandra Indexing Techniques&lt;/a&gt; at Cassandra SF 2011.&lt;/p&gt;

&lt;p&gt;The Reversed Comparator wraps another Comparator (.e.g. &lt;code&gt;AsciiType&lt;/code&gt;) and well, there is easy way to say this, &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.6/src/java/org/apache/cassandra/db/marshal/ReversedType.java#L60'&gt;reverses&lt;/a&gt; the order of the comparisons so that columns are stored in &lt;strong&gt;descending&lt;/strong&gt; order. If you are using something like a time stamp for a column name, the most recent value will be the first column on the row rather than the last. In a Relational DB with tree based indexes there is not much difference between getting the first or last row in an index, but Cassandra does not use tree indexes.&lt;/p&gt;

&lt;p&gt;Recall from my post on &lt;a href='http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/'&gt;Cassandra Query Plans&lt;/a&gt; that once rows get to a certain size they include an index of the columns. And that the entire index must be read whenever any part of the index needs to be used, which is the case when using a Slice Range that specifies start or reversed. So the fastest slice query to run against a row was one that retrieved the first X columns in a row by only specifying a column count.&lt;/p&gt;

&lt;p&gt;To get an idea of how different the performance is I took the test script from &lt;a href='http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/'&gt;Cassandra Query Plans&lt;/a&gt; and modified it to test two query patterns using both (regular) Ascending and (reversed) Descending comparators.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Getting the 10 most recent columns using a start column and a column count.&lt;/li&gt;

&lt;li&gt;Getting the 10 most recent columns using only a column count.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the second case the query would specify &lt;code&gt;reversed&lt;/code&gt; when using (regular) Ascending ordering so Cassandra would get the last 10 columns. However when using (reversed) Descending columns it would only specify the column count.&lt;/p&gt;

&lt;h2 id='in_motion__the_setup'&gt;In Motion - The Setup&lt;/h2&gt;

&lt;p&gt;The &lt;a href='https://gist.github.com/1258711'&gt;reverse_query_profile.py gist&lt;/a&gt; contains the Python code I used to setup the data and profile queries using the pycassa library. All tests were done locally on a 2011 Mac Book Pro with 8GB of RAM and spinning disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; you will need to edit the file and set the CASSANDRA_PATH at the top of the file.&lt;/p&gt;

&lt;p&gt;The tests used columns with a 10 byte name and 25 bytes of data. Together with the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.6/src/java/org/apache/cassandra/db/Column.java#L116'&gt;15 bytes of column meta data&lt;/a&gt; this will create columns that use 50 bytes on disk. For the standard 64KB column page this should give 1,310 columns per column page.&lt;/p&gt;

&lt;p&gt;To test the latency of various queries I used the (recent) &amp;#8216;ReadLatency: &amp;#8217; metric available via &lt;code&gt;nodetool cfstats&lt;/code&gt;. This metric wraps the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.6/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1295'&gt;local processing&lt;/a&gt; of a query for a single CF. The value presented by &lt;code&gt;nodetool&lt;/code&gt; is the most recent value.&lt;/p&gt;

&lt;p&gt;All tests were run 10 times and the min, max 80th and 95th percentiles of latency were recorded.&lt;/p&gt;

&lt;p&gt;The Reversed Comparator is specified by using a string to specify the comparator and appending &lt;code&gt;(reversed=true)&lt;/code&gt;. I created the schema using a clean 0.8.6 install and the following &lt;code&gt;bin/cassandra-cli&lt;/code&gt; script:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;create keyspace reverse
    with strategy_options=[{replication_factor:1}]
    and placement_strategy = &amp;#39;org.apache.cassandra.locator.SimpleStrategy&amp;#39;;

use reverse;

create column family NoCache_Ascending
    with comparator = AsciiType
    and default_validation_class = AsciiType
    and key_validation_class = AsciiType
    and keys_cached = 0
    and rows_cached = 0;

create column family NoCache_Descending
    with comparator = &amp;#39;AsciiType(reversed=true)&amp;#39;
    and default_validation_class = AsciiType
    and key_validation_class = AsciiType
    and keys_cached = 0
    and rows_cached = 0;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; You can use the Reversed Comparator anywhere, but it only makes sense to do so for the &lt;code&gt;comparator&lt;/code&gt;, &lt;code&gt;subcomparator&lt;/code&gt; and any secondary indexes.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;reverse_query_profile&lt;/code&gt; module inserts the following rows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&amp;#8220;small-row&amp;#8221; with 100 columns, 5K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;no-col-index&amp;#8221; with 1200 columns, 60K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;five-thousand&amp;#8221; with 5000 columns, 244K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;ten-thousand&amp;#8221; with 10000 columns, 488K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;hundred-thousand&amp;#8221; with 100000 columns, 4.8M of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;one-million&amp;#8221; with 1000000 columns, 48M of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;ten-million&amp;#8221; with 10000000 columns, 480M of data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Insert the data by executing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python reverse_query_profile insert_rows&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To keep things simple flush and compact the database so we only have one SSTable:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost flush
$ bin/nodetool -h localhost compact query&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At the end of this my &lt;code&gt;/var/lib/cassandra/data/query&lt;/code&gt; directory contained the following SSTables (ignoring the &lt;code&gt;-Compacted&lt;/code&gt; SSTables):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-rw-r--r--   1 aaron  wheel   536M  3 Oct 21:49 NoCache_Ascending-g-184-Data.db
-rw-r--r--   1 aaron  wheel   2.4K  3 Oct 21:49 NoCache_Ascending-g-184-Filter.db
-rw-r--r--   1 aaron  wheel   154B  3 Oct 21:49 NoCache_Ascending-g-184-Index.db
-rw-r--r--   1 aaron  wheel   4.2K  3 Oct 21:49 NoCache_Ascending-g-184-Statistics.db
-rw-r--r--   1 aaron  wheel   536M  3 Oct 21:49 NoCache_Descending-g-149-Data.db
-rw-r--r--   1 aaron  wheel   2.4K  3 Oct 21:49 NoCache_Descending-g-149-Filter.db
-rw-r--r--   1 aaron  wheel   154B  3 Oct 21:49 NoCache_Descending-g-149-Index.db
-rw-r--r--   1 aaron  wheel   4.2K  3 Oct 21:49 NoCache_Descending-g-149-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;reverse_query_profile&lt;/code&gt; contains a warm up function that will slice through all the columns in all the rows, warm up the database by executing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python reverse_query_profile warm_up&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='in_motion__start_column'&gt;In Motion - Start Column&lt;/h2&gt;

&lt;p&gt;I was not expecting much difference between the Ascending and Descending CF&amp;#8217;s for this test as both use a start column. For the Ascending CF the script specifies the name of the tenth last column as the column name. For the Descending CF it uses the name of the last column because it will be the first column on the row and the others will follow it.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ./reverse_query_profile.py recent_100_start
Latency is min, 80th percentile, 95th percentile and max.
100 most recent columns, using start

Testing CF: NoCache_Ascending
Row            small-row latency in ms       0.14     0.1512     0.1542      0.157
Row         no-col-index latency in ms      0.361     0.5082     0.6311      0.751
Row        five-thousand latency in ms      0.337      0.368     0.5269      0.721
Row         ten-thousand latency in ms      0.307       0.32     0.3214      0.323
Row     hundred-thousand latency in ms      0.205     0.2362     0.2506      0.266
Row          one-million latency in ms      0.429      0.476     0.5234       0.58
Row          ten-million latency in ms      1.247      1.432       4.71       8.71
Testing CF: NoCache_Descending
Row            small-row latency in ms      0.138     0.1526     0.1697       0.19
Row         no-col-index latency in ms       0.32     0.3532     0.3572       0.36
Row        five-thousand latency in ms      0.329      0.374     0.3863        0.4
Row         ten-thousand latency in ms      0.351     0.3868     0.3875      0.388
Row     hundred-thousand latency in ms      0.357     0.4094     0.4955        0.6
Row          one-million latency in ms      0.451      0.496     0.5069      0.519
Row          ten-million latency in ms      1.312      1.401      1.415      1.424&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Looks like there was a couple of outliers that mucked up the 95th percentile, but if we look at the 80th percent there is not much difference between the two CF&amp;#8217;s.&lt;/p&gt;

&lt;h2 id='in_motion__count_only'&gt;In Motion - Count Only&lt;/h2&gt;

&lt;p&gt;This is where the differences are. When a query does not specify a start column (and does not specify reversed) the server can just start reading columns from the start without having to worry about finding the right place to start. This is exactly what we can do for the Descending CF.&lt;/p&gt;

&lt;p&gt;For the regular Ascending CF we need to specify reversed, so the server must read the row index and work out which column is column count from the end of the row.&lt;/p&gt;

&lt;p&gt;There is no comparison really.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ./reverse_query_profile.py recent_100_count
Latency is min, 80th percentile, 95th percentile and max.
100 most recent columns, using no start

Testing CF: NoCache_Ascending
Row            small-row latency in ms      0.119     0.1496     0.1577      0.167
Row         no-col-index latency in ms      0.313     0.3504     0.3879      0.433
Row        five-thousand latency in ms      0.315      0.328     0.3361      0.346
Row         ten-thousand latency in ms      0.264      0.291     0.4032      0.539
Row     hundred-thousand latency in ms      0.181     0.2038     0.2049      0.206
Row          one-million latency in ms      0.397     0.4308     0.4315      0.432
Row          ten-million latency in ms       1.29      1.421      4.538      8.341
Testing CF: NoCache_Descending
Row            small-row latency in ms      0.133     0.1508     0.1514      0.152
Row         no-col-index latency in ms      0.139      0.157     0.1633      0.171
Row        five-thousand latency in ms      0.136     0.1506     0.1532      0.156
Row         ten-thousand latency in ms      0.142     0.1558     0.1565      0.157
Row     hundred-thousand latency in ms      0.139     0.1498     0.1518      0.154
Row          one-million latency in ms      0.139     0.1592     0.1661      0.171
Row          ten-million latency in ms      0.134     0.1602     0.1678      0.175&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For this type of query the Reversed Comparator provided consistent query performance no matter how many columns the row container. From now on if you are modeling time series data, such as in our old friend the Twitter clone, you should probably use the Reversed Comparator.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Token Range Ownership</title>
   <link href="http://thelastpickle.com/2011/09/30/Token-Range-Ownership"/>
   <updated>2011-09-30T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/09/30/Token-Range-Ownership</id>
   <content type="html">&lt;p&gt;Recently, like 2 hours ago, I was planning some work to rebalance a Cassandra cluster and I wanted to see how the steps involved would effect the range ownership of the nodes. So I replicated the logic from &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.6/src/java/org/apache/cassandra/dht/RandomPartitioner.java#L152'&gt;RandomPartitioner.describeOwnership()&lt;/a&gt; in a handy python script.&lt;/p&gt;

&lt;p&gt;The script is available at &lt;a href='https://gist.github.com/1250496'&gt;git hub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In my case I had an unbalanced cluster that looked like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                              154009024815050802110273337963779530663     
127.0.0.3    datacenter1 rack1       Up     Normal  430.51 GB       69.95%  	102889564695022956386161396156024583904     
127.0.0.2     datacenter1 rack1       Up     Normal  430.26 GB       22.81%   	141704132449535340642001248672108470009     
127.0.0.1    datacenter1 rack1       Up     Normal  725.61 GB       7.23%   	154009024815050802110273337963779530663&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So I wanted to change the nodes to use the balanced tokens:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;0
56713727820156410577229101238628035242
113427455640312821154458202477256070484&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As a side affect I also want to make the token order match the node IP / host name order, which makes things a little more confusing than they need to be. So each node, together with it&amp;#8217;s old and new token looks like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;127.0.0.1
    154009024815050802110273337963779530663
    0
127.0.0.2
    141704132449535340642001248672108470009
    56713727820156410577229101238628035242
127.0.0.3
    102889564695022956386161396156024583904
    113427455640312821154458202477256070484&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To get an idea of what the cluster balance would look like as I performed the &lt;code&gt;nodetool move&lt;/code&gt; operations I passed the initial tokens as a space separated list to the script and used the &lt;code&gt;--interactive&lt;/code&gt; arg so I could easily enter the changes I wanted to make.&lt;/p&gt;

&lt;p&gt;Here are the initial tokens, in the order of the host names.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;./token_range.py --interactive 154009024815050802110273337963779530663 141704132449535340642001248672108470009 102889564695022956386161396156024583904
154009024815050802110273337963779530663 141704132449535340642001248672108470009 102889564695022956386161396156024583904
69.95% - 102889564695022956386161396156024583904
22.81% - 141704132449535340642001248672108470009
 7.23% - 154009024815050802110273337963779530663&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;First change is to wrap the token for &lt;code&gt;0.1&lt;/code&gt; around to 0:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Next tokens: 0 141704132449535340642001248672108470009 102889564695022956386161396156024583904
0 141704132449535340642001248672108470009 102889564695022956386161396156024583904
16.71% - 0
60.47% - 102889564695022956386161396156024583904
22.81% - 141704132449535340642001248672108470009&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next center the middle token:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Next tokens: 0 56713727820156410577229101238628035242 102889564695022956386161396156024583904
0 56713727820156410577229101238628035242 102889564695022956386161396156024583904
39.53% - 0
33.33% - 56713727820156410577229101238628035242
27.14% - 102889564695022956386161396156024583904&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Finally sort out the little piggy at the end:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Next tokens: 0 56713727820156410577229101238628035242 113427455640312821154458202477256070484
0 56713727820156410577229101238628035242 113427455640312821154458202477256070484
33.33% - 0
33.33% - 56713727820156410577229101238628035242
33.33% - 113427455640312821154458202477256070484&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; The only testing this had was that it worked for me and matched what a live cluster was saying.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Cassandra Query Plans</title>
   <link href="http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans"/>
   <updated>2011-07-04T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans</id>
   <content type="html">&lt;p&gt;I&amp;#8217;ve had a few conversations about query performance on wide rows recently, so it seemed about time to dig into how the different slice queries work.&lt;/p&gt;

&lt;h2 id='lotsogarbage'&gt;Lots-o-garbage&lt;/h2&gt;

&lt;p&gt;Clearly lots of garbage in an SSTable in the form of deleted, expired, overwritten or tombstoned columns will have an impact on reads. As they increase the amount of data that will be read, examined and then discarded.&lt;/p&gt;

&lt;p&gt;Like any database one part of query evaluation involves reading data from disk and another involves filtering the candidate data into the final result set. The filtering part of a Cassandra read query involves collating the various row fragments, reconciling multiple columns with the same name, and handling tombstones and expiring columns.&lt;/p&gt;

&lt;p&gt;For now I&amp;#8217;m going to focus on the reading part and I&amp;#8217;m going to restrict it further by only considering cases where all the data is in a single SSTable and there are no garbage columns.&lt;/p&gt;

&lt;h2 id='whats_in_a_row'&gt;What&amp;#8217;s in a row?&lt;/h2&gt;

&lt;p&gt;Aside from the columns there are a few other things stored in a row in an SSTable that are of use when performing read queries.&lt;/p&gt;

&lt;p&gt;A row will be written to an SSTable with zero columns if it has a row level Tombstone. The Tombstone will be used when reconciling the columns for a query as it may delete columns in the row that have a lower timestamp.&lt;/p&gt;

&lt;p&gt;Each row also contains a &lt;a href='http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html'&gt;Bloom Filter&lt;/a&gt; of the column names in the row. This can be used to test if the row &lt;em&gt;may&lt;/em&gt; contain any of the columns a query is looking for. Bloom Filters may give false positives, just like the SSTable bloom filter for row keys, so the only way to know if the column really exists is to read it.&lt;/p&gt;

&lt;p&gt;Finally if the serialised size of the columns in the row is greater than the &lt;code&gt;column_index_size_in_kb&lt;/code&gt; config setting (default is 64) a column index is also written. The index is a sampling of first and last column names in each &lt;code&gt;column_index_size_in_kb&lt;/code&gt; (or more) chunk of column data, along with the offset and width of the chunk. Queries can de-serialise the index and work out how may pages of columns to skip before dropping down to the data file to scan for matching columns.&lt;/p&gt;

&lt;h2 id='slice_by_names'&gt;Slice By Names&lt;/h2&gt;

&lt;p&gt;A slice query that specifies a list of columns names in the &lt;a href='http://wiki.apache.org/cassandra/API'&gt;SlicePredicate&lt;/a&gt; is turned into a &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/SliceByNamesReadCommand.java'&gt;SliceByNamesReadCommand&lt;/a&gt; internally. This is the command you will see logged by the server when &lt;code&gt;DEBUG&lt;/code&gt; level logging is enabled. It uses the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/SSTableNamesIterator.java'&gt;SSTableNamesIterator&lt;/a&gt; to read bytes off disk and create the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/Column.java'&gt;Column&lt;/a&gt; objects that will be collated and reconciled to arrive at the query result.&lt;/p&gt;

&lt;p&gt;The first thing the Iterator does is determine if the SSTable contains the row fragment we are interested in, this follows the read path described in &lt;a href='http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/'&gt;The forces of Write and Read&lt;/a&gt;. When the file is opened a buffer of size &lt;code&gt;column_index_size_in_kb&lt;/code&gt; (specified in cassandra.yaml, default is 64K) will be allocated if Standard (non memory mapped) file access is been used. If the default &lt;a href='http://en.wikipedia.org/wiki/Memory-mapped_file'&gt;Memory-mapped file access&lt;/a&gt; is used the buffer size is ignored.&lt;/p&gt;

&lt;p&gt;If the row exists in the SSTable the Iterator will seek to the start of the row in the &lt;code&gt;Data&lt;/code&gt; component of the SSTable and take the following initial steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the row key.&lt;/li&gt;

&lt;li&gt;Read the row size.&lt;/li&gt;

&lt;li&gt;Read the row level Bloom Filter.&lt;/li&gt;

&lt;li&gt;Read the row level column index, if present.&lt;/li&gt;

&lt;li&gt;Read the row level Tombstone.&lt;/li&gt;

&lt;li&gt;Filter the requested columns using the row level Bloom Filter.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At the end of these initial steps the Iterator will know if the SSTable &lt;em&gt;may&lt;/em&gt; contain any of the columns requested in the query. Even if the SSTable table does not contain any columns of interest it may contain a row level Tombstone which will need to be reconciled with columns contained in other SSTables.&lt;/p&gt;

&lt;p&gt;If the row may contain columns for the query the path the Iterator will take depends on presence of the column index.&lt;/p&gt;

&lt;p&gt;If the row does not contain a column index the iterator will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the number of columns in the row.&lt;/li&gt;

&lt;li&gt;Read each column fully (including it&amp;#8217;s value) from the SSTable.&lt;/li&gt;

&lt;li&gt;Add the column to the result set if it&amp;#8217;s name is in the list of columns queried for.&lt;/li&gt;

&lt;li&gt;Stop the scan early if the number of filtered columns (from the Bloom Filter) is reached.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is a basic scan operation over potentially all columns in the row. However the code knows it will only have to scan at most one page of columns that will contain at most &lt;code&gt;column_index_size_in_kb&lt;/code&gt; (64K) of columns. The number of columns in the column page will vary with regard to the size of the columns, so the amount of de-serialisation and object creation will vary for each row.&lt;/p&gt;

&lt;p&gt;When using Standard disk access with the default file buffer size is 64K the entire row may have already been read from disk. When using Memory-mapped file access the row will be backed by a &lt;a href='http://download.oracle.com/javase/6/docs/api/java/nio/MappedByteBuffer.html'&gt;MappedByteBuffer&lt;/a&gt; of up to 2GB. The MappedByteBuffer will use the native &lt;a href='http://en.wikipedia.org/wiki/Page_(computer_memory'&gt;page size&lt;/a&gt;) when it asks for data, on Mac Book this is 4096 bytes (&lt;code&gt;getconf PAGESIZE&lt;/code&gt;). The &lt;code&gt;MappedByteBuffer&lt;/code&gt; has the advantage of not needing to copy the bytes read from disk, and it will also keep the data in memory until the OS needs to page it out.&lt;/p&gt;

&lt;p&gt;In both cases it&amp;#8217;s reasonable to assume there maybe no additional IO to read all the bytes for the column page.&lt;/p&gt;

&lt;p&gt;If the row does contain a column index the Iterator will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use the the column index to work out the distinct ordered set of column pages that contain the columns which may exist in the row (filtered columns from the Bloom Filter).&lt;/li&gt;

&lt;li&gt;Seek to the start position for each page of columns identified in step 1.&lt;/li&gt;

&lt;li&gt;Scan through the entire page of columns, de-serialise each column fully (including it&amp;#8217;s value) and add it to the result set if it&amp;#8217;s name is in the list of columns queried for.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The search path is reduced by only scanning the pages that contain columns which passed through the bloom filter. However there is no facility to stop a particular page scan until the end of the page is reached.&lt;/p&gt;

&lt;p&gt;For both indexed and non indexed rows all the columns which match the query are eagerly read from disk for each SSTable and held in the Iterator.&lt;/p&gt;

&lt;h2 id='slice_from_read'&gt;Slice From Read&lt;/h2&gt;

&lt;p&gt;A slice query that does not specify a list of columns is turned into a &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/SliceFromReadCommand.java'&gt;SliceFromReadCommand&lt;/a&gt;. It will contain optional start and finish column names and a column count. Again the work for reading columns from disk is done by an Iterator, this time the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/SSTableSliceIterator.java'&gt;SSTableSliceIterator&lt;/a&gt; which wraps either a &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/SimpleSliceReader.java'&gt;SimpleSliceReader&lt;/a&gt; or an &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/IndexedSliceReader.java'&gt;IndexedSliceReader&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;SSTableSliceIterator&lt;/code&gt; is responsible for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ensuring the key exists in the SSTable and seeking to it&amp;#8217;s position.&lt;/li&gt;

&lt;li&gt;Reading the key from the row.&lt;/li&gt;

&lt;li&gt;Reading the row size.&lt;/li&gt;

&lt;li&gt;Deciding which Iterator to use.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;SimpleSliceReader&lt;/code&gt; is used when the query does not specify a start column and uses the default ascending column order (i.e. does not specify reversed). Otherwise the &lt;code&gt;IndexedSliceReader&lt;/code&gt; is used.&lt;/p&gt;

&lt;p&gt;When the &lt;code&gt;SimpleSliceReader&lt;/code&gt; is used it will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Skip and ignore the row level Bloom Filter.&lt;/li&gt;

&lt;li&gt;Skip and ignore the column index.&lt;/li&gt;

&lt;li&gt;Read the row level Tombstone.&lt;/li&gt;

&lt;li&gt;Read the column count.&lt;/li&gt;

&lt;li&gt;De-serialise each column in turn on demand as requested through the Iterator interface.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;SimpleSliceReader&lt;/code&gt; will only stop the scan when either the end of the row is reached, or a column with a name greater than the finish column is reached.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s the responsibility of the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java'&gt;SliceQueryFilter&lt;/a&gt; to only request the columns it needs for the query. This involves determining which columns are &lt;code&gt;live&lt;/code&gt; and contribute to the column count specified in the &lt;code&gt;SliceRange&lt;/code&gt;. The query will be stopped by either:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reading &lt;code&gt;count&lt;/code&gt; live columns from the SSTables.&lt;/li&gt;

&lt;li&gt;Reading past the finish column if specified.&lt;/li&gt;

&lt;li&gt;Exhausting all SSTables involved in the read.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;IndexedSliceReader&lt;/code&gt; has a much tougher life. It is used when there is a start column or the columns must be returned in reverse order. It will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Skip and ignore the row level Bloom Filter.&lt;/li&gt;

&lt;li&gt;Read the column index.&lt;/li&gt;

&lt;li&gt;De-serialise the row and read the row level tombstone if present.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Depending on the presence of the column index one of two algorithms will be used to read the columns from disk. The &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/IndexedSliceReader.java#L200'&gt;SimpleBlockFetcher&lt;/a&gt; is used when the row does not contain a column index. It de-serialises the columns in the row, and builds a list that is either disk ordered or reverse disk ordered depending on the query. The scan is stopped by either:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reading all columns in the row.&lt;/li&gt;

&lt;li&gt;Reading a column beyond the finish column, if a finish column was specified and reverse is false.&lt;/li&gt;

&lt;li&gt;Reading a column beyond the start column, if a start column was specified and reverse is true.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Unlike the &lt;code&gt;SimpleSliceReader&lt;/code&gt; when the &lt;code&gt;IndexedSliceReader&lt;/code&gt; is used on a small row it will eagerly read all columns from the row which match the query. This only happens in the context of a row which has less then &lt;code&gt;column_index_size_in_kb&lt;/code&gt; of column data .&lt;/p&gt;

&lt;p&gt;The &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/IndexedSliceReader.java#L140'&gt;IndexedBlockFetcher&lt;/a&gt; is used for all other cases, it will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use the column index to find the first column page to read from.&lt;/li&gt;

&lt;li&gt;Step forwards or backwards (for reversed) through the column pages for the row on demand as each page of columns is exhausted.&lt;/li&gt;

&lt;li&gt;Stop the iteration if the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/io/sstable/IndexHelper.java#L164'&gt;IndexInfo&lt;/a&gt; for the next column page does not contain columns inside the start or finish column range, if a start or finish column are specified.&lt;/li&gt;

&lt;li&gt;De-serialise all columns in each page, building a list of columns in either disk order or reverse disk order.&lt;/li&gt;

&lt;li&gt;Stop a page scan using the same criteria as the &lt;code&gt;SimpleBlockFetcher&lt;/code&gt; above.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;IndexedBlockFetcher&lt;/code&gt; greedily reads the columns in each page just as the &lt;code&gt;SimpleBlockFetcher&lt;/code&gt; does. But will only advance to the next page when the query requests more columns and the previous block of columns has been exhausted.&lt;/p&gt;

&lt;h2 id='so_what_'&gt;So What ?&lt;/h2&gt;

&lt;p&gt;Not much.&lt;/p&gt;

&lt;p&gt;Cassandra works like other databases by using a seek and partial scan approach to finding data. Read requests that require fewer seek+scan operations will be faster than those the require more.&lt;/p&gt;

&lt;p&gt;There are a couple of things we can say about what will make some queries go faster than others though.&lt;/p&gt;

&lt;h3 id='name_locality'&gt;Name Locality&lt;/h3&gt;

&lt;p&gt;Queries by column name that select columns on fewer column pages should be faster that those which are spread out over more column pages. Querying more column pages means reading more data and creating more Column objects. The pathological case is selecting n columns from n column pages.&lt;/p&gt;

&lt;p&gt;Note that queries by column name must also de-serialise the column index, and will pay a constant cost for every query regardless of the number of columns requested or their distribution in the row.&lt;/p&gt;

&lt;h3 id='start_position'&gt;Start Position&lt;/h3&gt;

&lt;p&gt;The entire column index must be de-serialised whenever an offset into the row needs to be found. Queries by that slice columns without specifying a start column and use (default) ascending order will perform better than those that either use a start column or reverse order. However the cost of de-serialising the column index is constant and the position of the start column should have little affect on query performance.&lt;/p&gt;

&lt;p&gt;The overhead should naturally increase with the width of the rows.&lt;/p&gt;

&lt;h2 id='in_motion__the_setup'&gt;In Motion - The Setup&lt;/h2&gt;

&lt;p&gt;The &lt;a href='https://gist.github.com/1074715'&gt;query_profile gist&lt;/a&gt; contains the Python code I used to setup the data and profile queries using the &lt;a href='https://github.com/pycassa/pycassa'&gt;pycassa&lt;/a&gt; library. All tests were done locally on a 2011 Mac Book Pro with 8GB of RAM and spinning disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: you will need to edit the file and set the &lt;code&gt;CASSANDRA_PATH&lt;/code&gt; at the top of the file.&lt;/p&gt;

&lt;p&gt;The tests used columns with a 10 byte name and 25 bytes of data. Together with the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/Column.java#L116'&gt;15 bytes of column meta data&lt;/a&gt; this will create columns that use 50 bytes on disk. For the standard 64KB column page this should give 1,310 columns per column page.&lt;/p&gt;

&lt;p&gt;To verify the column paging I recompiled the 0.8.1 source to include a change to the &lt;a href='https://gist.github.com/1068855'&gt;SSTableNamesIterator that logged&lt;/a&gt; the number of index entries each row had and how many column pages were used during a read.&lt;/p&gt;

&lt;p&gt;The output made the paging look &amp;#8220;good enough&amp;#8221;, for example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#selecting from 1 page
INFO 13:21:52,662 Column Page Count 4/1
INFO 13:22:03,309 Column Page Count 8/1
INFO 13:22:14,070 Column Page Count 77/1
INFO 13:22:26,140 Column Page Count 763/1
INFO 13:22:49,362 Column Page Count 7628/1

#selecting from 50 pages 
INFO 13:32:15,501 Column Page Count 4/4
INFO 13:32:25,696 Column Page Count 8/8
INFO 13:32:36,790 Column Page Count 77/50
INFO 13:32:54,012 Column Page Count 763/50
INFO 13:33:37,511 Column Page Count 7628/50

#sometimes it was off
INFO 13:35:02,987 Column Page Count 77/53
INFO 13:35:04,153 Column Page Count 77/50
INFO 13:35:05,325 Column Page Count 77/51
INFO 13:35:06,489 Column Page Count 77/52
INFO 13:35:15,812 Column Page Count 763/71
INFO 13:35:18,182 Column Page Count 763/63
INFO 13:35:20,536 Column Page Count 763/69
INFO 13:35:22,898 Column Page Count 763/70
INFO 13:35:51,997 Column Page Count 7628/67
INFO 13:36:07,069 Column Page Count 7628/65
INFO 13:36:22,012 Column Page Count 7628/66&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To test the latency of various queries I used the (recent) &amp;#8216;ReadLatency: &amp;#8217; metric available via &lt;code&gt;nodetool cfstats&lt;/code&gt;. This metric wraps the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1180'&gt;local processing&lt;/a&gt; of a query for a single CF. The value presented by &lt;code&gt;nodetool&lt;/code&gt; is the most recent value.&lt;/p&gt;

&lt;p&gt;All tests were run 10 times and the min, max 80th and 95th percentiles of latency were recorded.&lt;/p&gt;

&lt;p&gt;Using a clean 0.8.1 install create the following keyspace using &lt;code&gt;bin/cassandra-cli&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;create keyspace query
    with strategy_options=[{replication_factor:1}]
    and placement_strategy = &amp;#39;org.apache.cassandra.locator.SimpleStrategy&amp;#39;;

use query;

create column family NoCache
    with comparator = AsciiType
    and default_validation_class = AsciiType
    and key_validation_class = AsciiType
    and keys_cached = 0
    and rows_cached = 0;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;query_profile&lt;/code&gt; module will insert the following rows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&amp;#8220;small-row&amp;#8221; with 100 columns, 5K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;no-col-index&amp;#8221; with 1200 columns, 60K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;five-thousand&amp;#8221; with 5000 columns, 244K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;ten-thousand&amp;#8221; with 10000 columns, 488K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;hundred-thousand&amp;#8221; with 100000 columns, 4.8M of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;one-million&amp;#8221; with 1000000 columns, 48M of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;ten-million&amp;#8221; with 10000000 columns, 480M of data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Insert the data by executing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python query_profile insert_rows&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To keep things simple flush and compact the database so we only have one SSTable:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost flush
$ bin/nodetool -h localhost compact query&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At the end of this my &lt;code&gt;/var/lib/cassandra/data/query&lt;/code&gt; directory contained the following SSTables (ignoring the &lt;code&gt;-Compacted&lt;/code&gt; SSTables):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-rw-r--r--  1 aaron  wheel   536M  6 Jul 13:03 NoCache-g-80-Data.db
-rw-r--r--  1 aaron  wheel   1.4K  6 Jul 13:03 NoCache-g-80-Filter.db
-rw-r--r--  1 aaron  wheel   154B  6 Jul 13:03 NoCache-g-80-Index.db
-rw-r--r--  1 aaron  wheel   4.2K  6 Jul 13:03 NoCache-g-80-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;query_profile&lt;/code&gt; contains a warm up function that will slice through all the columns in all the rows, warm up the database by executing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python query_profile warm_up&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='in_motion__name_locality'&gt;In Motion - Name Locality&lt;/h2&gt;

&lt;p&gt;We want to test that selecting columns by name when they are tightly grouped has better performance than selecting widely distributed columns.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;name_locality&lt;/code&gt; test runs through several tests that select up to 100 columns by name from each of the rows. The numbers show a reasonably stable latency for queries that touch the same number of pages, and an increase when the number of pages increases. Which matches the theory.&lt;/p&gt;

&lt;p&gt;Note: the code outputs a WARN message when the test selects less than 100 columns from a row. For various tests this happened on the &amp;#8220;small-row&amp;#8221;, &amp;#8220;no-col-index&amp;#8221;, &amp;#8220;five-thousand&amp;#8221; and &amp;#8220;ten-thousand&amp;#8221; rows. Mostly when selecting a set number of columns from each column page.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python query_profile.py name_locality
Latency is min, 80th percentile, 95th percentile and max.
Test name locality...

100 columns by name, start of the row.
Row            small-row had latency in ms      0.285      0.362     0.3828      0.396
Row         no-col-index had latency in ms      0.848     0.9146     0.9159      0.917
Row        five-thousand had latency in ms      1.257      1.321      1.324      1.324
Row         ten-thousand had latency in ms      1.226      1.311      1.333      1.358
Row     hundred-thousand had latency in ms      1.406      1.483      1.489      1.492
Row          one-million had latency in ms      2.848      2.957      3.019      3.089
Row          ten-million had latency in ms      17.75      18.72      18.76      18.78

100 columns by name, end of the row.
Row            small-row had latency in ms      0.273     0.2846     0.2891      0.294
Row         no-col-index had latency in ms      0.674     0.7078     0.7107      0.714
Row        five-thousand had latency in ms      0.928      0.962     0.9638      0.966
Row         ten-thousand had latency in ms      0.882      0.901     0.9172      0.937
Row     hundred-thousand had latency in ms      0.868     0.9138     0.9154      0.917
Row          one-million had latency in ms      2.553       2.72       2.73      2.739
Row          ten-million had latency in ms      16.95      17.67      17.82      17.97

100 columns by name, middle of row.
Row            small-row had latency in ms      0.273     0.3134     0.3422      0.368
Row         no-col-index had latency in ms      0.835     0.9052     0.9092      0.913
Row        five-thousand had latency in ms      1.719      1.772      1.839      1.915
Row         ten-thousand had latency in ms      1.726      1.802      1.818      1.837
Row     hundred-thousand had latency in ms      1.853      1.978      6.987       13.1
Row          one-million had latency in ms      2.682      2.832      2.927      3.041
Row          ten-million had latency in ms      17.27      18.59      18.72      18.75

100 columns by name, first 2 cols from 50 random pages
Row            small-row had latency in ms      0.115      0.119     0.1195       0.12
Row         no-col-index had latency in ms       0.37     0.4008     0.4019      0.403
Row        five-thousand had latency in ms      1.504      1.596      1.614      1.637
Row         ten-thousand had latency in ms       3.14      3.445      3.493      3.547
Row     hundred-thousand had latency in ms      25.25      27.63      36.62       47.5
Row          one-million had latency in ms      26.68      28.47      28.92      29.47
Row          ten-million had latency in ms      40.95      43.55      43.96      44.13

100 columns by name, last 2 cols from 50 random pages
Row            small-row had latency in ms      0.109      0.115     0.1218       0.13
Row         no-col-index had latency in ms      0.349     0.3536     0.3553      0.357
Row        five-thousand had latency in ms      1.778      1.854      1.877      1.897
Row         ten-thousand had latency in ms       3.39      3.664      3.715      3.765
Row     hundred-thousand had latency in ms      25.17      26.77       28.0      29.35
Row          one-million had latency in ms      26.44      28.04      28.19      28.35
Row          ten-million had latency in ms      40.57      44.26      44.69      44.74

100 columns by name, random 2 cols from 50 random pages
Row            small-row had latency in ms      0.107      0.117     0.1174      0.118
Row         no-col-index had latency in ms       0.36     0.4356     0.4445      0.445
Row        five-thousand had latency in ms       1.72      1.808      2.795      3.998
Row         ten-thousand had latency in ms      3.367      3.682      3.705      3.715
Row     hundred-thousand had latency in ms      25.45      27.23      28.93      30.95
Row          one-million had latency in ms       32.6      38.43      38.91      39.45
Row          ten-million had latency in ms      48.76      53.88      54.09      54.19

100 columns by name, first col from 100 random pages
Row            small-row had latency in ms      0.106      0.113     0.1306      0.152
Row         no-col-index had latency in ms      0.326     0.3508      0.351      0.351
Row        five-thousand had latency in ms      1.361      1.451      1.483      1.518
Row         ten-thousand had latency in ms      2.853      3.071      3.102      3.136
Row     hundred-thousand had latency in ms      37.65      40.08      40.92      41.84
Row          one-million had latency in ms      51.23      54.41      55.93       57.6
Row          ten-million had latency in ms      64.67      68.56      69.24       69.6

100 columns by name, last col from 100 random pages
Row            small-row had latency in ms      0.105      0.113     0.1153      0.118
Row         no-col-index had latency in ms      0.325      0.349     0.3553      0.363
Row        five-thousand had latency in ms      1.601      1.695      1.698        1.7
Row         ten-thousand had latency in ms      3.058      3.404      4.317      5.429
Row     hundred-thousand had latency in ms      37.34      39.52      40.54      41.72
Row          one-million had latency in ms      50.76      53.92      54.58      55.14
Row          ten-million had latency in ms       64.2      68.46      69.28      70.07

100 columns by name, random col from 100 random pages
Row            small-row had latency in ms       0.11     0.1156     0.1169      0.118
Row         no-col-index had latency in ms      0.327     0.3628     0.3721      0.382
Row        five-thousand had latency in ms      1.541      1.656      1.664      1.674
Row         ten-thousand had latency in ms      2.719      3.272      3.279      3.282
Row     hundred-thousand had latency in ms      35.51      37.52       38.8      40.35
Row          one-million had latency in ms      50.11      52.91       53.8       54.7
Row          ten-million had latency in ms      65.06      68.67       69.3      69.88&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='in_motion__start_position'&gt;In Motion - Start Position&lt;/h2&gt;

&lt;p&gt;We want to show that any start position in a (forward) slice query has worse performance than one that does not. And that the position of the start column has little impact on the query performance.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;start_position&lt;/code&gt; test runs through several tests that select a slice of up to 100 columns with different start positions.&lt;/p&gt;

&lt;p&gt;The test for &amp;#8220;100 columns from the start of the row with a start col&amp;#8221; takes significantly longer than &amp;#8220;100 columns from with no start column&amp;#8221; for all rows other than &amp;#8220;small-row&amp;#8221;. For all rows other than &amp;#8220;no-col-index&amp;#8221; I attribute this to the column index. For &amp;#8220;no-col-index&amp;#8221; row I attribute the increased latency to eagerly reading the entire page of columns. Performance was then reasonably similar for the other tests which used a start column.&lt;/p&gt;

&lt;p&gt;I&amp;#8217;m not sure why the test for &amp;#8220;100 columns from the start of the second page&amp;#8221; was an outlier. Hopefully I&amp;#8217;ll get to take a longer look, but it seemed to consistency perform worse than other tests.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python query_profile.py start_position
Test start position...

Latency is min, 80th percentile, 95th percentile and max.
100 columns from with no start column
Row            small-row latency in ms      0.181     0.1934     0.2134      0.237
Row         no-col-index latency in ms       0.18     0.1918     0.1956        0.2
Row        five-thousand latency in ms      0.169      0.185     0.1864      0.188
Row         ten-thousand latency in ms      0.187     0.1936     0.1972      0.201
Row     hundred-thousand latency in ms      0.183     0.1936     0.1963      0.199
Row          one-million latency in ms      0.181     0.1948     0.1959      0.197
Row          ten-million latency in ms      0.178     0.1952     0.1965      0.197

Latency is min, 80th percentile, 95th percentile and max.
100 columns from the start of the row with a start col
Row            small-row latency in ms      0.135      0.142     0.1424      0.143
Row         no-col-index latency in ms       0.34     0.3498       0.35       0.35
Row        five-thousand latency in ms      0.358     0.3676     0.3684      0.369
Row         ten-thousand latency in ms       0.36      0.376      0.511      0.676
Row     hundred-thousand latency in ms      0.358        0.4     0.5005      0.622
Row          one-million latency in ms      0.456      0.495     0.5006      0.505
Row          ten-million latency in ms      1.345      1.418       1.44      1.467

Latency is min, 80th percentile, 95th percentile and max.
100 columns from the start of the second page
Row            small-row latency in ms      0.128     0.1412     0.1424      0.143
Row         no-col-index latency in ms      0.327      0.338     0.3385      0.339
Row        five-thousand latency in ms      0.637     0.6488     0.6634      0.681
Row         ten-thousand latency in ms      0.615      0.651     0.6519      0.653
Row     hundred-thousand latency in ms      0.628     0.6726     0.6762       0.68
Row          one-million latency in ms      0.706     0.7636     0.7671      0.771
Row          ten-million latency in ms      1.569      1.706      3.148       4.91

Latency is min, 80th percentile, 95th percentile and max.
100 columns starting half way through the row
Row            small-row latency in ms      0.099     0.1028     0.1125      0.124
Row         no-col-index latency in ms      0.354     0.3626      0.363      0.363
Row        five-thousand latency in ms      0.397     0.4068     0.4318      0.462
Row         ten-thousand latency in ms      0.389     0.4138     0.4167       0.42
Row     hundred-thousand latency in ms      0.358     0.3828      0.383      0.383
Row          one-million latency in ms      0.456     0.5048     0.5135      0.524
Row          ten-million latency in ms      1.347       1.47      1.489      1.504

Latency is min, 80th percentile, 95th percentile and max.
100 columns starting from the last page 
Row            small-row latency in ms      0.126      0.143     0.1434      0.144
Row         no-col-index latency in ms      0.325     0.3406     0.3425      0.343
Row        five-thousand latency in ms      0.595      0.634     0.6646      0.696
Row         ten-thousand latency in ms      0.556     0.5694     0.5795      0.591
Row     hundred-thousand latency in ms      0.441     0.4942     0.5027      0.512
Row          one-million latency in ms      0.405     0.4468     0.4507      0.454
Row          ten-million latency in ms      1.272      1.372      1.393      1.417&lt;/code&gt;&lt;/pre&gt;</content>
 </entry>
 
 <entry>
   <title>Down For Me?</title>
   <link href="http://thelastpickle.com/2011/06/13/Down-For-Me"/>
   <updated>2011-06-13T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/06/13/Down-For-Me</id>
   <content type="html">&lt;p&gt;For a read or write request to &lt;em&gt;start&lt;/em&gt; in Cassandra at least as many nodes must be seen as &lt;code&gt;UP&lt;/code&gt; by the coordinator node as the request has specified via the ConsistencyLevel. Otherwise the client will get an &lt;code&gt;UnavailableException&lt;/code&gt; and the cluster will appear down for that request. That may not necessarily mean it is down for all keys or all requests.&lt;/p&gt;

&lt;p&gt;The Replication Factor, number of nodes, the Consistency Level and luck all play a part of determining how many nodes can be lost in a Cassandra cluster before it is unavailable for 100% of the keys. Before it reaches that point though the cluster may go through a period of partial failure where some keys will not be available at some CL levels.&lt;/p&gt;

&lt;p&gt;The partial failure support baked into the system is nice thing to have. But most people will be interested in keeping 100% of the keys available at the required Consistency Level. So most of discussion below talks about keeping the cluster up for 100% of the keys at the &lt;code&gt;QUORUM&lt;/code&gt; CL.&lt;/p&gt;

&lt;h2 id='which_nodes_'&gt;Which nodes ?&lt;/h2&gt;

&lt;p&gt;When it comes to counting the &lt;code&gt;UP&lt;/code&gt; nodes for a request we only consider the &lt;em&gt;Natural Endpoints&lt;/em&gt; for a key. These are the nodes identified by the &lt;code&gt;placement_strategy&lt;/code&gt; (set when the Keyspace was created) as the replicas for a key, and they never change. All read and write operations for a key, using the same partitioner, will select those same endpoints. Otherwise write operations could plonk down data that reads could never find.&lt;/p&gt;

&lt;p&gt;The row key is first &lt;em&gt;Decorated&lt;/em&gt; by the &lt;code&gt;partitioner&lt;/code&gt; (specified in &lt;code&gt;conf/cassandra.yaml&lt;/code&gt;) to create the token used to locate the row in the cluster. For example the &lt;code&gt;RandomPartitioner&lt;/code&gt; uses an MD5 transform to turn the key into a 128bit token.&lt;/p&gt;

&lt;p&gt;Is using the &lt;code&gt;SimpleStrategy&lt;/code&gt; it will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Order the nodes in the ring by their &lt;code&gt;initial_token&lt;/code&gt;.&lt;/li&gt;

&lt;li&gt;Select the node whose token range includes the token as the first replica.&lt;/li&gt;

&lt;li&gt;Select the next RF-1 nodes in the ordered ring as the remaining replicas.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The nodes loop around, so a row may be replicated on the last 2 nodes and the first 1.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;NetworkTopologyStrategy&lt;/code&gt; use a more involved algorithm that considers the Data Centre and Rack the node is assigned to by the Snitch. There is a good discussion of the process from &lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12092.html'&gt;Peter Schuller&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id='the_weird_consistency_level'&gt;The weird Consistency Level&lt;/h2&gt;

&lt;p&gt;There is one Consistency Level that does not behave like the others, so lets just get it out of the way first. CL &lt;code&gt;ANY&lt;/code&gt; for write requests allows the coordinator node to store the mutation in the form of a &lt;a href='http://wiki.apache.org/cassandra/HintedHandoff'&gt;Hinted Handoff&lt;/a&gt; on &lt;strong&gt;any node in the cluster&lt;/strong&gt;, which in practice means on the coordinator itself.&lt;/p&gt;

&lt;p&gt;This is useful in cases where extreme write uptime is needed. The sort of extreme where the write &lt;strong&gt;cannot&lt;/strong&gt; be reliably read until a &lt;code&gt;nodetool repair&lt;/code&gt; operation has been completed. &lt;code&gt;Hinted Handoffs&lt;/code&gt; must be delivered to Natural Endpoints before they can be included in a read operation.&lt;/p&gt;

&lt;p&gt;If you write at CL &lt;code&gt;ANY&lt;/code&gt; and some of the Natural Endpoints are up, the write and the Hints will be sent to them. The coordinator will only be used to store the Hints in cases where all the Natural Endpoints are down.&lt;/p&gt;

&lt;p&gt;For more information on Hinted Handoff see &lt;a href='http://www.datastax.com/dev/blog/understanding-hinted-handoff'&gt;Jonathan&amp;#8217;s recent post&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id='consistency_levels'&gt;Consistency Levels&lt;/h2&gt;

&lt;p&gt;For all the other Consistency Levels the read or write request is directed to one of the Natural Endpoints. Hinted Handoffs may be used as part of the request but are do not considered when determining if the cluster is available for the request.&lt;/p&gt;

&lt;p&gt;The named Consistency Levels &lt;code&gt;ONE&lt;/code&gt;, &lt;code&gt;TWO&lt;/code&gt; and &lt;code&gt;THREE&lt;/code&gt; are pretty easy to understand. Once, two or three replicas for the key must be seen as &lt;code&gt;UP&lt;/code&gt; before the operation will start. CL &lt;code&gt;ONE&lt;/code&gt; is the most often used of these.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;QUORUM&lt;/code&gt; is calculated as &lt;code&gt;floor(RF \ 2) + 1&lt;/code&gt;. This is most used CL level and in my opinion should be the starting CL for all applications until a reason is found to change (performance is &lt;em&gt;not&lt;/em&gt; a reason).&lt;/p&gt;

&lt;p&gt;For RF levels below 3 the &lt;code&gt;QUORUM&lt;/code&gt; is the same as the RF level so:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RF 1 - QUORUM = 1&lt;/li&gt;

&lt;li&gt;RF 2 - QUORUM = 2&lt;/li&gt;

&lt;li&gt;RF 3 - QUORUM = 2&lt;/li&gt;

&lt;li&gt;RF 4 - QUORUM = 3&lt;/li&gt;

&lt;li&gt;RF 5 - QUORUM = 3&lt;/li&gt;

&lt;li&gt;RF 6 - QUORUM = 4&lt;/li&gt;

&lt;li&gt;RF 7 - QUORUM = 4&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the &lt;code&gt;NetworkTopologyStrategy&lt;/code&gt; is used each data centre has it&amp;#8217;s own RF, and the standard &lt;code&gt;QUORUM&lt;/code&gt; is calculated using the total RF for the cluster.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;LOCAL_QUORUM&lt;/code&gt; and &lt;code&gt;EACH_QUORUM&lt;/code&gt; can be used with the &lt;code&gt;NetworkTopologyStrategy&lt;/code&gt; and they instruct the coordinator to also consider the Data Centre the nodes are located in. A write is always sent to all &lt;code&gt;UP&lt;/code&gt; replicas, this&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;LOCAL_QUORUM&lt;/code&gt; only the RF of the local data centre is considered when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Calculating how many nodes to block for.&lt;/li&gt;

&lt;li&gt;Checking if enough nodes are &lt;code&gt;UP&lt;/code&gt; for the request.&lt;/li&gt;

&lt;li&gt;Counting if CL nodes have responded to the request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;EACH_QUORUM&lt;/code&gt; works in a similar way but the tests apply to every DC in the cluster.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ALL&lt;/code&gt; requires that all replicas for the row be &lt;code&gt;UP&lt;/code&gt; before the request will start.&lt;/p&gt;

&lt;h2 id='a_failing_range'&gt;A failing Range&lt;/h2&gt;

&lt;p&gt;A simple, but incomplete, way to think about the cluster been available is to focus a one key range and it&amp;#8217;s replicas.&lt;/p&gt;

&lt;p&gt;Consider RF 3, at QUORUM if one node is lost the range will still be available. If two nodes are lost the range will not be available for QUORUM operations, but will still be available for ONE and ANY requests.&lt;/p&gt;

&lt;p&gt;For any number of nodes in the cluster, a range will become unavailable it more than (RF - CL) nodes are &lt;code&gt;DOWN&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now consider a cluster with 50 nodes, RF 5, QUORUM operations and the SimpleStrategy. If 3 adjacent nodes nodes go down the range assigned to the first one will no longer be available, as their will only be 2 replicas &lt;code&gt;UP&lt;/code&gt;. The cluster will be down for 2% of the possible keys, not terrible but it&amp;#8217;s no longer up for 100%. The range assigned to the second down node will have 3 &lt;code&gt;UP&lt;/code&gt; replicas and the range assigned to the third will have 4 &lt;code&gt;UP&lt;/code&gt; replicas. The nodes do not have to be adjacent in the ring for this occur, it could be any nodes in the replica set for the range. It&amp;#8217;s just easier to think about when they are adjacent.&lt;/p&gt;

&lt;p&gt;Spreading replicas for a key across nodes with different physical infrastructure is a good way to mitigate this risk. The &lt;code&gt;NetworkTopologyStrategy&lt;/code&gt; distributes the replicas for a DC across the available Racks. As defined by either the &lt;code&gt;RackInferringSnitch&lt;/code&gt;, &lt;code&gt;PropertyFileSnitch&lt;/code&gt; or the &lt;code&gt;EC2Snitch&lt;/code&gt; which uses AWS Availability Zones as racks. The &lt;code&gt;SimpleSnitch&lt;/code&gt; puts all nodes into &lt;code&gt;rack1&lt;/code&gt; in &lt;code&gt;datacenter1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In the worst case failure scenario the cluster can sustain up to (RF - CL) failures and still remain available for 100% of the keys.&lt;/p&gt;

&lt;h2 id='a_failing_cluster'&gt;A failing Cluster&lt;/h2&gt;

&lt;p&gt;The best case scenario for failure is when the node failures are evenly divided amongst the replicas for a range. So that every RF number of failures only removes one node from the available replica set for each range. To know how many nodes we can lose for a Consistency Level to still be available for 100% of the keys multiply by RF-CL.&lt;/p&gt;

&lt;p&gt;For a 5 node cluster with RF 3 at &lt;code&gt;QUORUM&lt;/code&gt; this is (5 / 3) * (3 - 2) or 1. For other cluster sizes the number is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def max_failure(num_nodes, rf, block_for):
    print &amp;quot;For up to %s nodes with RF %s and blocking for %s nodes...&amp;quot; % (
        num_nodes, rf, block_for)
    print &amp;quot;Number Nodes / max_failure&amp;quot;;
    for n in range(1, num_nodes + 1):
         print &amp;quot;%s / %s&amp;quot; % (n, ( int(n/rf) * (rf - block_for)))

max_failure(10, 3, 2)

For up to 10 nodes with RF 3 and blocking for 2 nodes...
Number Nodes / max_failure
1 / 0
2 / 0
3 / 1
4 / 1
5 / 1
6 / 2
7 / 2
8 / 2
9 / 3
10 / 3&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In the best case failure scenario the cluster can sustain up to floor(number of nodes / RF) * (RF - CL) failed nodes and still remain up for 100% of the keys.&lt;/p&gt;

&lt;h2 id='the_view_from_one_node'&gt;The view from one node&lt;/h2&gt;

&lt;p&gt;Considering which nodes are &lt;code&gt;UP&lt;/code&gt; or &lt;code&gt;DOWN&lt;/code&gt; is always from the perspective of the coordinator node. Network Partitions also play a part in the deciding if the cluster is available, as the nodes must be both running and contactable by the coordinator.&lt;/p&gt;

&lt;p&gt;At the small scale if a client connects to a node in the cluster that has lost connectivity to other nodes in the cluster it will consider them all &lt;code&gt;DOWN&lt;/code&gt; and be unavailable for all &lt;code&gt;QUORUM&lt;/code&gt; requests. The client will receive a &lt;code&gt;UnavailableException&lt;/code&gt; and should connect to another node and try request. Other nodes in the cluster may be in a bigger partition that contains enough &lt;code&gt;UP&lt;/code&gt; replicas for the request to complete.&lt;/p&gt;

&lt;p&gt;At a bigger scale when using Amazon AWS it&amp;#8217;s more likely that nodes an Availability Zone will lose connectivity with nodes from a different AZ then from nodes in the same AZ.&lt;/p&gt;

&lt;p&gt;With two AZ&amp;#8217;s operations at &lt;code&gt;QUORUM&lt;/code&gt; will require nodes in both AZ&amp;#8217;s as neither will hold &lt;code&gt;QUORUM&lt;/code&gt; replicas. So a network partition between the two would result in 100% of the keys been down in both AZ&amp;#8217;s.&lt;/p&gt;

&lt;p&gt;With three AZ&amp;#8217;s each AZ will hold one third of the replicas, and any two together may provide enough &lt;code&gt;UP&lt;/code&gt; replicas to support &lt;code&gt;QUORUM&lt;/code&gt; operations. The cluster could sustain a network partition so long as each AZ can talk to at least one other AZ.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Recent Cassandra AWS Reading</title>
   <link href="http://thelastpickle.com/2011/05/21/Recent-Cassandra-AWS-Reading"/>
   <updated>2011-05-21T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/05/21/Recent-Cassandra-AWS-Reading</id>
   <content type="html">&lt;p&gt;There&amp;#8217;s been a few good AWS discussions recently on the Cassandra User List, and some interesting blog posts.&lt;/p&gt;

&lt;h2 id='user_list_discussions'&gt;User List discussions&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12831.html'&gt;best way to backup&lt;/a&gt; on EC2 and the &lt;a href='https://github.com/simplegeo/tablesnap'&gt;SimpleGeo.com table snap&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12586.html'&gt;EC2 Stress Tests&lt;/a&gt; with lots of hard numbers&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12502.html'&gt;Multi DC Deployment&lt;/a&gt; discussion&lt;/li&gt;

&lt;li&gt;Work on a new &lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg13214.html'&gt;EC2 Snitch&lt;/a&gt; to make multi region deployments easier using &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2452'&gt;CASSANDRA-2452&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Using &lt;a href='http://www.vyatta.com/'&gt;Vyatta&lt;/a&gt; and &lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12733.html'&gt;IP address resolution in MultiDC setup (EC2)/VIP&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Token selection for a balanced &lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12975.html'&gt;Replica data distributing between racks&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='blogs_and_such'&gt;Blogs and such&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href='http://www.datastax.com/dev/blog/setting-up-a-cassandra-cluster-with-the-datastax-ami'&gt;Setting up a Cassandra cluster with the DataStax AMI&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.datastax.com/docs/0.8/brisk/install_brisk_ami'&gt;Installing the Brisk AMI on Amazon EC2&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Amazon documentation on &lt;a href='http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?instance-storage-concepts.html'&gt;EC2 Instance Storage&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.theserverlabs.com/blog/2010/07/08/ec2-persistence-strategies/'&gt;Persistence Strategies for Amazon EC2&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://aws.amazon.com/message/65648/'&gt;Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.gabrielweinberg.com/blog/2011/05/raid0-ephemeral-storage-on-aws-ec2.html'&gt;RAID0 ephemeral storage on AWS EC2&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://amistrongeryet.blogspot.com/2010/04/three-latency-anomalies.html'&gt;Three Latency Anomalies&lt;/a&gt; a look at EC2 networking&lt;/li&gt;
&lt;/ul&gt;</content>
 </entry>
 
 <entry>
   <title>Deletes and Tombstones</title>
   <link href="http://thelastpickle.com/2011/05/15/Deletes-and-Tombstones"/>
   <updated>2011-05-15T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/05/15/Deletes-and-Tombstones</id>
   <content type="html">&lt;p&gt;Deletes in Cassandra rely on Tombstones to support the Eventual Consistency model. Tombstones are markers that can exist at different levels of the data model and let the cluster know that a delete was recored on a replica, and when it happened. Tombstones then play a role in keeping deleted data hidden and help with freeing space used by deleted columns on disk.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s possible to delete columns, super columns and entire rows in Cassandra. However in this post I&amp;#8217;m going to look at the simple case of deleting a single column in a Standard Column Family.&lt;/p&gt;

&lt;h2 id='remembering_things_forgotten'&gt;Remembering things forgotten&lt;/h2&gt;

&lt;p&gt;In a simple case of a single node traditional RDBMS deleting data is relatively straight forward process. Rows in Index and Data pages are marked as unused and other than recording the transaction in the commit log, the fact that the delete happened is forgotten.&lt;/p&gt;

&lt;p&gt;As the &lt;a href='http://wiki.apache.org/cassandra/DistributedDeletes'&gt;Distributed Deletes&lt;/a&gt; wiki page points out, things are a bit more complicated in Cassandra. From the perspective of the coordinator node (the one the client is connected to) one or more nodes may be down when the delete is executed. So long as the requested Consistency Level is achieved the delete can still proceed, but if we forget that the delete happened it will not be possible to reach the &lt;em&gt;correct&lt;/em&gt; consistent view of the data later. The nodes that were offline will say &amp;#8220;foo == bar&amp;#8221; and the nodes that did the delete will have nothing to say.&lt;/p&gt;

&lt;p&gt;When a column is deleted a &lt;code&gt;DeletedColumn&lt;/code&gt; aka Tombstone is created in Cassandra. The DeletedColumn will have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;name: Name of the column deleted&lt;/li&gt;

&lt;li&gt;value: Current server time as seconds since the unix epoch (integer). This is known as the &lt;code&gt;localDeleteTime&lt;/code&gt; and is used during the (cassandra) GC process.&lt;/li&gt;

&lt;li&gt;timestamp: As provided by the client&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mutation is then &lt;a href='http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/'&gt;applied&lt;/a&gt; to the memtable in one of two ways. If the memtable does not contain the named column for the row it is simply added to the memtable. If there is an existing column it is &lt;a href='https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/Column.java#L179'&gt;reconcile() &amp;#8216;d&lt;/a&gt; with the Deleted Column. The new &lt;code&gt;DeletedColumn&lt;/code&gt; will replace the existing column if it has a higher (client provided) timestamp. The localDeleteTime is not used for reconciliation. At this point any previous column value in the memtable is lost and will not be persisted to disk.&lt;/p&gt;

&lt;p&gt;We now have a tombstone. If there are no other mutations the &lt;code&gt;DeletedColumn&lt;/code&gt; will later be persisted to the SSTable just like any other column.&lt;/p&gt;

&lt;h2 id='local_reads_for_local_queries'&gt;Local Reads for &lt;a href='http://www.bbc.co.uk/comedy/clips/p006vm6j/the_league_of_gentlemen_a_local_shop_for_local_people/'&gt;Local&lt;/a&gt; Queries&lt;/h2&gt;

&lt;p&gt;During a local read for a row value the same reconciliation process that was used during the delete request runs. Multiple row fragments are retrieved from the current memtable, memtables pending flush and SSTables on disk. The fragments are reduced and the columns with the same name reconciled to arrive at the current value.&lt;/p&gt;

&lt;p&gt;For example if there is a row fragment in an SSTable for key &amp;#8220;foo&amp;#8221; that says columns &amp;#8220;bar&amp;#8221; is &amp;#8220;baz&amp;#8221;, and a DeletedColumn in another SSTable with a higher time stamp when they are reconciled the DeletedColumn will &amp;#8220;win&amp;#8221;. The current view of the row will be that the &amp;#8220;bar&amp;#8221; column is deleted.&lt;/p&gt;

&lt;p&gt;When a read query is filtering the reconciled candidate columns it will include the &lt;code&gt;DeletedColumn&lt;/code&gt; in the result set if the localDeletionTime recorded for it is beyond the current &lt;code&gt;gcBefore&lt;/code&gt; time. &lt;code&gt;gcBefore&lt;/code&gt; is determined when the query starts as the current time (to 1 second resolution) less the &lt;code&gt;GCGraceSeconds&lt;/code&gt; value specified for the Column Family. If the deletion is before &lt;code&gt;gcBefore&lt;/code&gt; it is totally ignored, more on the GC process below. When a slice based query (e.g. get the first 10 columns) executes it adds the &lt;code&gt;DeletedColumn&lt;/code&gt; to the result set, but does not count it towards the limit of columns the query has asked for.&lt;/p&gt;

&lt;p&gt;The deleted columns are filtered out of the result set as the last step before returning it to the client. In the simple case of a local read at CL ONE the result of the local query is filtered and returned. For cluster reads the result of the query must be reconciled with CL replicas before the coordinator can say it has a consistent result. And this is why the &lt;code&gt;DeletedColumn&lt;/code&gt;&amp;#8217;s are still in the result set.&lt;/p&gt;

&lt;h2 id='readers_digest'&gt;Readers Digest&lt;/h2&gt;

&lt;p&gt;If a read request involves more than one replica only the &amp;#8220;closest&amp;#8221; (as determined by the &lt;a href='http://wiki.apache.org/cassandra/ArchitectureInternals?highlight=%28snitch%29'&gt;snitch&lt;/a&gt;) is asked to return the full result set. The result set it returns is will include any &lt;code&gt;DeletedColumn&lt;/code&gt;&amp;#8217;s read during the local read.&lt;/p&gt;

&lt;p&gt;The other nodes in the cluster are asked to return &lt;a href='http://wiki.apache.org/cassandra/DigestQueries'&gt;digest&lt;/a&gt; of their local read. The digest is a &lt;a href='http://en.wikipedia.org/wiki/Md5'&gt;MD5&lt;/a&gt; hash of the columns, their values, timestamps and other meta data. Once the coordinator node has received CL read responses, including the data response, it compares the digests with each other and the digest of the data response.&lt;/p&gt;

&lt;p&gt;If they all match then the read is consistent at the CL requested. Otherwise there is an inconsistency that needs to be repaired. The inconsistency could come from one replica including a &lt;code&gt;DeletedColumn&lt;/code&gt; in it&amp;#8217;s digest while another includes the previously deleted value.&lt;/p&gt;

&lt;h2 id='read_repair_sort_of'&gt;Read Repair (sort of)&lt;/h2&gt;

&lt;p&gt;Once a &lt;a href='https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/DigestMismatchException.java'&gt;&lt;code&gt;DigestMismatch&lt;/code&gt;&lt;/a&gt; is detected the differences have to be reconciled before the read response is returned to the client. The process that does this is part of the &lt;a href='http://wiki.apache.org/cassandra/ReadRepair'&gt;ReadRepair&lt;/a&gt; feature, but depending on the circumstances it may not be considered a full Read Repair.&lt;/p&gt;

&lt;p&gt;Read Repair is considered to be happening when the coordinator requests a response from all replicas for a row, no just those that are needed to meet the Consistency Level for the request. This read repair will run for normal get / multi_get / indexed get operations but not range scans, and can be controlled with the &lt;code&gt;read_repair_chance&lt;/code&gt; config setting. Even though all replicas receive the request, the request will only block of Consistency Level replicas. For now I&amp;#8217;m going to ignore this process and just consider what happens if replicas required for the Consistency Level do not agree.&lt;/p&gt;

&lt;p&gt;Once the &lt;code&gt;DigestMismatch&lt;/code&gt; is detected all the replicas that were involved in the read are asked to do the read again and return a full data response to the coordinator. Their responses are then reconciled using the same process as a normal mutation request to get a consistent result for the query.&lt;/p&gt;

&lt;p&gt;This is where the tombstone do their work of remembering that the delete happened. For example if replica 1 says &amp;#8216;foo&amp;#8217; is a normal column, and replica 2 says &amp;#8216;foo&amp;#8217; is a &lt;code&gt;DeletedColumn&lt;/code&gt; with a higher time stamp the value from replica 2 will be used.&lt;/p&gt;

&lt;p&gt;Once a reconciled view of the row has been created each replica is asynchronously sent a mutation with the difference between it&amp;#8217;s data and the reconciled data. The read can then return the reconciled view of the data to the client while the repair of the replicas that participated in the request is going on in the background.&lt;/p&gt;

&lt;h2 id='free'&gt;Free&lt;/h2&gt;

&lt;p&gt;Existing data on disk for a column is not deleted when the delete mutation is processed. Cassandra never mutates on disk data.&lt;/p&gt;

&lt;p&gt;Instead the &lt;a href='http://wiki.apache.org/cassandra/MemtableSSTable?highlight=%28compaction%29'&gt;compaction&lt;/a&gt; process reconciles the data in multiple SSTables on disk. The row fragments from each SSTable are collated and columns with the same name reconciled using the process we&amp;#8217;ve already seen. The result of the compaction is a single SSTable that contains the same &amp;#8220;truth&amp;#8221; as the input files, but may be considerably smaller due to reconciling overwrites and deletions.&lt;/p&gt;

&lt;p&gt;For example, there could be three SSTables that contain a value for the &amp;#8220;foo&amp;#8221; column. In the first the value is &amp;#8220;bar&amp;#8221;, in the second the value is a 16KB string, and in the third it&amp;#8217;s a &lt;code&gt;DeletedColumn&lt;/code&gt;. Before compaction runs the value of the column is nothing, however on disk it uses at least 16KB. After compaction the value will still be nothing, but it will be stored in a single SSTable and use only a few bytes.&lt;/p&gt;

&lt;p&gt;Minor compaction typically runs frequently, so data that is created and deleted reasonably quickly will be deleted from disk quickly. Data that has been through several generations of compaction before it is deleted will not be deleted from disk as quickly. The &lt;code&gt;DeletedColumn&lt;/code&gt; will continue to be written into the new, compacted, SStables until it&amp;#8217;s &lt;code&gt;localDeletionTime&lt;/code&gt; occurs before the current (server) time less the &lt;code&gt;GCGraceSeconds&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id='the_reincarnation_of_paul_reveres_horse'&gt;The reincarnation of &lt;a href='http://www.bobdylan.com/songs/tombstone-blues'&gt;Paul Revere’s horse&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Compactions run locally and by default automatically based on load. Once &lt;code&gt;GCGraceSeconds&lt;/code&gt; has elapsed since the delete a new compaction on the SSTable will purge the tombstone from disk and the delete will be forgotten. But how do we guarantee that other nodes in the cluster have seen the delete before it&amp;#8217;s deleted?&lt;/p&gt;

&lt;p&gt;If a node goes down for longer than &lt;code&gt;max_hint_window_in_ms&lt;/code&gt; it will no longer have hints recorded for it. If the column was never read a Read Repair could not have run. If the column was read but Read Repair was not active and the node was not included in CL nodes it would not have received a repair.&lt;/p&gt;

&lt;p&gt;Deletes operate under Eventual Consistency just like writing a value, with the added complication that they have an built in expiry time (&lt;code&gt;GCGraceSeconds&lt;/code&gt;). If the replicas for a value have not seen the delete before that time there is a risk of deleted data reappearing. The stop that happening &lt;code&gt;nodetool reapir&lt;/code&gt; &lt;a href='http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds'&gt;needs to be run&lt;/a&gt; at least every &lt;code&gt;GCGraceSeconds&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id='in_motion_on_a_single_node'&gt;In Motion on a single node&lt;/h2&gt;

&lt;p&gt;There is not a lot to look at on a single node, but it&amp;#8217;s pretty easy to see the tombstones and the column value persisted into an SSTable.&lt;/p&gt;

&lt;p&gt;One a fresh 0.7 install create the sample schema:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/cassandra-cli -h localhost -f conf/schema-sample.txt&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The jump into the &lt;code&gt;cassandra-cli&lt;/code&gt; and insert one column:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/cassandra-cli -h localhost
Connected to: &amp;quot;Test Cluster&amp;quot; on localhost/9160
Welcome to cassandra CLI.

Type &amp;#39;help;&amp;#39; or &amp;#39;?&amp;#39; for help. Type &amp;#39;quit;&amp;#39; or &amp;#39;exit;&amp;#39; to quit.
[default@unknown] use Keyspace1;                                    
Authenticated to keyspace: Keyspace1
[default@Keyspace1] set Standard1[&amp;#39;foo&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Flush the data from the memtable so our delete cannot be applied in memory:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost flush Keyspace1&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now delete the column using the cli:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@Keyspace1] del Standard1[&amp;#39;foo&amp;#39;][&amp;#39;bar&amp;#39;];        
column removed.&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Flush the delete to disk:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost flush Keyspace1&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Finally use the &lt;code&gt;sstable2json&lt;/code&gt; dump the data from the SSTables:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/sstable2json /var/lib/cassandra/data/Keyspace1/Standard1-f-1-Data.db 
{
&amp;quot;666f6f&amp;quot;: [[&amp;quot;626172&amp;quot;, &amp;quot;62617a&amp;quot;, 1305412876934000, false]]
}
$ bin/sstable2json /var/lib/cassandra/data/Keyspace1/Standard1-f-2-Data.db 
{
&amp;quot;666f6f&amp;quot;: [[&amp;quot;626172&amp;quot;, &amp;quot;4dcf05ab&amp;quot;, 1305413035092000, true]]
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The top level key in the output is the row key and each column is formatted as: name, value, timestamp, delete flag. The first SSTable still contains the &amp;#8220;bar&amp;#8221; column and the &amp;#8220;baz&amp;#8221; data written in the first set operation. The second SSTable also has the &amp;#8220;bar&amp;#8221; column, however the deleted flag is &lt;code&gt;true&lt;/code&gt; and the value is now the (server) time stamp of deletion. This is the timestamp used with &lt;code&gt;GCGraceSeconds&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Read the data back:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@Keyspace1] get Standard1[&amp;#39;foo&amp;#39;][&amp;#39;bar&amp;#39;];
Value was not found&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='in_motion_on_a_cluster'&gt;In Motion on a cluster&lt;/h2&gt;

&lt;p&gt;It&amp;#8217;s a bit more involved but it&amp;#8217;s also possible to see a Read Repair happening on a cluster.&lt;/p&gt;

&lt;p&gt;I normally run a 2 node cluster on my mac book using the direction from &lt;a href='http://www.onemanclapping.org/2010/03/running-multiple-cassandra-nodes-on.html'&gt;Gary Dusbabek&lt;/a&gt;. There is also this really handy tool from &lt;a href='https://github.com/pcmanus/ccm'&gt;Sylvain&lt;/a&gt; that I&amp;#8217;ve been meaning to try, or you can just run a normal two node cluster.&lt;/p&gt;

&lt;p&gt;Edit &lt;code&gt;conf/cassandra.yaml&lt;/code&gt; for both nodes to disable Hinted Handoff:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# See http://wiki.apache.org/cassandra/HintedHandoff
hinted_handoff_enabled: true&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will prevent the nodes telling each other about mutations that happen while the other is down.&lt;/p&gt;

&lt;p&gt;I&amp;#8217;ve also edited conf/log4j-server.properties to set logging at &lt;code&gt;DEBUG&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Start the nodes and create and populate schema below using &lt;code&gt;bin/cassandra-cli&lt;/code&gt; (schema definition is for a 0.7.5 cluster):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;create keyspace dev 
  with placement_strategy = &amp;#39;org.apache.cassandra.locator.SimpleStrategy&amp;#39;
  and replication_factor = 2;

use dev;

create column family data 
  with comparator = AsciiType;

set data[ascii(&amp;#39;foo&amp;#39;)][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Flush the data on both nodes:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h your-node1 flush dev
$ bin/nodetool -h your-node2 flush dev&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Shutdown node 1, connect to node 2 and delete the column using the cli:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/cassandra-cli -h your-node2
Connected to: &amp;quot;Test Cluster&amp;quot; on 127.0.0.2/9160
Welcome to cassandra CLI.

Type &amp;#39;help;&amp;#39; or &amp;#39;?&amp;#39; for help. Type &amp;#39;quit;&amp;#39; or &amp;#39;exit;&amp;#39; to quit.
[default@unknown] use dev;
Authenticated to keyspace: dev
[default@dev] del data[&amp;#39;foo&amp;#39;][&amp;#39;bar&amp;#39;];
column removed.&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Restart node 1, connect via the cli, set the consistency level to ALL to ensure Read Repair runs and read the deleted column:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/cassandra-cli -h your-node1
Connected to: &amp;quot;Test Cluster&amp;quot; on 127.0.0.1/9160
Welcome to cassandra CLI.

Type &amp;#39;help;&amp;#39; or &amp;#39;?&amp;#39; for help. Type &amp;#39;quit;&amp;#39; or &amp;#39;exit;&amp;#39; to quit.
[default@unknown] use dev;
Authenticated to keyspace: dev
[default@dev] consistencylevel as ALL;
Consistency level is set to &amp;#39;ALL&amp;#39;.
[default@dev] get data[&amp;#39;foo&amp;#39;][&amp;#39;bar&amp;#39;]; 
Value was not found&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This would have also worked at QUORUM (which is 2 for RF 2).&lt;/p&gt;

&lt;p&gt;Digging through the logs on node1 you should see that the read will block for 2 nodes and Read Repair is enabled:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ReadCallback.java (line 84) Blockfor/repair is 2/true; setting up requests
    to localhost/127.0.0.1,/127.0.0.2&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next should be some messages about reading the data locally and handling the response from your-node2. Then the &lt;code&gt;RowDigestResolver&lt;/code&gt; will say it&amp;#8217;s resolving 2 responses before detecting the mismatch and raising an error:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;RowDigestResolver.java (line 62) resolving 2 responses
StorageProxy.java (line 398) Digest mismatch:
    org.apache.cassandra.service.DigestMismatchException: Mismatch for key
        DecoratedKey(110673303387115207421586718101067225896, 666f6f)
        (34ec2eb2ec21eb3d05fb6f97cbf84c51 vs ba720207d87132da833ae2579487b172)
    at org.apache.cassandra.service.RowDigestResolver.resolve(
        RowDigestResolver.java:106)
    at org.apache.cassandra.service.RowDigestResolver.resolve(
        RowDigestResolver.java:30)
...&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There should now be a few messages about doing the read again, and receiving the response from your-node2 again. Once the data responses have been received by node1 it will resolve the differences and send out mutations to the nodes than need them. In this case it&amp;#8217;s node1 and you can see the mutation logged from &lt;code&gt;Table&lt;/code&gt; (aka Keyspace):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;RowRepairResolver.java (line 50) resolving 2 responses
RowRepairResolver.java (line 76) versions merged
RowRepairResolver.java (line 85) resolve: 1 ms.
Table.java (line 337) applying mutation of row 666f6f&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you run the read again you should see the RowRepairResolver log &amp;#8216;digests verified&amp;#8217; to say the data matched and there are no repairs to run.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>How are Memtables measured?</title>
   <link href="http://thelastpickle.com/2011/05/04/How-are-Memtables-measured"/>
   <updated>2011-05-04T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/05/04/How-are-Memtables-measured</id>
   <content type="html">&lt;p&gt;&lt;strong&gt;Updated:&lt;/strong&gt; I&amp;#8217;ve added information on the new &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; setting in version 0.8 and improved the information about &lt;code&gt;memtable_throughput&lt;/code&gt;. Thanks for the feedback.&lt;/p&gt;

&lt;p&gt;In version 0.7 of Cassandra there are 3 configuration settings that can trigger flushing a memtable to disk. Version 0.8 adds support for a global &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; which may replace the previous 3 settings.&lt;/p&gt;

&lt;p&gt;First the 0.7 settings.&lt;/p&gt;

&lt;h2 id='memtable_flush_after_minutes'&gt;memtable_flush_after (minutes)&lt;/h2&gt;

&lt;p&gt;This is the maximum number of minutes a memtable should stay in memory for if it has received writes. When the memtable is created the current time is recorded which is then checked every 10 seconds. If after the time span either the primary memtable for the Column Family or any secondary index memtables have received writes they are replaced and flushed to disk.&lt;/p&gt;

&lt;p&gt;Unfortunately as the &lt;a href='http://wiki.apache.org/cassandra/StorageConfiguration'&gt;wiki&lt;/a&gt; points out, there is a good reason to make this value small and a good reason to make it large.&lt;/p&gt;

&lt;p&gt;A log file cannot be deleted until all of the segments / records it contains have been marked as completed. This &lt;a href='http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/'&gt;happens&lt;/a&gt; when a memtable is flushed to disk. Because the log file is shared by all Column Familes, one Column Family that has long living memtables can prevent log files from been deleted and their disk space freed up.&lt;/p&gt;

&lt;p&gt;However smaller values can cause multiple memtables to expire at the same time, prior to version 0.7 this could cause flush requests to block. Version 0.7 added &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-1099'&gt;memtable_flush_writers&lt;/a&gt; and &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2333'&gt;memtable_flush_queue_size&lt;/a&gt; ,but it can still slow down the IO system (see conf/cassandra.yaml for more info). The best approach is to tune the other memtable thresholds to trigger when you want them, and leave this setting as a backup.&lt;/p&gt;

&lt;p&gt;The wiki recommends setting a default of 1440 minutes, or 24 hours which is also the default.&lt;/p&gt;

&lt;h2 id='memtable_operations_millions'&gt;memtable_operations (millions)&lt;/h2&gt;

&lt;p&gt;The Memtable tracks the number of operations applied to it by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;counting the number of top level columns (i.e. Columns for a Standard CF or Super Columns for a Super CF) in a mutation.&lt;/li&gt;

&lt;li&gt;considering a row level deletion as a single operation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deletions and Insertions are both considered a &lt;code&gt;mutation&lt;/code&gt;, inserting 3 columns increases the count by 3 and deleting 3 named columns increases the count by 3. Note that the number of sub columns in a Super Column is ignored. So inserting 10 sub columns into 1 super column increases the count by one, and deleting a super column by name that has 10 sub column increases the count by one.&lt;/p&gt;

&lt;p&gt;The operations threshold (and the size threshold) is checked before applying a mutation and the flush is not requested until after the mutation has completed. So it&amp;#8217;s possible for the memtable to contain more than &lt;code&gt;memtable_operations&lt;/code&gt; when it is flushed to disk.&lt;/p&gt;

&lt;p&gt;A different way to think about this setting, and &lt;code&gt;memtable_throughput&lt;/code&gt;, is as &lt;code&gt;sstable_min_operations&lt;/code&gt; and &lt;code&gt;sstable_min_bytes&lt;/code&gt;. In general operation new sstables are created after at least &lt;code&gt;sstable_min_operations&lt;/code&gt; operations have occurred or at most &lt;code&gt;sstable_min_bytes&lt;/code&gt; bytes will be written.&lt;/p&gt;

&lt;p&gt;If no value is provided when the CF is created it is set to the default memtable throughput in MB (below) / 64 * 0.3, so it&amp;#8217;s 300k ops per 64MB of throughput. If you have a CF that contains many small columns it&amp;#8217;s a good idea to look at the log entries for memtable flushes to see if the ops threshold is triggering early and causing small memtables to be frequently written.&lt;/p&gt;

&lt;h2 id='memtable_throughput'&gt;memtable_throughput&lt;/h2&gt;

&lt;p&gt;Throughput for the memtable is tracked and tested at the same time as the operation count. But counting the byte size of the data is more involved and depends on the type of the column. The size of the data when serialised to disk is counted as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;standard column byte size is&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;length of the key byte array plus 2 bytes to store the length&lt;/li&gt;

&lt;li&gt;1 byte to indicate if the column has been deleted&lt;/li&gt;

&lt;li&gt;8 bytes for the timestamp&lt;/li&gt;

&lt;li&gt;length of the value byte array plus 4 bytes to store the length&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;expiring columns (those with a TTL) add another 8 bytes to the length of a standard column.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;deleted columns (tombstones) are the same as standard columns but the value is always 4 bytes long.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;counter columns (in Cassandra v0.8) add another 8 bytes to the length of a standard column. Note that for a counter column the value will always be an 8 byte long.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;super columns sum the size of all contained columns and then add&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;length of the name byte array plus 2 bytes to store the length&lt;/li&gt;

&lt;li&gt;4 bytes to indicate when it was deleted&lt;/li&gt;

&lt;li&gt;8 bytes to store the timestamp for the deletion&lt;/li&gt;

&lt;li&gt;4 bytes to store the number of sub columns&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(Currently the calculation for the super column only includes the sum of the sub columns. I think this needs to be changed.)&lt;/p&gt;

&lt;p&gt;A row deletion will add zero bytes to the throughput counter.&lt;/p&gt;

&lt;p&gt;The byte size of the mutation is always added to the counter, if one mutation replaces columns in the memtable their byte size &lt;em&gt;is not&lt;/em&gt; subtracted from the counter.&lt;/p&gt;

&lt;p&gt;Getting this setting wrong is a very easy way to run out of memory. From version 0.7 onwards the worse case scenario is up to CF Count + Secondary Index Count + &lt;code&gt;memtable_flush_queue_size&lt;/code&gt; (defaults to 4) + &lt;code&gt;memtable_flush_writers&lt;/code&gt; (defaults to 1 per data directory) memtables in memory the JVM at once. It&amp;#8217;s best to be conservative, follow the &lt;a href='http://wiki.apache.org/cassandra/MemtableThresholds'&gt;wiki advice&lt;/a&gt; and consider that the JVM may take up to 10 times as much memory as it takes to serialise the data to disk.&lt;/p&gt;

&lt;p&gt;And that&amp;#8217;s the problem with this threshold. It&amp;#8217;s &lt;strong&gt;not&lt;/strong&gt; measuring how much memory a memtable is using in the JVM Heap, it&amp;#8217;s measuring the maximum amount of bytes it could take to serialise the data (excluding the index and bloom filter) to disk. Which makes it a difficult knob to use when tuning how much memory Cassandra uses.&lt;/p&gt;

&lt;p&gt;If a value is not provided when the Column Family is created it will default to 1/16th the maximum size of the JVM Heap at the time. This value stored with the Column Family meta data and will not change again. Typical values are around 128MB to 256MB.&lt;/p&gt;

&lt;h2 id='memtable_total_space_in_mb'&gt;memtable_total_space_in_mb&lt;/h2&gt;

&lt;p&gt;Version 0.8 &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2006'&gt;adds&lt;/a&gt; the per node &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; setting which makes life easier and may eventually &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2449'&gt;replace&lt;/a&gt; the 3 previous settings. While it&amp;#8217;s fun to play with the per CF settings, it can also be a pain when building real systems that need to stay up.&lt;/p&gt;

&lt;p&gt;If no value is set in &lt;code&gt;conf/cassandra.yaml&lt;/code&gt; the setting will default to one third of the JVM max Heap size. If it is set to zero the setting is disabled and only the old per CF thresholds will be used. If the global setting is enabled and there are per CF settings &lt;strong&gt;both&lt;/strong&gt; of them will be used.&lt;/p&gt;

&lt;p&gt;There are two parts to the global memtable size, measuring the real memory usage of the memtable and flushing. First the measuring.&lt;/p&gt;

&lt;p&gt;Rather than track every byte allocated the server periodically works out the ratio between the throughput as measured above and the real in memory bytes as measured by JVM. The in memory byte count is worked out using the &lt;a href='http://download.oracle.com/javase/6/docs/api/java/lang/instrument/package-summary.html'&gt;Instrumentation Java Package&lt;/a&gt; and code from &lt;a href='https://github.com/jbellis/jamm'&gt;Jonathan Ellis&lt;/a&gt;. After a mutation has been applied to the memtable, but before a flush is requested, Cassandra calculates the &amp;#8220;Live Ratio&amp;#8221; if more than twice as many operations (as calculated above) have been processed since the last time it was calculated.&lt;/p&gt;

&lt;p&gt;Measuring the Live Ratio is done asynchronously and involves measuring the real memory size of all the keys, super columns and columns in the memtable and dividing it by the throughput as measured above. For sanity the ratio is clamped between 1.0 and 64.0, if the value is outside of this range a &lt;code&gt;WARN&lt;/code&gt; level log message will let you know. Finally the ratio for the Column Family is updated to the new ratio if and only if the new ratio is higher than the previous one. An &lt;code&gt;INFO&lt;/code&gt; level message will let you know when the ratio is calculated, how long it took and if it changed.&lt;/p&gt;

&lt;p&gt;Next the &lt;code&gt;MeteredFlusher&lt;/code&gt; runs every second and uses a two phase approach to keeping the live memory use under the setting. First it looks at the total live bytes for each Column Family, including it&amp;#8217;s secondary indexes, and flushes CF&amp;#8217;s that could potentially fill the memory if &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2006?focusedCommentId=13010860&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13010860'&gt;allowed to create&lt;/a&gt; memtables of this size. Live bytes are calculated by multiplying the throughput as perviously measured by the Live Ratio. The Flusher considers the Column Family to be using too much memory if it&amp;#8217;s current live size is more than &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; divided by the maximum number of memtables the Column Family could have in memory. The calculation for this is similar to the one presented above for &lt;code&gt;memtable_throughput&lt;/code&gt; but it includes secondary indexes and a fudge factor that takes into account how the live size is measured.&lt;/p&gt;

&lt;p&gt;For example if &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; is 100MB, and &lt;code&gt;memtable_flush_writers&lt;/code&gt; is the default 1 (with one data directory), and &lt;code&gt;memtable_flush_queue_size&lt;/code&gt; is the default 4, and a Column Family has no secondary indexes. The CF will not be allowed to get above one seventh of 100MB or 14MB, as if the CF filled the flush pipeline with 7 memtables of this size it would take 98MB. At a more sensible 2GB for &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; (1/3 of a 6GB JVM Heap) the CF will be flushed if it is using 292MB of live memory.&lt;/p&gt;

&lt;p&gt;(I&amp;#8217;ve skipped a couple of things here such as considering the bytes currently been flushed.)&lt;/p&gt;

&lt;p&gt;The flusher process will end there if the number of bytes that were flushing when it started plus the bytes for all the CF&amp;#8217;s that were not flushed in the first phase is less than &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The second phase flushes the CF&amp;#8217;s in order of largest to smallest until the total live size (including the bytes currently been flushed) gets down below the target setting.&lt;/p&gt;

&lt;p&gt;This new setting (and the existing &lt;code&gt;flush_largest_memtables_at&lt;/code&gt;) should make it harder to shot yourself in the foot with memory management and easier for new users to feel comfortable with the server.&lt;/p&gt;

&lt;h2 id='in_motion'&gt;In Motion&lt;/h2&gt;

&lt;p&gt;You can check the per CF thresholds as well as the current tracked values for a memtable using &lt;code&gt;bin/nodetool&lt;/code&gt;, &lt;code&gt;bin/cassandra-cli&lt;/code&gt; or JConsole. I&amp;#8217;m not aware of any current features to check the &lt;code&gt;Live Ratio&lt;/code&gt; or &lt;code&gt;Live Size&lt;/code&gt; of a CF.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bin/nodetool cfstats&lt;/code&gt; can tell you the current operation count (&amp;#8216;Memtable Columns Count&amp;#8217;) and throughput (&amp;#8216;Memtable Data Size&amp;#8217;):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ./bin/nodetool -h localhost cfstats
Keyspace: dev
    Read Count: 1
    Read Latency: 0.897 ms.
    Write Count: 2
    Write Latency: 0.051 ms.
    Pending Tasks: 0
        Column Family: data
        SSTable count: 2
        Space used (live): 9530
        Space used (total): 9530
        Memtable Columns Count: 1
        Memtable Data Size: 26
        Memtable Switch Count: 1
        Read Count: 1
        Read Latency: 0.897 ms.
        Write Count: 2
        Write Latency: 0.020 ms.
        Pending Tasks: 0
        Key cache capacity: 200000
        Key cache size: 2
        Key cache hit rate: 0.0
        Row cache: disabled
        Compacted row minimum size: 51
        Compacted row maximum size: 86
        Compacted row mean size: 73&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&amp;#8216;bin/cassandra-cli&amp;#8217; can tell you the current thresholds using either &lt;code&gt;describe keyspace&lt;/code&gt; or &lt;code&gt;show keyspaces&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@dev] describe keyspace;
Keyspace: dev:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
    Options: [datacenter1:2]
  Column Families:
    ColumnFamily: data
      Key Validation Class: org.apache.cassandra.db.marshal.BytesType
      Default column value validator: org.apache.cassandra.db.marshal.BytesType
      Columns sorted by: org.apache.cassandra.db.marshal.AsciiType
      Row cache size / save period in seconds: 0.0/0
      Key cache size / save period in seconds: 200000.0/14400
      Memtable thresholds: 0.29062499999999997/62/1440 (millions of ops/MB/minutes)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or using JConsole connect to the server, select MBeans and then navigate to org.apacge.cassandra.db.ColumnFamilies.&amp;#60;your-keyspace&amp;#62;.&amp;#60;your-column-family&amp;#62;. There you can find the current thresholds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MemtableFlushAfterMins&lt;/li&gt;

&lt;li&gt;MemtableOperationsInMillions&lt;/li&gt;

&lt;li&gt;MemtableThroughputInMB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the running values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MemtableColumnCount&lt;/li&gt;

&lt;li&gt;MemtableDataSize&lt;/li&gt;
&lt;/ul&gt;</content>
 </entry>
 
 <entry>
   <title>The forces of Write and Read</title>
   <link href="http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read"/>
   <updated>2011-04-28T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read</id>
   <content type="html">&lt;p&gt;Requests that write and read data in Cassandra, like any data base, have competing characteristics that need to be balanced. This post compares the approach taken by Cassandra to traditional Relation Database Systems.&lt;/p&gt;

&lt;p&gt;When writing data the most efficient approach to splat it somewhere on disk with the minimum of fuss. Preferably at the end of a file, or even in a new file. The write is going to take longer if the data is carefully placed in the correct ordered position in an existing file.&lt;/p&gt;

&lt;p&gt;Problem is reading data is going to be much more efficient if the data is in the correct ordered location, and preferably in just one location. If the data is unordered and spread out in a file, or multiple files, the read is going to take longer.&lt;/p&gt;

&lt;p&gt;Traditional Relational databases such as SQL Server are optimised for read requests, so write requests ensure that data is written to the correct ordered location. However the &lt;a href='http://en.wikipedia.org/wiki/Database_log'&gt;transaction log / write ahead log&lt;/a&gt; lets the database delay writing the data to the correct location on disk, providing a handy performance boost while still supporting &lt;a href='http://en.wikipedia.org/wiki/ACID'&gt;ACID&lt;/a&gt; transactions.&lt;/p&gt;

&lt;h2 id='roughly_speaking_write_requests'&gt;Roughly speaking Write requests&lt;/h2&gt;

&lt;p&gt;(very) Roughly speaking a write request in a RDBMS follows this path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Append the write request to the end of the transaction log.&lt;/li&gt;

&lt;li&gt;Locate the rows modified by the request or those adjacent to new rows, by reading index, data or other &lt;a href='http://msdn.microsoft.com/en-us/library/ms190969.aspx'&gt;pages&lt;/a&gt; from disk if not already in memory.&lt;/li&gt;

&lt;li&gt;Modify the index and data pages in memory to insert or update rows.&lt;/li&gt;

&lt;li&gt;Acknowledge the write to the client.&lt;/li&gt;

&lt;li&gt;&lt;a href='http://en.wikipedia.org/wiki/Transaction_checkpoint#Recovery_process'&gt;Checkpoint&lt;/a&gt; the modified pages by flushing them to disk in the correct ordered location, and mark the log records associated with the changes as no longer needed. The Checkpoint process is typically executed asynchronously (&lt;a href='http://msdn.microsoft.com/en-us/library/ms189573.aspx'&gt;SQL Server&lt;/a&gt; &lt;a href='http://dev.mysql.com/doc/refman/5.0/en/innodb-checkpoints.html'&gt;MySQL&lt;/a&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(I think in reality the log record is written once a statement completes inserting/updating an individual row. Also a record needs to be written to say if the transaction was committed or aborted, so that during log file recovery the system knows if the transaction should be &lt;a href='http://en.wikipedia.org/wiki/Transaction_processing#Rollforward'&gt;replayed&lt;/a&gt; or &lt;a href='http://en.wikipedia.org/wiki/Transaction_processing#Rollback'&gt;ignored&lt;/a&gt;. I&amp;#8217;m also ignoring all the locking.)&lt;/p&gt;

&lt;p&gt;The write ahead log lets the write request dump the data &lt;em&gt;somewhere&lt;/em&gt; on disk fast, then work on structures which are &lt;em&gt;hopefully&lt;/em&gt; in memory. For updates care is taken to check for existing data, which is then read and mutated in memory. The read request ignores the transaction log, it reads the single source of truth from the correctly ordered index and data pages that are hopefully still in memory.&lt;/p&gt;

&lt;p&gt;If the RDBMS uses &lt;a href='http://en.wikipedia.org/wiki/Multiversion_concurrency_control'&gt;Multi Version Concurrency Control&lt;/a&gt; life is a bit more difficult for the read request, so lets ignore that for now.&lt;/p&gt;

&lt;p&gt;On the other hand Cassandra is optimised for write requests, so read requests need to do more work to ensure they have the correct data.&lt;/p&gt;

&lt;p&gt;Again roughly speaking a write request for Cassandra follows the path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Appended the write to the end of the write ahead log.&lt;/li&gt;

&lt;li&gt;Modify an in memory structure called a &lt;a href='http://wiki.apache.org/cassandra/MemtableSSTable'&gt;Memtable&lt;/a&gt; to store the row changes present in the request.&lt;/li&gt;

&lt;li&gt;Queue the Memtable to be flushed to disk (in the background) if per Memtable &lt;a href='http://wiki.apache.org/cassandra/MemtableThresholds'&gt;thresholds&lt;/a&gt; on operation count, data size or time to be violated. A new empty Memtable is immediately put in it&amp;#8217;s place.&lt;/li&gt;

&lt;li&gt;Acknowledge the write to the client.&lt;/li&gt;

&lt;li&gt;Flush the Memtable to disk by creating a new &lt;a href='http://wiki.apache.org/cassandra/ArchitectureSSTable'&gt;SSTable (Sorted String Table)&lt;/a&gt; on disk, and mark the log records as no longer needed. Typically the flush task is executed asynchronously.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(This is the write process on a single node, I&amp;#8217;m ignoring how the write is applied to the cluster and other features required for &amp;#8216;Eventual Consistency&amp;#8217;.)&lt;/p&gt;

&lt;p&gt;Again the write ahead log lets the write process dump the data &lt;em&gt;somewhere&lt;/em&gt; on disk fast, and then work on structures which are &lt;em&gt;guaranteed&lt;/em&gt; to be in memory. No effort is taken to read the existing data, if a row was updated it may now partially exist in memory and in multiple SSTables on disk.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s probably important to point out that the RDBMS Checkpoint process works with an &lt;em&gt;extent&lt;/em&gt; which defines it&amp;#8217;s on disk allocation unit. Sql Server uses &lt;a href='http://msdn.microsoft.com/en-us/library/ms190969.aspx'&gt;extents&lt;/a&gt; that contain 8 8KB pages. MySQL uses a &lt;a href='http://dev.mysql.com/doc/refman/5.1/en/create-tablespace.html'&gt;default EXTENT_SIZE&lt;/a&gt; of 1MB for the extent, and the InnoDB engine has a &lt;a href='http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html'&gt;hard coded&lt;/a&gt; 16KB page size. Once one extent has been written it may be necessary to seek to anther random part of the disk to write to next extent. Though I&amp;#8217;m sure they do all sorts of clever things to make it nice and fast, this is the basic unit they deal with.&lt;/p&gt;

&lt;p&gt;In Cassandra the maximum Memtable size is configured per &lt;a href='http://www.datastax.com/docs/0.7/data_model/column_families'&gt;Column Family&lt;/a&gt; (sort of like a table but I dislike that analogy). Memory thresholds of 128MB, 256MB or higher are &lt;a href='http://wiki.apache.org/cassandra/MemtableThresholds'&gt;common&lt;/a&gt;, though Memtables may flush at lower sizes due violating other thresholds or in response to certain system events. The Flush task is able to write the entire contents of the Memtable to disk as a new file without having to perform random IO seeks.&lt;/p&gt;

&lt;h2 id='dear_readers'&gt;Dear Readers&lt;/h2&gt;

&lt;p&gt;In Cassandra read requests potentially have a lot more to do that those in a traditional RDBMS. A typical read request in a RDBMS to get a row by Primary Key seeks through the nicely ordered &lt;a href='http://en.wikipedia.org/wiki/B_tree'&gt;b-tree&lt;/a&gt; index pages to find the right data page, scans the page to find the appropriate row and reads the complete row (ignoring off page large data). Reading the index and data pages may require disk access, and a &lt;a href='http://dev.mysql.com/doc/refman/5.1/en/query-cache.html'&gt;query cache&lt;/a&gt; may provide direct access to the results of a previous query without needing to negotiate the index.&lt;/p&gt;

&lt;p&gt;Cassandra has two caches that can expedite a read request, a key cache (discussed below) and a &lt;a href='http://www.datastax.com/dev/blog/maximizing-cache-benefit-with-cassandra'&gt;row cache&lt;/a&gt;. The row cache contains all the current data for the row, and a query that needs to read any column from the row can be fully resolved by reading the cache.&lt;/p&gt;

&lt;p&gt;If the requested row cannot be served from the row cache it must be read from disk, however the write path may have left fragments of the row in various SSTables. The read request follows the path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check the in memory &lt;a href='http://en.wikipedia.org/wiki/Bloom_filter'&gt;Bloom Filter&lt;/a&gt; for each SSTable on disk, to build a list of candidate SSTables where the row &lt;em&gt;may&lt;/em&gt; exist.&lt;/li&gt;

&lt;li&gt;If enabled probe the key cache for each candidate SSTable from step 1 to get the position of the row in the data file. The key cache may hold only a sample of keys from the file so misses are possible.&lt;/li&gt;

&lt;li&gt;If the key cache missed, probe the SSTable index summary held in memory which contains a regular sampling of the keys (and their positions) in the SSTable index file to find the preceding (sampled) key. Seek to the position of the preceding sampled key in the SSTable index file, and then scan the index file until the requested key and it&amp;#8217;s data file position is found.&lt;/li&gt;

&lt;li&gt;Seek to the row position in the SSTable data file and then scan the row data to read the columns that match the read request.&lt;/li&gt;

&lt;li&gt;Reduce (potentially) multiple values for each column provided by each SSTable and the current Memtable to a single value by considering column &lt;a href='http://wiki.apache.org/cassandra/API?highlight=%28timestamp%29'&gt;timestamps&lt;/a&gt;, Time To Live settings and &lt;a href='http://wiki.apache.org/cassandra/DistributedDeletes'&gt;deletes&lt;/a&gt;.&lt;/li&gt;

&lt;li&gt;Return the result to the client.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(This is the read path on a single node, I&amp;#8217;m ignoring how the read is processed by the cluster, features required for &amp;#8216;Eventual Consistency&amp;#8217;, and have not mentioned that memory mapped file access is used by default on 64 bit platforms.)&lt;/p&gt;

&lt;p&gt;The read request must piece together the row from potentially many files, which may each involve random IO. Each SSTable which contains any columns for a row must be read to determine if it contains columns that match the request criteria. Hot rows may be present in present in memory in the form of Cassandra key and row caches or OS caches.&lt;/p&gt;

&lt;p&gt;Rows which receive many writes (including overwriting the same column) over time have the potential to perform worse than those that are written to within a short time span. Multiple writes to the same row and column are reconciled in memory by the Memtable so that only the current values for a row are written to disk in the SSTable. The &lt;a href='http://wiki.apache.org/cassandra/MemtableSSTable'&gt;Compaction Process&lt;/a&gt; provides an opposing force to the propensity of write requests to spread data out over multiple files.&lt;/p&gt;

&lt;p&gt;With all this going on it&amp;#8217;s still possible to get millisecond or better read performance. The import thing to note is that it&amp;#8217;s possible to have poorly performing reads due to spread out rows.&lt;/p&gt;

&lt;h2 id='compacting_for_a_sustainable_future'&gt;Compacting for a sustainable future&lt;/h2&gt;

&lt;p&gt;The &lt;a href='http://wiki.apache.org/cassandra/MemtableSSTable'&gt;Compaction Process&lt;/a&gt; is there to help the read requests by reducing the number of SSTables they may have visit. Compaction is also responsible for finalising &lt;a href='http://wiki.apache.org/cassandra/DistributedDeletes'&gt;Distributed Deletes&lt;/a&gt; and Time To Live, but I will ignore those for now.&lt;/p&gt;

&lt;p&gt;Major Compactions are triggered manually via the &lt;a href='http://wiki.apache.org/cassandra/NodeTool'&gt;nodetool&lt;/a&gt; utility and compact all SSTables for a Column Family into one file. Owing to the way compaction chooses which files to process this may result in the new file not been compacted for a very long time. As a result Major compactions are no longer recommended as Minor compactions can do the same things. I&amp;#8217;ve mentioned them here just for completeness.&lt;/p&gt;

&lt;p&gt;Cassandra checks to see if a Minor compaction is needed whenever an SSTable is written to disk. The process groups the files into buckets where every file in a bucket is within 50% of the average size of files in the bucket, small files (less than 50MB) are put in the first bucket. If the bucket contains more than than the Column Family defined min_compaction_threshold files the compaction process will compact up to max_compaction_threshold files together.&lt;/p&gt;

&lt;p&gt;During compaction the row fragments from the SSTables are reconciled to create a single row. When merging columns the one with the highest timestamp is used, if the timestamps are equal the column value is used as a tie breaker.&lt;/p&gt;

&lt;p&gt;The result is a single SSTable that contains an aggregated view of the row that was present in the mulitple input SStables. And a shorter path for any read request that previously needed to potentially read from several SSTables.&lt;/p&gt;

&lt;h2 id='in_motion'&gt;In motion&lt;/h2&gt;

&lt;p&gt;It&amp;#8217;s reasonably easy to see this process happening in slow motion by constructing a contrived schema in Cassandra and playing with some of the command line tools.&lt;/p&gt;

&lt;p&gt;This schema creates two Column Families. The OneOp CF sets the Memtable to flush to disk after just one operation is written, and FiveOp after 5 operations (&lt;code&gt;memtable_operations&lt;/code&gt; is expressed in millions of operations). The &lt;code&gt;min_compaction_threshold&lt;/code&gt; tells compaction to start after we see 4 files. The script then inserts some sample data, all against the same row and the same columns to demonstrate overwrites.&lt;/p&gt;

&lt;p&gt;Note that the Memtable thresholds are checked before the operation starts, but it is not flushed to disk until after the operation completes. For writes it counts the each columns in the request as an operation. In our case it means the SSTables will be created with after 2 or 6 operations rather than the expected 1 or 5.&lt;/p&gt;

&lt;p&gt;Start the cassandra cli and paste the script into the cli.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;create keyspace ReadWrite;

use ReadWrite;

create column family OneOp 
  with memtable_operations = 0.000001
  and min_compaction_threshold = 4;
create column family FiveOp 
  with memtable_operations = 0.000005
  and min_compaction_threshold = 4;

set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz1&amp;#39;;
set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz2&amp;#39;;
set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz3&amp;#39;;
set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz3&amp;#39;;
set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz1&amp;#39;;
set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz1&amp;#39;;


set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The data directory for the ReadWrite keyspace should now look like.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;/var/lib/cassandra/data/ReadWrite:
total 200
drwxr-xr-x  22 aaron  wheel   748B 27 Apr 10:42 .
drwxr-xr-x   4 aaron  wheel   136B 27 Apr 10:42 ..
-rw-r--r--   1 aaron  wheel    75B 27 Apr 10:42 FiveOp-f-1-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 FiveOp-f-1-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 FiveOp-f-1-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 FiveOp-f-1-Statistics.db
-rw-r--r--   1 aaron  wheel    75B 27 Apr 10:42 FiveOp-f-2-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 FiveOp-f-2-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 FiveOp-f-2-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 FiveOp-f-2-Statistics.db
-rw-r--r--   1 aaron  wheel    76B 27 Apr 10:42 OneOp-f-1-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 OneOp-f-1-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 OneOp-f-1-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 OneOp-f-1-Statistics.db
-rw-r--r--   1 aaron  wheel    76B 27 Apr 10:42 OneOp-f-2-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 OneOp-f-2-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 OneOp-f-2-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 OneOp-f-2-Statistics.db
-rw-r--r--   1 aaron  wheel    76B 27 Apr 10:42 OneOp-f-3-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 OneOp-f-3-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 OneOp-f-3-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 OneOp-f-3-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are 3 SSTables for OneOp and 2 for FiveOp, and all of them contain the a fragment of the &amp;#8216;foo1&amp;#8217; row.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;nodetool cfhistograms&lt;/code&gt; utility shows recent statistics for requests against a Column Family, where recent means &amp;#8220;since the last time it was run&amp;#8221;. The same statistics are also available via the org.apache.cassandra.db.ColumnFamilies.&amp;#60;keyspace&amp;#62;.&amp;#60;column_family&amp;#62; MBean in JConsole.&lt;/p&gt;

&lt;p&gt;Clear the stats for both Column Families on the command line.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost cfhistograms ReadWrite OneOp
$ bin/nodetool -h localhost cfhistograms ReadWrite FiveOp&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now execute two reads via the cli.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@ReadWrite] get OneOp[&amp;#39;foo1&amp;#39;];
=&amp;gt; (column=626172, value=62617a31, timestamp=1303875486224000)
Returned 1 results.
[default@ReadWrite] get FiveOp[&amp;#39;foo1&amp;#39;];
=&amp;gt; (column=626172, value=62617a, timestamp=1303875486248000)
Returned 1 results.
[default@ReadWrite] &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Check the stats for the OneOp Column Family.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost cfhistograms ReadWrite OneOp
ReadWrite/OneOp histograms
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1                  0                 0                 0                 0                 3
2                  0                 0                 0                 0                 0
3                  1                 0                 0                 0                 0
4                  0                 0                 0                 0                 0&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The histogram is saying that 1 request used 3 SSTables, and there are an estimated 3 columns in all the SSTables the Column Family is tracking. The Column Family should be tracking 3 SSTables, and each should have only one column for the &amp;#8216;foo1&amp;#8217; row. The insert script overwrote the same column and the Memtable absorbed one overwrite before been flushed to disk.&lt;/p&gt;

&lt;p&gt;The stats for the FiveOp Column Family also make sense.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost cfhistograms ReadWrite FiveOp
ReadWrite/FiveOp histograms
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1                  0                 0                 0                 0                 2
2                  1                 0                 0                 0                 0&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are only 2 SSTables for the Column Family, both contain a fragment of the &amp;#8216;foo1&amp;#8217; row and both contain only one column.&lt;/p&gt;

&lt;p&gt;A minor compaction on the OneOp Column Family can be triggered by adding just two more &lt;a href='http://www.youtube.com/watch?v=rXH_12QWWg8'&gt;wafer thin&lt;/a&gt; rows via the cli.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@ReadWrite] set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz1&amp;#39;;
Value inserted.
[default@ReadWrite] set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz2&amp;#39;;
Value inserted.
[default@ReadWrite]&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This should trigger a minor compaction as there is now 4 SSTables of similar size. The data directory should now look something like this.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ls -l /var/lib/cassandra/data/ReadWrite/
total 280
-rw-r--r--  1 aaron  wheel    75 27 Apr 15:38 FiveOp-f-1-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 FiveOp-f-1-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 FiveOp-f-1-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 FiveOp-f-1-Statistics.db
-rw-r--r--  1 aaron  wheel    75 27 Apr 15:38 FiveOp-f-2-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 FiveOp-f-2-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 FiveOp-f-2-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 FiveOp-f-2-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-1-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 15:38 OneOp-f-1-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 OneOp-f-1-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 OneOp-f-1-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 OneOp-f-1-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-2-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 15:38 OneOp-f-2-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 OneOp-f-2-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 OneOp-f-2-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 OneOp-f-2-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-3-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 15:38 OneOp-f-3-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 OneOp-f-3-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 OneOp-f-3-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 OneOp-f-3-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-4-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 16:16 OneOp-f-4-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 16:16 OneOp-f-4-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 16:16 OneOp-f-4-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 16:16 OneOp-f-4-Statistics.db
-rw-r--r--  1 aaron  wheel    76 27 Apr 16:16 OneOp-f-5-Data.db
-rw-r--r--  1 aaron  wheel  1936 27 Apr 16:16 OneOp-f-5-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 16:16 OneOp-f-5-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 16:16 OneOp-f-5-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The SSTables for OneOp numbered 1 through 4 have been compacted, this is tracked in the server and a compacted marker file (e.g. &amp;#8220;OneOp-f-4-Compacted&amp;#8221;) is written to disk to preserve the information across system restarts. The unused files will be physically deleted during &lt;a href='http://wiki.apache.org/cassandra/ArchitectureInternals?highlight=%28delete%29'&gt;JVM Garbage Collection&lt;/a&gt;. If Cassandra detects it is low on disk space when about to write data to disk, it will trigger GC in an effort to reclaim unused space.&lt;/p&gt;

&lt;p&gt;There is also a new SSTable &lt;code&gt;OneOp-f-5-Data.db&lt;/code&gt; that contains the single reconciled row from the other 4 SSTables. A read request against that row should now only use one SSTable:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@ReadWrite] get OneOp[&amp;#39;foo1&amp;#39;];
=&amp;gt; (column=626172, value=62617a32, timestamp=1303877811057000)
Returned 1 results.
[default@ReadWrite] &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Check the stats.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost cfhistograms ReadWrite OneOp
ReadWrite/OneOp histograms
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1                  1                 0                 0                 0                 1
2                  0                 0                 0                 0                 0
3                  0                 0                 0                 0                 0
4                  0                 0                 0                 0                 0&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The request used one SSTable and the Column Family has only 1 column in all the SSTables it is tracking.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Introduction to Cassandra</title>
   <link href="http://thelastpickle.com/2011/02/07/Introduction-to-Cassandra"/>
   <updated>2011-02-07T00:00:00-08:00</updated>
   <id>http://thelastpickle.com/2011/02/07/Introduction-to-Cassandra</id>
   <content type="html">&lt;p&gt;I&amp;#8217;ve tidied up a previous Introduction to Cassandra presentation with better diagrams and improved explanations. It&amp;#8217;s available as &lt;a href='/files/2011-02-07-introduction-to-cassandra/introduction-to-cassandra.key'&gt;Keynote&lt;/a&gt;, &lt;a href='/files/2011-02-07-introduction-to-cassandra/introduction-to-cassandra.pdf'&gt;PDF&lt;/a&gt; or plain old &lt;a href='/files/2011-02-07-introduction-to-cassandra/introduction-to-cassandra.html'&gt;web pages&lt;/a&gt;. It&amp;#8217;s a basic introduction to the way Cassandra works as a clustered system. It covers Partitioning, Replication, Hinted Handoff, Read Repair and Anti Entrophy. It does not cover the data model.&lt;/p&gt;

&lt;p&gt;Presentation classic is still &lt;a href='http://www.slideshare.net/aaronmorton/well-railedcassandra24112010-5901169'&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have any questions or suggestions let me know. You can also &lt;a href='http://creativecommons.org/licenses/by-nc/3.0/nz/deed.en'&gt;use it&lt;/a&gt; as a starting point if you are creating a Cassandra presentation of your own.&lt;/p&gt;</content>
 </entry>
 
 
</feed>
