There are many knobs to turn in Apache Cassandra. Finding the right value for all of them is hard. Yet even with all values finely tuned unexpected things happen. In this post we will see how
gc_grace_seconds can break the promises of the Hinted Handoff.
gc_grace_seconds defines the time Cassandra keeps tombstones around. Tombstones are special values Cassandra writes instead of the actual data whenever the data is deleted or its TTL expires. Alain covered tombstones in detail in a previous post, About Deletes and Tombstones in Cassandra.
Hinted Handoff is one of the data consistency mechanisms built into Cassandra. When a node in the cluster goes down, the remaining nodes will save mutations (writes) for this node locally as hints. This will be happening for a period given by the
max_hint_window_ms setting in cassandra.yaml. Once this window expires, nodes will stop saving hints.
Prior to version 3.0, Cassandra stores hints in the
system.hints table. In version 3.0 and later, hint storage received a redesign and Cassandra now stores hints in flat files on disk. Regardless of the version, hints have an expiration time. There is a slight difference in how Cassandra sets the hint expiration time depending on the version:
- Up until 2.2, Cassandra picks the smallest one of
gc_grace_seconds(a table property) and
cassandra.maxHintTTL(a JVM runtime argument). This value then becomes a TTL of the row in the system.hints table.
- In 3.0 and later, Cassandra uses only the
gc_grace_secondsfor hint expiration.
- This value goes into the hint file together with the hint.
Once a node is about to replay a hint, it will only send out the not yet expired hints. Moreover, the hint data itself must still be live, meaning the TTL of the data, if used, has not yet expired.
As a mid-post recap, we have several concepts influencing the lifetime of a hint:
max_hint_window_mscontrolling how long to collect hints for.
gc_grace_secondsindicating hint expiration time.
- Data TTL determining duration of data validity.
Next we will explore combinations of values for these settings and their impact on Hinted Handoff.
First, let’s consider only the
gc_grace_seconds. By default,
gc_grace_seconds is set to 10 days and
max_hint_window_ms is 3 hours.
With the default settings, first hint will expire long after Cassandra stops collecting hints. There is plenty of time (9 days and 21 hours exactly) for the unavailable nodes to come back and receive their hints.
Let’s now suppose we lower the
gc_grace_seconds so that it falls below
max_hint_window_ms. This move is acceptable in situations when Cassandra faces too many tombstones and we want it to drop them quicker.
GC grace shorter than hint window means the very first hints start to expire even before the hint collection stops. This breaks the guarantee of Hinted Handoff to deliver
max_hint_window_ms span of missed data.
Taking this to the extreme, we can set
gc_grace_seconds to 0. In other words, we tell Cassandra to drop tombstones immediately.
gc_grace_seconds set to 0, the hints expire immediately as well. Before the Hinted Handoff rewrite, Cassandra did not even store the hint. Hints expiring immediately, or not being stored in the first place, virtually disable the Hinted Handoff.
So far we have not considered data TTL. Let’s now see what adding data TTL to the mix implies for Hinted Handoff.
As long as
gc_grace_seconds is bigger than data TTL, hints will always expire only after the data they contain:
However, the GC grace needs to be higher than data TTL. Without it, we end up with the following:
With data TTL longer than GC grace, hints expire before the data they contain. This means we once again break the reliability of Hinted Handoff because not all collected hints get replayed.
Data TTL allows setting the
gc_grace_seconds lower than
max_hint_window_ms because the data expires before the hints do:
Flipping the setup and making data TTL longer than gc_grace_seconds leads to problems once again as hints expire before their data does:
In this post we took a close look on how
gc_grace_seconds can prevent the Hinted Handoff from delivering all hints it collected within the
max_hint_window_ms period. The relationship between the data’s TTL,
max_hint_window_ms is nuanced and can be a little confusing at first glance.
When making decisions regarding these settings, keep these suggestions in mind:
- When not using data TTL,
gc_grace_secondsshould be (far) longer than
- When using data TTL,
gc_grace_secondsshould be (reasonably) larger than the smaller of
max_hint_window_msand data TTL.