Hinted Handoff and GC Grace Demystified
There are many knobs to turn in Apache Cassandra. Finding the right value for all of them is hard. Yet even with all values finely tuned unexpected things happen. In this post we will see how gc_grace_seconds
can break the promises of the Hinted Handoff.
The gc_grace_seconds
defines the time Cassandra keeps tombstones around. Tombstones are special values Cassandra writes instead of the actual data whenever the data is deleted or its TTL expires. Alain covered tombstones in detail in a previous post, About Deletes and Tombstones in Cassandra.
Hinted Handoff is one of the data consistency mechanisms built into Cassandra. When a node in the cluster goes down, the remaining nodes will save mutations (writes) for this node locally as hints. This will be happening for a period given by the max_hint_window_ms
setting in cassandra.yaml. Once this window expires, nodes will stop saving hints.
Prior to version 3.0, Cassandra stores hints in the system.hints
table. In version 3.0 and later, hint storage received a redesign and Cassandra now stores hints in flat files on disk. Regardless of the version, hints have an expiration time. There is a slight difference in how Cassandra sets the hint expiration time depending on the version:
- Up until 2.2, Cassandra picks the smallest one of
gc_grace_seconds
(a table property) andcassandra.maxHintTTL
(a JVM runtime argument). This value then becomes a TTL of the row in the system.hints table. - In 3.0 and later, Cassandra uses only the
gc_grace_seconds
for hint expiration.- This value goes into the hint file together with the hint.
Once a node is about to replay a hint, it will only send out the not yet expired hints. Moreover, the hint data itself must still be live, meaning the TTL of the data, if used, has not yet expired.
As a mid-post recap, we have several concepts influencing the lifetime of a hint:
max_hint_window_ms
controlling how long to collect hints for.gc_grace_seconds
indicating hint expiration time.- Data TTL determining duration of data validity.
Next we will explore combinations of values for these settings and their impact on Hinted Handoff.
First, let’s consider only the max_hint_window_ms
and gc_grace_seconds
. By default, gc_grace_seconds
is set to 10 days and max_hint_window_ms
is 3 hours.
With the default settings, first hint will expire long after Cassandra stops collecting hints. There is plenty of time (9 days and 21 hours exactly) for the unavailable nodes to come back and receive their hints.
Let’s now suppose we lower the gc_grace_seconds
so that it falls below max_hint_window_ms
. This move is acceptable in situations when Cassandra faces too many tombstones and we want it to drop them quicker.
GC grace shorter than hint window means the very first hints start to expire even before the hint collection stops. This breaks the guarantee of Hinted Handoff to deliver max_hint_window_ms
span of missed data.
Taking this to the extreme, we can set gc_grace_seconds
to 0. In other words, we tell Cassandra to drop tombstones immediately.
With gc_grace_seconds
set to 0, the hints expire immediately as well. Before the Hinted Handoff rewrite, Cassandra did not even store the hint. Hints expiring immediately, or not being stored in the first place, virtually disable the Hinted Handoff.
So far we have not considered data TTL. Let’s now see what adding data TTL to the mix implies for Hinted Handoff.
As long as gc_grace_seconds
is bigger than data TTL, hints will always expire only after the data they contain:
However, the GC grace needs to be higher than data TTL. Without it, we end up with the following:
With data TTL longer than GC grace, hints expire before the data they contain. This means we once again break the reliability of Hinted Handoff because not all collected hints get replayed.
Data TTL allows setting the gc_grace_seconds
lower than max_hint_window_ms
because the data expires before the hints do:
Flipping the setup and making data TTL longer than gc_grace_seconds leads to problems once again as hints expire before their data does:
In this post we took a close look on how gc_grace_seconds
can prevent the Hinted Handoff from delivering all hints it collected within the max_hint_window_ms
period. The relationship between the data’s TTL, gc_grace_seconds
and max_hint_window_ms
is nuanced and can be a little confusing at first glance.
When making decisions regarding these settings, keep these suggestions in mind:
- When not using data TTL,
gc_grace_seconds
should be (far) longer thanmax_hint_window_ms
. - When using data TTL,
gc_grace_seconds
should be (reasonably) larger than the smaller ofmax_hint_window_ms
and data TTL.