-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TTL cache retries more frequently on failures #806
base: main
Are you sure you want to change the base?
Conversation
* Test with current thread executor
8396cae
to
db2126d
Compare
refreshIntervalMillis: Long = 8 * 1000 // 8 seconds | ||
refreshIntervalMillis: Long = 8 * 1000, // 8 seconds | ||
// same as ttlMillis, so behavior is unchanged barring an override | ||
failureTTLMillis: Long = 2 * 60 * 60 * 1000 // 2 hours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should lower this to 30 seconds - I am no longer at airbnb, but I think it would benefit airbnb too. we have had incidents in the past, similar to yours where this could have helped.
cc: @pengyu-hou who is familiar with the Airbnb incident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I feel 30 seconds may be too frequent as default. Some of these TTLCache is used to cache metadata, which doesn't get updated frequently anyway.
But I can see why for things like BatchIr cache, a more frequent failure refresh is desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nikhilsimha. Our incident was caused by a stale but valid metadata. To mitigate it, we would have to flush the TTL cache. This should be addressed with @yuli-han 's recent work that we will only fetch active configs.
I am curious what is the failureTTLMillis
from Stripe side? @yizzlez
For failure cases, I agree that we should use a lower TTL.
val minFailureUpdateTTL = Math.min(intervalMillis, failureTTLMillis) | ||
val shouldUpdate = entry.value match { | ||
// Encountered a failure, update according to failure TTL. | ||
case Failure(_) => nowFunc() - entry.updatedAtMillis > minFailureUpdateTTL | ||
case _ => nowFunc() - entry.updatedAtMillis > intervalMillis | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val minFailureUpdateTTL = Math.min(intervalMillis, failureTTLMillis) | |
val shouldUpdate = entry.value match { | |
// Encountered a failure, update according to failure TTL. | |
case Failure(_) => nowFunc() - entry.updatedAtMillis > minFailureUpdateTTL | |
case _ => nowFunc() - entry.updatedAtMillis > intervalMillis | |
} | |
val effectiveExpiry = entry.map(_ => intervalMillis).getOrElse(Math.min(intervalMillis, failureTTLMillis)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor simplification.
if ( | ||
(nowFunc() - entry.updatedAtMillis > intervalMillis) && | ||
shouldUpdate && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldUpdate && | |
(nowFunc() - entry.updatedAtMillis > effectiveExpiry) && |
@@ -0,0 +1,138 @@ | |||
package ai.chronon.online.test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the change!
Summary
This PR adds a new parameter to the TTL cache --
failureTTLMillis
. This is a custom TTL for entries with typeFailure
. CurrentlyfailureTTLMillis == ttlMillis
, which means this should have no behavior change. At Stripe, we will be changing some of the caches to have a significantly shorterfailureTTLMillis
.Why / Goal
At Stripe, we ran into an incident involving this particular piece of caching code for groupByServing info.
Our internal KV store was returning a handful of timeout errors for some requests. The issue is that this
Failure
is stored in a TTL Cache with a timeout of 2 hours. This caused hosts to enter a bad state and error repeatedly, as it was fetching the previously stored error from the TTLCache instead of retrying against our KVStore.With this code change, it's possible to configure this TTL cache to have a
failureTTLMillis == 5s
, in which case we will automatically retry after 5s if we encounter a KV store timeout error.Test Plan
Checklist
Reviewers