-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide for detection of Stuck Projectors in StreamsProjector #125
Comments
Do you foresee a config type being passed in to when Propulsion is initialized / some sort of callback that is passed in that has the number of attempts so far as a parameter and controls the amount of time to sleep / backoff?
Yeah, of note, the system we worked on has a poor man's version that checks if there's non-zero amount in the Propulsion buffers (bytes or event count) and if the number of handler successes is zero 😆
I'm presuming you meant Unless the per-stream level would be simultaneously pumped into a different kind of sink, like logs? And if that log entry gets huge, probably some top X offenders. I would imagine in the worst case: having the stream ID visible in some form would be useful for manual investigation of the stream (but maybe not in Prometheus)
For this version of stalling, do you foresee that one may have set the
Could it be defined in the |
This discussion around measuring handler performance made me think of a (maybe) tangentially related subject, but let me know if we should discuss elsewhere:
|
I did, but your later comment puts me on what may be a better path - having such stuff configurable in the
Interesting to hear - that's actually not bad - kudos to the inventor!
edited in some detail/qualifiers - definitely had your concerns in mind
Yes, though it needs to be opt in and pluggable - need to limit the number of log entries a retry loop can spew, probably influenced by the retry count and/or time stuck; @alexeyzimarev made the point that the event types are often a pretty important/relevant too. I'm also wondering whether such a hook should provide context across all currently failing/stuck streams
This is a very good point (that I'll have to ponder). In general whether a projection is caught up, falling behind or catching up is an orthogonal concern. The central issue which I see as critical from an alerting point of view is whether there is zero progress being made. But the point is not lost - if I can reasonably efficiently produce a count of streams that have been scheduled with residual events after handling, I may do so. It should be called out thought that in some particular cases, this can be quite a healthy and normal condition - e.g. if replicating from one store to another (or e.g. sending to Kafka), then its not abnormal for one stream to be dominant ('the long pole') and always have work outstanding, and the others are slipstreaming in alongside them 'for free'
well things like handlers that have random bugs/issues due to stale reads and/or not being able to read their writes sufficiently quickly - in some cases watchdog like things with complex logic that intentionally hold off on declaring things complete can have logic or assumption bugs which can lead to a lack of progress. In other words, yes the known unknowns, but also the unknown ones ;) |
wrt your second comment:
For a second I thought you meant within Propulsion - in the template it is, but its definitely a case by case basis thing - maybe a comment there outlining some of the considerations might help; I don't envisage there being something that one would want to bake into Propulsion given the various tools in the toolbox like e.g. slicing the spans, using |
At present, if a
handler
continually fails to make progress on a given stream, the Scheduler will continually retry, resulting in:In order to be able to define a clear alertable condition, it is proposed to maintain, on a per-stream basis:
While the necessary data may be maintained at the stream level, its problematic to surface these either as:
Metrics
timeFailing
: now - oldest failing since (tags: app,category)countFailing
: number of streams whose last outcome was a failure (tags: app,category)timeStalled
: now - oldest stalled (tags: app,category)countStalled
: number of streams whose last handler invocation did not make progress (and has messages waiting) (tags: app,category)🤔 -
longestRunning
: oldest dispatched call in flight that has yet to yield a success/fail🤔 -
countRunning
: number of handler invocations in flightExample Alerts
max timeFailing > 5m
as a reasonable default for the average projector that is writing to a rate-limited storemax timeStalled > 2m
for a watchdog that's responsible for cancelling and/or pumping workflows that have not reached a conclusion within 1m🤔 -
max timeRunning > 1h
for a workflow engine processing step sanity checkPseudocode Logic
When a success happens:
When a fail or happens:
When a success with lack of progress happens:
When a dispatch or completion of a call happens:
🤔 while there should probably be a set of callbacks the projector provides that can be used to hook in metrics, but we also want the system to log summaries out of the box
Other ideas/questions
tagging @ameier38 @belcher-rok @deviousasti @dunnry @enricosada @ragiano215 @swrhim @wantastic84 who have been party to discussions in this space (and may be able to extend or, hopefully, simply the requirements above, or link to a better writeup or concept regarding this)
The text was updated successfully, but these errors were encountered: