Provide for detection of Stuck Projectors in StreamsProjector #125

bartelink · 2021-11-25T16:14:56Z

At present, if a handler continually fails to make progress on a given stream, the Scheduler will continually retry, resulting in:

running 'hot'; there are no backoffs and/or anything else to ameliorate the impact of things failing
no direct way to determine that such a state has been entered from a programmatic, alerting or dashboards perspective. At present, a number of secondary effects will hint at the problem:
- lack of progress if observing the read and/or checkpoint positions on the source e.g. fo the CFP or Feed readers
- increase in exception outcomes on dashboards
- reduction in successful outcomes on dashboards

In order to be able to define a clear alertable condition, it is proposed to maintain, on a per-stream basis:

number of consecutive failures, timestamp of first failure
number of consecutive successes without progress, timestamp of first attempt

While the necessary data may be maintained at the stream level, its problematic to surface these either as:

a log record per invocation - Handlers can receive extremely high traffic and adding this overhead as a fixed cost is not likely to work well
metrics tagged/instanced at the stream level - this will lead to excessive cardinality as it's unbounded, while likely making querying more complex (though more metrics gives more scope for alerting, it does not ease the task of determining which ones are relevant to someone coming to a set of metrics fresh)

Metrics

timeFailing: now - oldest failing since (tags: app,category)
countFailing: number of streams whose last outcome was a failure (tags: app,category)
timeStalled: now - oldest stalled (tags: app,category)
countStalled: number of streams whose last handler invocation did not make progress (and has messages waiting) (tags: app,category)

🤔 - longestRunning: oldest dispatched call in flight that has yet to yield a success/fail
🤔 - countRunning: number of handler invocations in flight

Example Alerts

max timeFailing > 5m as a reasonable default for the average projector that is writing to a rate-limited store
max timeStalled > 2m for a watchdog that's responsible for cancelling and/or pumping workflows that have not reached a conclusion within 1m

🤔 - max timeRunning > 1h for a workflow engine processing step sanity check

Pseudocode Logic

When a success happens:

consecutive failures/failing since is cleared
(if progress is made) consecutive stalled is cleared

When a fail or happens:

consecutive failures/failing since is either initialized to (1,now) or incremented

When a success with lack of progress happens:

consecutive stalled is either initialized to (1,now) or incremented

When a dispatch or completion of a call happens:

update the longest running task start time metric
record the latency in the metrics immediately vs waiting to surface it every n ms?

🤔 while there should probably be a set of callbacks the projector provides that can be used to hook in metrics, but we also want the system to log summaries out of the box

Other ideas/questions

is being able to inject backoffs based on these metrics for a given specific stream important?
how/would one want to be able to internally exit/restart the projector host app based on the values ?
is there some more important intermittent failure pattern this will be useless for ?
I'm excluding higher level lag based metrics e.g. stuff https://github.com/linkedin/Burrow does

tagging @ameier38 @belcher-rok @deviousasti @dunnry @enricosada @ragiano215 @swrhim @wantastic84 who have been party to discussions in this space (and may be able to extend or, hopefully, simply the requirements above, or link to a better writeup or concept regarding this)

The text was updated successfully, but these errors were encountered:

ragiano215 · 2021-11-25T20:31:38Z

there are no backoffs and/or anything else to ameliorate the impact of things

Do you foresee a config type being passed in to when Propulsion is initialized / some sort of callback that is passed in that has the number of attempts so far as a parameter and controls the amount of time to sleep / backoff?

no direct way to determine that such a state has been entered

Yeah, of note, the system we worked on has a poor man's version that checks if there's non-zero amount in the Propulsion buffers (bytes or event count) and if the number of handler successes is zero 😆

on a per-stream basis

I'm presuming you meant on a per-category basis, given the labels you indicated in the metrics. (Also, I was thinking of the Prometheus cardinality thing.)

Unless the per-stream level would be simultaneously pumped into a different kind of sink, like logs? And if that log entry gets huge, probably some top X offenders. I would imagine in the worst case: having the stream ID visible in some form would be useful for manual investigation of the stream (but maybe not in Prometheus)

number of consecutive successes without progress

For this version of stalling, do you foresee that one may have set the OverrideWritePosition erroneously? Or are there other processing scenarios that you have in mind? At least in my mind, the consecutive failures scenario seems more interesting to distinguish than the consecutive successes. I'm wondering for the latter if simply a metric indicating that there are additional events for a stream in the buffer but no progress is being made would suffice -- regardless of if it's due to successes or exceptions

while there should probably be a set of callbacks the projector provides

Could it be defined in the Stats type? A Handle*?

ragiano215 · 2021-11-25T20:31:47Z

This discussion around measuring handler performance made me think of a (maybe) tangentially related subject, but let me know if we should discuss elsewhere:
Presently for our usage of the generic feed reader, there could be large number of events presented in one span to the handler, so the present code truncates to the first 1000 events (with a SpanResult.PartiallyProcessed). But that 1000 is hardcoded (and actual could be anywhere between 0 and 1000).

I can't think of a good solution at the moment; it could potentially be misleading for how long a certain handler invocation takes?
the main form of throttling is MaxWriters (-w option) and if semaphores are introduced into the handler invocations; do you suppose Propulsion could have a config to control the max size of spans?

bartelink · 2021-11-26T10:02:59Z

Do you foresee a config type being passed in to when Propulsion is initialized / some sort of callback that is passed in that has the number of attempts so far as a parameter and controls the amount of time to sleep / backoff?

I did, but your later comment puts me on what may be a better path - having such stuff configurable in the type Stats. I'll be aiming to provide most hooks internally as CLI Events and/or functions, but in general the Stats will be where it's surfaced - it's not like any of the top level Propulsion APIs are short of required parameters ;) Will ping here when I actually get beyond preconceived notions though!

Yeah, of note, the system we worked on has a poor man's version that checks if there's non-zero amount in the Propulsion buffers (bytes or event count) and if the number of handler successes is zero 😆

Interesting to hear - that's actually not bad - kudos to the inventor!

on a per-stream basis

edited in some detail/qualifiers - definitely had your concerns in mind

I would imagine in the worst case: having the stream ID visible in some form would be useful for manual investigation of the stream (but maybe not in Prometheus)

Yes, though it needs to be opt in and pluggable - need to limit the number of log entries a retry loop can spew, probably influenced by the retry count and/or time stuck; @alexeyzimarev made the point that the event types are often a pretty important/relevant too. I'm also wondering whether such a hook should provide context across all currently failing/stuck streams

I'm wondering for the latter if simply a metric indicating that there are additional events for a stream in the buffer but no progress is being made would suffice -- regardless of if it's due to successes or exceptions

This is a very good point (that I'll have to ponder). In general whether a projection is caught up, falling behind or catching up is an orthogonal concern. The central issue which I see as critical from an alerting point of view is whether there is zero progress being made. But the point is not lost - if I can reasonably efficiently produce a count of streams that have been scheduled with residual events after handling, I may do so. It should be called out thought that in some particular cases, this can be quite a healthy and normal condition - e.g. if replicating from one store to another (or e.g. sending to Kafka), then its not abnormal for one stream to be dominant ('the long pole') and always have work outstanding, and the others are slipstreaming in alongside them 'for free'

Or are there other processing scenarios that you have in mind?

well things like handlers that have random bugs/issues due to stale reads and/or not being able to read their writes sufficiently quickly - in some cases watchdog like things with complex logic that intentionally hold off on declaring things complete can have logic or assumption bugs which can lead to a lack of progress. In other words, yes the known unknowns, but also the unknown ones ;)

bartelink · 2021-11-26T10:13:02Z

wrt your second comment:

But that 1000 is hardcoded

For a second I thought you meant within Propulsion - in the template it is, but its definitely a case by case basis thing - maybe a comment there outlining some of the considerations might help; I don't envisage there being something that one would want to bake into Propulsion given the various tools in the toolbox like e.g. slicing the spans, using AsyncBatchingGate, using Semaphores etc. Ultimately, while workloads with uniform event sizes, distribution and cost to handle exist, you won't get far by assuming them to be uniform - its about being able to identify when things are off. The main concern with large batches of events is that the evidence of throughput in terms of events having been handled is going to be lumpy. The timeRunning metric would become relevant iff one can bound the work by doing things like this.

cf jet/propulsion#125 (comment)

bartelink · 2021-12-17T20:40:56Z

Closing as 2.0.12-rc.1 provides the basic function and #129 is a prototype fo backoff logic that can function as a placeholder.
Any further discussion should be posted as fresh issues expressed in terms of the impl completed in #126

bartelink added a commit to jet/dotnet-templates that referenced this issue Nov 26, 2021

Add comment re magic number of 1000 for chunking

b0f8b15

cf jet/propulsion#125 (comment)

bartelink mentioned this issue Nov 27, 2021

Stuck Streams metrics #126

Merged

bartelink closed this as completed Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide for detection of Stuck Projectors in StreamsProjector #125

Provide for detection of Stuck Projectors in StreamsProjector #125

bartelink commented Nov 25, 2021 •

edited

Loading

ragiano215 commented Nov 25, 2021 •

edited

Loading

ragiano215 commented Nov 25, 2021

bartelink commented Nov 26, 2021

bartelink commented Nov 26, 2021

bartelink commented Dec 17, 2021

Provide for detection of Stuck Projectors in StreamsProjector #125

Provide for detection of Stuck Projectors in StreamsProjector #125

Comments

bartelink commented Nov 25, 2021 • edited Loading

Metrics

Example Alerts

Pseudocode Logic

Other ideas/questions

ragiano215 commented Nov 25, 2021 • edited Loading

ragiano215 commented Nov 25, 2021

bartelink commented Nov 26, 2021

bartelink commented Nov 26, 2021

bartelink commented Dec 17, 2021

bartelink commented Nov 25, 2021 •

edited

Loading

ragiano215 commented Nov 25, 2021 •

edited

Loading