heartbeat for runners for better execution in environments with autoscaling #16142
Labels
enhancement
An improvement of an existing feature
great writeup
This is a wonderful example of our standards
performance
Related to an optimization or performance improvement
Describe the current behavior
Prefect does not seem to work well with a kubernetes autoscaling strategy or in general in a context where compute resources are elastic and can be scaled up and down. Because we are running our Prefect workloads on these highly volatile/scalable clusters, we have jobs/pods getting rescheduled frequently.
This graph is a view of our cluster autoscaler, it’s showing the number
Instance Group - Instance Group Size
metric in GCP. This is a direct result of the number of autoscale events in our production Kubernetes clusters.What you're seeing here is a consequence of the rate of Prefect job submission. Essentially we have applications which submit a number of flows, and these flows in turn launch a number of subflows via
run_deployment
. This results in a spike in the number of jobs submitted to Kubernetes. In response, our autoscaler dramatically scales up the number of nodes. As these jobs start to finish, our autoscaler scales back down, evicting many of the jobs that are still in flight. Thus we end up with a lot more failures than we otherwise would, which is both frustrating for the developers of these flows and harmful to our bottom line as, after retries, we have essentially paid twice to process the same workload.When this happens, Prefect often loses track of the workers that had been executing its workloads: either the run is marked Failed while the runner continues executing, or the runner dies but the Prefect server thinks it is still executing. These "ghost runs" can remain in the Prefect UI for hours until a human notices the issue and reschedules or cancels the run in question.
We have often been told by support that the way around this issue is to set up timeout automations that set the flow run state to Failed, schedule retries, etc. This is a workable idea, but it doesn't address the fundamental problem. There is often nothing wrong with their code, the resources that have been requested, or anything else in their control; the problem is that their infrastructure has failed them. It's a lot to ask of developers to set up these kinds of timeouts to work around a problem that should be solved out of the box. At the very least a timeout of this nature should be something more easily configurable than an entire automation block.
Describe the proposed behavior
We would like to suggest a sort of heartbeat for Prefect runners. Either this could be some communication the runner sends to the server at a predefined interval, or it could expose a healthcheck endpoint, with Prefect jobs including some sort of sidecar responsible for the same. If the runner in question has not checked in for a certain amount of time, the Prefect server assumes the run has crashed and re-enqueues the work (or marks it Failed, etc). Conversely, we could then configure the runner so that, if the runner has not been able to contact the Prefect server in a certain amount of time, it could gracefully shut down.
Example Use
This would help a lot with both the flakiness of our Prefect runs and keep down some of the costs we pay for our cloud compute. That is because we could come up with ways to help manage this burst on the Kubernetes side. Two of our best ideas here are:
prefect-kubernetes
worker. In this case we would add something like Kueue to control the rate at which these jobs are actually scheduled in Kubernetes, helping ease the spikiness of the graph you see above. Kueue also controls the scale-down and can add smoothness there, but ultimately flow runs scaled down in this way will still be subject to the same "ghost run" issue described above.As you can see, both of these approaches fall afoul of the same issue: the Prefect server does not see that the agents executing its inflight workflows have vanished. Both of these approaches would be viable if Prefect runners included some sort of heartbeat so that the server would know to mark them dead. (Note that Prefect workers seem already to behave this way, and the UI will report if it has not heard from a worker in a certain period of time. But we don't have the same observability when it comes to the runners actually executing the flows.)
Additional context
Note: we have tried some of prefect's concurrency settings (per deployment and global concurrency) here, but they actually make this problem worse. Because Prefect cannot see that some workers have vanished, it will conclude that a concurrency slot is taken up by a long-dead worker and refuse to schedule anything else until some other agent forcibly sets the state of the flow run in question to Failed.
The text was updated successfully, but these errors were encountered: