-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stuck jobs #5352
Comments
Hello, I'm Franke Tang, a graduate student currently taking a Distributed Computing course, and part of my final project encourages us to contribute to open issues on GitHub relating to distributed systems. I would like to work on this issue if this has not been implemented yet. |
Welcome, @FTang21, sure, go ahead |
Hello, sorry for the late followup, was working on PRs on other repos. I was looking through code, would app.cpp be a good point to start on this issue? |
The new logic would go in ACTIVE_TASK_SET::poll() |
Reportedly, some VM jobs (and possibly others) get in a "stuck" state where they
don't make progress: no fraction done change, and little CPU usage.
These jobs will eventually be aborted when their elapsed time reaches the rsc_fpops_bound limit,
but this could take weeks or months depending on the limit.
Proposal: have the client try to figure out when a job is stuck.
e.g. in the last hour of running, the fraction done hasn't changed,
and the incremental CPU time is < 10s.
At that point, the client could
Let's do 1) for starters, to make sure that the logic is right,
then at some point do 2).
The text was updated successfully, but these errors were encountered: