Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improve process termination logic in multiprocess manager #2371

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

abersheeran
Copy link
Member

Summary

About #2369

In our cluster, we accidentally discovered the zombie process. We found the reason. Uvicorn's new process manager will JOIN child processes one by one after sending all exit signals. When the previous child process does not exit for a long time, the subsequent child processes cannot be JOIN.

I noticed that Uvicorn has an inherent shutdown timeout, which would be nice if we could use it with a multiprocessor.

The reason why we don't use terminate&join sequentially is to kill all processes faster, as mentioned in this PR #2010

Checklist

  • I understand that this PR may be closed in case there was no previous discussion. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.

@abersheeran
Copy link
Member Author

I will add unit tests later, but I have no experience in designing a process that will hang, so if anyone can help, it would be greatly appreciated.

@abersheeran abersheeran requested a review from Kludex June 25, 2024 07:32
@Kludex
Copy link
Member

Kludex commented Jul 31, 2024

I will add unit tests later, but I have no experience in designing a process that will hang, so if anyone can help, it would be greatly appreciated.

I'll try to check this over the weekend. Sorry the delay.

For next time, your PRs always have preference @abersheeran , please ping me if I take long.

@Kludex Kludex self-assigned this Jul 31, 2024
@Kludex
Copy link
Member

Kludex commented Aug 11, 2024

I'm not sure if we should use the same timeout_graceful_shutdown 🤔

@abersheeran
Copy link
Member Author

Do you have any new ideas? We really need a configurable timeout here.

Comment on lines 93 to 99
def join(self, join_timeout: float | None = None) -> None:
logger.info(f"Waiting for child process [{self.process.pid}]")
self.process.join()
self.process.join(join_timeout)
# Timeout, kill the process
while self.process.exitcode is None:
self.process.kill()
self.process.join(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have a join(1) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for the kill command to take effect. If it does not take effect within 1 second, send the kill command again.

uvicorn/supervisors/multiprocess.py Outdated Show resolved Hide resolved
self.process.join(timeout)
# Timeout, kill the process
while self.process.exitcode is None:
self.process.kill()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why CI failed is that this is not covered by the test. But I don't know how to design a process that will be 100% stuck.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to add the pragma here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants