-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix infinite loop in gateway client when kernel no longer exists #5678
base: 6.4.x
Are you sure you want to change the base?
Conversation
self.log.warning("Encountered this HTTPError: {}. Disconnecting if it's a 404.".format(exc)) | ||
if exc.code == 404: | ||
# Disconnect if the kernel no longer exists | ||
self._disconnect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commenting here since github won't let me comment outside edited lines. Basically, this kind of logic could maybe go on line 200, or I could add exponential backoff there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or I could add exponential backoff there
Could you please add this?
Thanks @golf-player. I'll try to take a look at this via EG in the next day or two. |
Appreciated!
…On Thu, Aug 13, 2020, 12:17 Kevin Bates ***@***.***> wrote:
Thanks @golf-player <https://github.com/golf-player>. I'll try to take a
look at this via EG in the next day or two.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5678 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIQ6EKQIIJHXTR3D74XHCBDSAQN3JANCNFSM4P6UNKUQ>
.
|
I'm trying to reproduce the issue using the EG 2.2.0rc2 release with Hadoop YARN. I suspect you're probably using Kubernetes. How are you forcing the kernel's termination and do you have a connection to the kernel (from the front end) at that time? Could you please provide log entries on both the EG and Notebook sides? Thank you. Once this can be reproduced, I'll move forward with PR review. |
So what I did was leave a fresh notebook instance open and then restarted EG. Logs from notebook is this ad infinitum. Note the frequency
And then on EG's side:
|
Thanks for the hint - that definitely helps. So, with or without this change, it looks like the gateway handler enters the infinite-loop when the gateway server goes down. I do see, however, the loop stops once the gateway is back up. So I agree, I think we're going to need some form of backoff on the front half of this (when the handler detects the gateway has gone down). |
Hi @golf-player. Just checking on the status of this. I think the backoff should be part of this PR. |
I've seen the same problem using EG on kube. If it's okay with @golf-player, I may take a crack at this PR |
Is this PR still relevant? If so, should it be ported to Jupyter Server? |
Hi @jtpio - yeah, we should probably port this to Jupyter Server. I won't be able to get to this until perhaps the weekend. If there's someone will to do this, that would be greatly appreciated. |
_read_messages
callsconnect
over and over again when the connection fails to the gateway websocket. In the case that the kernel no longer exists, this causes an infinite loop without any backoff.There's several solutions, this is one of them. Please let me know if you see any potential problems. eg: is it possible that there's an initial period where this will 404 while the EG is spinning up the kernel and preparing the ws endpoint? Hasn't happened to me locally.