Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix infinite loop in gateway client when kernel no longer exists #5678

Open
wants to merge 1 commit into
base: 6.4.x
Choose a base branch
from

Conversation

golf-player
Copy link

_read_messages calls connect over and over again when the connection fails to the gateway websocket. In the case that the kernel no longer exists, this causes an infinite loop without any backoff.

There's several solutions, this is one of them. Please let me know if you see any potential problems. eg: is it possible that there's an initial period where this will 404 while the EG is spinning up the kernel and preparing the ws endpoint? Hasn't happened to me locally.

self.log.warning("Encountered this HTTPError: {}. Disconnecting if it's a 404.".format(exc))
if exc.code == 404:
# Disconnect if the kernel no longer exists
self._disconnect()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commenting here since github won't let me comment outside edited lines. Basically, this kind of logic could maybe go on line 200, or I could add exponential backoff there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or I could add exponential backoff there

Could you please add this?

@kevin-bates
Copy link
Member

Thanks @golf-player. I'll try to take a look at this via EG in the next day or two.

@kevin-bates kevin-bates self-requested a review August 13, 2020 17:17
@golf-player
Copy link
Author

golf-player commented Aug 13, 2020 via email

@kevin-bates
Copy link
Member

I'm trying to reproduce the issue using the EG 2.2.0rc2 release with Hadoop YARN. I suspect you're probably using Kubernetes. How are you forcing the kernel's termination and do you have a connection to the kernel (from the front end) at that time? Could you please provide log entries on both the EG and Notebook sides? Thank you.

Once this can be reproduced, I'll move forward with PR review.

@golf-player
Copy link
Author

So what I did was leave a fresh notebook instance open and then restarted EG.

Logs from notebook is this ad infinitum. Note the frequency

[W 11:53:17.757 NotebookApp] Websocket connection has been closed via client disconnect or due to error.  Kernel with ID '8b0894d8-edbd-4484-b0d5-388242364f79' may not be terminated on GatewayClient: https://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/
[I 11:53:17.758 NotebookApp] Attempting to re-establish the connection to Gateway: 8b0894d8-edbd-4484-b0d5-388242364f79
[I 11:53:17.758 NotebookApp] Connecting to wss://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels
[W 11:53:17.941 NotebookApp] Websocket connection has been closed via client disconnect or due to error.  Kernel with ID '8b0894d8-edbd-4484-b0d5-388242364f79' may not be terminated on GatewayClient: https://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/
[I 11:53:17.941 NotebookApp] Attempting to re-establish the connection to Gateway: 8b0894d8-edbd-4484-b0d5-388242364f79
[I 11:53:17.941 NotebookApp] Connecting to wss://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels
[W 11:53:18.137 NotebookApp] Websocket connection has been closed via client disconnect or due to error.  Kernel with ID '8b0894d8-edbd-4484-b0d5-388242364f79' may not be terminated on GatewayClient: https://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/
[I 11:53:18.138 NotebookApp] Attempting to re-establish the connection to Gateway: 8b0894d8-edbd-4484-b0d5-388242364f79
[I 11:53:18.138 NotebookApp] Connecting to wss://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels
[W 11:53:18.319 NotebookApp] Websocket connection has been closed via client disconnect or due to error.  Kernel with ID '8b0894d8-edbd-4484-b0d5-388242364f79' may not be terminated on GatewayClient: https://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/
[I 11:53:18.320 NotebookApp] Attempting to re-establish the connection to Gateway: 8b0894d8-edbd-4484-b0d5-388242364f79
[I 11:53:18.320 NotebookApp] Connecting to wss://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels
[W 11:53:18.507 NotebookApp] Websocket connection has been closed via client disconnect or due to error.  Kernel with ID '8b0894d8-edbd-4484-b0d5-388242364f79' may not be terminated on GatewayClient: https://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/
[I 11:53:18.508 NotebookApp] Attempting to re-establish the connection to Gateway: 8b0894d8-edbd-4484-b0d5-388242364f79
[I 11:53:18.508 NotebookApp] Connecting to wss://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels
[W 11:53:18.681 NotebookApp] Websocket connection has been closed via client disconnect or due to error.  Kernel with ID '8b0894d8-edbd-4484-b0d5-388242364f79' may not be terminated on GatewayClient: https://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/
[I 11:53:18.681 NotebookApp] Attempting to re-establish the connection to Gateway: 8b0894d8-edbd-4484-b0d5-388242364f79
[I 11:53:18.681 NotebookApp] Connecting to wss://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels
[W 11:53:18.838 NotebookApp] Websocket connection has been closed via client disconnect or due to error.  Kernel with ID '8b0894d8-edbd-4484-b0d5-388242364f79' may not be terminated on GatewayClient: https://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/
[I 11:53:18.838 NotebookApp] Attempting to re-establish the connection to Gateway: 8b0894d8-edbd-4484-b0d5-388242364f79
[I 11:53:18.839 NotebookApp] Connecting to wss://enterprise-gateway-ishbook.bigqa.k8s.indeed.tech/api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels

And then on EG's side:

[W 200814 16:53:17 web:2250] 404 GET /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels (10.42.9.69) 2.87ms
[D 2020-08-14 16:53:17.401 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels
[W 2020-08-14 16:53:17.403 EnterpriseGatewayApp] No session ID specified
[W 200814 16:53:17 web:1786] 404 GET /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels (10.42.9.69): Kernel does not exist: 8b0894d8-edbd-4484-b0d5-388242364f79
[W 200814 16:53:17 web:2250] 404 GET /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels (10.42.9.69) 3.96ms
[D 2020-08-14 16:53:17.576 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels
[W 2020-08-14 16:53:17.579 EnterpriseGatewayApp] No session ID specified
[W 200814 16:53:17 web:1786] 404 GET /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels (10.42.9.69): Kernel does not exist: 8b0894d8-edbd-4484-b0d5-388242364f79
[W 200814 16:53:17 web:2250] 404 GET /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels (10.42.9.69) 4.16ms
[D 2020-08-14 16:53:17.751 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels
[W 2020-08-14 16:53:17.754 EnterpriseGatewayApp] No session ID specified
[W 200814 16:53:17 web:1786] 404 GET /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels (10.42.9.69): Kernel does not exist: 8b0894d8-edbd-4484-b0d5-388242364f79
[W 200814 16:53:17 web:2250] 404 GET /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels (10.42.9.69) 4.06ms
[D 2020-08-14 16:53:17.926 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels
[W 2020-08-14 16:53:17.929 EnterpriseGatewayApp] No session ID specified
[W 200814 16:53:17 web:1786] 404 GET /api/kernels/8b0894d8-edbd-4484-b0d5-388242364f79/channels (10.42.16.10): Kernel does not exist: 8b0894d8-edbd-4484-b0d5-388242364f79

@kevin-bates
Copy link
Member

Thanks for the hint - that definitely helps.

So, with or without this change, it looks like the gateway handler enters the infinite-loop when the gateway server goes down. I do see, however, the loop stops once the gateway is back up. So I agree, I think we're going to need some form of backoff on the front half of this (when the handler detects the gateway has gone down).

@kevin-bates
Copy link
Member

Hi @golf-player. Just checking on the status of this. I think the backoff should be part of this PR.

@telamonian
Copy link
Contributor

I've seen the same problem using EG on kube. If it's okay with @golf-player, I may take a crack at this PR

@Zsailer Zsailer changed the base branch from main to 6.4.x March 7, 2022 18:22
@jtpio
Copy link
Member

jtpio commented Jul 26, 2023

Is this PR still relevant? If so, should it be ported to Jupyter Server?

@kevin-bates
Copy link
Member

Hi @jtpio - yeah, we should probably port this to Jupyter Server. I won't be able to get to this until perhaps the weekend. If there's someone will to do this, that would be greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants