-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workers disconnect after an unknown period of time #1530
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you. |
Jobsrv reports:
|
This problem can be created on a linux target as well.
|
Note also that the builder database has a table 'busy_workers' that shows the active builder workers. When a worker instance goes down while in the busy state, its failure to send a heartbeat will result in this worker being removed from jobsrv and from the busy_workers table. The job will transition to a pending state where it will remain until a new worker for the target (e.g. x86_64-linux) becomes available or the job timeout (60 minutes default) is reached. |
|
While rust does support setting up keep alive and this would result in the ROUTER socket in builder-jobsrv continuing to test for connectivity. What I am not clear on is how will we know when the client disconnects? We want to know immediately if a client is disconnected so we can ensure it is no longer in the builder-jobsrv worker list. Using our heartbeats at the application layer we are sending from the builder-worker instance to the builder-jobsrv instance. The absence of heartbeats will result in a disconnected state and ultimately results in the worker instance being removed from the worker list. This is desirable because we will know within 1 heartbeat if we've lost a client connection. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you. |
After some amount of time, workers stop respond to new jobs. This has only been observed on Windows and Kernel2 workers.
The observed behavior is a job remains in the
Dispatching
state until thecfg.job_timeout
period elapses and is then cancelled.The worker is connected and we see heartbeats continue. It also remains present in metrics dashboard. Our heartbeat channel is separate from our job dispatch channel.
Currently, the remediation is to restart the
builder-worker
service on affected build nodes.It appears that the zmq::ROUTER socket is no longer transmitting messages to the client. It is a known zmq pattern that if a client connects to a ROUTER socket, but does not send heartbeats, it may timeout and the server won't be able to reconnect. We suspect we need to send KEEPALIVES as described in https://zguide.zeromq.org/docs/chapter4/#Heartbeating to keep the channel alive.
An alternate implementation would be to send jobs to workers to keep them alive. The downside is that we don't know how frequently we would need to dispatch to keep them alive.
The text was updated successfully, but these errors were encountered: