You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because all Slurm nodes — workers and controller — are Dockerized, there's no reason we couldn't spin up an entire Slurm cluster on a single machine. This would greatly facilitate workflow development and debugging:
All job containers would be in the same place; looking into one would be as simple as docker exec -ti <job name> /bin/bash. Currently, this would entail first looking up which VM the container is running on, sshing into it, and then interfacing with the container.
When developing a containerized task, the task's image could be live-updated with docker commit; the updated image would be instantly available to all worker nodes. This would completely eliminate the current commit/push/pull cycle.
This would free wolF/Canine from having to run on Google Cloud VMs; you could do all workflow development on your laptop.
Currently, we assume that nodes can only run one Slurm container at a time (but multiple job containers); thus, each Slurm container inherits its host's network settings, e.g. if the container is listening on port 1234, all requests to the host on 1234 will be forwarded to the container. To run in local mode, we would have to give each container its own virtual network address.
We will also have to modify the autoscaling scripts. They currently are GCP-only, so they add nodes by creating compute instances and remove nodes by deleting them. In local mode, we would add virtual nodes by spinning up Docker containers and remove nodes by stopping their container.
Finally, this will entail some changes to the Canine Docker backend, which makes a few gcloud-specific assumptions. We might need a child class of the current DockerTransientImageSlurmBackend.
The text was updated successfully, but these errors were encountered:
Because all Slurm nodes — workers and controller — are Dockerized, there's no reason we couldn't spin up an entire Slurm cluster on a single machine. This would greatly facilitate workflow development and debugging:
docker exec -ti <job name> /bin/bash
. Currently, this would entail first looking up which VM the container is running on, sshing into it, and then interfacing with the container.docker commit
; the updated image would be instantly available to all worker nodes. This would completely eliminate the current commit/push/pull cycle.Currently, we assume that nodes can only run one Slurm container at a time (but multiple job containers); thus, each Slurm container inherits its host's network settings, e.g. if the container is listening on port 1234, all requests to the host on 1234 will be forwarded to the container. To run in local mode, we would have to give each container its own virtual network address.
We will also have to modify the autoscaling scripts. They currently are GCP-only, so they add nodes by creating compute instances and remove nodes by deleting them. In local mode, we would add virtual nodes by spinning up Docker containers and remove nodes by stopping their container.
Finally, this will entail some changes to the Canine Docker backend, which makes a few gcloud-specific assumptions. We might need a child class of the current
DockerTransientImageSlurmBackend
.The text was updated successfully, but these errors were encountered: