Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local mode #4

Open
julianhess opened this issue Feb 13, 2020 · 0 comments
Open

Local mode #4

julianhess opened this issue Feb 13, 2020 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@julianhess
Copy link
Collaborator

Because all Slurm nodes — workers and controller — are Dockerized, there's no reason we couldn't spin up an entire Slurm cluster on a single machine. This would greatly facilitate workflow development and debugging:

  • All job containers would be in the same place; looking into one would be as simple as docker exec -ti <job name> /bin/bash. Currently, this would entail first looking up which VM the container is running on, sshing into it, and then interfacing with the container.
  • When developing a containerized task, the task's image could be live-updated with docker commit; the updated image would be instantly available to all worker nodes. This would completely eliminate the current commit/push/pull cycle.
  • This would free wolF/Canine from having to run on Google Cloud VMs; you could do all workflow development on your laptop.

Currently, we assume that nodes can only run one Slurm container at a time (but multiple job containers); thus, each Slurm container inherits its host's network settings, e.g. if the container is listening on port 1234, all requests to the host on 1234 will be forwarded to the container. To run in local mode, we would have to give each container its own virtual network address.

We will also have to modify the autoscaling scripts. They currently are GCP-only, so they add nodes by creating compute instances and remove nodes by deleting them. In local mode, we would add virtual nodes by spinning up Docker containers and remove nodes by stopping their container.

Finally, this will entail some changes to the Canine Docker backend, which makes a few gcloud-specific assumptions. We might need a child class of the current DockerTransientImageSlurmBackend.

@julianhess julianhess added the enhancement New feature or request label Feb 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants