Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some high-level questions around usage of kubespawner #18

Closed
Analect opened this issue Jan 5, 2017 · 6 comments
Closed

Some high-level questions around usage of kubespawner #18

Analect opened this issue Jan 5, 2017 · 6 comments

Comments

@Analect
Copy link
Contributor

Analect commented Jan 5, 2017

@yuvipanda
Thanks for all your work on kubespawner. I've started experimenting with running jupyterhub on kubernetes, largely thanks to this spawner, but I wanted to get some guidance around my use-cases / workflow from someone a bit more seasoned in this technology. I'm structuring these as a series of high-level questions, where your input would be be much appreciated. For ease of explanation, I may refer to the rough sketch below lower down.

image

My efforts so far, for context:
I was working through the data-8/jupyterhub-k8s implementation, which I think bases itself off your work, since it's structure in a chart form (fro helm) is the easiest to work with, compared to some of the other implementations I've found out there.

I modified that set-up slightly to handle gitlab authentication (rather than google), which worked OK, but I wasn't able to get the spawning of their large user image (>5GB), based on this Dockerfile and their hub image to work. It was constantly stuck in a Waiting: ContainerCreating state and would then try to re-spawn itself. I haven't figured out what the problem is, but there appears to be plenty of space on the cluster. I'm using v1.51 of kubernetes on GCE.

Anyway, I ended up getting things working using instead the hub image (dockerfile below), a variation of the data-8 one, in conjunction with your yuvipanda/simple-singleuser:v1 user image.

FROM jupyterhub/jupyterhub-onbuild:0.7.1
# Install kubespawner and its dependencies
RUN /opt/conda/bin/pip install \
    oauthenticator==0.5.* \
    git+https://github.com/derrickmar/kubespawner \
    git+https://github.com/yuvipanda/jupyterhub-nginx-chp.git
ADD jupyterhub_config.py /srv/jupyterhub_config.py
ADD userlist /srv/userlist
WORKDIR /srv/jupyterhub
EXPOSE 8081
CMD jupyterhub --config /srv/jupyterhub_config.py --no-ssl

This was able to spawn new user persistent volumes, bind them to PVCs and obviously spawn user jupyter notebook servers, which could be stopped/started and re-use the same PV. My initial tests as to whether new files/notebooks were getting persisted on the PV were failing, since I wasn't saving them under /home, which is where the binding to the volume is happening.

i. user management / userid - After various aborted attempts to get the larger data-8 user image working, and where user PVs weren't deleted. I noticed that the userid appended to username for naming the PV incremented up, but it wasn't clear where this numbering logic was coming from, as it wasn't a env variable in any of the manifests. Is this some fail-safe of some sort?

Currently, I'm using a whitelist userlist for users (see code from jupyterhub_config.py) below, and these correspond with my users' gitlab logins that I'm authenticating against. However, it's probably not a clean solution. I see you are working on another approach on the fsgroup and just wanted to get a better understanding around the context of this solution?

# Whitlelist users and admins
c.Authenticator.whitelist = whitelist = set()
c.Authenticator.admin_users = admin = set()
c.JupyterHub.admin_access = True
pwd = os.path.dirname(__file__)
with open(os.path.join(pwd, 'userlist')) as f:
    for line in f:
        if not line:
            continue
        parts = line.split()
        name = parts[0]
        whitelist.add(name)
        if len(parts) > 1 and parts[1] == 'admin':
            admin.add(name)

ii. possibility for interchangeable images - I find the current default set-up with Jupyterhub allowing for spawning a single image very limiting. I can see from #14 that you are considering extending functionality in the kubespawner to allow for an image to be selected. @minrk was able to confirm over here that it could be possible to pass this image selection programmatically via the jupyterhub API, although I'm not sure, as per this issue, as to whether the hub API will work in a kubernetes context.

You pointed to an implementation by Google here. It's not clear to me where they are deriving their list of available images. How do you think something like this should work?

As per the sketch up top, I'm looking to handle a set-up where users have various private/shared repos (marked 1 above in sketch), from which docker images are generated and stored in a registry (2 above). Then my users (3 above) would be able to spawn a compute environment for their chosen repo and have it spawned in kubernetes (4 above), with the possibility, from 5 above, to have the repo cloned (maybe leveraging gitRepo) and for any incrimental work performed on it, while on the notebook server, persisted (6).

iii. multiple simultaneous servers per user based on different images - As far as I understand, it's not possible with jupyterhub to presently allow a user to have multiples instances of a notebook server, each running a different image? Do the tools exist within kubernetes to potentially facilitate this? Thinking out loud, could this be facilitated by having multiple smaller persistent volumes for a user, based on the repo from which the server image is derived? Or maybe this could be achieved within a single PV, by using the subPath functionality?

c.KubeSpawner.volumes = [
    {
        'name': 'volume-{username}-{repo-namespace}-{repo-name}',
        'persistentVolumeClaim': {
            'claimName': 'claim-{username}-{repo-namespace}-{repo-name}'
        }
    }
]

iv. ideas around version-control - Given the various advantages derived from using kubernetes to host jupyter, I would be curious if you had some thoughts around whether kubernetes also potentially makes it easier to manage version control for notebooks and other files created while in a user works in a notebook server environment. Perhaps something like preStop hooks could be used to commit and push changes prior to a container shutting down.

Even facilitating a user to be able to run git commands from a notebook server terminal .. and have SSH keys back to the version-control system handled via the kubernetes secrets/config maps might be a start. Have you seen any implementations solving this?

Thanks for your patience in reading through this!

@yuvipanda
Copy link
Collaborator

yuvipanda commented Jan 6, 2017 via email

@Analect
Copy link
Contributor Author

Analect commented Jan 14, 2017

@yuvipanda ... just wondering if you've had any time to think about some of the items raised above. Much appreciated.

@yuvipanda
Copy link
Collaborator

yuvipanda commented Jan 14, 2017 via email

@yuvipanda
Copy link
Collaborator

yuvipanda commented Jan 14, 2017 via email

@Analect
Copy link
Contributor Author

Analect commented Jan 14, 2017

@yuvipanda . Thanks for your responses.

This is an awesome sketch! May I ask how you created it?

I think you're going to be disappointed when I tell you powerpoint!

image

Yes, I've seen a flurry of activity cleaning up the data-8 implementation, which looks great. It would be nice to get an implementation under github.com/kubernetes/charts

Ref {user}-{user-id} ... thanks for the explanation.
In my jupyterhub_config.py I have a whitelist of 3 or 4 users for testing ... and I'm at the same time authenticating these users against a gitlab authenticator .... and I noticed as I was bringing up and down the helm chart ... it was sometimes incrementing a different id against my user ... see the case for my username below ... where 1,2,3 and 4 got appended ... and so, there wasn't really a consistency in term of which PV got appended to a container. Perhaps my jupyterhub.sqlite was somehow getting corrupted for this to have happenend.

image

Ref. passing image to get spawned.

If dynamic it might be a little more difficult, but not impossible.

OK, based on heavy prompting from @minrk ... I was able to modify jupyterhub_config.py to include this ... which was able to pick up new 'image' payloads passed to the JupyterHub API.

from traitlets import observe
from kubespawner.spawner import KubeSpawner
class MySpawner(KubeSpawner):
    @observe('user_options')
    def _update_options(self, change):
        options = change.new
        if 'image' in options:
            self.singleuser_image_spec = options['image']
c.JupyterHub.spawner_class = MySpawner

So all the other c.KubeSpawner entries required in the jupyterhub_config.py then got changed to c.MySpawner.

I then pass this API call to jupyterhub ... and it appears to work. I have obviously pushed that image to my private docker registry first.

curl -v -X POST -H "Authorization: token my-testuser-token"  \
"http://jupyterhub.myserver.com/hub/api/users/testuser/server" \
-d '{"image": "my-private-registry/my-simple-singleuser:v1.1"}'

However, it's not bullet-proof. For instance, for larger images (2GB+), I noticed sometimes kubernetes is slow to pull the image ... and so you end up in this situation (see table below) ... where it eventually aborts ... which isn't ideal. However, I found deleting the pod and then retrying the above seemed to resolve. Maybe there's a better approach of pulling these images down to kubernetes ahead of time ... or maybe there's better performance if the images are pushed to a google registry (on the assumption one is using their kubernetes implementation, of course).

NAME                                READY     STATUS             RESTARTS   AGE
jupyter-testuser-4               0/1       ContainerCreating   0          6m
jupyter-testuser-4               0/1       ImagePullBackOff   0          8m
jupyter-testuser-4               0/1       ErrImagePull   0          12m

Obviously once the image is pulled to the kubernetes cluster, then spawning from the hub is a matter of seconds.

Ref multi-servers per user ... yes, I've been keeping an eye on this and this.

Ref. version-control ... I'm using a self-hosted gitlab rather than github. They have a similar user-token concept, so maybe, as you said, passing that as a 'secret' or 'config map' variable per user, might work.

Given that I'm experimenting with spawning into 'lab' environments, rather than the classic notebook 'tree', I've been looking for ways to pass a template ... a bit like how the notebooks.azure.com implementation below (although they are still working against the classic notebook).

image

It seems doing the same for jupyterlab is a bit more involved (see this issue), requiring a plugin on the jupyterlab end, but it appears some of the required tooling is in place with jupyterhub-labextension. I'm not sure this is ready for usage yet though.

If it were, then maybe one could potentially give a rudimentary way of pushing/pulling to a repo, by exposing, in my case, the gitlab API via some buttons on that template. I would be interested in whether you thought that viable or not.

Anyway, thanks for the dialogue on these matters.

@consideRatio
Copy link
Member

@Analect I love how you thoroughly documented your thoughts in this issue! ❤️

I'm closing it now as it is stale and doesn't seem to have a specific action point related to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants