-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some high-level questions around usage of kubespawner #18
Comments
\o/ Thank you for your well thought out questions! I want to acknowledge
I've seen them, but am travelling presently - will respond in bits and
pieces!
…On Thu, Jan 5, 2017 at 2:56 AM, Analect ***@***.***> wrote:
@yuvipanda <https://github.com/yuvipanda>
Thanks for all your work on kubespawner. I've started experimenting with
running jupyterhub on kubernetes, largely thanks to this spawner, but I
wanted to get some guidance around my use-cases / workflow from someone a
bit more seasoned in this technology. I'm structuring these as a series of
high-level questions, where your input would be be much appreciated. For
ease of explanation, I may refer to the rough sketch below lower down.
[image: image]
<https://cloud.githubusercontent.com/assets/4063815/21677128/9bc79b9c-d330-11e6-85a5-f8602b0bbff1.png>
*My efforts so far, for context:*
I was working through the data-8/jupyterhub-k8s
<https://github.com/data-8/jupyterhub-k8s> implementation, which I think
bases itself off your work, since it's structure in a chart form (fro helm)
is the easiest to work with, compared to some of the other implementations
I've found out there.
I modified that set-up slightly to handle gitlab authentication (rather
than google), which worked OK, but I wasn't able to get the spawning of
their large user image (>5GB), based on this Dockerfile
<https://github.com/data-8/jupyterhub-k8s/blob/master/user/Dockerfile>
and their hub image
<https://github.com/data-8/jupyterhub-k8s/blob/master/hub/Dockerfile> to
work. It was constantly stuck in a Waiting: ContainerCreating state and
would then try to re-spawn itself. I haven't figured out what the problem
is, but there appears to be plenty of space on the cluster. I'm using v1.51
of kubernetes on GCE.
Anyway, I ended up getting things working using instead the hub image
(dockerfile below), a variation of the data-8 one, in conjunction with your
yuvipanda/simple-singleuser:v1
<https://github.com/yuvipanda/jupyterhub-simplest-k8s/blob/master/singleuser/Dockerfile>
user image.
FROM jupyterhub/jupyterhub-onbuild:0.7.1
# Install kubespawner and its dependencies
RUN /opt/conda/bin/pip install \
oauthenticator==0.5.* \
git+https://github.com/derrickmar/kubespawner \
git+https://github.com/yuvipanda/jupyterhub-nginx-chp.git
ADD jupyterhub_config.py /srv/jupyterhub_config.py
ADD userlist /srv/userlist
WORKDIR /srv/jupyterhub
EXPOSE 8081
CMD jupyterhub --config /srv/jupyterhub_config.py --no-ssl
This was able to spawn new user persistent volumes, bind them to PVCs and
obviously spawn user jupyter notebook servers, which could be
stopped/started and re-use the same PV. My initial tests as to whether new
files/notebooks were getting persisted on the PV were failing, since I
wasn't saving them under /home, which is where the binding to the volume
<https://github.com/data-8/jupyterhub-k8s/blob/master/hub/jupyterhub_config.py#L33-L47>
is happening.
i. *user management / userid* - After various aborted attempts to get the
larger data-8 user image working, and where user PVs weren't deleted. I
noticed that the userid appended to username for naming the PV
incremented up, but it wasn't clear where this numbering logic was coming
from, as it wasn't a env variable in any of the manifests. Is this some
fail-safe of some sort?
Currently, I'm using a whitelist userlist for users (see code from
jupyterhub_config.py) below, and these correspond with my users' gitlab
logins that I'm authenticating against. However, it's probably not a clean
solution. I see you are working on another approach on the fsgroup
<13edc76>
and just wanted to get a better understanding around the context of this
solution?
# Whitlelist users and admins
c.Authenticator.whitelist = whitelist = set()
c.Authenticator.admin_users = admin = set()
c.JupyterHub.admin_access = True
pwd = os.path.dirname(__file__)
with open(os.path.join(pwd, 'userlist')) as f:
for line in f:
if not line:
continue
parts = line.split()
name = parts[0]
whitelist.add(name)
if len(parts) > 1 and parts[1] == 'admin':
admin.add(name)
ii. *possibility for interchangeable images* - I find the current default
set-up with Jupyterhub allowing for spawning a single image very limiting.
I can see from #14 <#14>
that you are considering extending functionality in the kubespawner to
allow for an image to be selected. @minrk <https://github.com/minrk> was
able to confirm over here
<jupyterhub/jupyterhub-deploy-docker#25 (comment)>
that it could be possible to pass this image selection programmatically via
the jupyterhub API, although I'm not sure, as per this
<jupyterhub/jupyterhub#891> issue, as to
whether the hub API will work in a kubernetes context.
You pointed to an implementation by Google here
<https://github.com/sveesible/jupyterhub-kubernetes-spawner/blob/master/kubernetespawner/spawner.py#L174-L214>.
It's not clear to me where they are deriving their list of available
images. How do you think something like this should work?
As per the sketch up top, I'm looking to handle a set-up where users have
various private/shared repos (marked 1 above in sketch), from which docker
images are generated and stored in a registry (2 above). Then my users (3
above) would be able to spawn a compute environment for their chosen repo
and have it spawned in kubernetes (4 above), with the possibility, from 5
above, to have the repo cloned (maybe leveraging gitRepo
<http://kubernetes.io/docs/user-guide/volumes/#gitrepo>) and for any
incrimental work performed on it, while on the notebook server, persisted
(6).
iii. *multiple simultaneous servers per user based on different images* -
As far as I understand, it's not possible with jupyterhub to presently
allow a user to have multiples instances of a notebook server, each running
a different image? Do the tools exist within kubernetes to potentially
facilitate this? Thinking out loud, could this be facilitated by having
multiple smaller persistent volumes for a user, based on the repo from
which the server image is derived? Or maybe this could be achieved within a
single PV, by using the subPath
<http://kubernetes.io/docs/user-guide/volumes/#using-subpath>
functionality?
c.KubeSpawner.volumes = [
{
'name': 'volume-{username}-{repo-namespace}-{repo-name}',
'persistentVolumeClaim': {
'claimName': 'claim-{username}-{repo-namespace}-{repo-name}'
}
}
]
iv. *ideas around version-control* - Given the various advantages derived
from using kubernetes to host jupyter, I would be curious if you had some
thoughts around whether kubernetes also potentially makes it easier to
manage version control for notebooks and other files created while in a
user works in a notebook server environment. Perhaps something like
preStop
<http://kubernetes.io/docs/user-guide/container-environment/#container-hooks>
hooks could be used to commit and push changes prior to a container
shutting down.
Even facilitating a user to be able to run git commands from a notebook
server terminal .. and have SSH keys back to the version-control system
handled via the kubernetes secrets/config maps might be a start. Have you
seen any implementations solving this?
Thanks for your patience in reading through this!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAB23qmnLmR_H-oyusHajCy23r-FFKaNks5rPMxZgaJpZM4LbkgJ>
.
--
Yuvi Panda T
http://yuvi.in/blog
|
@yuvipanda ... just wondering if you've had any time to think about some of the items raised above. Much appreciated. |
Yes! I have a drafted a response! Will hopefully complete in a few hours.
Thanks for your patience!
…On Jan 14, 2017 6:16 PM, "Analect" ***@***.***> wrote:
@yuvipanda <https://github.com/yuvipanda> ... just wondering if you've
had any time to think about some of the items raised above. Much
appreciated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAB23i-FJuQyNQvUJX1wN18HaUVXz0NGks5rSMOJgaJpZM4LbkgJ>
.
|
On Thu, Jan 5, 2017 at 4:26 PM, Analect ***@***.***> wrote:
@yuvipanda <https://github.com/yuvipanda>
Thanks for all your work on kubespawner. I've started experimenting with
running jupyterhub on kubernetes, largely thanks to this spawner, but I
wanted to get some guidance around my use-cases / workflow from someone a
bit more seasoned in this technology. I'm structuring these as a series of
high-level questions, where your input would be be much appreciated. For
ease of explanation, I may refer to the rough sketch below lower down.
[image: image]
<https://cloud.githubusercontent.com/assets/4063815/21677128/9bc79b9c-d330-11e6-85a5-f8602b0bbff1.png>
This is an awesome sketch! May I ask how you created it?
<https://cloud.githubusercontent.com/assets/4063815/21677128/9bc79b9c-d330-11e6-85a5-f8602b0bbff1.png>
*My efforts so far, for context:*
I was working through the data-8/jupyterhub-k8s
<https://github.com/data-8/jupyterhub-k8s> implementation, which I think
bases itself off your work, since it's structure in a chart form (fro helm)
is the easiest to work with, compared to some of the other implementations
I've found out there.
I modified that set-up slightly to handle gitlab authentication (rather
than google), which worked OK, but I wasn't able to get the spawning of
their large user image (>5GB), based on this Dockerfile
<https://github.com/data-8/jupyterhub-k8s/blob/master/user/Dockerfile>
and their hub image
<https://github.com/data-8/jupyterhub-k8s/blob/master/hub/Dockerfile> to
work. It was constantly stuck in a Waiting: ContainerCreating state and
would then try to re-spawn itself. I haven't figured out what the problem
is, but there appears to be plenty of space on the cluster. I'm using v1.51
of kubernetes on GCE.
Anyway, I ended up getting things working using instead the hub image
(dockerfile below), a variation of the data-8 one, in conjunction with your
yuvipanda/simple-singleuser:v1
<https://github.com/yuvipanda/jupyterhub-simplest-k8s/blob/master/singleuser/Dockerfile>
user image.
FROM jupyterhub/jupyterhub-onbuild:0.7.1
# Install kubespawner and its dependencies
RUN /opt/conda/bin/pip install \
oauthenticator==0.5.* \
git+https://github.com/derrickmar/kubespawner \
git+https://github.com/yuvipanda/jupyterhub-nginx-chp.git
ADD jupyterhub_config.py /srv/jupyterhub_config.py
ADD userlist /srv/userlist
WORKDIR /srv/jupyterhub
EXPOSE 8081
CMD jupyterhub --config /srv/jupyterhub_config.py --no-ssl
This was able to spawn new user persistent volumes, bind them to PVCs and
obviously spawn user jupyter notebook servers, which could be
stopped/started and re-use the same PV. My initial tests as to whether new
files/notebooks were getting persisted on the PV were failing, since I
wasn't saving them under /home, which is where the binding to the volume
<https://github.com/data-8/jupyterhub-k8s/blob/master/hub/jupyterhub_config.py#L33-L47>
is happening.
Awesome! In the last week or so, I've spent a lot of time generalizing the
helm configuration a lot more, and it should be more widely usable (with
multiple authenticators support) soon. We're deploying it for UC Berkeley's
class starting Monday, so will have more time to actually write
documentation after that. I intend to get it included in
github.com/kubernetes/charts eventually, to make it an officially supported
way of installing JupyterHub.
i. *user management / userid* - After various aborted attempts to get the
larger data-8 user image working, and where user PVs weren't deleted. I
noticed that the userid appended to username for naming the PV
incremented up, but it wasn't clear where this numbering logic was coming
from, as it wasn't a env variable in any of the manifests. Is this some
fail-safe of some sort?
Currently, I'm using a whitelist userlist for users (see code from
jupyterhub_config.py) below, and these correspond with my users' gitlab
logins that I'm authenticating against. However, it's probably not a clean
solution. I see you are working on another approach on the fsgroup
<13edc76>
and just wanted to get a better understanding around the context of this
solution?
# Whitlelist users and admins
c.Authenticator.whitelist = whitelist = set()
c.Authenticator.admin_users = admin = set()
c.JupyterHub.admin_access = True
pwd = os.path.dirname(__file__)
with open(os.path.join(pwd, 'userlist')) as f:
for line in f:
if not line:
continue
parts = line.split()
name = parts[0]
whitelist.add(name)
if len(parts) > 1 and parts[1] == 'admin':
admin.add(name)
There are multiple types of users / userids, which is confusing!
1. The JupyterHub user id - this is simply the id of the entry for the user
in the sqlite table. This is pretty useless for everything other than as
unique identifiers. This is used in the pod name to make sure no two users'
pods have the same name - since we 'normalize' the username to a subset of
ascii, there are plenty of cases where two pods can have the same names if
only username is used. Hence we append ID to it. There is pretty much no
other external use of the id anywhere.
2. The unix user as which the notebook process runs. This is completely
separate from and unrelated to (1). This is specified in the Dockerfile (as
USER) and overrideable as `c.KubeSpawner.singleuser_uid`. These users are
what is used for permission checks (writing things to persistent storage
for example - this is what was causing permission errors when writing to
the mounted persistent volume). fsgroup is related to this as well - it
should be set to a group that this unix user is part of so that singleuser
servers can mount and write to persistent volumes properly. In Kubernetes,
this should ideally just always be one unix user that's the same for all
users - they're all contained in containers, so this is ok.
As for deleting PVs - if you delete PVs you lose the data in them (since
dynamically provisioned PVs always have reclaimPolicy: Delete). Hence it is
a manual operation that is not automated at all - you have to delete the
linked PVC manually, which will delete the PV (and lose your data)
ii. *possibility for interchangeable images* - I find the current
default set-up with Jupyterhub allowing for spawning a single image very
limiting. I can see from #14
<#14> that you are
considering extending functionality in the kubespawner to allow for an
image to be selected. @minrk <https://github.com/minrk> was able to
confirm over here
<jupyterhub/jupyterhub-deploy-docker#25 (comment)>
that it could be possible to pass this image selection programmatically via
the jupyterhub API, although I'm not sure, as per this
<jupyterhub/jupyterhub#891> issue, as to
whether the hub API will work in a kubernetes context.
You pointed to an implementation by Google here
<https://github.com/sveesible/jupyterhub-kubernetes-spawner/blob/master/kubernetespawner/spawner.py#L174-L214>.
It's not clear to me where they are deriving their list of available
images. How do you think something like this should work?
As per the sketch up top, I'm looking to handle a set-up where users have
various private/shared repos (marked 1 above in sketch), from which docker
images are generated and stored in a registry (2 above). Then my users (3
above) would be able to spawn a compute environment for their chosen repo
and have it spawned in kubernetes (4 above), with the possibility, from 5
above, to have the repo cloned (maybe leveraging gitRepo
<http://kubernetes.io/docs/user-guide/volumes/#gitrepo>) and for any
incrimental work performed on it, while on the notebook server, persisted
(6).
This can be done currently with
https://jupyterhub.readthedocs.io/en/latest/spawners.html#spawner-options-form.
Are you thinking of the list of images as being static (ie specified by
administrator) or dynamic? If dynamic it might be a little more difficult,
but not impossible. I see you've already dug into this on Gitter - would
love to see your solution so we can make it easier in KubeSpawner :)
iii. *multiple simultaneous servers per user based on different images* -
As far as I understand, it's not possible with jupyterhub to presently
allow a user to have multiples instances of a notebook server, each running
a different image? Do the tools exist within kubernetes to potentially
facilitate this? Thinking out loud, could this be facilitated by having
multiple smaller persistent volumes for a user, based on the repo from
which the server image is derived? Or maybe this could be achieved within a
single PV, by using the subPath
<http://kubernetes.io/docs/user-guide/volumes/#using-subpath>
functionality?
c.KubeSpawner.volumes = [
{
'name': 'volume-{username}-{repo-namespace}-{repo-name}',
'persistentVolumeClaim': {
'claimName': 'claim-{username}-{repo-namespace}-{repo-name}'
}
}
]
This is a little more difficult from JupyterHub but active work is being
done on this right now - follow
jupyterhub/jupyterhub#766 for more details!
iv. *ideas around version-control* - Given the various advantages
derived from using kubernetes to host jupyter, I would be curious if you
had some thoughts around whether kubernetes also potentially makes it
easier to manage version control for notebooks and other files created
while in a user works in a notebook server environment. Perhaps something
like preStop
<http://kubernetes.io/docs/user-guide/container-environment/#container-hooks>
hooks could be used to commit and push changes prior to a container
shutting down.
Even facilitating a user to be able to run git commands from a notebook
server terminal .. and have SSH keys back to the version-control system
handled via the kubernetes secrets/config maps might be a start. Have you
seen any implementations solving this?
Thanks for your patience in reading through this!
If you are using GitHub for authentication, then we could possibly do
something like generate a personal access token when the user logs in and
then put it in an appropriate place on the notebook container, thus
allowing users to pull / push natively. I think that's far better than
wrapping git with some magic, which in my experience ends badly always. In
https://github.com/yuvipanda/paws/blob/master/hub/jupyterhub_config.py#L41
I pass extra generated parameters into the single-user notebook from the
hub, and we could do something similar here.
Action items from here are:
1. Play with getting GitHub personal access token into environment
variables / proper locations on disk so people can push / pull from repos
2. Expand documentation on what 'users' are and how the various kinds of
'users' are used
3. See if you need any follow up help on the docker image selection with
options form thing
4. Continue making the helm config configurable enough for general use.
Feel free to ask follow up questions here or on gitter! Looking forward to
seeing what cool things you are doing!
…--
Yuvi Panda T
http://yuvi.in/blog
|
@yuvipanda . Thanks for your responses.
I think you're going to be disappointed when I tell you powerpoint! Yes, I've seen a flurry of activity cleaning up the data-8 implementation, which looks great. It would be nice to get an implementation under Ref {user}-{user-id} ... thanks for the explanation. Ref. passing image to get spawned.
OK, based on heavy prompting from @minrk ... I was able to modify jupyterhub_config.py to include this ... which was able to pick up new 'image' payloads passed to the JupyterHub API.
So all the other I then pass this API call to jupyterhub ... and it appears to work. I have obviously pushed that image to my private docker registry first.
However, it's not bullet-proof. For instance, for larger images (2GB+), I noticed sometimes kubernetes is slow to pull the image ... and so you end up in this situation (see table below) ... where it eventually aborts ... which isn't ideal. However, I found deleting the pod and then retrying the above seemed to resolve. Maybe there's a better approach of pulling these images down to kubernetes ahead of time ... or maybe there's better performance if the images are pushed to a google registry (on the assumption one is using their kubernetes implementation, of course).
Obviously once the image is pulled to the kubernetes cluster, then spawning from the hub is a matter of seconds. Ref multi-servers per user ... yes, I've been keeping an eye on this and this. Ref. version-control ... I'm using a self-hosted gitlab rather than github. They have a similar user-token concept, so maybe, as you said, passing that as a 'secret' or 'config map' variable per user, might work. Given that I'm experimenting with spawning into 'lab' environments, rather than the classic notebook 'tree', I've been looking for ways to pass a template ... a bit like how the notebooks.azure.com implementation below (although they are still working against the classic notebook). It seems doing the same for jupyterlab is a bit more involved (see this issue), requiring a plugin on the jupyterlab end, but it appears some of the required tooling is in place with jupyterhub-labextension. I'm not sure this is ready for usage yet though. If it were, then maybe one could potentially give a rudimentary way of pushing/pulling to a repo, by exposing, in my case, the gitlab API via some buttons on that template. I would be interested in whether you thought that viable or not. Anyway, thanks for the dialogue on these matters. |
@Analect I love how you thoroughly documented your thoughts in this issue! ❤️ I'm closing it now as it is stale and doesn't seem to have a specific action point related to it. |
@yuvipanda
Thanks for all your work on kubespawner. I've started experimenting with running jupyterhub on kubernetes, largely thanks to this spawner, but I wanted to get some guidance around my use-cases / workflow from someone a bit more seasoned in this technology. I'm structuring these as a series of high-level questions, where your input would be be much appreciated. For ease of explanation, I may refer to the rough sketch below lower down.
My efforts so far, for context:
I was working through the data-8/jupyterhub-k8s implementation, which I think bases itself off your work, since it's structure in a chart form (fro helm) is the easiest to work with, compared to some of the other implementations I've found out there.
I modified that set-up slightly to handle gitlab authentication (rather than google), which worked OK, but I wasn't able to get the spawning of their large user image (>5GB), based on this Dockerfile and their hub image to work. It was constantly stuck in a
Waiting: ContainerCreating
state and would then try to re-spawn itself. I haven't figured out what the problem is, but there appears to be plenty of space on the cluster. I'm using v1.51 of kubernetes on GCE.Anyway, I ended up getting things working using instead the hub image (dockerfile below), a variation of the data-8 one, in conjunction with your yuvipanda/simple-singleuser:v1 user image.
This was able to spawn new user persistent volumes, bind them to PVCs and obviously spawn user jupyter notebook servers, which could be stopped/started and re-use the same PV. My initial tests as to whether new files/notebooks were getting persisted on the PV were failing, since I wasn't saving them under
/home
, which is where the binding to the volume is happening.i. user management / userid - After various aborted attempts to get the larger data-8 user image working, and where user PVs weren't deleted. I noticed that the
userid
appended to username for naming the PV incremented up, but it wasn't clear where this numbering logic was coming from, as it wasn't a env variable in any of the manifests. Is this some fail-safe of some sort?Currently, I'm using a whitelist
userlist
for users (see code from jupyterhub_config.py) below, and these correspond with my users' gitlab logins that I'm authenticating against. However, it's probably not a clean solution. I see you are working on another approach on the fsgroup and just wanted to get a better understanding around the context of this solution?ii. possibility for interchangeable images - I find the current default set-up with Jupyterhub allowing for spawning a single image very limiting. I can see from #14 that you are considering extending functionality in the kubespawner to allow for an image to be selected. @minrk was able to confirm over here that it could be possible to pass this image selection programmatically via the jupyterhub API, although I'm not sure, as per this issue, as to whether the hub API will work in a kubernetes context.
You pointed to an implementation by Google here. It's not clear to me where they are deriving their list of available images. How do you think something like this should work?
As per the sketch up top, I'm looking to handle a set-up where users have various private/shared repos (marked 1 above in sketch), from which docker images are generated and stored in a registry (2 above). Then my users (3 above) would be able to spawn a compute environment for their chosen repo and have it spawned in kubernetes (4 above), with the possibility, from 5 above, to have the repo cloned (maybe leveraging gitRepo) and for any incrimental work performed on it, while on the notebook server, persisted (6).
iii. multiple simultaneous servers per user based on different images - As far as I understand, it's not possible with jupyterhub to presently allow a user to have multiples instances of a notebook server, each running a different image? Do the tools exist within kubernetes to potentially facilitate this? Thinking out loud, could this be facilitated by having multiple smaller persistent volumes for a user, based on the repo from which the server image is derived? Or maybe this could be achieved within a single PV, by using the subPath functionality?
iv. ideas around version-control - Given the various advantages derived from using kubernetes to host jupyter, I would be curious if you had some thoughts around whether kubernetes also potentially makes it easier to manage version control for notebooks and other files created while in a user works in a notebook server environment. Perhaps something like preStop hooks could be used to commit and push changes prior to a container shutting down.
Even facilitating a user to be able to run git commands from a notebook server terminal .. and have SSH keys back to the version-control system handled via the kubernetes secrets/config maps might be a start. Have you seen any implementations solving this?
Thanks for your patience in reading through this!
The text was updated successfully, but these errors were encountered: