Goal
In this example we will demonstrate how to setup a ray cluster on GKE and deploy a distributed training job to fine tuning a stable diffusion model following the example from https://docs.ray.io/en/latest/train/examples/pytorch/dreambooth_finetuning.html and artifacts in https://github.com/ray-project/ray/tree/master/doc/source/templates/05_dreambooth_finetuning
We will deploy a jupyter pod and a ray cluster (using kuberay operator). The pods will mount to shared filesystem (GCS Fuse CSI in this specific example) where the model and the datasets live and readily accessible to ray worker pods during training and inference. Ray jobs will be triggered from the jupyter notebook running in the jupyter pod. The example showcases ray data API usage with a GKE GCS Fuse CSI mounted volumes
Setup Steps
- Create a GKE cluster with GPU node pool of 4 nodes (1 GPU per GKE node. In this example we used the n1-standard-32 machine type with T4 GPU). Ensure that Workload Identity and GCS CSI driver is enabled for the cluster. See details here and here
$ gcloud container clusters create $CLUSTER_NAME --location us-central1-c --workload-pool $PROJECT_ID.svc.id.goog --cluster-version=1.27 --num-nodes=1 --machine-type=e2-standard-32 --addons GcsFuseCsiDriver --enable-ip-alias
$ gcloud container node-pools create gpu-pool --cluster $CLUSTER_NAME --machine-type n1-standard-32 --accelerator type=nvidia-tesla-t4,count=1 --num-nodes=4
- Ensure that the nvidia driver plugins are installed as expected (If not follow the steps here)
$ kubectl get po -n kube-system | grep nvidia
$ k get po -n kube-system | grep nvidia
nvidia-gpu-device-plugin-medium-cos-c5j8b 1/1 Running 0 8m24s
nvidia-gpu-device-plugin-medium-cos-kpmlr 1/1 Running 0 7m54s
nvidia-gpu-device-plugin-medium-cos-q844w 1/1 Running 0 8m25s
nvidia-gpu-device-plugin-medium-cos-t4q2x 1/1 Running 0 7m17s
- Create a namespace
example
kubectl create ns example
- Change context to the current namespace
kubectl config set-context --current --namespace example
- Install the kuberay operator and validate operator pod is Running in
example
namespace
helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0-rc.0 --values <path to /raytrain-examples/raytrain-with-gcsfusecsi/kuberay-operator/values.yaml
$ kubectl get po -n example
pod/kuberay-operator-64b7b88759-5ppfw 1/1 Running 0 95m
- Deploy the kuberay terraform which sets up the kuberay operator and the ray cluster custom resourcs; spins up a ray head and 3 worker pods. Replace the
project_id
in kuberaytf/variables.tf to your own project. Key things to note for this terraform
-
The template expects a pre-created bucket of name
test-gcsfuse-1
. If you plan to change it change the bucket name in the raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/variables.tfgcs_bucket
variable name. Also change the bucket name in raytrain-examples/raytrain-with-gcsfusecsi/kuberaytf/user/modules/kuberay/kuberay-values.yaml csi.volumeAttributes.bucketName in head and worker spec. -
Ray worker and head pods and Jupyter pods mount the bucket with uid=1000, gid=100, so that the necessary directories and artifacts can be downloaded to the shared directory
-
The service account bindings of bucket, and Workload Identity bindings between GCP SA and k8s SA are done automatically by the serice_accounts module.
cd kuberaytf/user/
terraform init
terraform apply
- Deploy the jupyter Pod and PVC spec (This step expects the service account bindings have been already setup for the GCS Bucket as part of the terraform apply step above)
kubectl apply -f jupyter-spec.yaml
- When all the pods and services are ready this is how it looks like for jupyter and ray pods
$ kubectl get all -n example
NAME READY STATUS RESTARTS AGE
pod/ray-cluster-kuberay-head-9x2q6 2/2 Running 0 3m12s
pod/ray-cluster-kuberay-worker-workergroup-95nm2 2/2 Running 0 3m12s
pod/ray-cluster-kuberay-worker-workergroup-tfg9n 2/2 Running 0 3m12s
pod/kuberay-operator-64b7b88759-5ppfw 1/1 Running 0 4m4s
pod/tensorflow-0 2/2 Running 0 16s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/ray-cluster-kuberay-head-svc ClusterIP 10.8.10.33 <none> 10001/TCP,8265/TCP,8080/TCP,6379/TCP,8000/TCP 3m12s
service/kuberay-operator ClusterIP 10.8.14.245 <none> 8080/TCP 4m4s
service/tensorflow ClusterIP None <none> 8888/TCP 16s
service/tensorflow-jupyter LoadBalancer 10.8.3.9 <pending> 80:31891/TCP 16s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/kuberay-operator 1/1 1 1 4m4s
NAME DESIRED CURRENT READY AGE
replicaset.apps/kuberay-operator-64b7b88759 1 1 1 4m4s
NAME READY AGE
statefulset.apps/tensorflow 1/1 16s
- Locate the service IP of the jupyter
$ kubectl get service tensorflow-jupyter
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
tensorflow-jupyter LoadBalancer 10.8.14.182 35.188.214.7 80:31524/TCP 5m33s
- fetch the token for the login
$ kubectl exec --tty -i tensorflow-0 -c tensorflow-container -n example -- jupyter server list
Currently running servers:
http://tensorflow-0:8888/?token=<TOKEN> :: /home/jovyan
-
Open a new notebook and import the notebook from the URL
https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/main/ray-on-gke/example_notebooks/raytrain-stablediffusion.ipynb
(notebook) -
Follow the comments and execute the cells in the notebook to run a distributed training job and then inference on the tuned model
-
Port forward the ray service port to examine the ray dashboard for jobs progress details, The dashboard is reachable at localhost:8286 in the local browser
kubectl port-forward -n example service/ray-cluster-kuberay-head-svc 8265:8265