Skip to content

luis5tb/neural-magic-poc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How to deploy the pipeline and/or serve the model

The pipeline is to make use of SparseML to optimize the model, and then the KServe InferenceService/ServingRuntime are the one running the DeepSparse runtime with the model

Create object data store (MinIO) for the models

Create namespace for the object store if you don't have one

oc new-project object-datastore

Deploy MinIO:

oc apply -f minio.yaml

And create a couple of buckets, one for the pipeline (e.g., named mlops) and one for the models (e.g., named models).

SparseML

Create pipeline server, pointing to an S3 bucket

Import the pipeline (sparseml_pipeline.yaml) into the pipeline server. This can be generated by running:

python openshift-ai/pipeline.py
  • NOTE: if some of the steps may take longer than one hour you either need to change the defaults for taskRuns in OpenShift AI or add a timeout: Xh per taskRun. You can see sparseml_simplified_pipeline.yaml and search for timeout: 5h to see an example.

Pipeline Requirements

Cluster storage named models-shared, so that a volume to be shared is created

Data connection, named models, pointing to the S3 bucket to store the resulting model

  • NOTE: the cluster storage and the data connection can have any name, as long as it is the same given later on the pipeline parameters.

Create the images needed for the pipeline

Build the container for the sparsification and the evaluation steps:

podman build -t quay.io/USER/neural-magic:sparseml -f openshift-ai/sparseml_Dockerfile .
podman build -t quay.io/USER/neural-magic:sparseml_eval -f openshift-ai/sparseml_eval_Dockerfile .
podman build -t quay.io/USER/neural-magic:nm_vllm_eval -f openshift-ai/nm_vllm_eval_Dockerfile .
podman build -t quay.io/USER/neural-magic:base_eval -f openshift-ai/base_eval_Dockerfile .

And push them to a registry

podman push quay.io/USER/neural-magic:sparseml
podman push quay.io/USER/neural-magic:sparseml_eval
podman push quay.io/USER/neural-magic:nm_vllm_eval
podman push quay.io/USER/neural-magic:base_eval

(OLD) Compile the pipeline (RHOAI < 2.9)

This is the process to create the PipelineRun yaml file from the python script. It requires kfp_tekton version 1.5.9:

pip install kfp_tekton==1.5.9
python pipeline_simplified.py
  • NOTE: there is another option for a more complex/flexible pipeline at pipeline_nmvllm.py, but the rest assumes the usage of the simplified one.

(NEW) Complie the pipeline for V2 (RHOAI >= 2.9)

This is the process to create the pipeline yaml file from the python script. It requires kfp.kubernetes:

pip install kfp[kubernetes]
python pipeline_v2_cpu.py
python pipeline_v2_gpu.py
  • NOTE: there are two different pipelines for V2, one for GPU and one for CPU. It would be straightforward to merge them in one and have a pipeline parameter to chose between them

Run the pipeline

Run the pipeline selecting the model and the options:

  • Evaluate or not
  • GPU (Quantized) or CPU (Sparsified: Quantized + Pruned). Note for GPU inferencing, it is not supported to both prune and quantized yet.

DeepSparse

Run the optimized model with DeepSparse

Create the image needed for the Inference Service

Build the container with:

podman build -t quay.io/USER/neural-magic:deepsparse -f deepsparse_Dockerfile .

And push it to a registry

podman push quay.io/USER/neural-magic:deepsparse

Option A: Deploy through ServingRuntime

Note DeepSparse require write access to the mounted volume with the model, so doing a workaround so that it gets copied to an extra mount with ReadOnly set to False.

oc apply -f openshift-ai/serving_runtime_deepsparse.yaml

And them from the OpenShift AI you can deploy a model using it and pointing to the models DataConnection

Option B: Deploy InferenceService

Create a secret and a Service Account that points to the S3 endpoint. Modified them as needed.

oc apply -f openshift-ai/secret.yaml
oc apply -f openshift-ai/sa.yaml

oc apply -f openshift-ai/inference.yaml

nm-vLLM

Run the optimized model with nm-vLLM

Create the image needed for the ServingRuntime

Build the container with:

podman build -t quay.io/USER/neural-magic:nm-vllm -f nmvllm_Dockerfile .

And push it to a registry

podman push quay.io/USER/neural-magic:nm-vllm

Deploy through ServingRuntime

Note DeepSparse require write access to the mounted volume with the model, so doing a workaround so that it gets copied to an extra mount with ReadOnly set to False.

oc apply -f openshift-ai/serving_runtime_vllm.yaml
oc apply -f openshift-ai/serving_runtime_vllm_marlin.yaml

And them from the OpenShift AI you can deploy a model using it and pointing to the models DataConnection. You can use one or the other depending on running sparsified models or quantized (with marlin) models.

Testing with Gradio

Run the request.py and access the Gradio server deployed locally at 127.0.0.1:7860. Update the URL with the one from the deployed runtime (ksvc route)

python openshift-ai/request.py

Releases

No releases published

Packages

No packages published