GoogleCloudPlatform · german-grandas · Jul 26, 2024 · Sep 18, 2024 · Sep 18, 2024 · Sep 18, 2024
diff --git a/applications/rag/README.md b/applications/rag/README.md
@@ -17,7 +17,7 @@ RAG uses a semantically searchable knowledge base (like vector search) to retrie
 5. A [Jupyter](https://docs.jupyter.org/en/latest/) notebook running on GKE that reads the dataset using GCS fuse driver integrations and runs a Ray job to populate the vector DB.
 3. A front end chat interface running on GKE that prompts the inference server with context from the vector DB.
 
-This tutorial walks you through installing the RAG infrastructure in a GCP project, generating vector embeddings for a sample [Kaggle Netflix shows](https://www.kaggle.com/datasets/shivamb/netflix-shows) dataset and prompting the LLM with context.
+This tutorial walks you through installing the RAG infrastructure in a GCP project, generating vector embeddings for a sample [Kubernetes Docs](https://github.com/dohsimpson/kubernetes-doc-pdf) dataset and prompting the LLM with context.
 
 # Prerequisites
 
@@ -74,7 +74,7 @@ This section sets up the RAG infrastructure in your GCP project using Terraform.
 
 # Generate vector embeddings for the dataset
 
-This section generates the vector embeddings for your input dataset. Currently, the default dataset is [Netflix shows](https://www.kaggle.com/datasets/shivamb/netflix-shows). We will use a Jupyter notebook to run a Ray job that generates the embeddings & populates them into the `pgvector` instance created above.
+This section generates the vector embeddings for your input dataset. Currently, the default dataset is [Kubernetes docs](https://github.com/dohsimpson/kubernetes-doc-pdf). We will use a Jupyter notebook to generate the embeddings & populates them into the `pgvector` instance created above.
 
 Set your the namespace, cluster name and location from `workloads.tfvars`):
 
@@ -108,30 +108,10 @@ gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_L
 
 2. Load the notebook:
     - Once logged in to JupyterHub, choose the `CPU` preset with `Default` storage. 
-    - Click [File] -> [Open From URL] and paste: `https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/main/applications/rag/example_notebooks/rag-kaggle-ray-sql-interactive.ipynb`
-
-3. Configure Kaggle:
-    - Create a [Kaggle account](https://www.kaggle.com/account/login?phase=startRegisterTab&returnUrl=%2F).
-    - [Generate an API token](https://www.kaggle.com/settings/account). See [further instructions](https://www.kaggle.com/docs/api#authentication). This token is used in the notebook to access the [Kaggle Netflix shows](https://www.kaggle.com/datasets/shivamb/netflix-shows) dataset.
-    - Replace the variables in the 1st cell of the notebook with your Kaggle credentials (can be found in the `kaggle.json` file created while generating the API token):
-        * `KAGGLE_USERNAME`
-        * `KAGGLE_KEY`
-
-4. Generate vector embeddings: Run all the cells in the notebook to generate vector embeddings for the Netflix shows dataset (https://www.kaggle.com/datasets/shivamb/netflix-shows) and store them in the `pgvector` CloudSQL instance via a Ray job.
-    * When the last cell says the job has succeeded (eg: `Job 'raysubmit_APungAw6TyB55qxk' succeeded`), the vector embeddings have been generated and we can launch the frontend chat interface. Note that running the job can take up to 10 minutes.
-    * Ray may take several minutes to create the runtime environment. During this time, the job will appear to be missing (e.g. `Status message: PENDING`).
-    * Connect to the Ray dashboard to check the job status or logs:
-        - If IAP is disabled (`ray_dashboard_add_auth = false`):
-            - `kubectl port-forward -n ${NAMESPACE} service/ray-cluster-kuberay-head-svc 8265:8265`
-            - Go to `localhost:8265` in a browser
-        - If IAP is enabled (`ray_dashboard_add_auth = true`):
-            - Fetch the domain: `terraform output ray-dashboard-managed-cert`
-            - If you used a custom domain, ensure you configured your DNS as described above.
-            - Verify the domain status is `Active`:
-                - `kubectl get managedcertificates ray-dashboard-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'`
-                - Note: This can take up to 20 minutes to propagate.
-            - Once the domain status is Active, go to the domain in a browser and login with your Google credentials.
-            - To add additional users to your frontend application, go to [Google Cloud Platform IAP](https://console.cloud.google.com/security/iap), select the `rag/ray-cluster-kuberay-head-svc` service and add principals with the role `IAP-secured Web App User`.
+    - Click [File] -> [Open From URL] and paste: `https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/main/applications/rag/example_notebooks/rag-data-ingest-with-kubernetes-docs.ipynb`
+
+
+4. Generate vector embeddings: Run all the cells in the notebook to generate vector embeddings for the [Kubernetes documentation](https://github.com/dohsimpson/kubernetes-doc-pdf) and store them in the `pgvector` CloudSQL instance.
 
 # Launch the frontend chat interface
 

diff --git a/applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb b/applications/rag/example_notebooks/rag-ray-ingest-with-kubernetes-docs.ipynb
@@ -0,0 +1,285 @@
+{
+  "cells": [
+   {
+    "cell_type": "markdown",
+    "id": "7e14d0f0-2573-4fe4-ba87-7a447f2f511c",
+    "metadata": {},
+    "source": [
+     "# RAG-on-GKE Application\n",
+     "\n",
+     "This is a Python notebook for generating the vector embeddings based on [Kubernetes docs](https://github.com/dohsimpson/kubernetes-doc-pdf/) used by the RAG on GKE application.   \n",
+     "For full information, please checkout the GitHub documentation [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/applications/rag/README.md).\n",
+     "\n",
+     "\n",
+     "\n",
+     "## Clone the kubernetes docs repo"
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "id": "5f9b1fad-537e-425f-a5fc-587a408b1fab",
+    "metadata": {},
+    "outputs": [],
+    "source": [
+     "!mkdir /data/kubernetes-docs -p\n",
+     "!git clone https://github.com/dohsimpson/kubernetes-doc-pdf /data/kubernetes-docs\n"
+    ]
+   },
+   {
+    "cell_type": "markdown",
+    "id": "b984429c-b65a-47b7-9723-ee3ad81d61db",
+    "metadata": {},
+    "source": [
+     "## Install the required packages"
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "id": "40e4d29d-79c6-4233-a8ed-0f8a42576656",
+    "metadata": {
+     "scrolled": true
+    },
+    "outputs": [],
+    "source": [
+     "!pip install langchain langchain-community sentence_transformers pypdf"
+    ]
+   },
+   {
+    "cell_type": "markdown",
+    "id": "f80cc5af-a1fa-456d-a4ed-fa2ffa3b87a0",
+    "metadata": {},
+    "source": [
+     "## Writting job to be used on the Ray Cluster"
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "id": "36523f3f-0c93-41da-abb9-c113bb456bc1",
+    "metadata": {},
+    "outputs": [],
+    "source": [
+     "# Create a directory to package the contents that need to be downloaded in ray worker\n",
+     "! mkdir -p rag-app"
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "id": "69d912e5-2225-4b44-80cd-651f7cc71a40",
+    "metadata": {},
+    "outputs": [],
+    "source": [
+     "%%writefile rag-app/job.py\n",
+     "\n",
+     "import os\n",
+     "import uuid\n",
+     "import glob\n",
+     "\n",
+     "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+     "from langchain.embeddings import HuggingFaceEmbeddings\n",
+     "from langchain_community.document_loaders import PyPDFLoader\n",
+     "\n",
+     "from google.cloud.sql.connector import Connector, IPTypes\n",
+     "import sqlalchemy\n",
+     "\n",
+     "from sqlalchemy.ext.declarative import declarative_base\n",
+     "from sqlalchemy import Column, String, Text, text, JSON\n",
+     "from sqlalchemy.orm import scoped_session, sessionmaker, mapped_column\n",
+     "from pgvector.sqlalchemy import Vector\n",
+     "\n",
+     "# initialize parameters\n",
+     "\n",
+     "INSTANCE_CONNECTION_NAME = os.environ[\"CLOUDSQL_INSTANCE_CONNECTION_NAME\"]\n",
+     "print(f\"Your instance connection name is: {INSTANCE_CONNECTION_NAME}\")\n",
+     "VECTOR_EMBEDDINGS_TABLE_NAME = os.environ.get(\"EMBEDDINGS_TABLE_NAME\", \"rag_vector_embeddings\")\n",
+     "DB_NAME = \"pgvector-database\"\n",
+     "\n",
+     "db_username_file = open(\"/etc/secret-volume/username\", \"r\")\n",
+     "DB_USER = db_username_file.read()\n",
+     "db_username_file.close()\n",
+     "\n",
+     "db_password_file = open(\"/etc/secret-volume/password\", \"r\")\n",
+     "DB_PASS = db_password_file.read()\n",
+     "db_password_file.close()\n",
+     "\n",
+     "# initialize Connector object\n",
+     "connector = Connector()\n",
+     "\n",
+     "# function to return the database connection object\n",
+     "def getconn():\n",
+     "    conn = connector.connect(\n",
+     "        INSTANCE_CONNECTION_NAME,\n",
+     "        \"pg8000\",\n",
+     "        user=DB_USER,\n",
+     "        password=DB_PASS,\n",
+     "        db=DB_NAME,\n",
+     "        ip_type=IPTypes.PRIVATE\n",
+     "    )\n",
+     "    return conn\n",
+     "\n",
+     "# create connection pool with 'creator' argument to our connection object function\n",
+     "pool = sqlalchemy.create_engine(\n",
+     "    \"postgresql+pg8000://\",\n",
+     "    creator=getconn,\n",
+     ")\n",
+     "\n",
+     "Base = declarative_base()\n",
+     "DBSession = scoped_session(sessionmaker())\n",
+     "\n",
+     "class TextEmbedding(Base):\n",
+     "    __tablename__ = VECTOR_EMBEDDINGS_TABLE_NAME\n",
+     "    langchain_id = Column(String(255), primary_key=True)\n",
+     "    content = Column(Text)\n",
+     "    embedding = mapped_column(Vector(384))\n",
+     "    langchain_metadata = Column(JSON)  \n",
+     "\n",
+     "with pool.connect() as conn:\n",
+     "    conn.execute(text(\"CREATE EXTENSION IF NOT EXISTS vector\"))\n",
+     "    conn.commit() \n",
+     "    \n",
+     "DBSession.configure(bind=pool, autoflush=False, expire_on_commit=False)\n",
+     "Base.metadata.drop_all(pool)\n",
+     "Base.metadata.create_all(pool)\n",
+     "\n",
+     "SENTENCE_TRANSFORMER_MODEL = \"intfloat/multilingual-e5-small\"  # Transformer to use for converting text chunks to vector embeddings\n",
+     "\n",
+     "# the dataset has been pre-dowloaded to the GCS bucket as part of the notebook in the cell above. Ray workers will find the dataset readily mounted.\n",
+     "SHARED_DATASET_BASE_PATH = \"/data/kubernetes-docs/\"\n",
+     "\n",
+     "BATCH_SIZE = 100\n",
+     "CHUNK_SIZE = 1000  # text chunk sizes which will be converted to vector embeddings\n",
+     "CHUNK_OVERLAP = 10\n",
+     "VECTOR_DIMENSION = 384  # Embeddings size\n",
+     "\n",
+     "splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP, length_function=len)\n",
+     "embeddings_service = HuggingFaceEmbeddings(model_name=SENTENCE_TRANSFORMER_MODEL)\n",
+     "\n",
+     "def process_pdf(file_path):\n",
+     "    \"\"\"Loads, splits and embed a single PDF file.\"\"\"\n",
+     "    loader = PyPDFLoader(file_path)\n",
+     "    print(f\"Loading {file_path}\")\n",
+     "    pages = loader.load_and_split()\n",
+     "    \n",
+     "    splits = splitter.split_documents(pages)\n",
+     "\n",
+     "    chunks = []\n",
+     "    for split in splits:\n",
+     "        id = uuid.uuid4()\n",
+     "        page_content = split.page_content\n",
+     "        file_metadata = split.metadata\n",
+     "        embedded_document = embeddings_service.embed_query(page_content)\n",
+     "        split_data = {\n",
+     "            \"langchain_id\" : id,\n",
+     "            \"content\" : page_content,\n",
+     "            \"embedding\" : embedded_document,\n",
+     "            \"langchain_metadata\" : file_metadata\n",
+     "        }\n",
+     "        chunks.append(split_data)\n",
+     "    return chunks\n",
+     "\n",
+     "documents_file_paths = glob.glob(f\"{SHARED_DATASET_BASE_PATH}/PDFs/*.pdf\")\n",
+     "for file_path in documents_file_paths:\n",
+     "    processed_result = process_pdf(file_path)\n",
+     "    DBSession.bulk_insert_mappings(TextEmbedding, processed_result)\n",
+     "        \n",
+     "DBSession.commit()\n",
+     "print (\"end job\")"
+    ]
+   },
+   {
+    "cell_type": "markdown",
+    "id": "6b9bc582-50cd-4d7c-b5c4-549626fd2349",
+    "metadata": {},
+    "source": [
+     "## Summiting the job into Ray Cluster:"
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "id": "d5b6acbe-5a14-4bc8-a4ca-58a6b3dd5391",
+    "metadata": {},
+    "outputs": [],
+    "source": [
+     "import ray, time\n",
+     "from ray.job_submission import JobSubmissionClient\n",
+     "client = JobSubmissionClient(\"ray://ray-cluster-kuberay-head-svc:10001\")"
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "id": "4eb8eae9-2a20-4c02-ac79-196942ae2783",
+    "metadata": {},
+    "outputs": [],
+    "source": [
+     "# Port forward to the Ray dashboard and go to `localhost:8265` in a browser to see job status: kubectl port-forward -n <namespace> service/ray-cluster-kuberay-head-svc 8265:8265\n",
+     "import time\n",
+     "\n",
+     "start_time = time.time()\n",
+     "job_id = client.submit_job(\n",
+     "    entrypoint=\"python job.py\",\n",
+     "    # Path to the local directory that contains the entrypoint file.\n",
+     "    runtime_env={\n",
+     "        \"working_dir\": \"/home/jovyan/rag-app\", # upload the local working directory to ray workers\n",
+     "         \"pip\": [               \n",
+     "            \"langchain\",\n",
+     "             \"langchain-community\",\n",
+     "            \"sentence-transformers\",\n",
+     "             \"pypdf\",\n",
+     "             \"pgvector\"\n",
+     "        ]\n",
+     "    }\n",
+     ")\n",
+     "\n",
+     "# The Ray job typically takes 5m-10m to complete.\n",
+     "print(\"Job submitted with ID:\", job_id)\n",
+     "while True:\n",
+     "    status = client.get_job_status(job_id)\n",
+     "    print(\"Job status:\", status)\n",
+     "    print(\"Job info:\", client.get_job_info(job_id).message)\n",
+     "    if status.is_terminal():\n",
+     "        break\n",
+     "    time.sleep(30)\n",
+     "\n",
+     "end_time = time.time()\n",
+     "job_duration = end_time - start_time\n",
+     "print(f\"Job  completed in {job_duration} seconds.\")"
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "id": "50882494-6fe9-47c7-a6ed-0726d4abddc3",
+    "metadata": {},
+    "outputs": [],
+    "source": [
+     "ray.shutdown()"
+    ]
+   }
+  ],
+  "metadata": {
+   "kernelspec": {
+    "display_name": "Python 3 (ipykernel)",
+    "language": "python",
+    "name": "python3"
+   },
+   "language_info": {
+    "codemirror_mode": {
+     "name": "ipython",
+     "version": 3
+    },
+    "file_extension": ".py",
+    "mimetype": "text/x-python",
+    "name": "python",
+    "nbconvert_exporter": "python",
+    "pygments_lexer": "ipython3",
+    "version": "3.10.11"
+   }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+ }
diff --git a/applications/rag/metadata.yaml b/applications/rag/metadata.yaml
@@ -70,7 +70,7 @@ spec:
       - name: dataset_embeddings_table_name
         description: Name of the table that stores vector embeddings for input dataset
         varType: string
-        defaultValue: netflix_reviews_db
+        defaultValue: rag_embeddings_db
       - name: disable_ray_cluster_network_policy
         description: Disables Kubernetes Network Policy for Ray Clusters for this demo. Defaulting to 'true' aka disabled pending fixes to the kuberay-monitoring module. This should be defaulted to false.
         varType: bool