Added initial version of the LangChain RAG example

codingbandit · codingbandit · commit 4e6ec18405d5 · 2023-12-23T18:48:19.000-05:00
diff --git a/Labs/lab_4_langchain_vector_search.ipynb b/Labs/lab_4_langchain_vector_search.ipynb
@@ -0,0 +1,232 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# LangChain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "from typing import List\n",
+    "from dotenv import load_dotenv\n",
+    "from pymongo import MongoClient\n",
+    "from langchain.chat_models import AzureChatOpenAI\n",
+    "from langchain.embeddings import AzureOpenAIEmbeddings\n",
+    "from langchain.vectorstores import AzureCosmosDBVectorSearch\n",
+    "from langchain.schema.document import Document\n",
+    "from langchain.prompts import PromptTemplate\n",
+    "from langchain.schema import StrOutputParser\n",
+    "from langchain.schema.runnable import RunnablePassthrough"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load settings for the notebook\n",
+    "load_dotenv()\n",
+    "CONNECTION_STRING = os.environ.get(\"DB_CONNECTION_STRING\")\n",
+    "EMBEDDINGS_DEPLOYMENT_NAME = \"embeddings\"\n",
+    "COMPLETIONS_DEPLOYMENT_NAME = \"completions\"\n",
+    "AOAI_ENDPOINT = os.environ.get(\"AOAI_ENDPOINT\")\n",
+    "AOAI_KEY = os.environ.get(\"AOAI_KEY\")\n",
+    "AOAI_API_VERSION = \"2023-05-15\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Establish Azure OpenAI connectivity\n",
+    "llm = AzureChatOpenAI(            \n",
+    "        temperature = 0,\n",
+    "        openai_api_version = AOAI_API_VERSION,\n",
+    "        azure_endpoint = AOAI_ENDPOINT,\n",
+    "        openai_api_key = AOAI_KEY,         \n",
+    "        azure_deployment = \"completions\"\n",
+    ")\n",
+    "embedding_model = AzureOpenAIEmbeddings(\n",
+    "    openai_api_version = AOAI_API_VERSION,\n",
+    "    azure_endpoint = AOAI_ENDPOINT,\n",
+    "    openai_api_key = AOAI_KEY,   \n",
+    "    azure_deployment = \"embeddings\",\n",
+    "    chunk_size=10\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Vector search with LangChain\n",
+    "\n",
+    "In the previous lab, the `pymongo` library was used to perform a vector search through a db command to find product documents that were most similar to the user's input. In this lab, you will use the `langchain` library to perform the same search. LangChain has a vector store class named **AzureCosmosDBVectorSearch**, a community contribution, that supports vector search in Azure CosmosDB for MongoDB API vCore.\n",
+    "\n",
+    "When establishing the connection to the vector store (MongoDB vCore), remember that in previous labs the products collection was populated and a contentVector field added that contains the vectorized embeddings of the document itself. Finally, a vector index was also created on the contentVector field to enable vector search.\n",
+    "\n",
+    "The return value of a vector search in LangChain is a list of `Document` objects. The LangChain `Document` class contains two properties: `page_content`, that represents the textual content that is typically used to augment the prompt, and `metadata` that contains all other attributes of the document. In the cell below, we'll use the `_id` field as the page_content, and the rest of the fields are returned as metadata.\n",
+    "\n",
+    "The next two cells initiate a connection to the vector store and performs a vector search. Notice how much more concise the code is compared to the previous lab with the addition of LangChain."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Reference the existing vector store\n",
+    "vector_store = AzureCosmosDBVectorSearch.from_connection_string(\n",
+    "    connection_string = CONNECTION_STRING,\n",
+    "    namespace = \"cosmic_works.products\",\n",
+    "    embedding = embedding_model,\n",
+    "    index_name = \"VectorSearchIndex\",    \n",
+    "    embedding_key = \"contentVector\",\n",
+    "    text_key = \"_id\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = \"What yellow products are there?\"\n",
+    "vector_store.similarity_search(query, k=3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## RAG with LangChain\n",
+    "\n",
+    "In this section, we'll implement the RAG pattern using LangChain. In LangChain, a **retriever** is used to augment the prompt with contextual data. In this case, the already established vector store will be used as the retriever. By default, the prompt is augmented with the `page_content` field of the retrieved document that customarily contains the text content of the embedded vector. In our case, the document itself serves as the textual content, so we'll have to do some pre-processing to format the text of the product list that is expected in our system prompt (JSON string) - see the **format_documents** function below for this implementation.\n",
+    "\n",
+    "We'll also define a reusable RAG [chain](https://python.langchain.com/docs/modules/chains/) to control the flow and behavior of the call into the LLM. This chain is defined using the LCEL syntax (LangChain Expression Language)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# A system prompt describes the responsibilities, instructions, and persona of the AI.\n",
+    "# Note the addition of the templated variable/placeholder for the list of products and the incoming question.\n",
+    "system_prompt = \"\"\"\n",
+    "You are a helpful, fun and friendly sales assistant for Cosmic Works, a bicycle and bicycle accessories store. \n",
+    "Your name is Cosmo.\n",
+    "You are designed to answer questions about the products that Cosmic Works sells.\n",
+    "\n",
+    "Only answer questions related to the information provided in the list of products below that are represented\n",
+    "in JSON format.\n",
+    "\n",
+    "If you are asked a question that is not in the list, respond with \"I don't know.\"\n",
+    "\n",
+    "List of products:\n",
+    "{products}\n",
+    "\n",
+    "Question:\n",
+    "{question}\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# remember that each Document contains a page_content property\n",
+    "# that is populated with the _id field of the document\n",
+    "# all other document fields are located in the metadata property\n",
+    "def format_docs(docs:List[Document]) -> str:\n",
+    "        \"\"\"\n",
+    "        Prepares the product list for the system prompt.\n",
+    "        \"\"\"\n",
+    "        str_docs = []\n",
+    "        for doc in docs:\n",
+    "                # Build the product document without the contentVector\n",
+    "                doc_dict = {\"_id\": doc.page_content}\n",
+    "                doc_dict.update(doc.metadata)\n",
+    "                if \"contentVector\" in doc_dict:  \n",
+    "                        del doc_dict[\"contentVector\"]\n",
+    "                str_docs.append(json.dumps(doc_dict))                  \n",
+    "        # Return a single string containing each product JSON representation\n",
+    "        # separated by two newlines\n",
+    "        return \"\\n\\n\".join(str_docs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a retriever from the vector store\n",
+    "retriever = vector_store.as_retriever()\n",
+    "\n",
+    "# Create the prompt template from the system_prompt text\n",
+    "llm_prompt = PromptTemplate.from_template(system_prompt)\n",
+    "\n",
+    "rag_chain = (\n",
+    "    # populate the tokens/placeholders in the llm_prompt \n",
+    "    # products takes the results of the vector store and formats the documents\n",
+    "    # question is a passthrough that takes the incoming question\n",
+    "    { \"products\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
+    "    | llm_prompt\n",
+    "    # pass the populated prompt to the language model\n",
+    "    | llm\n",
+    "    # return the string ouptut from the language model\n",
+    "    | StrOutputParser()\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "question = \"What are the names and skus of yellow products? Output the answer as a bulleted list.\"\n",
+    "response = rag_chain.invoke(question)\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/Labs/requirements.txt b/Labs/requirements.txt
@@ -3,4 +3,6 @@ python-dotenv==1.0.0
 requests==2.31.0
 pydantic==2.5.2
 openai==1.6.0
-tenacity==8.2.3
+tenacity==8.2.3
+langchain==0.0.352
+tiktoken==0.5.2