diff --git a/nemo/NeMo-Safe-Synthesizer/README.md b/nemo/NeMo-Safe-Synthesizer/README.md index 805d45b2..dbc830be 100644 --- a/nemo/NeMo-Safe-Synthesizer/README.md +++ b/nemo/NeMo-Safe-Synthesizer/README.md @@ -1,5 +1,4 @@ -# NeMo Safe Synthesizer Example Notebooks - +# NeMo Safe Synthesizer Example Notebooks This directory contains the tutorial notebooks for getting started with NeMo Safe Synthesizer. @@ -12,17 +11,15 @@ Install the sdk as follows: ```bash uv venv source .venv/bin/activate -uv pip install nemo-microservices[safe-synthesizer] +uv pip install nemo-microservices[safe-synthesizer] rich ``` - Be sure to select this virtual environment as your kernel when running the notebooks. ## 🚀 Deploying the NeMo Safe Synthesizer Microservice To run these notebooks, you'll need access to a deployment of the NeMo Safe Synthesizer microservice. You have two deployment options: - ### 🐳 Deploy the NeMo Safe Synthesizer Microservice Locally Follow our quickstart guide to deploy the NeMo safe synthesizer microservice locally via Docker Compose. diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb index 1f4d824e..4ddf8431 100644 --- a/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb +++ b/nemo/NeMo-Safe-Synthesizer/advanced/advanced_privacy.ipynb @@ -19,14 +19,6 @@ "- **Access an evaluation report** on the quality and privacy of the synthetic data" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "a538526a", - "metadata": {}, - "outputs": [], - "source": [] - }, { "cell_type": "markdown", "id": "8be84f5d", @@ -37,6 +29,22 @@ "Ensure you have a NeMo Microservices Platform deployment available. If you're using a managed or remote deployment, have the correct base URLs and tokens ready." ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "a538526a", + "metadata": { + "tags": [ + "parameters" + ] + }, + "outputs": [], + "source": [ + "# Update as appropriate to your installation URLs\n", + "base_url = \"http://localhost:8080\"\n", + "datastore_endpoint = \"http://localhost:3000/v1/hf\"" + ] + }, { "cell_type": "code", "execution_count": null, @@ -46,7 +54,7 @@ "source": [ "import pandas as pd\n", "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder\n", + "from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder\n", "\n", "import logging\n", "\n", @@ -74,7 +82,7 @@ "outputs": [], "source": [ "client = NeMoMicroservices(\n", - " base_url=\"http://localhost:8080\",\n", + " base_url=base_url\n", ")" ] }, @@ -94,7 +102,7 @@ "outputs": [], "source": [ "datastore_config = {\n", - " \"endpoint\": \"http://localhost:3000/v1/hf\",\n", + " \"endpoint\": datastore_endpoint\n", "}" ] }, @@ -155,7 +163,7 @@ "source": [ "## 🏗️ Create a Safe Synthesizer job\n", "\n", - "The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.\n", + "The `SafeSynthesizerJobBuilder` provides a fluent interface to configure and submit jobs.\n", "\n", "This job will:\n", "- Initialize the builder with the NeMo Microservices client.\n", @@ -175,8 +183,8 @@ "outputs": [], "source": [ "job = (\n", - " SafeSynthesizerBuilder(client)\n", - " .from_data_source(df)\n", + " SafeSynthesizerJobBuilder(client)\n", + " .with_data_source(df)\n", " .with_datastore(datastore_config)\n", " .with_replace_pii()\n", " .with_differential_privacy(dp_enabled=True, epsilon=8.0)\n", diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb index 1488602b..50c31921 100644 --- a/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb +++ b/nemo/NeMo-Safe-Synthesizer/advanced/replace_pii_only.ipynb @@ -1,249 +1,265 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "630e3e17", - "metadata": {}, - "source": [ - "# 🔒 NeMo Safe Synthesizer: PII Replacement Only\n", - "\n", - "> ⚠️ **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.\n", - "\n", - "
\n", - "\n", - "In this notebook, we demonstrate how to use the NeMo Microservices Python SDK to replace PII in a tabular dataset. The notebook should take about 15 minutes to run.\n", - "\n", - "After completing this notebook, you'll be able to:\n", - "- **Use the NeMo Microservices SDK** to interact with Safe Synthesizer\n", - "- **Run a job to perform PII replacement only** (no novel data generation)\n" - ] - }, - { - "cell_type": "markdown", - "id": "8be84f5d", - "metadata": {}, - "source": [ - "#### 💾 Install dependencies\n", - "\n", - "Ensure you have a NeMo Microservices Platform deployment available. If you're using a managed or remote deployment, have the correct base URLs and tokens ready." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9f5d6f5a", - "metadata": {}, - "outputs": [], - "source": [ - "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder\n", - "\n", - "import logging\n", - "logging.basicConfig(level=logging.WARNING)\n", - "logging.getLogger(\"httpx\").setLevel(logging.WARNING)" - ] - }, - { - "cell_type": "markdown", - "id": "53bb2807", - "metadata": {}, - "source": [ - "### ⚙️ Initialize the NeMo Safe Synthesizer Client\n", - "\n", - "- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.\n", - "- `http://localhost:8080` is the default URL for `base_url` in quickstart.\n", - "- If using a managed or remote deployment, ensure you use the correct base URLs and tokens." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8c15ab93", - "metadata": {}, - "outputs": [], - "source": [ - "client = NeMoMicroservices(\n", - " base_url=\"http://localhost:8080\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "3e1c5697", - "metadata": {}, - "source": [ - "NeMo DataStore is launched as one of the services. We'll use it to manage storage, so set the following:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "016213ab", - "metadata": {}, - "outputs": [], - "source": [ - "datastore_config = {\n", - " \"endpoint\": \"http://localhost:3000/v1/hf\",\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "2d66c819", - "metadata": {}, - "source": [ - "## 📥 Load input data\n", - "\n", - "Safe Synthesizer processes your input dataset and returns the same rows with PII replaced. For this tutorial we load a small public sample dataset. Replace it with your own data if desired.\n", - "\n", - "The dolly dataset is an open source dataset of instruction-following records. Each record contains (1) a free text prompt that could be sent to an LLM, (2) a context descriptions to help the LLM determine the answer, (3) a response that could come from the LLM, and (4) the instruction category such as classification, open QA, closed QA, information extraction, and brainstorming. The text in each of the first three fields sometimes contains Personally Identifiable Information, such as names, birth dates, and locations." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7204f213", - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "df = pd.read_json(\n", - " \"hf://datasets/databricks/databricks-dolly-15k/databricks-dolly-15k.jsonl\",\n", - " lines=True,\n", - ")\n", - "print(df.head())" - ] - }, - { - "cell_type": "markdown", - "id": "87d72c68", - "metadata": {}, - "source": [ - "## 🏗️ Create a Safe Synthesizer job\n", - "\n", - "The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.\n", - "\n", - "This job will:\n", - "- Initialize the builder with the NeMo Microservices client.\n", - "- Use the loaded DataFrame as the input data source.\n", - "- Configure the job to use the specified datastore for model storage.\n", - "- Enable automatic replacement of personally identifiable information (PII).\n", - "- Submit the job to the microservices platform." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "85d9de56", - "metadata": {}, - "outputs": [], - "source": [ - "job = (\n", - " SafeSynthesizerBuilder(client)\n", - " .from_data_source(df)\n", - " .with_datastore(datastore_config)\n", - " .with_replace_pii()\n", - " .create_job()\n", - ")\n", - "\n", - "print(f\"job_id = {job.job_id}\")\n", - "job.wait_for_completion()\n", - "\n", - "print(f\"Job finished with status {job.fetch_status()}\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fa2eacb2", - "metadata": {}, - "outputs": [], - "source": [ - "# If your notebook shuts down, it's okay, your job is still running on the microservices platform.\n", - "# You can get the same job object and interact with it again by uncommenting the following code\n", - "# snippet, and modifying it with the job id from the previous cell output.\n", - "\n", - "# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob\n", - "# job = SafeSynthesizerJob(job_id=\"\", client=client)" - ] - }, - { - "cell_type": "markdown", - "id": "285d4a9d", - "metadata": {}, - "source": [ - "## 👀 View output data\n", - "\n", - "After the job completes, fetch the output with PII replaced." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7f25574a", - "metadata": {}, - "outputs": [], - "source": [ - "# Fetch the job output data with PII replaced\n", - "output_df = job.fetch_data()\n", - "output_df" - ] - }, - { - "cell_type": "markdown", - "id": "571efc39", - "metadata": {}, - "source": [ - "## 📊 View PII report\n", - "\n", - "A report summarizing the PII replacement is created automatically for every job.\n", - "\n", - "You can download the full HTML report or display it inline below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bba96175", - "metadata": {}, - "outputs": [], - "source": [ - "# Download the full evaluation report to your local machine\n", - "job.save_report(\"evaluation_report.html\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "45f7e22b", - "metadata": {}, - "outputs": [], - "source": [ - "# Fetch and display the full evaluation report inline\n", - "job.display_report_in_notebook()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "kendrickb-notebooks", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.13" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "630e3e17", + "metadata": {}, + "source": [ + "# 🔒 NeMo Safe Synthesizer: PII Replacement Only\n", + "\n", + "> ⚠️ **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.\n", + "\n", + "
\n", + "\n", + "In this notebook, we demonstrate how to use the NeMo Microservices Python SDK to replace PII in a tabular dataset. The notebook should take about 15 minutes to run.\n", + "\n", + "After completing this notebook, you'll be able to:\n", + "- **Use the NeMo Microservices SDK** to interact with Safe Synthesizer\n", + "- **Run a job to perform PII replacement only** (no novel data generation)\n" + ] + }, + { + "cell_type": "markdown", + "id": "8be84f5d", + "metadata": {}, + "source": [ + "#### 💾 Install dependencies\n", + "\n", + "Ensure you have a NeMo Microservices Platform deployment available. If you're using a managed or remote deployment, have the correct base URLs and tokens ready." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0e23975", + "metadata": { + "tags": [ + "parameters" + ] + }, + "outputs": [], + "source": [ + "# Update as appropriate to your installation URLs\n", + "base_url = \"http://localhost:8080\"\n", + "datastore_endpoint = \"http://localhost:3000/v1/hf\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9f5d6f5a", + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_microservices import NeMoMicroservices\n", + "from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder\n", + "\n", + "import logging\n", + "logging.basicConfig(level=logging.WARNING)\n", + "logging.getLogger(\"httpx\").setLevel(logging.WARNING)" + ] + }, + { + "cell_type": "markdown", + "id": "53bb2807", + "metadata": {}, + "source": [ + "### ⚙️ Initialize the NeMo Safe Synthesizer Client\n", + "\n", + "- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.\n", + "- `http://localhost:8080` is the default URL for `base_url` in quickstart.\n", + "- If using a managed or remote deployment, ensure you use the correct base URLs and tokens." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c15ab93", + "metadata": {}, + "outputs": [], + "source": [ + "client = NeMoMicroservices(\n", + " base_url=base_url\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "3e1c5697", + "metadata": {}, + "source": [ + "NeMo DataStore is launched as one of the services. We'll use it to manage storage, so set the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "016213ab", + "metadata": {}, + "outputs": [], + "source": [ + "datastore_config = {\n", + " \"endpoint\": datastore_endpoint\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "2d66c819", + "metadata": {}, + "source": [ + "## 📥 Load input data\n", + "\n", + "Safe Synthesizer processes your input dataset and returns the same rows with PII replaced. For this tutorial we load a small public sample dataset. Replace it with your own data if desired.\n", + "\n", + "The dolly dataset is an open source dataset of instruction-following records. Each record contains (1) a free text prompt that could be sent to an LLM, (2) a context descriptions to help the LLM determine the answer, (3) a response that could come from the LLM, and (4) the instruction category such as classification, open QA, closed QA, information extraction, and brainstorming. The text in each of the first three fields sometimes contains Personally Identifiable Information, such as names, birth dates, and locations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7204f213", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "df = pd.read_json(\n", + " \"hf://datasets/databricks/databricks-dolly-15k/databricks-dolly-15k.jsonl\",\n", + " lines=True,\n", + ")\n", + "print(df.head())" + ] + }, + { + "cell_type": "markdown", + "id": "87d72c68", + "metadata": {}, + "source": [ + "## 🏗️ Create a Safe Synthesizer job\n", + "\n", + "The `SafeSynthesizerJobBuilder` provides a fluent interface to configure and submit jobs.\n", + "\n", + "This job will:\n", + "- Initialize the builder with the NeMo Microservices client.\n", + "- Use the loaded DataFrame as the input data source.\n", + "- Configure the job to use the specified datastore for model storage.\n", + "- Enable automatic replacement of personally identifiable information (PII).\n", + "- Submit the job to the microservices platform." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "85d9de56", + "metadata": {}, + "outputs": [], + "source": [ + "job = (\n", + " SafeSynthesizerJobBuilder(client)\n", + " .with_data_source(df)\n", + " .with_datastore(datastore_config)\n", + " .with_replace_pii()\n", + " .create_job()\n", + ")\n", + "\n", + "print(f\"job_id = {job.job_id}\")\n", + "job.wait_for_completion()\n", + "\n", + "print(f\"Job finished with status {job.fetch_status()}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa2eacb2", + "metadata": {}, + "outputs": [], + "source": [ + "# If your notebook shuts down, it's okay, your job is still running on the microservices platform.\n", + "# You can get the same job object and interact with it again by uncommenting the following code\n", + "# snippet, and modifying it with the job id from the previous cell output.\n", + "\n", + "# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob\n", + "# job = SafeSynthesizerJob(job_id=\"\", client=client)" + ] + }, + { + "cell_type": "markdown", + "id": "285d4a9d", + "metadata": {}, + "source": [ + "## 👀 View output data\n", + "\n", + "After the job completes, fetch the output with PII replaced." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f25574a", + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch the job output data with PII replaced\n", + "output_df = job.fetch_data()\n", + "output_df" + ] + }, + { + "cell_type": "markdown", + "id": "571efc39", + "metadata": {}, + "source": [ + "## 📊 View PII report\n", + "\n", + "A report summarizing the PII replacement is created automatically for every job.\n", + "\n", + "You can download the full HTML report or display it inline below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bba96175", + "metadata": {}, + "outputs": [], + "source": [ + "# Download the full evaluation report to your local machine\n", + "job.save_report(\"evaluation_report.html\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45f7e22b", + "metadata": {}, + "outputs": [], + "source": [ + "# Fetch and display the full evaluation report inline\n", + "job.display_report_in_notebook()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "kendrickb-notebooks", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb index d76f1973..a47f37ee 100644 --- a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb +++ b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101.ipynb @@ -29,6 +29,22 @@ "**IMPORTANT** 👉 Ensure you have a NeMo Microservices Platform deployment available. Follow the quickstart or Helm chart instructions in your environment's setup guide. You may need to restart your kernel after installing dependencies.\n" ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e0327d4", + "metadata": { + "tags": [ + "parameters" + ] + }, + "outputs": [], + "source": [ + "# Update as appropriate to your installation URLs\n", + "base_url = \"http://localhost:8080\"\n", + "datastore_endpoint = \"http://localhost:3000/v1/hf\"" + ] + }, { "cell_type": "code", "execution_count": null, @@ -38,7 +54,7 @@ "source": [ "import pandas as pd\n", "from nemo_microservices import NeMoMicroservices\n", - "from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder\n", + "from nemo_microservices.beta.safe_synthesizer.sdk.job_builder import SafeSynthesizerJobBuilder\n", "\n", "import logging\n", "logging.basicConfig(level=logging.WARNING)\n", @@ -65,7 +81,7 @@ "outputs": [], "source": [ "client = NeMoMicroservices(\n", - " base_url=\"http://localhost:8080\",\n", + " base_url=base_url,\n", ")" ] }, @@ -85,7 +101,7 @@ "outputs": [], "source": [ "datastore_config = {\n", - " \"endpoint\": \"http://localhost:3000/v1/hf\",\n", + " \"endpoint\": datastore_endpoint\n", "}" ] }, @@ -138,7 +154,7 @@ "\n", "The following code creates and submits a job:\n", "- `SafeSynthesizerBuilder(client)`: initialize with the NeMo Microservices client.\n", - "- `.from_data_source(df)`: set the input data source.\n", + "- `.with_data_source(df)`: set the input data source.\n", "- `.with_datastore(datastore_config)`: configure model artifact storage.\n", "- `.with_replace_pii()`: enable automatic replacement of PII.\n", "- `.synthesize()`: train and generate synthetic data.\n", @@ -153,8 +169,8 @@ "outputs": [], "source": [ "job = (\n", - " SafeSynthesizerBuilder(client)\n", - " .from_data_source(df)\n", + " SafeSynthesizerJobBuilder(client)\n", + " .with_data_source(df)\n", " .with_datastore(datastore_config)\n", " .with_replace_pii()\n", " .synthesize()\n", @@ -258,7 +274,7 @@ ], "metadata": { "kernelspec": { - "display_name": "kendrickb-notebooks", + "display_name": "GenerativeAIExamples", "language": "python", "name": "python3" },