added refactored intro tutorial on seeding to work with public HF repo

raosukrit67 · raosukrit67 · commit 139849d1d627 · 2025-09-08T10:00:14.000-07:00
diff --git a/nemo/NeMo-Data-Designer/intro-tutorials/3-seeding-with-a-dataset.ipynb b/nemo/NeMo-Data-Designer/intro-tutorials/3-seeding-with-a-dataset.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 🎨 NeMo Data Designer 101: Seeding synthetic data generation with an external dataset\n",
+    "# 🎨 NeMo Data Designer 101: Seeding Synthetic Data Generation with an External Dataset\n",
     "\n",
     "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n",
     ">\n",
@@ -14,18 +14,16 @@
     "\n",
     "In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset.\n",
     "\n",
-    "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series.\n",
-    "\n",
-    "#### 💾 Install dependencies\n",
-    "\n",
-    "**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n"
+    "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "If the installation worked, you should be able to make the following imports:\n"
+    "#### 💾 Install dependencies\n",
+    "\n",
+    "**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n"
    ]
   },
   {
@@ -40,7 +38,10 @@
     "from nemo_microservices.beta.data_designer import (\n",
     "    DataDesignerConfigBuilder,\n",
     "    DataDesignerClient,\n",
-    ")"
+    ")\n",
+    "\n",
+    "from nemo_microservices.beta.data_designer.config import columns as C\n",
+    "from nemo_microservices.beta.data_designer.config import params as P"
    ]
   },
   {
@@ -49,7 +50,28 @@
    "source": [
     "### ⚙️ Initialize the NeMo Data Designer (NDD) Client\n",
     "\n",
-    "- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n"
+    "- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n",
+    "- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n",
+    "- If you have an instance of data designer running locally, you can connect to it as follows\n",
+    "\n",
+    "    ```python\n",
+    "    data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n",
+    "    ```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# if using the managed service of data designer, provide the api key here\n",
+    "api_key = getpass(\"Enter data designer API key: \")\n",
+    "\n",
+    "if len(api_key) > 0:\n",
+    "    print(\"✅ API key received.\")\n",
+    "else:\n",
+    "    print(\"❌ No API key provided. Please enter your model provider API key.\")"
    ]
   },
   {
@@ -58,7 +80,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ndd = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8000\"))"
+    "data_designer_client = DataDesignerClient(\n",
+    "    client=NeMoMicroservices(\n",
+    "            base_url=\"https://ai.api.nvidia.com/v1/stg/nemo/dd\",\n",
+    "            default_headers={\"Authorization\": f\"Bearer {api_key}\"} # auto-generated API KEY\n",
+    "    )\n",
+    ")"
    ]
   },
   {
@@ -83,18 +110,8 @@
    "outputs": [],
    "source": [
     "# build.nvidia.com model endpoint\n",
-    "endpoint = \"https://integrate.api.nvidia.com/v1\"\n",
-    "model_id = \"mistralai/mistral-small-24b-instruct\"\n",
-    "\n",
-    "model_alias = \"mistral-small\"\n",
-    "\n",
-    "# You will need to enter your model provider API key to run this notebook.\n",
-    "api_key = getpass(\"Enter model provider API key: \")\n",
-    "\n",
-    "if len(api_key) > 0:\n",
-    "    print(\"✅ API key received.\")\n",
-    "else:\n",
-    "    print(\"❌ No API key provided. Please enter your model provider API key.\")"
+    "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n",
+    "model_alias = \"nemotron-nano-v2\""
    ]
   },
   {
@@ -103,23 +120,22 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# You can also load the model configs from a YAML string or file.\n",
-    "\n",
-    "model_configs_yaml = f\"\"\"\\\n",
-    "model_configs:\n",
-    "  - alias: \"{model_alias}\"\n",
-    "    inference_parameters:\n",
-    "      max_tokens: 1024\n",
-    "      temperature: 0.5\n",
-    "      top_p: 1.0\n",
-    "    model:\n",
-    "      api_endpoint:\n",
-    "        api_key: \"{api_key}\"\n",
-    "        model_id: \"{model_id}\"\n",
-    "        url: \"{endpoint}\"\n",
-    "\"\"\"\n",
-    "\n",
-    "config_builder = DataDesignerConfigBuilder(model_configs=model_configs_yaml)"
+    "config_builder = DataDesignerConfigBuilder(\n",
+    "    model_configs = [\n",
+    "        P.ModelConfig(\n",
+    "            alias=model_alias,\n",
+    "            provider=\"nvidiabuild\",\n",
+    "            model=model_id,\n",
+    "            inference_parameters=P.InferenceParameters(\n",
+    "                max_tokens=1024,\n",
+    "                temperature=0.5,\n",
+    "                top_p=1.0,\n",
+    "                timeout=120\n",
+    "            ),\n",
+    "            is_reasoner=True\n",
+    "        ),\n",
+    "    ]\n",
+    ")"
    ]
   },
   {
@@ -130,7 +146,9 @@
     "\n",
     "- For this notebook, we'll change gears and create a synthetic dataset of patient notes.\n",
     "\n",
-    "- To steer the generation process, we will use an open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).\n"
+    "- To steer the generation process, we will use an open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).\n",
+    "\n",
+    "- In this dataset, the `input_text` represents the `patient_summary` and the `output_text` represents the `diagnosis`\n"
    ]
   },
   {
@@ -139,34 +157,13 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from datasets import load_dataset\n",
-    "\n",
-    "df_seed = load_dataset(\"gretelai/symptom_to_diagnosis\")[\"train\"].to_pandas()\n",
-    "\n",
-    "# Rename the columns to something more descriptive.\n",
-    "df_seed = df_seed.rename(\n",
-    "    columns={\"output_text\": \"diagnosis\", \"input_text\": \"patient_summary\"}\n",
-    ")\n",
-    "\n",
-    "print(f\"Number of records: {len(df_seed)}\")\n",
-    "\n",
-    "# Save the file so we can upload it to the microservice.\n",
-    "df_seed.to_csv(\"symptom_to_diagnosis.csv\", index=False)\n",
-    "\n",
-    "df_seed.head()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 🎨 Designing our synthetic patient notes dataset\n",
-    "\n",
-    "- We set the seed dataset using the `with_seed_dataset` method.\n",
+    "# if using the managed service of data designer, provide the api key here\n",
+    "hf_token = getpass(\"Enter Huggingface Token here: \")\n",
     "\n",
-    "- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.\n",
-    "\n",
-    "- We set `with_replacement=False`, which limits our max number of records to 853, which is the number of records in the seed dataset.\n"
+    "if len(hf_token) > 0:\n",
+    "    print(\"✅ Huggingface Token received.\")\n",
+    "else:\n",
+    "    print(\"❌ No Huggingface Token provided. Please enter your Huggingface Token.\")"
    ]
   },
   {
@@ -180,15 +177,27 @@
     "# to the datastore. Note we need to pass in the datastore's endpoint, which\n",
     "# must match the endpoint in the docker-compose file.\n",
     "config_builder.with_seed_dataset(\n",
-    "    repo_id=\"into-tutorials/seeding-with-a-dataset\",\n",
-    "    filename=\"symptom_to_diagnosis.csv\",\n",
-    "    dataset_path=\"./symptom_to_diagnosis.csv\",\n",
+    "    repo_id=\"gretelai/symptom_to_diagnosis\",\n",
+    "    filename=\"train.jsonl\",\n",
     "    sampling_strategy=\"shuffle\",\n",
     "    with_replacement=False,\n",
-    "    datastore={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n",
+    "    datastore={\"endpoint\": \"https://huggingface.co\", \"token\": hf_token}\n",
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 🎨 Designing our synthetic patient notes dataset\n",
+    "\n",
+    "- We set the seed dataset using the `with_seed_dataset` method.\n",
+    "\n",
+    "- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.\n",
+    "\n",
+    "- We set `with_replacement=False`, which limits our max number of records to 853, which is the number of records in the seed dataset.\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -270,10 +279,10 @@
     "    name=\"physician_notes\",\n",
     "    prompt=\"\"\"\\\n",
     "You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},\n",
-    "who has been struggling with symptoms from {{ diagnosis }} since {{ symptom_onset_date }}.\n",
+    "who has been struggling with symptoms from {{ output_text }} since {{ symptom_onset_date }}.\n",
     "The date of today's visit is {{ date_of_visit }}.\n",
     "\n",
-    "{{ patient_summary }}\n",
+    "{{ input_text }}\n",
     "\n",
     "Write careful notes about your visit with {{ first_name }},\n",
     "as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.\n",
@@ -303,17 +312,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "preview = ndd.preview(config_builder, verbose_logging=True)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# The preview dataset is available as a pandas DataFrame.\n",
-    "preview.dataset"
+    "preview = data_designer_client.preview(config_builder, verbose_logging=True)"
    ]
   },
   {
@@ -326,44 +325,31 @@
     "preview.display_sample_record()"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 🧬 Generate your dataset\n",
-    "\n",
-    "- Once you are happy with the preview, scale up to a larger dataset.\n",
-    "\n",
-    "- The `create` method will submit your generation job to the microservice and return a results object.\n",
-    "\n",
-    "- If you want to wait for the job to complete, set `wait_until_done=True`.\n"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "results = ndd.create(config_builder, num_records=20, wait_until_done=True)"
+    "# The preview dataset is available as a pandas DataFrame.\n",
+    "preview.dataset"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "# load the dataset into a pandas DataFrame\n",
-    "dataset = results.load_dataset()\n",
+    "## ⏭️ Next Steps\n",
+    "\n",
+    "Check out the following notebooks to learn more about:\n",
     "\n",
-    "dataset.head()"
+    "- [Using Custom Model Configs](./4-custom-model-configs.ipynb)\n"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": ".venv",
+   "display_name": "sdg_venv",
    "language": "python",
    "name": "python3"
   },
@@ -377,7 +363,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.9"
+   "version": "3.12.11"
   }
  },
  "nbformat": 4,