|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "metadata": {}, |
6 | 6 | "source": [ |
7 | | - "# 🎨 NeMo Data Designer 101: Seeding synthetic data generation with an external dataset\n", |
| 7 | + "# 🎨 NeMo Data Designer 101: Seeding Synthetic Data Generation with an External Dataset\n", |
8 | 8 | "\n", |
9 | 9 | "> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n", |
10 | 10 | ">\n", |
|
14 | 14 | "\n", |
15 | 15 | "In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset.\n", |
16 | 16 | "\n", |
17 | | - "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series.\n", |
18 | | - "\n", |
19 | | - "#### 💾 Install dependencies\n", |
20 | | - "\n", |
21 | | - "**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" |
| 17 | + "If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series." |
22 | 18 | ] |
23 | 19 | }, |
24 | 20 | { |
25 | 21 | "cell_type": "markdown", |
26 | 22 | "metadata": {}, |
27 | 23 | "source": [ |
28 | | - "If the installation worked, you should be able to make the following imports:\n" |
| 24 | + "#### 💾 Install dependencies\n", |
| 25 | + "\n", |
| 26 | + "**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n" |
29 | 27 | ] |
30 | 28 | }, |
31 | 29 | { |
|
40 | 38 | "from nemo_microservices.beta.data_designer import (\n", |
41 | 39 | " DataDesignerConfigBuilder,\n", |
42 | 40 | " DataDesignerClient,\n", |
43 | | - ")" |
| 41 | + ")\n", |
| 42 | + "\n", |
| 43 | + "from nemo_microservices.beta.data_designer.config import columns as C\n", |
| 44 | + "from nemo_microservices.beta.data_designer.config import params as P" |
44 | 45 | ] |
45 | 46 | }, |
46 | 47 | { |
|
49 | 50 | "source": [ |
50 | 51 | "### ⚙️ Initialize the NeMo Data Designer (NDD) Client\n", |
51 | 52 | "\n", |
52 | | - "- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n" |
| 53 | + "- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n", |
| 54 | + "- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n", |
| 55 | + "- If you have an instance of data designer running locally, you can connect to it as follows\n", |
| 56 | + "\n", |
| 57 | + " ```python\n", |
| 58 | + " data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n", |
| 59 | + " ```\n" |
| 60 | + ] |
| 61 | + }, |
| 62 | + { |
| 63 | + "cell_type": "code", |
| 64 | + "execution_count": null, |
| 65 | + "metadata": {}, |
| 66 | + "outputs": [], |
| 67 | + "source": [ |
| 68 | + "# if using the managed service of data designer, provide the api key here\n", |
| 69 | + "api_key = getpass(\"Enter data designer API key: \")\n", |
| 70 | + "\n", |
| 71 | + "if len(api_key) > 0:\n", |
| 72 | + " print(\"✅ API key received.\")\n", |
| 73 | + "else:\n", |
| 74 | + " print(\"❌ No API key provided. Please enter your model provider API key.\")" |
53 | 75 | ] |
54 | 76 | }, |
55 | 77 | { |
|
58 | 80 | "metadata": {}, |
59 | 81 | "outputs": [], |
60 | 82 | "source": [ |
61 | | - "ndd = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8000\"))" |
| 83 | + "data_designer_client = DataDesignerClient(\n", |
| 84 | + " client=NeMoMicroservices(\n", |
| 85 | + " base_url=\"https://ai.api.nvidia.com/v1/stg/nemo/dd\",\n", |
| 86 | + " default_headers={\"Authorization\": f\"Bearer {api_key}\"} # auto-generated API KEY\n", |
| 87 | + " )\n", |
| 88 | + ")" |
62 | 89 | ] |
63 | 90 | }, |
64 | 91 | { |
|
83 | 110 | "outputs": [], |
84 | 111 | "source": [ |
85 | 112 | "# build.nvidia.com model endpoint\n", |
86 | | - "endpoint = \"https://integrate.api.nvidia.com/v1\"\n", |
87 | | - "model_id = \"mistralai/mistral-small-24b-instruct\"\n", |
88 | | - "\n", |
89 | | - "model_alias = \"mistral-small\"\n", |
90 | | - "\n", |
91 | | - "# You will need to enter your model provider API key to run this notebook.\n", |
92 | | - "api_key = getpass(\"Enter model provider API key: \")\n", |
93 | | - "\n", |
94 | | - "if len(api_key) > 0:\n", |
95 | | - " print(\"✅ API key received.\")\n", |
96 | | - "else:\n", |
97 | | - " print(\"❌ No API key provided. Please enter your model provider API key.\")" |
| 113 | + "model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n", |
| 114 | + "model_alias = \"nemotron-nano-v2\"" |
98 | 115 | ] |
99 | 116 | }, |
100 | 117 | { |
|
103 | 120 | "metadata": {}, |
104 | 121 | "outputs": [], |
105 | 122 | "source": [ |
106 | | - "# You can also load the model configs from a YAML string or file.\n", |
107 | | - "\n", |
108 | | - "model_configs_yaml = f\"\"\"\\\n", |
109 | | - "model_configs:\n", |
110 | | - " - alias: \"{model_alias}\"\n", |
111 | | - " inference_parameters:\n", |
112 | | - " max_tokens: 1024\n", |
113 | | - " temperature: 0.5\n", |
114 | | - " top_p: 1.0\n", |
115 | | - " model:\n", |
116 | | - " api_endpoint:\n", |
117 | | - " api_key: \"{api_key}\"\n", |
118 | | - " model_id: \"{model_id}\"\n", |
119 | | - " url: \"{endpoint}\"\n", |
120 | | - "\"\"\"\n", |
121 | | - "\n", |
122 | | - "config_builder = DataDesignerConfigBuilder(model_configs=model_configs_yaml)" |
| 123 | + "config_builder = DataDesignerConfigBuilder(\n", |
| 124 | + " model_configs = [\n", |
| 125 | + " P.ModelConfig(\n", |
| 126 | + " alias=model_alias,\n", |
| 127 | + " provider=\"nvidiabuild\",\n", |
| 128 | + " model=model_id,\n", |
| 129 | + " inference_parameters=P.InferenceParameters(\n", |
| 130 | + " max_tokens=1024,\n", |
| 131 | + " temperature=0.5,\n", |
| 132 | + " top_p=1.0,\n", |
| 133 | + " timeout=120\n", |
| 134 | + " ),\n", |
| 135 | + " is_reasoner=True\n", |
| 136 | + " ),\n", |
| 137 | + " ]\n", |
| 138 | + ")" |
123 | 139 | ] |
124 | 140 | }, |
125 | 141 | { |
|
130 | 146 | "\n", |
131 | 147 | "- For this notebook, we'll change gears and create a synthetic dataset of patient notes.\n", |
132 | 148 | "\n", |
133 | | - "- To steer the generation process, we will use an open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).\n" |
| 149 | + "- To steer the generation process, we will use an open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).\n", |
| 150 | + "\n", |
| 151 | + "- In this dataset, the `input_text` represents the `patient_summary` and the `output_text` represents the `diagnosis`\n" |
134 | 152 | ] |
135 | 153 | }, |
136 | 154 | { |
|
139 | 157 | "metadata": {}, |
140 | 158 | "outputs": [], |
141 | 159 | "source": [ |
142 | | - "from datasets import load_dataset\n", |
143 | | - "\n", |
144 | | - "df_seed = load_dataset(\"gretelai/symptom_to_diagnosis\")[\"train\"].to_pandas()\n", |
145 | | - "\n", |
146 | | - "# Rename the columns to something more descriptive.\n", |
147 | | - "df_seed = df_seed.rename(\n", |
148 | | - " columns={\"output_text\": \"diagnosis\", \"input_text\": \"patient_summary\"}\n", |
149 | | - ")\n", |
150 | | - "\n", |
151 | | - "print(f\"Number of records: {len(df_seed)}\")\n", |
152 | | - "\n", |
153 | | - "# Save the file so we can upload it to the microservice.\n", |
154 | | - "df_seed.to_csv(\"symptom_to_diagnosis.csv\", index=False)\n", |
155 | | - "\n", |
156 | | - "df_seed.head()" |
157 | | - ] |
158 | | - }, |
159 | | - { |
160 | | - "cell_type": "markdown", |
161 | | - "metadata": {}, |
162 | | - "source": [ |
163 | | - "## 🎨 Designing our synthetic patient notes dataset\n", |
164 | | - "\n", |
165 | | - "- We set the seed dataset using the `with_seed_dataset` method.\n", |
| 160 | + "# if using the managed service of data designer, provide the api key here\n", |
| 161 | + "hf_token = getpass(\"Enter Huggingface Token here: \")\n", |
166 | 162 | "\n", |
167 | | - "- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.\n", |
168 | | - "\n", |
169 | | - "- We set `with_replacement=False`, which limits our max number of records to 853, which is the number of records in the seed dataset.\n" |
| 163 | + "if len(hf_token) > 0:\n", |
| 164 | + " print(\"✅ Huggingface Token received.\")\n", |
| 165 | + "else:\n", |
| 166 | + " print(\"❌ No Huggingface Token provided. Please enter your Huggingface Token.\")" |
170 | 167 | ] |
171 | 168 | }, |
172 | 169 | { |
|
180 | 177 | "# to the datastore. Note we need to pass in the datastore's endpoint, which\n", |
181 | 178 | "# must match the endpoint in the docker-compose file.\n", |
182 | 179 | "config_builder.with_seed_dataset(\n", |
183 | | - " repo_id=\"into-tutorials/seeding-with-a-dataset\",\n", |
184 | | - " filename=\"symptom_to_diagnosis.csv\",\n", |
185 | | - " dataset_path=\"./symptom_to_diagnosis.csv\",\n", |
| 180 | + " repo_id=\"gretelai/symptom_to_diagnosis\",\n", |
| 181 | + " filename=\"train.jsonl\",\n", |
186 | 182 | " sampling_strategy=\"shuffle\",\n", |
187 | 183 | " with_replacement=False,\n", |
188 | | - " datastore={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n", |
| 184 | + " datastore={\"endpoint\": \"https://huggingface.co\", \"token\": hf_token}\n", |
189 | 185 | ")" |
190 | 186 | ] |
191 | 187 | }, |
| 188 | + { |
| 189 | + "cell_type": "markdown", |
| 190 | + "metadata": {}, |
| 191 | + "source": [ |
| 192 | + "## 🎨 Designing our synthetic patient notes dataset\n", |
| 193 | + "\n", |
| 194 | + "- We set the seed dataset using the `with_seed_dataset` method.\n", |
| 195 | + "\n", |
| 196 | + "- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.\n", |
| 197 | + "\n", |
| 198 | + "- We set `with_replacement=False`, which limits our max number of records to 853, which is the number of records in the seed dataset.\n" |
| 199 | + ] |
| 200 | + }, |
192 | 201 | { |
193 | 202 | "cell_type": "code", |
194 | 203 | "execution_count": null, |
|
270 | 279 | " name=\"physician_notes\",\n", |
271 | 280 | " prompt=\"\"\"\\\n", |
272 | 281 | "You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},\n", |
273 | | - "who has been struggling with symptoms from {{ diagnosis }} since {{ symptom_onset_date }}.\n", |
| 282 | + "who has been struggling with symptoms from {{ output_text }} since {{ symptom_onset_date }}.\n", |
274 | 283 | "The date of today's visit is {{ date_of_visit }}.\n", |
275 | 284 | "\n", |
276 | | - "{{ patient_summary }}\n", |
| 285 | + "{{ input_text }}\n", |
277 | 286 | "\n", |
278 | 287 | "Write careful notes about your visit with {{ first_name }},\n", |
279 | 288 | "as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.\n", |
|
303 | 312 | "metadata": {}, |
304 | 313 | "outputs": [], |
305 | 314 | "source": [ |
306 | | - "preview = ndd.preview(config_builder, verbose_logging=True)" |
307 | | - ] |
308 | | - }, |
309 | | - { |
310 | | - "cell_type": "code", |
311 | | - "execution_count": null, |
312 | | - "metadata": {}, |
313 | | - "outputs": [], |
314 | | - "source": [ |
315 | | - "# The preview dataset is available as a pandas DataFrame.\n", |
316 | | - "preview.dataset" |
| 315 | + "preview = data_designer_client.preview(config_builder, verbose_logging=True)" |
317 | 316 | ] |
318 | 317 | }, |
319 | 318 | { |
|
326 | 325 | "preview.display_sample_record()" |
327 | 326 | ] |
328 | 327 | }, |
329 | | - { |
330 | | - "cell_type": "markdown", |
331 | | - "metadata": {}, |
332 | | - "source": [ |
333 | | - "## 🧬 Generate your dataset\n", |
334 | | - "\n", |
335 | | - "- Once you are happy with the preview, scale up to a larger dataset.\n", |
336 | | - "\n", |
337 | | - "- The `create` method will submit your generation job to the microservice and return a results object.\n", |
338 | | - "\n", |
339 | | - "- If you want to wait for the job to complete, set `wait_until_done=True`.\n" |
340 | | - ] |
341 | | - }, |
342 | 328 | { |
343 | 329 | "cell_type": "code", |
344 | 330 | "execution_count": null, |
345 | 331 | "metadata": {}, |
346 | 332 | "outputs": [], |
347 | 333 | "source": [ |
348 | | - "results = ndd.create(config_builder, num_records=20, wait_until_done=True)" |
| 334 | + "# The preview dataset is available as a pandas DataFrame.\n", |
| 335 | + "preview.dataset" |
349 | 336 | ] |
350 | 337 | }, |
351 | 338 | { |
352 | | - "cell_type": "code", |
353 | | - "execution_count": null, |
| 339 | + "cell_type": "markdown", |
354 | 340 | "metadata": {}, |
355 | | - "outputs": [], |
356 | 341 | "source": [ |
357 | | - "# load the dataset into a pandas DataFrame\n", |
358 | | - "dataset = results.load_dataset()\n", |
| 342 | + "## ⏭️ Next Steps\n", |
| 343 | + "\n", |
| 344 | + "Check out the following notebooks to learn more about:\n", |
359 | 345 | "\n", |
360 | | - "dataset.head()" |
| 346 | + "- [Using Custom Model Configs](./4-custom-model-configs.ipynb)\n" |
361 | 347 | ] |
362 | 348 | } |
363 | 349 | ], |
364 | 350 | "metadata": { |
365 | 351 | "kernelspec": { |
366 | | - "display_name": ".venv", |
| 352 | + "display_name": "sdg_venv", |
367 | 353 | "language": "python", |
368 | 354 | "name": "python3" |
369 | 355 | }, |
|
377 | 363 | "name": "python", |
378 | 364 | "nbconvert_exporter": "python", |
379 | 365 | "pygments_lexer": "ipython3", |
380 | | - "version": "3.11.9" |
| 366 | + "version": "3.12.11" |
381 | 367 | } |
382 | 368 | }, |
383 | 369 | "nbformat": 4, |
|
0 commit comments