Skip to content

Commit 139849d

Browse files
committed
added refactored intro tutorial on seeding to work with public HF repo
1 parent 1183382 commit 139849d

File tree

1 file changed

+92
-106
lines changed

1 file changed

+92
-106
lines changed

nemo/NeMo-Data-Designer/intro-tutorials/3-seeding-with-a-dataset.ipynb

Lines changed: 92 additions & 106 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# 🎨 NeMo Data Designer 101: Seeding synthetic data generation with an external dataset\n",
7+
"# 🎨 NeMo Data Designer 101: Seeding Synthetic Data Generation with an External Dataset\n",
88
"\n",
99
"> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.\n",
1010
">\n",
@@ -14,18 +14,16 @@
1414
"\n",
1515
"In this notebook, we will demonstrate how to seed synthetic data generation in Data Designer with an external dataset.\n",
1616
"\n",
17-
"If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series.\n",
18-
"\n",
19-
"#### 💾 Install dependencies\n",
20-
"\n",
21-
"**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n"
17+
"If this is your first time using Data Designer, we recommend starting with the [first notebook](./1-the-basics.ipynb) in this 101 series."
2218
]
2319
},
2420
{
2521
"cell_type": "markdown",
2622
"metadata": {},
2723
"source": [
28-
"If the installation worked, you should be able to make the following imports:\n"
24+
"#### 💾 Install dependencies\n",
25+
"\n",
26+
"**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.\n"
2927
]
3028
},
3129
{
@@ -40,7 +38,10 @@
4038
"from nemo_microservices.beta.data_designer import (\n",
4139
" DataDesignerConfigBuilder,\n",
4240
" DataDesignerClient,\n",
43-
")"
41+
")\n",
42+
"\n",
43+
"from nemo_microservices.beta.data_designer.config import columns as C\n",
44+
"from nemo_microservices.beta.data_designer.config import params as P"
4445
]
4546
},
4647
{
@@ -49,7 +50,28 @@
4950
"source": [
5051
"### ⚙️ Initialize the NeMo Data Designer (NDD) Client\n",
5152
"\n",
52-
"- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n"
53+
"- The NDD client is responsible for submitting generation requests to the Data Designer microservice.\n",
54+
"- In this notebook, we connect to the [managed service of data designer](https://build.nvidia.com/nemo/data-designer). Alternatively, you can connect to your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).\n",
55+
"- If you have an instance of data designer running locally, you can connect to it as follows\n",
56+
"\n",
57+
" ```python\n",
58+
" data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8080\"))\n",
59+
" ```\n"
60+
]
61+
},
62+
{
63+
"cell_type": "code",
64+
"execution_count": null,
65+
"metadata": {},
66+
"outputs": [],
67+
"source": [
68+
"# if using the managed service of data designer, provide the api key here\n",
69+
"api_key = getpass(\"Enter data designer API key: \")\n",
70+
"\n",
71+
"if len(api_key) > 0:\n",
72+
" print(\"✅ API key received.\")\n",
73+
"else:\n",
74+
" print(\"❌ No API key provided. Please enter your model provider API key.\")"
5375
]
5476
},
5577
{
@@ -58,7 +80,12 @@
5880
"metadata": {},
5981
"outputs": [],
6082
"source": [
61-
"ndd = DataDesignerClient(client=NeMoMicroservices(base_url=\"http://localhost:8000\"))"
83+
"data_designer_client = DataDesignerClient(\n",
84+
" client=NeMoMicroservices(\n",
85+
" base_url=\"https://ai.api.nvidia.com/v1/stg/nemo/dd\",\n",
86+
" default_headers={\"Authorization\": f\"Bearer {api_key}\"} # auto-generated API KEY\n",
87+
" )\n",
88+
")"
6289
]
6390
},
6491
{
@@ -83,18 +110,8 @@
83110
"outputs": [],
84111
"source": [
85112
"# build.nvidia.com model endpoint\n",
86-
"endpoint = \"https://integrate.api.nvidia.com/v1\"\n",
87-
"model_id = \"mistralai/mistral-small-24b-instruct\"\n",
88-
"\n",
89-
"model_alias = \"mistral-small\"\n",
90-
"\n",
91-
"# You will need to enter your model provider API key to run this notebook.\n",
92-
"api_key = getpass(\"Enter model provider API key: \")\n",
93-
"\n",
94-
"if len(api_key) > 0:\n",
95-
" print(\"✅ API key received.\")\n",
96-
"else:\n",
97-
" print(\"❌ No API key provided. Please enter your model provider API key.\")"
113+
"model_id = \"nvidia/nvidia-nemotron-nano-9b-v2\"\n",
114+
"model_alias = \"nemotron-nano-v2\""
98115
]
99116
},
100117
{
@@ -103,23 +120,22 @@
103120
"metadata": {},
104121
"outputs": [],
105122
"source": [
106-
"# You can also load the model configs from a YAML string or file.\n",
107-
"\n",
108-
"model_configs_yaml = f\"\"\"\\\n",
109-
"model_configs:\n",
110-
" - alias: \"{model_alias}\"\n",
111-
" inference_parameters:\n",
112-
" max_tokens: 1024\n",
113-
" temperature: 0.5\n",
114-
" top_p: 1.0\n",
115-
" model:\n",
116-
" api_endpoint:\n",
117-
" api_key: \"{api_key}\"\n",
118-
" model_id: \"{model_id}\"\n",
119-
" url: \"{endpoint}\"\n",
120-
"\"\"\"\n",
121-
"\n",
122-
"config_builder = DataDesignerConfigBuilder(model_configs=model_configs_yaml)"
123+
"config_builder = DataDesignerConfigBuilder(\n",
124+
" model_configs = [\n",
125+
" P.ModelConfig(\n",
126+
" alias=model_alias,\n",
127+
" provider=\"nvidiabuild\",\n",
128+
" model=model_id,\n",
129+
" inference_parameters=P.InferenceParameters(\n",
130+
" max_tokens=1024,\n",
131+
" temperature=0.5,\n",
132+
" top_p=1.0,\n",
133+
" timeout=120\n",
134+
" ),\n",
135+
" is_reasoner=True\n",
136+
" ),\n",
137+
" ]\n",
138+
")"
123139
]
124140
},
125141
{
@@ -130,7 +146,9 @@
130146
"\n",
131147
"- For this notebook, we'll change gears and create a synthetic dataset of patient notes.\n",
132148
"\n",
133-
"- To steer the generation process, we will use an open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).\n"
149+
"- To steer the generation process, we will use an open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).\n",
150+
"\n",
151+
"- In this dataset, the `input_text` represents the `patient_summary` and the `output_text` represents the `diagnosis`\n"
134152
]
135153
},
136154
{
@@ -139,34 +157,13 @@
139157
"metadata": {},
140158
"outputs": [],
141159
"source": [
142-
"from datasets import load_dataset\n",
143-
"\n",
144-
"df_seed = load_dataset(\"gretelai/symptom_to_diagnosis\")[\"train\"].to_pandas()\n",
145-
"\n",
146-
"# Rename the columns to something more descriptive.\n",
147-
"df_seed = df_seed.rename(\n",
148-
" columns={\"output_text\": \"diagnosis\", \"input_text\": \"patient_summary\"}\n",
149-
")\n",
150-
"\n",
151-
"print(f\"Number of records: {len(df_seed)}\")\n",
152-
"\n",
153-
"# Save the file so we can upload it to the microservice.\n",
154-
"df_seed.to_csv(\"symptom_to_diagnosis.csv\", index=False)\n",
155-
"\n",
156-
"df_seed.head()"
157-
]
158-
},
159-
{
160-
"cell_type": "markdown",
161-
"metadata": {},
162-
"source": [
163-
"## 🎨 Designing our synthetic patient notes dataset\n",
164-
"\n",
165-
"- We set the seed dataset using the `with_seed_dataset` method.\n",
160+
"# if using the managed service of data designer, provide the api key here\n",
161+
"hf_token = getpass(\"Enter Huggingface Token here: \")\n",
166162
"\n",
167-
"- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.\n",
168-
"\n",
169-
"- We set `with_replacement=False`, which limits our max number of records to 853, which is the number of records in the seed dataset.\n"
163+
"if len(hf_token) > 0:\n",
164+
" print(\"✅ Huggingface Token received.\")\n",
165+
"else:\n",
166+
" print(\"❌ No Huggingface Token provided. Please enter your Huggingface Token.\")"
170167
]
171168
},
172169
{
@@ -180,15 +177,27 @@
180177
"# to the datastore. Note we need to pass in the datastore's endpoint, which\n",
181178
"# must match the endpoint in the docker-compose file.\n",
182179
"config_builder.with_seed_dataset(\n",
183-
" repo_id=\"into-tutorials/seeding-with-a-dataset\",\n",
184-
" filename=\"symptom_to_diagnosis.csv\",\n",
185-
" dataset_path=\"./symptom_to_diagnosis.csv\",\n",
180+
" repo_id=\"gretelai/symptom_to_diagnosis\",\n",
181+
" filename=\"train.jsonl\",\n",
186182
" sampling_strategy=\"shuffle\",\n",
187183
" with_replacement=False,\n",
188-
" datastore={\"endpoint\": \"http://localhost:3000/v1/hf\"},\n",
184+
" datastore={\"endpoint\": \"https://huggingface.co\", \"token\": hf_token}\n",
189185
")"
190186
]
191187
},
188+
{
189+
"cell_type": "markdown",
190+
"metadata": {},
191+
"source": [
192+
"## 🎨 Designing our synthetic patient notes dataset\n",
193+
"\n",
194+
"- We set the seed dataset using the `with_seed_dataset` method.\n",
195+
"\n",
196+
"- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.\n",
197+
"\n",
198+
"- We set `with_replacement=False`, which limits our max number of records to 853, which is the number of records in the seed dataset.\n"
199+
]
200+
},
192201
{
193202
"cell_type": "code",
194203
"execution_count": null,
@@ -270,10 +279,10 @@
270279
" name=\"physician_notes\",\n",
271280
" prompt=\"\"\"\\\n",
272281
"You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},\n",
273-
"who has been struggling with symptoms from {{ diagnosis }} since {{ symptom_onset_date }}.\n",
282+
"who has been struggling with symptoms from {{ output_text }} since {{ symptom_onset_date }}.\n",
274283
"The date of today's visit is {{ date_of_visit }}.\n",
275284
"\n",
276-
"{{ patient_summary }}\n",
285+
"{{ input_text }}\n",
277286
"\n",
278287
"Write careful notes about your visit with {{ first_name }},\n",
279288
"as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.\n",
@@ -303,17 +312,7 @@
303312
"metadata": {},
304313
"outputs": [],
305314
"source": [
306-
"preview = ndd.preview(config_builder, verbose_logging=True)"
307-
]
308-
},
309-
{
310-
"cell_type": "code",
311-
"execution_count": null,
312-
"metadata": {},
313-
"outputs": [],
314-
"source": [
315-
"# The preview dataset is available as a pandas DataFrame.\n",
316-
"preview.dataset"
315+
"preview = data_designer_client.preview(config_builder, verbose_logging=True)"
317316
]
318317
},
319318
{
@@ -326,44 +325,31 @@
326325
"preview.display_sample_record()"
327326
]
328327
},
329-
{
330-
"cell_type": "markdown",
331-
"metadata": {},
332-
"source": [
333-
"## 🧬 Generate your dataset\n",
334-
"\n",
335-
"- Once you are happy with the preview, scale up to a larger dataset.\n",
336-
"\n",
337-
"- The `create` method will submit your generation job to the microservice and return a results object.\n",
338-
"\n",
339-
"- If you want to wait for the job to complete, set `wait_until_done=True`.\n"
340-
]
341-
},
342328
{
343329
"cell_type": "code",
344330
"execution_count": null,
345331
"metadata": {},
346332
"outputs": [],
347333
"source": [
348-
"results = ndd.create(config_builder, num_records=20, wait_until_done=True)"
334+
"# The preview dataset is available as a pandas DataFrame.\n",
335+
"preview.dataset"
349336
]
350337
},
351338
{
352-
"cell_type": "code",
353-
"execution_count": null,
339+
"cell_type": "markdown",
354340
"metadata": {},
355-
"outputs": [],
356341
"source": [
357-
"# load the dataset into a pandas DataFrame\n",
358-
"dataset = results.load_dataset()\n",
342+
"## ⏭️ Next Steps\n",
343+
"\n",
344+
"Check out the following notebooks to learn more about:\n",
359345
"\n",
360-
"dataset.head()"
346+
"- [Using Custom Model Configs](./4-custom-model-configs.ipynb)\n"
361347
]
362348
}
363349
],
364350
"metadata": {
365351
"kernelspec": {
366-
"display_name": ".venv",
352+
"display_name": "sdg_venv",
367353
"language": "python",
368354
"name": "python3"
369355
},
@@ -377,7 +363,7 @@
377363
"name": "python",
378364
"nbconvert_exporter": "python",
379365
"pygments_lexer": "ipython3",
380-
"version": "3.11.9"
366+
"version": "3.12.11"
381367
}
382368
},
383369
"nbformat": 4,

0 commit comments

Comments
 (0)