Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 121 additions & 9 deletions examples/advanced_agents/alert_triage_agent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ This example demonstrates how to build an intelligent alert triage system using
- [Evaluation](#evaluation)
- [General](#general)
- [Evaluators](#evaluators)
- [Optimization](#optimization)
- [Numeric Optimization](#numeric-optimization)
- [Prompt Optimization](#prompt-optimization)
- [Example Usage](#example-usage)
- [Running in a live environment](#running-in-a-live-environment)
- [Credentials and Access](#credentials-and-access)
Expand Down Expand Up @@ -292,6 +295,102 @@ Each entry under `evaluators` defines a specific metric to evaluate the pipeline

The list of evaluators can be extended or swapped out depending on your evaluation goals.

#### Optimization

An optional `optimizer` section can be found in [`configs/config_offline_optimizer.yml`](src/nat_alert_triage_agent/configs/config_offline_optimizer.yml). It enables automated hyperparameter tuning and prompt optimization to improve the agent's performance. The optimizer uses the evaluation metrics defined in the `eval` section to search for better configurations.

```yaml
optimizer:
output_path: .tmp/examples/advanced_agents/alert_triage_agent/optimizer/
reps_per_param_set: 2
eval_metrics:
rag_accuracy:
evaluator_name: rag_accuracy
direction: maximize
classification_accuracy:
evaluator_name: classification_accuracy
direction: maximize

numeric:
enabled: true
n_trials: 3

prompt:
enabled: true
prompt_population_init_function: prompt_init
prompt_recombination_function: prompt_recombination
ga_generations: 3
ga_population_size: 3
ga_diversity_lambda: 0.3
ga_parallel_evaluations: 1
```

* `output_path`: Directory where optimization results, including trial configurations, scores, and best parameters, are saved.
* `reps_per_param_set`: Number of times to evaluate each parameter configuration to account for variability in LLM outputs. Higher values provide more reliable metrics but increase evaluation time.
* `eval_metrics`: Dictionary of metrics to optimize. Each entry includes:
* `evaluator_name`: Name of the evaluator from the `eval.evaluators` section.
* `direction`: Either `maximize` or `minimize`, indicating whether higher or lower scores are better.

##### Numeric Optimization

The `numeric` section enables automated hyperparameter tuning for numeric parameters like temperature, `top_p`, and `max_tokens`. The optimizer uses Optuna's Bayesian optimization to efficiently search the parameter space.

* `enabled`: Set to `true` to enable numeric parameter optimization.
* `n_trials`: Number of optimization trials to run. Each trial tests a different combination of hyperparameters. More trials allow for better exploration but require more evaluation time.

To mark a numeric parameter as optimizable, add `optimizable_params` to the relevant configuration section. For example:

```yaml
llms:
ata_agent_llm:
_type: nim
model_name: meta/llama-3.3-70b-instruct
temperature: 0.2
max_tokens: 2048
optimizable_params:
- temperature
- top_p
- max_tokens
```

##### Prompt Optimization

The `prompt` section of `optimizer` enables genetic algorithm-based prompt optimization to automatically improve prompt instructions.

* `enabled`: Set to `true` to enable prompt optimization.
* `prompt_population_init_function`: Function name (from the `functions` section) that generates the initial population of prompt variations.
* `prompt_recombination_function`: Function name (from the `functions` section) that combines successful prompts to create new candidates.
* `ga_generations`: Number of generations for the genetic algorithm.
* `ga_population_size`: Size of the population for each generation.
* `ga_diversity_lambda`: Diversity penalty strength to discourage duplicate prompt sets.
* `ga_parallel_evaluations`: Maximum number of concurrent evaluations.

For more detailed explanations of all genetic algorithm configuration options, see the [Optimizer Reference](../../../docs/source/reference/optimizer.md).

To mark a prompt as optimizable, add `optimizable_params` to the relevant configuration section. For example:

```yaml
functions:
telemetry_metrics_analysis_agent:
_type: telemetry_metrics_analysis_agent
tool_names:
- telemetry_metrics_host_heartbeat_check
- telemetry_metrics_host_performance_check
llm_name: telemetry_metrics_analysis_agent_llm
optimizable_params:
- prompt

workflow:
_type: alert_triage_agent
tool_names:
- hardware_check
- host_performance_check
llm_name: ata_agent_llm
optimizable_params:
- agent_prompt
```

Both numeric and prompt optimization can be enabled simultaneously. The optimizer will coordinate both optimization strategies in stages, finding the best overall configuration.

## Example Usage
You can run the agent in [offline mode](#running-in-offline-mode) or [live mode](#running-live-with-a-http-server-listening-for-alerts). Offline mode allows you to evaluate the agent in a controlled, offline environment using synthetic data. Live mode allows you to run the agent in a real environment.
Expand Down Expand Up @@ -429,26 +528,26 @@ To use this mode, first ensure you have configured your live environment as desc
Offline mode lets you evaluate the triage agent in a controlled, offline environment using synthetic data. Instead of calling real systems, the agent uses predefined inputs to simulate alerts and tool outputs, ideal for development, debugging, and tuning.

To run in offline mode:
1. **Set required environment variables**
#### 1. **Set required environment variables**

Make sure `offline_mode: true` is set in both the `workflow` section and individual tool sections of your config file (see [Understanding the configuration](#understanding-the-configuration) section).

2. **How offline mode works:**
#### 2. **How offline mode works:**

- The **main CSV offline dataset** (`offline_data_path`) provides both alert details and a mock environment. For each alert, expected tool return values are included. These simulate how the environment would behave if the alert occurred on a real system.
- The **JSON offline dataset** (`eval.general.dataset.filepath` in the config) contains a subset of the information from the main CSV: the alert inputs and their associated ground truth root causes. It is used to run `nat eval`, focusing only on the essential data needed for running the workflow, while the full CSV retains the complete mock environment context.
- At runtime, the system links each alert in the JSON dataset to its corresponding context in the CSV using the unique host IDs included in both datasets.
- The **benign fallback dataset** fills in tool responses when the agent calls a tool not explicitly defined in the alert's offline data. These fallback responses mimic healthy system behavior and help provide the "background scenery" without obscuring the true root cause.

3. **Run the agent in offline mode**

To run the agent in offline mode with a test question, use the following command structure. Test questions can be found in `examples/advanced_agents/alert_triage_agent/data/offline_data.json`.
#### 3. **Run the agent in offline mode**
##### Single alert run
To run the agent in offline mode with a test alert, use the following command structure. Test alert examples can be found in `examples/advanced_agents/alert_triage_agent/data/offline_data.json`.

```bash
nat run --config_file=examples/advanced_agents/alert_triage_agent/configs/config_offline_mode.yml --input "{your_alert_in_json_format}"
```

**Example:** To run the agent with a test question, use the following command:
**Example:** To run the agent with a test alert, use the following command:

```bash
nat run \
Expand Down Expand Up @@ -515,7 +614,8 @@ To run in offline mode:
2025-07-21 17:14:45,234 - nat_alert_triage_agent - INFO - Cleaning up
```

To evaluate the agent, use the following command:
##### Evaluation of a dataset
To evaluate the agent using the test dataset, use the following command:

```bash
nat eval --config_file=examples/advanced_agents/alert_triage_agent/configs/config_offline_mode.yml
Expand All @@ -528,8 +628,20 @@ To run in offline mode:
- Run evaluation for the metrics specified in the config `eval.evaluators`
- Save the pipeline output along with the evaluation results to the path specified by `eval.output_dir`

4. **Understanding the output**
The output file will be located in the `eval.output_dir` directory and will include a `workflow_output.json` file as part of the evaluation run (alongside other results from each evaluator). This file contains a list of JSON objects, each representing the result for a single data point. Each entry includes the original alert (`question`), the ground truth root cause classification from the dataset (`answer`), the detailed diagnostic report generated by the agentic system (`generated_answer`), and a trace of the agent’s internal reasoning and tool usage (`intermediate_steps`).
##### Optimization over a dataset
To optimize the agent over the test dataset, use the following command:

```bash
nat optimize --config_file=examples/advanced_agents/alert_triage_agent/configs/config_offline_optimizer.yml
```

The agent will:
- Load alerts from the JSON dataset specified in the config `eval.general.dataset.filepath`
- Run optimization for the metrics specified in the config `optimizer.eval_metrics`
- Save the optimization results to the path specified by `optimizer.output_dir`

#### 4. **Understanding the output**
If you run `nat eval` over a dataset, the output file will be located in the `eval.output_dir` directory and will include a `workflow_output.json` file as part of the evaluation run (alongside other results from each evaluator). This file contains a list of JSON objects, each representing the result for a single data point. Each entry includes the original alert (`question`), the ground truth root cause classification from the dataset (`answer`), the detailed diagnostic report generated by the agentic system (`generated_answer`), and a trace of the agent’s internal reasoning and tool usage (`intermediate_steps`).

**Sample Workflow Result**
```
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


general:
telemetry:
tracing:
weave:
_type: weave
project: "nat-ata-optimize"

functions:
hardware_check:
_type: hardware_check
llm_name: tool_reasoning_llm
offline_mode: true
host_performance_check:
_type: host_performance_check
llm_name: tool_reasoning_llm
offline_mode: true
monitoring_process_check:
_type: monitoring_process_check
llm_name: tool_reasoning_llm
offline_mode: true
network_connectivity_check:
_type: network_connectivity_check
llm_name: tool_reasoning_llm
offline_mode: true
telemetry_metrics_host_heartbeat_check:
_type: telemetry_metrics_host_heartbeat_check
llm_name: tool_reasoning_llm
offline_mode: true
metrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live mode
telemetry_metrics_host_performance_check:
_type: telemetry_metrics_host_performance_check
llm_name: tool_reasoning_llm
offline_mode: true
metrics_url: http://your-monitoring-server:9090 # Replace with your monitoring system URL if running in live mode
telemetry_metrics_analysis_agent:
_type: telemetry_metrics_analysis_agent
tool_names:
- telemetry_metrics_host_heartbeat_check
- telemetry_metrics_host_performance_check
llm_name: telemetry_metrics_analysis_agent_llm
optimizable_params:
- prompt
maintenance_check:
_type: maintenance_check
llm_name: maintenance_check_llm
static_data_path: examples/advanced_agents/alert_triage_agent/data/maintenance_static_dataset.csv
categorizer:
_type: categorizer
llm_name: categorizer_llm
prompt_init:
_type: prompt_init
optimizer_llm: optimizer_llm
system_objective: The alert triage agent autonomously investigates infrastructure monitoring alerts, performs root cause analysis, and generates structured diagnostic reports by dynamically selecting and orchestrating diagnostic tools including IPMI hardware checks, network connectivity tests, host performance monitoring, process status verification, and telemetry analysis, then correlating multi-source data through LLM-powered reasoning to classify issues into predefined categories (hardware, software, network, false positive, or requiring investigation), helping security analysts reduce manual triage workload, accelerate incident response times, and maintain consistent investigation quality through standardized evidence collection and automated documentation of findings and recommended remediation actions.
prompt_recombination:
_type: prompt_recombiner
optimizer_llm: optimizer_llm
system_objective: The alert triage agent autonomously investigates infrastructure monitoring alerts, performs root cause analysis, and generates structured diagnostic reports by dynamically selecting and orchestrating diagnostic tools including IPMI hardware checks, network connectivity tests, host performance monitoring, process status verification, and telemetry analysis, then correlating multi-source data through LLM-powered reasoning to classify issues into predefined categories (hardware, software, network, false positive, or requiring investigation), helping security analysts reduce manual triage workload, accelerate incident response times, and maintain consistent investigation quality through standardized evidence collection and automated documentation of findings and recommended remediation actions.

workflow:
_type: alert_triage_agent
tool_names:
- hardware_check
- host_performance_check
- monitoring_process_check
- network_connectivity_check
- telemetry_metrics_analysis_agent
llm_name: ata_agent_llm
offline_mode: true
# The below paths are only used if offline_mode is true
offline_data_path: examples/advanced_agents/alert_triage_agent/data/offline_data.csv
benign_fallback_data_path: examples/advanced_agents/alert_triage_agent/data/benign_fallback_offline_data.json
optimizable_params:
- agent_prompt

llms:
ata_agent_llm:
_type: nim
model_name: meta/llama-3.3-70b-instruct
temperature: 0.2
max_tokens: 2048
optimizable_params:
- temperature
- top_p
- max_tokens

tool_reasoning_llm:
_type: nim
model_name: meta/llama-3.3-70b-instruct
temperature: 0.2
top_p: 0.7
max_tokens: 2048
optimizable_params:
- temperature
- top_p
- max_tokens

telemetry_metrics_analysis_agent_llm:
_type: nim
model_name: meta/llama-3.3-70b-instruct
temperature: 0
max_tokens: 2048
optimizable_params:
- temperature
- top_p
- max_tokens

maintenance_check_llm:
_type: nim
model_name: meta/llama-3.3-70b-instruct
temperature: 0
max_tokens: 2048

categorizer_llm:
_type: nim
model_name: meta/llama-3.3-70b-instruct
temperature: 0
max_tokens: 2048
optimizable_params:
- temperature
- top_p
- max_tokens

nim_rag_eval_llm:
_type: nim
model_name: meta/llama-3.3-70b-instruct
max_tokens: 8

optimizer_llm:
_type: nim
model_name: meta/llama-3.3-70b-instruct
temperature: 0.5
max_tokens: 2048


eval:
general:
output_dir: .tmp/examples/advanced_agents/alert_triage_agent/original/
workflow_alias: alert_triage_agent_optimizer
dataset:
_type: json
file_path: examples/advanced_agents/alert_triage_agent/data/offline_data.json

evaluators:
rag_accuracy:
_type: ragas
metric: AnswerAccuracy
llm_name: nim_rag_eval_llm
classification_accuracy:
_type: classification_accuracy

optimizer:
output_path: .tmp/examples/advanced_agents/alert_triage_agent/optimizer/
reps_per_param_set: 2
eval_metrics:
rag_accuracy:
evaluator_name: rag_accuracy
direction: maximize
classification_accuracy:
evaluator_name: classification_accuracy
direction: maximize

numeric:
enabled: true
n_trials: 3

prompt:
enabled: true
prompt_population_init_function: prompt_init
prompt_recombination_function: prompt_recombination
ga_generations: 3
ga_population_size: 3
ga_diversity_lambda: 0.3
ga_parallel_evaluations: 1
Loading
Loading