Skip to content

Commit f16f057

Browse files
committed
Update readme with evaluation details
1 parent 484f7ca commit f16f057

File tree

1 file changed

+64
-0
lines changed

1 file changed

+64
-0
lines changed

README.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,3 +121,67 @@ uv run streamlit run src/app.py --server.port 8501 --server.address 127.0.0.1
121121

122122
Refer to the **agentcore_runtime_deployment.ipynb** notebook to deploy your agent using [Bedrock AgentCore Runtime](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/agents-tools-runtime.html).
123123

124+
## Evaluation
125+
126+
The platform includes comprehensive evaluation capabilities to assess agent performance across multiple dimensions.
127+
128+
### Evaluation Setup
129+
130+
The evaluation system consists of:
131+
- **offline_evaluation.py**: Main evaluation script that runs test queries and calculates metrics
132+
- **response_quality_evaluator.py**: Uses Bedrock LLM to evaluate response quality
133+
- **groundtruth.json**: Test queries with expected tool usage (create this file with your test cases)
134+
135+
### Prerequisites
136+
137+
1. **Langfuse Configuration**: Ensure Langfuse is properly configured for trace collection
138+
2. **Agent Endpoint**: Have your agent running locally or deployed
139+
3. **AWS Credentials**: For Bedrock access (response quality evaluation)
140+
4. **Test Data**: Create `groundtruth.json` with test queries:
141+
142+
```json
143+
[
144+
{
145+
"query": "How do I reset my router hub?",
146+
"expected_tools": ["retrieve_context"]
147+
}
148+
]
149+
```
150+
151+
### Running Evaluation
152+
153+
```bash
154+
# Run offline evaluation
155+
python offline_evaluation.py
156+
157+
# Or evaluate existing trace data
158+
python response_quality_evaluator.py
159+
```
160+
161+
### Metrics Collected
162+
163+
- **Success Rate**: Percentage of successful agent responses
164+
- **Tool Accuracy**: How well the agent selects expected tools
165+
- **Retrieval Quality**: Relevance scores from knowledge base retrieval
166+
- **Response Quality**: AI-evaluated metrics using Bedrock LLM:
167+
- **Faithfulness** (0.0-1.0): How well the response sticks to provided context without hallucination
168+
- **Correctness** (0.0-1.0): How factually accurate and technically correct the response is
169+
- **Helpfulness** (0.0-1.0): How useful and relevant the response is to answering the user's query
170+
- **Latency Metrics**: Total and per-tool response times
171+
172+
### Output Files
173+
174+
- **comprehensive_results.csv**: Complete evaluation results with all metrics
175+
- **trace_metrics.csv**: Raw trace data from Langfuse
176+
- **response_quality_scores.csv**: Detailed response quality evaluations
177+
178+
### Configuration
179+
180+
Set environment variables:
181+
```bash
182+
export AGENT_ARN="http://localhost:8080" # or your deployed endpoint
183+
export LANGFUSE_SECRET_KEY="your-key"
184+
export LANGFUSE_PUBLIC_KEY="your-key"
185+
export LANGFUSE_HOST="your-langfuse-host"
186+
```
187+

0 commit comments

Comments
 (0)