You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+64Lines changed: 64 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -121,3 +121,67 @@ uv run streamlit run src/app.py --server.port 8501 --server.address 127.0.0.1
121
121
122
122
Refer to the **agentcore_runtime_deployment.ipynb** notebook to deploy your agent using [Bedrock AgentCore Runtime](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/agents-tools-runtime.html).
123
123
124
+
## Evaluation
125
+
126
+
The platform includes comprehensive evaluation capabilities to assess agent performance across multiple dimensions.
127
+
128
+
### Evaluation Setup
129
+
130
+
The evaluation system consists of:
131
+
-**offline_evaluation.py**: Main evaluation script that runs test queries and calculates metrics
132
+
-**response_quality_evaluator.py**: Uses Bedrock LLM to evaluate response quality
133
+
-**groundtruth.json**: Test queries with expected tool usage (create this file with your test cases)
134
+
135
+
### Prerequisites
136
+
137
+
1.**Langfuse Configuration**: Ensure Langfuse is properly configured for trace collection
138
+
2.**Agent Endpoint**: Have your agent running locally or deployed
139
+
3.**AWS Credentials**: For Bedrock access (response quality evaluation)
140
+
4.**Test Data**: Create `groundtruth.json` with test queries:
141
+
142
+
```json
143
+
[
144
+
{
145
+
"query": "How do I reset my router hub?",
146
+
"expected_tools": ["retrieve_context"]
147
+
}
148
+
]
149
+
```
150
+
151
+
### Running Evaluation
152
+
153
+
```bash
154
+
# Run offline evaluation
155
+
python offline_evaluation.py
156
+
157
+
# Or evaluate existing trace data
158
+
python response_quality_evaluator.py
159
+
```
160
+
161
+
### Metrics Collected
162
+
163
+
-**Success Rate**: Percentage of successful agent responses
164
+
-**Tool Accuracy**: How well the agent selects expected tools
165
+
-**Retrieval Quality**: Relevance scores from knowledge base retrieval
166
+
-**Response Quality**: AI-evaluated metrics using Bedrock LLM:
167
+
-**Faithfulness** (0.0-1.0): How well the response sticks to provided context without hallucination
168
+
-**Correctness** (0.0-1.0): How factually accurate and technically correct the response is
169
+
-**Helpfulness** (0.0-1.0): How useful and relevant the response is to answering the user's query
170
+
-**Latency Metrics**: Total and per-tool response times
171
+
172
+
### Output Files
173
+
174
+
-**comprehensive_results.csv**: Complete evaluation results with all metrics
175
+
-**trace_metrics.csv**: Raw trace data from Langfuse
0 commit comments