Skip to content

Commit d160709

Browse files
feat: agent testing framework + parameter country filter (#51)
* feat: add tax_benefit_model_name filter to parameters endpoint Reduces agent turns for parameter lookups by allowing country filtering. Updated system prompt with parameter search tips. * fix: remove specific parameter hints from system prompt (was test hacking) Reorganised test categories: - Parameter lookups are now separate from household calcs - Economy-wide tests are actual budgetary/distributional analyses * fix: use local API by default for tests * feat: add search and country filters to variables and datasets endpoints * docs: update AGENT_TESTING with baseline measurements * fix: deduplicate parameters by name in seed script * feat: add current filter to parameter-values endpoint Allows agent to get just the current value with current=true * docs: document all API improvements for agent efficiency * feat: add db-reset-local, db-reseed-local, db-reseed-prod make targets * feat: improve agent turn efficiency (10 turns → 3 turns) Key improvements: - Fix model name in system prompt (policyengine-uk with hyphen) - Add case-insensitive search using ILIKE for parameters and variables - Update API docstrings with correct model names Agent can now find UK personal allowance in 3 turns vs 10 baseline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent dda5ebb commit d160709

File tree

10 files changed

+565
-45
lines changed

10 files changed

+565
-45
lines changed

Makefile

Lines changed: 36 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: install dev format lint test integration-test clean seed up down logs start-supabase stop-supabase reset rebuild create-state-bucket deploy-local init db-reset-prod modal-deploy modal-serve docs
1+
.PHONY: install dev format lint test integration-test clean seed seed-full up down logs start-supabase stop-supabase rebuild create-state-bucket deploy-local init db-reset-local db-reseed-local db-reset-prod db-reseed-prod modal-deploy modal-serve docs
22

33
# AWS Configuration
44
AWS_REGION ?= us-east-1
@@ -25,8 +25,8 @@ integration-test:
2525
@supabase start || true
2626
@echo "2. Initialising database..."
2727
@echo "yes" | uv run python scripts/init.py
28-
@echo "3. Running seed script..."
29-
@uv run python scripts/seed.py
28+
@echo "3. Running seed script (lite mode)..."
29+
@uv run python scripts/seed.py --lite
3030
@echo "4. Running integration tests..."
3131
@pytest tests/test_integration.py -v --tb=short
3232
@echo "✓ Integration tests complete!"
@@ -40,9 +40,18 @@ clean:
4040
find . -type f -name "*.pyc" -delete
4141
find . -type d -name "*.egg-info" -exec rm -rf {} + 2>/dev/null || true
4242

43-
reset:
44-
@echo "Resetting Supabase database..."
45-
supabase db reset
43+
db-reset-local:
44+
@echo "Resetting and reseeding LOCAL database..."
45+
@echo "1. Initialising database (drops and recreates tables)..."
46+
@echo "yes" | uv run python scripts/init.py
47+
@echo "2. Seeding data (lite mode)..."
48+
@uv run python scripts/seed.py --lite
49+
@echo "✓ Local database reset and seeded!"
50+
51+
db-reseed-local:
52+
@echo "Reseeding LOCAL database (lite mode, keeps existing tables)..."
53+
@uv run python scripts/seed.py --lite
54+
@echo "✓ Local database reseeded!"
4655

4756
rebuild:
4857
@echo "Rebuilding Docker containers..."
@@ -52,7 +61,11 @@ rebuild:
5261
@echo "✓ Rebuild complete!"
5362

5463
seed:
55-
@echo "Seeding database with UK and US models..."
64+
@echo "Seeding database with UK and US models (lite mode)..."
65+
uv run python scripts/seed.py --lite
66+
67+
seed-full:
68+
@echo "Seeding database with UK and US models (full)..."
5669
uv run python scripts/seed.py
5770

5871
start-supabase:
@@ -113,6 +126,22 @@ db-reset-prod:
113126
exit 1; \
114127
fi
115128

129+
db-reseed-prod:
130+
@echo "⚠️ WARNING: This will reseed the PRODUCTION database ⚠️"
131+
@echo "This will add/update models, parameters, and datasets."
132+
@echo "Existing data will be preserved where possible."
133+
@echo ""
134+
@read -p "Are you sure you want to continue? Type 'yes' to confirm: " -r CONFIRM; \
135+
echo; \
136+
if [ "$$CONFIRM" = "yes" ]; then \
137+
echo "Reseeding production database..."; \
138+
set -a && . .env.prod && set +a && \
139+
uv run python scripts/seed.py; \
140+
else \
141+
echo "Aborted."; \
142+
exit 1; \
143+
fi
144+
116145
modal-deploy:
117146
@echo "Deploying Modal functions..."
118147
@set -a && . .env.prod && set +a && \

docs/AGENT_TESTING.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# Agent testing and optimisation
2+
3+
This document tracks ongoing work to test and improve the PolicyEngine agent's ability to answer policy questions efficiently.
4+
5+
## Goal
6+
7+
Minimise the number of turns the agent needs to answer policy questions by improving API metadata, documentation, and structure - not by hacking for specific test cases.
8+
9+
## Test categories
10+
11+
We want comprehensive coverage across:
12+
- **Country**: UK and US
13+
- **Scope**: Household (single family) and Economy (population-wide)
14+
- **Complexity**: Simple (single variable lookup) to Complex (multi-step reforms)
15+
16+
## Example questions to test
17+
18+
### UK Household (simple)
19+
- "What is my income tax if I earn £50,000?"
20+
- "How much child benefit would a family with 2 children receive?"
21+
22+
### UK Household (complex)
23+
- "Compare my net income under current law vs if the basic rate was 25%"
24+
- "What's the marginal tax rate for someone earning £100,000?"
25+
26+
### UK Economy (simple)
27+
- "What's the total cost of child benefit?"
28+
- "How many people pay higher rate tax?"
29+
30+
### UK Economy (complex)
31+
- "What would be the budgetary impact of raising the personal allowance to £15,000?"
32+
- "How would a £500 UBI affect poverty rates?"
33+
34+
### US Household (simple)
35+
- "What is my federal income tax if I earn $75,000?"
36+
- "How much SNAP would a family of 4 with $30,000 income receive?"
37+
38+
### US Household (complex)
39+
- "Compare my benefits under current law vs doubling the EITC"
40+
- "What's my marginal tax rate including state taxes in California?"
41+
42+
### US Economy (simple)
43+
- "What's the total cost of SNAP?"
44+
- "How many households receive the EITC?"
45+
46+
### US Economy (complex)
47+
- "What would be the budgetary impact of expanding the Child Tax Credit to $3,600?"
48+
- "How would eliminating the SALT cap affect different income deciles?"
49+
50+
## Current agent architecture
51+
52+
The agent uses Claude Code in a Modal sandbox with:
53+
- System prompt containing API documentation (see `src/policyengine_api/prompts/`)
54+
- Direct HTTP calls via curl to the PolicyEngine API
55+
- No MCP (it was causing issues in Modal containers)
56+
57+
## Optimisation strategies
58+
59+
1. **Improve system prompt** - Make API usage clearer, provide more examples
60+
2. **Add API response examples** - Show what successful responses look like
61+
3. **Parameter documentation** - Ensure all parameters are well-documented with valid values
62+
4. **Error messages** - Make error messages actionable so agent can self-correct
63+
5. **Endpoint discoverability** - Help agent find the right endpoint quickly
64+
65+
## Test file location
66+
67+
Tests are in `tests/test_agent_policy_questions.py` (integration tests requiring Modal).
68+
69+
## How to continue this work
70+
71+
1. Run existing tests: `pytest tests/test_agent_policy_questions.py -v -s`
72+
2. Check agent logs in Logfire for turn counts and errors
73+
3. Identify common failure patterns
74+
4. Improve prompts/metadata to address failures
75+
5. Add new test cases as coverage expands
76+
77+
## Observed issues
78+
79+
### Issue 1: Parameter search doesn't filter by country (9 turns for personal allowance)
80+
81+
**Problem**: When searching for "personal allowance", the agent gets US results (Illinois AABD) mixed with UK results. It took 9 turns to find the UK personal allowance.
82+
83+
**Agent's failed searches**:
84+
1. "personal allowance" → Illinois AABD (US)
85+
2. "income tax personal allowance" → empty
86+
3. "income_tax" → US CBO parameters
87+
4. "basic rate" → UK CGT (closer!)
88+
5. "allowance" → California SSI (US)
89+
6. "hmrc income_tax allowances personal" → empty
90+
7. "hmrc.income_tax.allowances" → found it!
91+
92+
**Solution implemented**:
93+
- Added `tax_benefit_model_name` filter to `/parameters/` endpoint
94+
- Updated system prompt to instruct agent to use country filter
95+
96+
**NOT acceptable solutions** (test hacking):
97+
- Adding specific parameter name examples to system prompt
98+
- Telling agent exactly what to search for
99+
100+
### Issue 2: Duplicate parameters in database
101+
102+
**Problem**: Same parameter name exists with multiple IDs. One has values, one doesn't. Agent picks wrong one first.
103+
104+
**Example**: `gov.hmrc.income_tax.allowances.personal_allowance.amount` has two entries with different IDs.
105+
106+
**Solution implemented**: Deduplicate parameters by name in seed script (`seen_names` set).
107+
108+
### Issue 6: Case-sensitive search
109+
110+
**Problem**: Search for "personal allowance" didn't find "Personal allowance" (capital P).
111+
112+
**Solution implemented**: Changed search to use `ILIKE` instead of `contains` for case-insensitive matching.
113+
114+
### Issue 7: Model name mismatch
115+
116+
**Problem**: System prompt said `policyengine_uk` but database has `policyengine-uk` (hyphen vs underscore).
117+
118+
**Solution implemented**: Updated system prompt and API docstrings to use correct model names with hyphens.
119+
120+
### Issue 3: Variables endpoint lacks search
121+
122+
**Problem**: `/variables/` had no search or country filter. Agent can't discover variable names.
123+
124+
**Solution implemented**: Added `search` and `tax_benefit_model_name` filters to `/variables/`.
125+
126+
### Issue 4: Datasets endpoint lacks country filter
127+
128+
**Problem**: `/datasets/` returned all datasets, mixing UK and US.
129+
130+
**Solution implemented**: Added `tax_benefit_model_name` filter to `/datasets/`.
131+
132+
### Issue 5: Parameter values lack "current" filter
133+
134+
**Problem**: Agent had to parse through all historical values to find current one.
135+
136+
**Solution implemented**: Added `current=true` filter to `/parameter-values/` endpoint.
137+
138+
## API improvements summary
139+
140+
| Endpoint | Improvement |
141+
|----------|-------------|
142+
| `/parameters/` | Added `tax_benefit_model_name` filter, case-insensitive search |
143+
| `/variables/` | Added `search` and `tax_benefit_model_name` filters, case-insensitive search |
144+
| `/datasets/` | Added `tax_benefit_model_name` filter |
145+
| `/parameter-values/` | Added `current` filter |
146+
| Seed script | Deduplicate parameters by name |
147+
| System prompt | Fixed model names (hyphen not underscore) |
148+
149+
## Measurements
150+
151+
| Question type | Baseline | After improvements | Target |
152+
|---------------|----------|-------------------|--------|
153+
| Parameter lookup (UK personal allowance) | 10 turns | **3 turns** | 3-4 |
154+
| Household calculation (UK £50k income) | 6 turns | - | 5-6 |
155+
156+
## Progress log
157+
158+
- 2024-12-30: Initial setup, created test framework and first batch of questions
159+
- 2024-12-30: Tested personal allowance lookup - 9-10 turns (target: 3-4). Root cause: no country filter on parameter search
160+
- 2024-12-30: Added `tax_benefit_model_name` filter to `/parameters/`, `/variables/`, `/datasets/`
161+
- 2024-12-30: Tested household calc - 6 turns (acceptable). Async polling is the overhead
162+
- 2024-12-30: Discovered duplicate parameters in DB causing extra turns
163+
- 2024-12-30: Fixed model name mismatch (policyengine-uk with hyphen, not underscore)
164+
- 2024-12-30: Added case-insensitive search using ILIKE
165+
- 2024-12-30: Tested personal allowance lookup - **3 turns** (target met!)

scripts/seed.py

Lines changed: 35 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
"""Seed database with UK and US models, variables, parameters, datasets."""
22

3+
import argparse
34
import json
45
import logging
56
import math
@@ -101,7 +102,7 @@ def bulk_insert(session, table: str, columns: list[str], rows: list[dict]):
101102
session.commit()
102103

103104

104-
def seed_model(model_version, session) -> TaxBenefitModelVersion:
105+
def seed_model(model_version, session, lite: bool = False) -> TaxBenefitModelVersion:
105106
"""Seed a tax-benefit model with its variables and parameters."""
106107

107108
with logfire.span(
@@ -205,12 +206,27 @@ def seed_model(model_version, session) -> TaxBenefitModelVersion:
205206
f" [green]✓[/green] Added {len(model_version.variables)} variables"
206207
)
207208

208-
# Add parameters (only user-facing ones: those with labels or gov.* params)
209-
parameters_to_add = [p for p in model_version.parameters if p.label is not None]
210-
console.print(
211-
f" Filtered to {len(parameters_to_add)} user-facing parameters "
212-
f"(from {len(model_version.parameters)} total)"
213-
)
209+
# Add parameters (only user-facing ones: those with labels)
210+
# Deduplicate by name - keep first occurrence
211+
# In lite mode, exclude US state parameters (gov.states.*)
212+
seen_names = set()
213+
parameters_to_add = []
214+
skipped_state_params = 0
215+
for p in model_version.parameters:
216+
if p.label is None or p.name in seen_names:
217+
continue
218+
# In lite mode, skip state-level parameters for faster seeding
219+
if lite and p.name.startswith("gov.states."):
220+
skipped_state_params += 1
221+
continue
222+
parameters_to_add.append(p)
223+
seen_names.add(p.name)
224+
225+
filter_msg = f" Filtered to {len(parameters_to_add)} user-facing parameters"
226+
filter_msg += f" (from {len(model_version.parameters)} total, deduplicated by name)"
227+
if lite and skipped_state_params > 0:
228+
filter_msg += f", skipped {skipped_state_params} state params (lite mode)"
229+
console.print(filter_msg)
214230

215231
with logfire.span("add_parameters", count=len(parameters_to_add)):
216232
# Build list of parameter dicts for bulk insert
@@ -574,16 +590,25 @@ def seed_example_policies(session):
574590

575591
def main():
576592
"""Main seed function."""
593+
parser = argparse.ArgumentParser(description="Seed PolicyEngine database")
594+
parser.add_argument(
595+
"--lite",
596+
action="store_true",
597+
help="Lite mode: skip US state parameters for faster local seeding",
598+
)
599+
args = parser.parse_args()
600+
577601
with logfire.span("database_seeding"):
578-
console.print("[bold green]PolicyEngine database seeding[/bold green]\n")
602+
mode_str = " (lite mode)" if args.lite else ""
603+
console.print(f"[bold green]PolicyEngine database seeding{mode_str}[/bold green]\n")
579604

580605
with next(get_quiet_session()) as session:
581606
# Seed UK model
582-
uk_version = seed_model(uk_latest, session)
607+
uk_version = seed_model(uk_latest, session, lite=args.lite)
583608
console.print(f"[green]✓[/green] UK model seeded: {uk_version.id}\n")
584609

585610
# Seed US model
586-
us_version = seed_model(us_latest, session)
611+
us_version = seed_model(us_latest, session, lite=args.lite)
587612
console.print(f"[green]✓[/green] US model seeded: {us_version.id}\n")
588613

589614
# Seed datasets

src/policyengine_api/agent_sandbox.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,18 @@
2121
You have access to the full PolicyEngine API. Key workflows:
2222
2323
1. **Household calculations**: POST to /household/calculate with people array, then poll GET /household/calculate/{job_id}
24-
2. **Parameter lookup**: GET /parameters/ with search query, then GET /parameter-values/ with parameter_id
24+
2. **Parameter lookup**: GET /parameters/ with search query and tax_benefit_model_name, then GET /parameter-values/ with parameter_id
2525
3. **Economic impact**:
2626
- GET /parameters/ to find parameter_id
2727
- POST /policies/ to create reform with parameter_values
2828
- GET /datasets/ to find dataset_id
2929
- POST /analysis/economic-impact with policy_id and dataset_id
3030
- Poll GET /analysis/economic-impact/{report_id} until completed
3131
32+
When searching for parameters, use tax_benefit_model_name to filter by country:
33+
- "policyengine-uk" for UK parameters
34+
- "policyengine-us" for US parameters
35+
3236
When answering questions:
3337
1. Use the API tools to get accurate, current data
3438
2. Show your calculations clearly

src/policyengine_api/api/datasets.py

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,24 +11,35 @@
1111
from fastapi import APIRouter, Depends, HTTPException
1212
from sqlmodel import Session, select
1313

14-
from policyengine_api.models import Dataset, DatasetRead
14+
from policyengine_api.models import Dataset, DatasetRead, TaxBenefitModel
1515
from policyengine_api.services.database import get_session
1616

1717
router = APIRouter(prefix="/datasets", tags=["datasets"])
1818

1919

2020
@router.get("/", response_model=List[DatasetRead])
21-
def list_datasets(session: Session = Depends(get_session)):
22-
"""List all available datasets.
21+
def list_datasets(
22+
tax_benefit_model_name: str | None = None,
23+
session: Session = Depends(get_session),
24+
):
25+
"""List available datasets.
2326
2427
Returns datasets that can be used with the /analysis/economic-impact endpoint.
2528
Each dataset represents population microdata for a specific country and year.
2629
27-
USAGE: For UK analysis, look for datasets with names containing "uk" or "frs".
28-
For US analysis, look for datasets with names containing "us" or "cps".
29-
Use the dataset's id when calling /analysis/economic-impact.
30+
Args:
31+
tax_benefit_model_name: Filter by country model.
32+
Use "policyengine-uk" for UK datasets.
33+
Use "policyengine-us" for US datasets.
3034
"""
31-
datasets = session.exec(select(Dataset)).all()
35+
query = select(Dataset)
36+
37+
if tax_benefit_model_name:
38+
query = query.join(TaxBenefitModel).where(
39+
TaxBenefitModel.name == tax_benefit_model_name
40+
)
41+
42+
datasets = session.exec(query).all()
3243
return datasets
3344

3445

0 commit comments

Comments
 (0)