Terminal-Bench Assessor Refinement Research Study (Phase 2b)

## Overview

Conduct empirical research study to refine AgentReady's assessor list using real Terminal-Bench benchmark data. This is a follow-on to #190 which completed the technical Harbor framework integration.

**Prerequisites**: ✅ Issue #190 complete (Harbor integration functional)

**Goal**: Use empirical Terminal-Bench data to identify which assessors have measurable impact on agentic development performance.

---

## Research Objectives

### Primary Question
Which AgentReady assessors demonstrate statistically significant impact on Terminal-Bench performance?

### Specific Questions
1. Which attributes have the largest measurable impact on Terminal-Bench scores?
2. Are higher-tier attributes more impactful than lower-tier ones (validate tier system)?
3. Do certain assessor combinations create synergistic effects?
4. How does repository type (Python vs JavaScript vs Go) affect assessor impact?
5. Which assessors show zero/negligible impact and should be reconsidered?

---

## Implementation Plan

### 1. Repository Selection (10-20 diverse repos)

**Selection Criteria**:
- Diverse languages (Python, JavaScript, TypeScript, Go, Rust)
- Varying sizes (small: <1k LOC, medium: 1k-10k, large: >10k)
- Different domains (web apps, CLIs, libraries, data science)
- Mix of well-documented vs poorly-documented
- Open source repositories with permissive licenses

**Candidate Repositories** (examples):
- Python: requests, flask, django, numpy, pandas
- JavaScript: express, react, vue, axios
- TypeScript: nest.js, typeorm
- Go: hugo, cobra, viper
- Rust: ripgrep, tokio, serde

### 2. Baseline Benchmarking

For each repository:
```bash
# Run baseline (current state)
agentready benchmark --subset smoketest --model claude-sonnet-4-5
```

**Capture**:
- Baseline Terminal-Bench score
- Task success rates
- Repository assessment report (current AgentReady score)

### 3. Individual Assessor Testing

For each of the 25 assessors:
```bash
# 1. Apply single assessor fix
agentready align --assessor <assessor_id>

# 2. Run benchmark
agentready benchmark --subset smoketest --model claude-sonnet-4-5

# 3. Calculate delta
delta = aligned_score - baseline_score

# 4. Revert changes
git reset --hard HEAD
```

**Metrics to Track**:
- Mean score delta across all repos
- Number of repos showing positive impact
- Number of repos showing negative impact
- Statistical significance (p-value)
- Effect size (Cohen's d)

### 4. Statistical Analysis

**Methods**:
- Paired t-tests (baseline vs aligned for each assessor)
- Effect size calculations (Cohen's d)
- Bonferroni correction for multiple comparisons
- Regression analysis for assessor combinations

**Significance Thresholds**:
- p < 0.05 (statistically significant)
- |Cohen's d| > 0.3 (meaningful effect size)

### 5. Assessor Categorization

Based on empirical results:

**High Impact** (keep, possibly promote tier):
- Mean delta > 0.05 AND p < 0.05 AND |Cohen's d| > 0.5

**Moderate Impact** (keep current tier):
- Mean delta > 0.02 AND p < 0.05 AND |Cohen's d| > 0.3

**Low/No Impact** (consider removing or demoting):
- Mean delta < 0.02 OR p > 0.05 OR |Cohen's d| < 0.3

**Negative Impact** (investigate, possibly remove):
- Mean delta < -0.02 AND p < 0.05

---

## Deliverables

### 1. Research Report: `docs/tbench/assessor-refinement-results.md`

**Required Sections**:
- Executive Summary (key findings, recommendations)
- Methodology (repo selection, benchmark protocol)
- Results (tables, charts, statistical analysis)
- Assessor Rankings (ordered by empirical impact)
- Recommendations (which assessors to keep/remove/retune/promote/demote)
- Appendices (raw data, statistical details)

### 2. Updated Assessor Tiers

Based on empirical evidence:
- Promote high-impact assessors to higher tiers
- Demote low-impact assessors to lower tiers
- Remove assessors with zero/negative impact
- Document rationale for all changes

### 3. Dashboard Updates

Update Terminal-Bench dashboard with:
- Empirical impact scores per assessor
- Before/after comparisons
- Statistical confidence indicators

---

## Success Criteria

- ✅ 10-20 diverse repositories benchmarked (baseline + per-assessor)
- ✅ Statistical analysis complete (p-values, effect sizes)
- ✅ Assessor rankings documented with empirical evidence
- ✅ Actionable recommendations for tier adjustments
- ✅ `docs/tbench/assessor-refinement-results.md` published
- ✅ Tier system validated or refined based on data
- ✅ Clear guidance on which assessors to prioritize

---

## Estimated Effort

**Benchmark Execution**:
- 15 repos × (1 baseline + 25 assessors) = 390 benchmark runs
- ~3 minutes per smoketest run = ~20 hours compute time
- Parallelization possible (reduce to ~4-6 hours wall time)

**Analysis & Documentation**:
- Statistical analysis: 4-8 hours
- Report writing: 8-16 hours
- Tier adjustment implementation: 2-4 hours

**Total**: ~20-30 hours

---

## Resources

**Technical**:
- Harbor framework (installed via #190)
- Terminal-Bench dataset (89 tasks)
- AgentReady align command (for applying assessor fixes)
- Statistical analysis tools (scipy, pandas for Python analysis)

**Documentation**:
- `docs/tbench/methodology.md` - A/B testing methodology
- `agent-ready-codebase-attributes.md` - Current assessor definitions
- Issue #190 - Harbor integration implementation

---

## Dependencies

**Blocks**:
- None (all prerequisites met via #190)

**Blocked By**:
- #190 ✅ Complete (Harbor framework integration)

---

## Notes

- This research study will provide empirical validation of AgentReady's tier system
- Results will directly inform which assessors to prioritize in documentation
- May discover that some "essential" (Tier 1) assessors have low actual impact
- Findings could reshape AgentReady's entire assessment strategy
- Consider publishing results as blog post or research paper

---

**Related**: #190 (technical implementation)
**Labels**: research, terminal-bench, assessor-refinement, phase-2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Terminal-Bench Assessor Refinement Research Study (Phase 2b) #201

Overview

Research Objectives

Primary Question

Specific Questions

Implementation Plan

1. Repository Selection (10-20 diverse repos)

2. Baseline Benchmarking

3. Individual Assessor Testing

4. Statistical Analysis

5. Assessor Categorization

Deliverables

1. Research Report: `docs/tbench/assessor-refinement-results.md`

2. Updated Assessor Tiers

3. Dashboard Updates

Success Criteria

Estimated Effort

Resources

Dependencies

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Terminal-Bench Assessor Refinement Research Study (Phase 2b) #201

Description

Overview

Research Objectives

Primary Question

Specific Questions

Implementation Plan

1. Repository Selection (10-20 diverse repos)

2. Baseline Benchmarking

3. Individual Assessor Testing

4. Statistical Analysis

5. Assessor Categorization

Deliverables

1. Research Report: docs/tbench/assessor-refinement-results.md

2. Updated Assessor Tiers

3. Dashboard Updates

Success Criteria

Estimated Effort

Resources

Dependencies

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Research Report: `docs/tbench/assessor-refinement-results.md`