-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Overview
Conduct empirical research study to refine AgentReady's assessor list using real Terminal-Bench benchmark data. This is a follow-on to #190 which completed the technical Harbor framework integration.
Prerequisites: ✅ Issue #190 complete (Harbor integration functional)
Goal: Use empirical Terminal-Bench data to identify which assessors have measurable impact on agentic development performance.
Research Objectives
Primary Question
Which AgentReady assessors demonstrate statistically significant impact on Terminal-Bench performance?
Specific Questions
- Which attributes have the largest measurable impact on Terminal-Bench scores?
- Are higher-tier attributes more impactful than lower-tier ones (validate tier system)?
- Do certain assessor combinations create synergistic effects?
- How does repository type (Python vs JavaScript vs Go) affect assessor impact?
- Which assessors show zero/negligible impact and should be reconsidered?
Implementation Plan
1. Repository Selection (10-20 diverse repos)
Selection Criteria:
- Diverse languages (Python, JavaScript, TypeScript, Go, Rust)
- Varying sizes (small: <1k LOC, medium: 1k-10k, large: >10k)
- Different domains (web apps, CLIs, libraries, data science)
- Mix of well-documented vs poorly-documented
- Open source repositories with permissive licenses
Candidate Repositories (examples):
- Python: requests, flask, django, numpy, pandas
- JavaScript: express, react, vue, axios
- TypeScript: nest.js, typeorm
- Go: hugo, cobra, viper
- Rust: ripgrep, tokio, serde
2. Baseline Benchmarking
For each repository:
# Run baseline (current state)
agentready benchmark --subset smoketest --model claude-sonnet-4-5Capture:
- Baseline Terminal-Bench score
- Task success rates
- Repository assessment report (current AgentReady score)
3. Individual Assessor Testing
For each of the 25 assessors:
# 1. Apply single assessor fix
agentready align --assessor <assessor_id>
# 2. Run benchmark
agentready benchmark --subset smoketest --model claude-sonnet-4-5
# 3. Calculate delta
delta = aligned_score - baseline_score
# 4. Revert changes
git reset --hard HEADMetrics to Track:
- Mean score delta across all repos
- Number of repos showing positive impact
- Number of repos showing negative impact
- Statistical significance (p-value)
- Effect size (Cohen's d)
4. Statistical Analysis
Methods:
- Paired t-tests (baseline vs aligned for each assessor)
- Effect size calculations (Cohen's d)
- Bonferroni correction for multiple comparisons
- Regression analysis for assessor combinations
Significance Thresholds:
- p < 0.05 (statistically significant)
- |Cohen's d| > 0.3 (meaningful effect size)
5. Assessor Categorization
Based on empirical results:
High Impact (keep, possibly promote tier):
- Mean delta > 0.05 AND p < 0.05 AND |Cohen's d| > 0.5
Moderate Impact (keep current tier):
- Mean delta > 0.02 AND p < 0.05 AND |Cohen's d| > 0.3
Low/No Impact (consider removing or demoting):
- Mean delta < 0.02 OR p > 0.05 OR |Cohen's d| < 0.3
Negative Impact (investigate, possibly remove):
- Mean delta < -0.02 AND p < 0.05
Deliverables
1. Research Report: docs/tbench/assessor-refinement-results.md
Required Sections:
- Executive Summary (key findings, recommendations)
- Methodology (repo selection, benchmark protocol)
- Results (tables, charts, statistical analysis)
- Assessor Rankings (ordered by empirical impact)
- Recommendations (which assessors to keep/remove/retune/promote/demote)
- Appendices (raw data, statistical details)
2. Updated Assessor Tiers
Based on empirical evidence:
- Promote high-impact assessors to higher tiers
- Demote low-impact assessors to lower tiers
- Remove assessors with zero/negative impact
- Document rationale for all changes
3. Dashboard Updates
Update Terminal-Bench dashboard with:
- Empirical impact scores per assessor
- Before/after comparisons
- Statistical confidence indicators
Success Criteria
- ✅ 10-20 diverse repositories benchmarked (baseline + per-assessor)
- ✅ Statistical analysis complete (p-values, effect sizes)
- ✅ Assessor rankings documented with empirical evidence
- ✅ Actionable recommendations for tier adjustments
- ✅
docs/tbench/assessor-refinement-results.mdpublished - ✅ Tier system validated or refined based on data
- ✅ Clear guidance on which assessors to prioritize
Estimated Effort
Benchmark Execution:
- 15 repos × (1 baseline + 25 assessors) = 390 benchmark runs
- ~3 minutes per smoketest run = ~20 hours compute time
- Parallelization possible (reduce to ~4-6 hours wall time)
Analysis & Documentation:
- Statistical analysis: 4-8 hours
- Report writing: 8-16 hours
- Tier adjustment implementation: 2-4 hours
Total: ~20-30 hours
Resources
Technical:
- Harbor framework (installed via Terminal-Bench Eval Harness - Phase 2: Real Harbor Framework Integration #190)
- Terminal-Bench dataset (89 tasks)
- AgentReady align command (for applying assessor fixes)
- Statistical analysis tools (scipy, pandas for Python analysis)
Documentation:
docs/tbench/methodology.md- A/B testing methodologyagent-ready-codebase-attributes.md- Current assessor definitions- Issue Terminal-Bench Eval Harness - Phase 2: Real Harbor Framework Integration #190 - Harbor integration implementation
Dependencies
Blocks:
- None (all prerequisites met via Terminal-Bench Eval Harness - Phase 2: Real Harbor Framework Integration #190)
Blocked By:
- Terminal-Bench Eval Harness - Phase 2: Real Harbor Framework Integration #190 ✅ Complete (Harbor framework integration)
Notes
- This research study will provide empirical validation of AgentReady's tier system
- Results will directly inform which assessors to prioritize in documentation
- May discover that some "essential" (Tier 1) assessors have low actual impact
- Findings could reshape AgentReady's entire assessment strategy
- Consider publishing results as blog post or research paper
Related: #190 (technical implementation)
Labels: research, terminal-bench, assessor-refinement, phase-2b