Skip to content

Terminal-Bench Assessor Refinement Research Study (Phase 2b) #201

@jeremyeder

Description

@jeremyeder

Overview

Conduct empirical research study to refine AgentReady's assessor list using real Terminal-Bench benchmark data. This is a follow-on to #190 which completed the technical Harbor framework integration.

Prerequisites: ✅ Issue #190 complete (Harbor integration functional)

Goal: Use empirical Terminal-Bench data to identify which assessors have measurable impact on agentic development performance.


Research Objectives

Primary Question

Which AgentReady assessors demonstrate statistically significant impact on Terminal-Bench performance?

Specific Questions

  1. Which attributes have the largest measurable impact on Terminal-Bench scores?
  2. Are higher-tier attributes more impactful than lower-tier ones (validate tier system)?
  3. Do certain assessor combinations create synergistic effects?
  4. How does repository type (Python vs JavaScript vs Go) affect assessor impact?
  5. Which assessors show zero/negligible impact and should be reconsidered?

Implementation Plan

1. Repository Selection (10-20 diverse repos)

Selection Criteria:

  • Diverse languages (Python, JavaScript, TypeScript, Go, Rust)
  • Varying sizes (small: <1k LOC, medium: 1k-10k, large: >10k)
  • Different domains (web apps, CLIs, libraries, data science)
  • Mix of well-documented vs poorly-documented
  • Open source repositories with permissive licenses

Candidate Repositories (examples):

  • Python: requests, flask, django, numpy, pandas
  • JavaScript: express, react, vue, axios
  • TypeScript: nest.js, typeorm
  • Go: hugo, cobra, viper
  • Rust: ripgrep, tokio, serde

2. Baseline Benchmarking

For each repository:

# Run baseline (current state)
agentready benchmark --subset smoketest --model claude-sonnet-4-5

Capture:

  • Baseline Terminal-Bench score
  • Task success rates
  • Repository assessment report (current AgentReady score)

3. Individual Assessor Testing

For each of the 25 assessors:

# 1. Apply single assessor fix
agentready align --assessor <assessor_id>

# 2. Run benchmark
agentready benchmark --subset smoketest --model claude-sonnet-4-5

# 3. Calculate delta
delta = aligned_score - baseline_score

# 4. Revert changes
git reset --hard HEAD

Metrics to Track:

  • Mean score delta across all repos
  • Number of repos showing positive impact
  • Number of repos showing negative impact
  • Statistical significance (p-value)
  • Effect size (Cohen's d)

4. Statistical Analysis

Methods:

  • Paired t-tests (baseline vs aligned for each assessor)
  • Effect size calculations (Cohen's d)
  • Bonferroni correction for multiple comparisons
  • Regression analysis for assessor combinations

Significance Thresholds:

  • p < 0.05 (statistically significant)
  • |Cohen's d| > 0.3 (meaningful effect size)

5. Assessor Categorization

Based on empirical results:

High Impact (keep, possibly promote tier):

  • Mean delta > 0.05 AND p < 0.05 AND |Cohen's d| > 0.5

Moderate Impact (keep current tier):

  • Mean delta > 0.02 AND p < 0.05 AND |Cohen's d| > 0.3

Low/No Impact (consider removing or demoting):

  • Mean delta < 0.02 OR p > 0.05 OR |Cohen's d| < 0.3

Negative Impact (investigate, possibly remove):

  • Mean delta < -0.02 AND p < 0.05

Deliverables

1. Research Report: docs/tbench/assessor-refinement-results.md

Required Sections:

  • Executive Summary (key findings, recommendations)
  • Methodology (repo selection, benchmark protocol)
  • Results (tables, charts, statistical analysis)
  • Assessor Rankings (ordered by empirical impact)
  • Recommendations (which assessors to keep/remove/retune/promote/demote)
  • Appendices (raw data, statistical details)

2. Updated Assessor Tiers

Based on empirical evidence:

  • Promote high-impact assessors to higher tiers
  • Demote low-impact assessors to lower tiers
  • Remove assessors with zero/negative impact
  • Document rationale for all changes

3. Dashboard Updates

Update Terminal-Bench dashboard with:

  • Empirical impact scores per assessor
  • Before/after comparisons
  • Statistical confidence indicators

Success Criteria

  • ✅ 10-20 diverse repositories benchmarked (baseline + per-assessor)
  • ✅ Statistical analysis complete (p-values, effect sizes)
  • ✅ Assessor rankings documented with empirical evidence
  • ✅ Actionable recommendations for tier adjustments
  • docs/tbench/assessor-refinement-results.md published
  • ✅ Tier system validated or refined based on data
  • ✅ Clear guidance on which assessors to prioritize

Estimated Effort

Benchmark Execution:

  • 15 repos × (1 baseline + 25 assessors) = 390 benchmark runs
  • ~3 minutes per smoketest run = ~20 hours compute time
  • Parallelization possible (reduce to ~4-6 hours wall time)

Analysis & Documentation:

  • Statistical analysis: 4-8 hours
  • Report writing: 8-16 hours
  • Tier adjustment implementation: 2-4 hours

Total: ~20-30 hours


Resources

Technical:

Documentation:


Dependencies

Blocks:

Blocked By:


Notes

  • This research study will provide empirical validation of AgentReady's tier system
  • Results will directly inform which assessors to prioritize in documentation
  • May discover that some "essential" (Tier 1) assessors have low actual impact
  • Findings could reshape AgentReady's entire assessment strategy
  • Consider publishing results as blog post or research paper

Related: #190 (technical implementation)
Labels: research, terminal-bench, assessor-refinement, phase-2b

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions