-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Goal
Improve the existing export-screening showcase to demonstrate more robust CI gating without requiring AgentV core changes.
This keeps the current PR scope contained while providing an actionable pattern users can adopt immediately.
Proposal
Extend examples/showcase/export-screening/evals/ci_check.ts with optional multi-sample evaluation and stability-aware gating.
New wrapper options (suggested)
--samples N: run the eval N times (fresh eval invocation each time) and aggregate metrics across runs.--gate min|mean|p05|p10(or similar): choose conservative gating strategy.--min-run-f1 X: require every run (or p05) to meet threshold.- Optional:
--max-stddev Xor--max-variance Xfor the checked class.
Behavior
- Default behavior remains unchanged (single-run threshold gate), so existing docs continue to work.
- When
--samplesis provided:- wrapper runs
bun agentv evalrepeatedly (or expects multiple results files) - aggregates confusion matrices / per-class metrics
- emits a stability-aware CI result JSON and exits non-zero on failure.
- wrapper runs
Why this belongs in a wrapper/example
- Export-screening already demonstrates ΓÇ£results.jsonl -> wrapper metrics -> CI exit codeΓÇ¥.
- Multi-run/stability primitives are proposed for core (see issue Stability-aware evaluation: multi-run sampling + variance/disagreement penalty aggregation #117), but users want robust CI gating now.
- This example is a concrete pattern that generalizes to other domains, even if the dataset is export risk.
Acceptance Criteria
bun run ./evals/ci_check.ts --eval ./evals/dataset.yaml --samples 5 --threshold 0.95 --check-class Highworks.- Output JSON includes per-run metrics + aggregate metrics + selected gating rule.
- Wrapper exits 0/1 deterministically based on selected gate.
- README updated with the new options and guidance on choosing
--samplesand gate types.
Related
- Stability-aware evaluation: multi-run sampling + variance/disagreement penalty aggregation #117 Stability-aware evaluation: multi-run sampling + variance/disagreement penalty aggregation
Related
- Stability-aware evaluation: multi-run sampling + variance/disagreement penalty aggregation #117 Stability-aware evaluation: multi-run sampling + variance/disagreement penalty aggregation
- Showcase: evaluator conformance harness (compatibility + consistency fixtures, CI gate) #119 Showcase: evaluator conformance harness (compatibility + consistency fixtures, CI gate)
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request