Showcase: strengthen export-screening CI gating (multi-sample runs + stability-aware thresholds)

## Goal
Improve the existing export-screening showcase to demonstrate more robust CI gating without requiring AgentV core changes.

This keeps the current PR scope contained while providing an actionable pattern users can adopt immediately.

## Proposal
Extend examples/showcase/export-screening/evals/ci_check.ts with optional multi-sample evaluation and stability-aware gating.

### New wrapper options (suggested)
- `--samples N`: run the eval N times (fresh eval invocation each time) and aggregate metrics across runs.
- `--gate min|mean|p05|p10` (or similar): choose conservative gating strategy.
- `--min-run-f1 X`: require every run (or p05) to meet threshold.
- Optional: `--max-stddev X` or `--max-variance X` for the checked class.

### Behavior
- Default behavior remains unchanged (single-run threshold gate), so existing docs continue to work.
- When `--samples` is provided:
  - wrapper runs `bun agentv eval` repeatedly (or expects multiple results files)
  - aggregates confusion matrices / per-class metrics
  - emits a stability-aware CI result JSON and exits non-zero on failure.

## Why this belongs in a wrapper/example
- Export-screening already demonstrates ΓÇ£results.jsonl -> wrapper metrics -> CI exit codeΓÇ¥.
- Multi-run/stability primitives are proposed for core (see issue #117), but users want robust CI gating now.
- This example is a concrete pattern that generalizes to other domains, even if the dataset is export risk.

## Acceptance Criteria
- `bun run ./evals/ci_check.ts --eval ./evals/dataset.yaml --samples 5 --threshold 0.95 --check-class High` works.
- Output JSON includes per-run metrics + aggregate metrics + selected gating rule.
- Wrapper exits 0/1 deterministically based on selected gate.
- README updated with the new options and guidance on choosing `--samples` and gate types.

## Related
- #117 Stability-aware evaluation: multi-run sampling + variance/disagreement penalty aggregation



## Related
- #117 Stability-aware evaluation: multi-run sampling + variance/disagreement penalty aggregation
- #119 Showcase: evaluator conformance harness (compatibility + consistency fixtures, CI gate)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Showcase: strengthen export-screening CI gating (multi-sample runs + stability-aware thresholds) #118

Goal

Proposal

New wrapper options (suggested)

Behavior

Why this belongs in a wrapper/example

Acceptance Criteria

Related

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Showcase: strengthen export-screening CI gating (multi-sample runs + stability-aware thresholds) #118

Description

Goal

Proposal

New wrapper options (suggested)

Behavior

Why this belongs in a wrapper/example

Acceptance Criteria

Related

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions