Conversation
Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>
Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>
extended_metrics precision over-estimation due to monotone PR-curve interpolation
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical bug in extended_metrics where precision was over-estimated due to using COCO's interpolated precision-recall curve. The interpolated curve hides false positives below the recall ceiling, causing incorrect F1-optimal threshold selection.
Changes:
- Replaced interpolated-precision sweep with confidence-threshold sweep using actual TP/FP counts from
eval["matched"] - Added comprehensive regression tests demonstrating the bug and verifying the fix
- Updated documentation to clarify the algorithm change
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
faster_coco_eval/core/faster_eval_api.py |
Core fix: replaces interpolated precision with actual per-threshold precision computation using detection-GT matches from eval["matched"] |
tests/test_basic.py |
Adds TestExtendedMetrics class with two tests: one demonstrating the over-estimation bug and one verifying perfect-prediction behavior |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>
Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>
|
Thanks for the implementation, it's awesome. I think it's cool that it's possible to move the validation engine part to a library, thereby reducing the engine's codebase. |
COCOeval_faster.extended_metricswas reading precision fromeval["precision"]— the monotone envelopeP_interp[r] = max(actual_precision at recall ≥ r). This is correct for AP but wrong for F1: FPs below the recall ceiling are invisible to the interpolation, silently inflating precision and misidentifying the F1-optimal confidence threshold.Motivation
When a class has FPs at confidence scores below its recall ceiling, COCO's interpolated PR curve collapses them away.
extended_metricswas using those inflated values for F1 computation rather than actual per-threshold precision.Modification
faster_coco_eval/core/faster_eval_api.pyextended_metricswith an actual-precision confidence-threshold sweep:eval["matched"](IoU ≥ 0.5)np.searchsortedfor O(log n) per-class countingmap@50,map@50:95) is unchanged — interpolated precision remains correct for area-under-curvescore_vecvariabletests/test_basic.py@pytest.fixture(coco_gt_dt_with_fp) that builds and returns the primitive(coco_gt, coco_dt)COCO object pair:test_extended_metrics_precision_not_overestimated— regression test for the exact bug scenario (2 classes, one with sub-ceiling FPs); runsevaluate()/accumulate()/summarize()inside the test and assertsprecision=0.75,recall=1.0usingpytest.approxtest_extended_metrics_perfect_predictions— sanity check: all-TP case yieldsprecision=recall=1.0BC-breaking (Optional)
extended_metrics["precision"]and["recall"]values will change for datasets where any class has detections below the recall ceiling. The new values are correct; the old values were over-estimating precision.mapand all AP/AR stats are unaffected.Checklist
Original prompt
This section details on the original issue you should resolve
<issue_title>Bug:
extended_metricsover-estimates precision due to monotone PR-curve interpolation</issue_title><issue_description>## Summary
COCOeval_faster.extended_metricsreads precision fromeval["precision"], which stores themonotone-decreasing interpolated PR curve used for AP computation.
That value is the maximum precision achievable at recall ≥ r — not the actual precision
when all predictions at or above confidence threshold t are included.
False positives that appear below the recall ceiling (i.e. after every GT is already matched)
are invisible to the interpolated curve, so precision is silently over-estimated and the
F1-optimal confidence threshold is mis-identified.
Affected property
COCOeval_faster.extended_metrics(faster_eval_api.py, lines ~243–254)Why the interpolated array is wrong for F1
COCO's
eval["precision"]is computed as:This is the correct value for AP (area under the monotone envelope), but wrong for F1.
Concrete example:
interpolation collapses them:
P_interp[recall=1.0] = 1.0.extended_metricssees P=1.0 and reports macro-precision = mean(1.0, 1.0) = 1.0.Reproducible example
Expected vs actual output
Hand-verified expected values
At the F1-optimal confidence threshold (0.0 — include all predictions):
Fix direction
The fix requires reconstructing the actual (non-interpolated) PR curve from raw
annotation data, then sweeping candidate confidence thresholds to find the one that
maximises macro-F1. The interpolated
eval["precision"]array shoul...extended_metricsover-estimates precision due to monotone PR-curve interpolation #73✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.