Fix `extended_metrics` precision over-estimation due to monotone PR-curve interpolation by Copilot · Pull Request #74 · MiXaiLL76/faster_coco_eval

Copilot · 2026-02-19T12:43:49Z

COCOeval_faster.extended_metrics was reading precision from eval["precision"] — the monotone envelope P_interp[r] = max(actual_precision at recall ≥ r). This is correct for AP but wrong for F1: FPs below the recall ceiling are invisible to the interpolation, silently inflating precision and misidentifying the F1-optimal confidence threshold.

# Before fix – FPs at low confidence are hidden by interpolation
precision : 1.0000  (expected 0.75)
recall    : 1.0000  (expected 1.00)

# After fix
precision : 0.7500  ✓
recall    : 1.0000  ✓

Motivation

When a class has FPs at confidence scores below its recall ceiling, COCO's interpolated PR curve collapses them away. extended_metrics was using those inflated values for F1 computation rather than actual per-threshold precision.

Modification

faster_coco_eval/core/faster_eval_api.py

Replaced the interpolated-precision sweep in extended_metrics with an actual-precision confidence-threshold sweep:
- Identifies TPs from eval["matched"] (IoU ≥ 0.5)
- Builds per-class sorted score arrays with cumulative TP counts
- Iterates thresholds ascending (most inclusive first); first threshold achieving max macro-F1 wins — breaking ties in favor of higher recall
- Uses np.searchsorted for O(log n) per-class counting
AP computation (map@50, map@50:95) is unchanged — interpolated precision remains correct for area-under-curve
Removed dead score_vec variable

tests/test_basic.py

Added pytest-style extended-metrics tests with a @pytest.fixture (coco_gt_dt_with_fp) that builds and returns the primitive (coco_gt, coco_dt) COCO object pair:
- test_extended_metrics_precision_not_overestimated — regression test for the exact bug scenario (2 classes, one with sub-ceiling FPs); runs evaluate()/accumulate()/summarize() inside the test and asserts precision=0.75, recall=1.0 using pytest.approx
- test_extended_metrics_perfect_predictions — sanity check: all-TP case yields precision=recall=1.0

BC-breaking (Optional)

extended_metrics["precision"] and ["recall"] values will change for datasets where any class has detections below the recall ceiling. The new values are correct; the old values were over-estimating precision. map and all AP/AR stats are unaffected.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
The documentation has been modified accordingly, like docstring or example tutorials.

Original prompt

This section details on the original issue you should resolve

<issue_title>Bug: extended_metrics over-estimates precision due to monotone PR-curve interpolation</issue_title>
<issue_description>## Summary

COCOeval_faster.extended_metrics reads precision from eval["precision"], which stores the
monotone-decreasing interpolated PR curve used for AP computation.
That value is the maximum precision achievable at recall ≥ r — not the actual precision
when all predictions at or above confidence threshold t are included.

False positives that appear below the recall ceiling (i.e. after every GT is already matched)
are invisible to the interpolated curve, so precision is silently over-estimated and the
F1-optimal confidence threshold is mis-identified.

Affected property

COCOeval_faster.extended_metrics (faster_eval_api.py, lines ~243–254)

# current (buggy) code
prec_raw = P[iou50_idx, :, :, area_idx, maxdet_idx]
prec = prec_raw.copy().astype(float)
prec[prec < 0] = np.nan
f1_cls   = 2 * prec * rec_thrs[:, None] / (prec + rec_thrs[:, None])
f1_macro = np.nanmean(f1_cls, axis=1)
best_j   = int(f1_macro.argmax())
macro_precision = float(np.nanmean(prec[best_j]))   # ← reads interpolated value
macro_recall    = float(rec_thrs[best_j])

Why the interpolated array is wrong for F1

COCO's eval["precision"] is computed as:

P_interp[r] = max(actual_precision at all recall r' ≥ r)

This is the correct value for AP (area under the monotone envelope), but wrong for F1.

Concrete example:

Class 2 has 10 GTs, 10 TPs (conf 0.50–0.95), and 10 FPs (conf 0.00–0.45).
The 10th TP arrives at conf=0.50 → recall=1.0, precision=1.0.
The 10 FPs land at conf < 0.50. They do not increase recall, so COCO's
interpolation collapses them: P_interp[recall=1.0] = 1.0.
At confidence threshold 0.0 (include everything), actual precision = 10/20 = 0.5.
extended_metrics sees P=1.0 and reports macro-precision = mean(1.0, 1.0) = 1.0.
Correct answer: macro-precision = mean(1.0, 0.5) = 0.75.

Reproducible example

import math
import numpy as np
from faster_coco_eval.core import COCO
from faster_coco_eval.core import COCOeval_faster as COCOeval

SIZE, SPACING, ROW = 200, 250, 260

def make_gt(ann_id, image_id, cat_id, bbox):
    return {"id": ann_id, "image_id": image_id, "category_id": cat_id,
            "bbox": bbox, "area": bbox[2]*bbox[3], "iscrowd": 0}

def make_dt(image_id, cat_id, bbox, score):
    return {"image_id": image_id, "category_id": cat_id, "bbox": bbox, "score": score}

def contained_box(gt_box, iou):
    x, y, s, _ = gt_box
    p = s * math.sqrt(iou)
    off = (s - p) / 2
    return [x + off, y + off, p, p]

image_id = 1
anns, dets = [], []
ann_id = 1

# Class 1: 10 GTs, 10 TPs, 0 FPs
for i, conf in enumerate([0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]):
    gt = [float(i*SPACING), 0.0, float(SIZE), float(SIZE)]
    anns.append(make_gt(ann_id, image_id, 1, gt))
    dets.append(make_dt(image_id, 1, contained_box(gt, 0.96), conf))
    ann_id += 1

# Class 2: 10 GTs, 10 TPs, 10 FPs
for i, conf in enumerate([0.95, 0.90, 0.85, 0.80, 0.75, 0.70, 0.65, 0.60, 0.55, 0.50]):
    gt = [float(i*SPACING), float(ROW), float(SIZE), float(SIZE)]
    anns.append(make_gt(ann_id, image_id, 2, gt))
    dets.append(make_dt(image_id, 2, contained_box(gt, 0.96), conf))
    ann_id += 1

# Class 2 FPs: placed in a separate row with no GT overlap
for i, conf in enumerate([0.45, 0.40, 0.35, 0.30, 0.25, 0.20, 0.15, 0.10, 0.05, 0.00]):
    dets.append(make_dt(image_id, 2, [float(i*SPACING), float(2*ROW), float(SIZE), float(SIZE)], conf))

coco_gt = COCO()
coco_gt.dataset = {
    "images": [{"id": image_id, "width": 10*SPACING, "height": 3*ROW}],
    "annotations": anns,
    "categories": [{"id": 1, "name": "cat1"}, {"id": 2, "name": "cat2"}],
}
coco_gt.createIndex()
coco_dt = coco_gt.loadRes(dets)

coco_eval = COCOeval(coco_gt, coco_dt, iouType="bbox")
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()

m = coco_eval.extended_metrics
print(f"precision : {m['precision']:.4f}  (expected 0.75)")
print(f"recall    : {m['recall']:.4f}  (expected 1.00)")

Expected vs actual output

precision : 1.0000  (expected 0.75)   ← WRONG
recall    : 0.7500  (expected 1.00)   ← WRONG

Hand-verified expected values

At the F1-optimal confidence threshold (0.0 — include all predictions):

Class	TP	FP	FN	Precision	Recall	F1
Class 1	10	0	0	1.000	1.000	1.000
Class 2	10	10	0	0.500	1.000	0.667
Macro				0.750	1.000	0.833

Fix direction

The fix requires reconstructing the actual (non-interpolated) PR curve from raw
annotation data, then sweeping candidate confidence thresholds to find the one that
maximises macro-F1. The interpolated eval["precision"] array shoul...

Fixes Bug: extended_metrics over-estimates precision due to monotone PR-curve interpolation #73

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

tests/test_basic.py

Copilot

Pull request overview

This PR fixes a critical bug in extended_metrics where precision was over-estimated due to using COCO's interpolated precision-recall curve. The interpolated curve hides false positives below the recall ceiling, causing incorrect F1-optimal threshold selection.

Changes:

Replaced interpolated-precision sweep with confidence-threshold sweep using actual TP/FP counts from eval["matched"]
Added comprehensive regression tests demonstrating the bug and verifying the fix
Updated documentation to clarify the algorithm change

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`faster_coco_eval/core/faster_eval_api.py`	Core fix: replaces interpolated precision with actual per-threshold precision computation using detection-GT matches from `eval["matched"]`
`tests/test_basic.py`	Adds `TestExtendedMetrics` class with two tests: one demonstrating the over-estimation bug and one verifying perfect-prediction behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

tests/test_basic.py

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

MiXaiLL76 · 2026-02-21T22:13:17Z

Thanks for the implementation, it's awesome.
I ported extended_metrics from the earlier rf-detr, so I didn't test it much and only used it a few times in my implementation.

I think it's cool that it's possible to move the validation engine part to a library, thereby reducing the engine's codebase.

Initial plan

9cc5f0f

Copilot AI assigned Copilot and Borda Feb 19, 2026

Copilot started work on behalf of Borda February 19, 2026 12:44 View session

Copilot AI requested a review from Borda February 19, 2026 12:57

Copilot started work on behalf of Borda February 19, 2026 13:17 View session

Copilot AI and others added 2 commits February 19, 2026 13:33

Fix extended_metrics precision over-estimation and add regression tests

adb47bd

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

Address code review feedback: rename local var and improve comment

a0f2b95

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix extended_metrics precision over-estimation in COCOeval_faster~~ Fix extended_metrics precision over-estimation due to monotone PR-curve interpolation Feb 19, 2026

Copilot finished work on behalf of Borda February 19, 2026 13:36

Borda reviewed Feb 19, 2026

View reviewed changes

tests/test_basic.py Outdated Show resolved Hide resolved

Copilot started work on behalf of Borda February 19, 2026 13:44 View session

Borda marked this pull request as ready for review February 19, 2026 13:45

Copilot AI review requested due to automatic review settings February 19, 2026 13:45

Copilot started reviewing on behalf of Borda February 19, 2026 13:45 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

Refactor extended_metrics tests to use pytest fixture

7c7e5fe

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

Copilot AI requested a review from Borda February 19, 2026 13:50

Copilot finished work on behalf of Borda February 19, 2026 13:50

Borda reviewed Feb 19, 2026

View reviewed changes

tests/test_basic.py Outdated Show resolved Hide resolved

Copilot started work on behalf of Borda February 19, 2026 14:00 View session

Move evaluate/accumulate/summarize from fixture into tests

6866ac8

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

Copilot AI requested a review from Borda February 19, 2026 14:03

Copilot finished work on behalf of Borda February 19, 2026 14:03

Borda requested a review from MiXaiLL76 February 19, 2026 15:38

MiXaiLL76 merged commit 33ab609 into main Feb 21, 2026
12 checks passed

MiXaiLL76 mentioned this pull request Feb 21, 2026

update version #76

Merged

7 tasks

Borda deleted the copilot/fix-precision-estimation-bug branch February 22, 2026 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix `extended_metrics` precision over-estimation due to monotone PR-curve interpolation#74

Fix `extended_metrics` precision over-estimation due to monotone PR-curve interpolation#74
MiXaiLL76 merged 5 commits intomainfrom
copilot/fix-precision-estimation-bug

Copilot AI commented Feb 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

MiXaiLL76 commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

Copilot AI commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

BC-breaking (Optional)

Checklist

Affected property

Why the interpolated array is wrong for F1

Reproducible example

Expected vs actual output

Hand-verified expected values

Fix direction

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

MiXaiLL76 commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Feb 19, 2026 •

edited

Loading