Skip to content

Comments

Fix extended_metrics precision over-estimation due to monotone PR-curve interpolation#74

Merged
MiXaiLL76 merged 5 commits intomainfrom
copilot/fix-precision-estimation-bug
Feb 21, 2026
Merged

Fix extended_metrics precision over-estimation due to monotone PR-curve interpolation#74
MiXaiLL76 merged 5 commits intomainfrom
copilot/fix-precision-estimation-bug

Conversation

Copy link
Contributor

Copilot AI commented Feb 19, 2026

COCOeval_faster.extended_metrics was reading precision from eval["precision"] — the monotone envelope P_interp[r] = max(actual_precision at recall ≥ r). This is correct for AP but wrong for F1: FPs below the recall ceiling are invisible to the interpolation, silently inflating precision and misidentifying the F1-optimal confidence threshold.

# Before fix – FPs at low confidence are hidden by interpolation
precision : 1.0000  (expected 0.75)
recall    : 1.0000  (expected 1.00)

# After fix
precision : 0.7500recall    : 1.0000

Motivation

When a class has FPs at confidence scores below its recall ceiling, COCO's interpolated PR curve collapses them away. extended_metrics was using those inflated values for F1 computation rather than actual per-threshold precision.

Modification

faster_coco_eval/core/faster_eval_api.py

  • Replaced the interpolated-precision sweep in extended_metrics with an actual-precision confidence-threshold sweep:
    • Identifies TPs from eval["matched"] (IoU ≥ 0.5)
    • Builds per-class sorted score arrays with cumulative TP counts
    • Iterates thresholds ascending (most inclusive first); first threshold achieving max macro-F1 wins — breaking ties in favor of higher recall
    • Uses np.searchsorted for O(log n) per-class counting
  • AP computation (map@50, map@50:95) is unchanged — interpolated precision remains correct for area-under-curve
  • Removed dead score_vec variable

tests/test_basic.py

  • Added pytest-style extended-metrics tests with a @pytest.fixture (coco_gt_dt_with_fp) that builds and returns the primitive (coco_gt, coco_dt) COCO object pair:
    • test_extended_metrics_precision_not_overestimated — regression test for the exact bug scenario (2 classes, one with sub-ceiling FPs); runs evaluate()/accumulate()/summarize() inside the test and asserts precision=0.75, recall=1.0 using pytest.approx
    • test_extended_metrics_perfect_predictions — sanity check: all-TP case yields precision=recall=1.0

BC-breaking (Optional)

extended_metrics["precision"] and ["recall"] values will change for datasets where any class has detections below the recall ceiling. The new values are correct; the old values were over-estimating precision. map and all AP/AR stats are unaffected.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
  4. The documentation has been modified accordingly, like docstring or example tutorials.
Original prompt

This section details on the original issue you should resolve

<issue_title>Bug: extended_metrics over-estimates precision due to monotone PR-curve interpolation</issue_title>
<issue_description>## Summary

COCOeval_faster.extended_metrics reads precision from eval["precision"], which stores the
monotone-decreasing interpolated PR curve used for AP computation.
That value is the maximum precision achievable at recall ≥ r — not the actual precision
when all predictions at or above confidence threshold t are included.

False positives that appear below the recall ceiling (i.e. after every GT is already matched)
are invisible to the interpolated curve, so precision is silently over-estimated and the
F1-optimal confidence threshold is mis-identified.

Affected property

COCOeval_faster.extended_metrics (faster_eval_api.py, lines ~243–254)

# current (buggy) code
prec_raw = P[iou50_idx, :, :, area_idx, maxdet_idx]
prec = prec_raw.copy().astype(float)
prec[prec < 0] = np.nan
f1_cls   = 2 * prec * rec_thrs[:, None] / (prec + rec_thrs[:, None])
f1_macro = np.nanmean(f1_cls, axis=1)
best_j   = int(f1_macro.argmax())
macro_precision = float(np.nanmean(prec[best_j]))   # ← reads interpolated value
macro_recall    = float(rec_thrs[best_j])

Why the interpolated array is wrong for F1

COCO's eval["precision"] is computed as:

P_interp[r] = max(actual_precision at all recall r' ≥ r)

This is the correct value for AP (area under the monotone envelope), but wrong for F1.

Concrete example:

  • Class 2 has 10 GTs, 10 TPs (conf 0.50–0.95), and 10 FPs (conf 0.00–0.45).
  • The 10th TP arrives at conf=0.50 → recall=1.0, precision=1.0.
  • The 10 FPs land at conf < 0.50. They do not increase recall, so COCO's
    interpolation collapses them: P_interp[recall=1.0] = 1.0.
  • At confidence threshold 0.0 (include everything), actual precision = 10/20 = 0.5.
  • extended_metrics sees P=1.0 and reports macro-precision = mean(1.0, 1.0) = 1.0.
  • Correct answer: macro-precision = mean(1.0, 0.5) = 0.75.

Reproducible example

import math
import numpy as np
from faster_coco_eval.core import COCO
from faster_coco_eval.core import COCOeval_faster as COCOeval

SIZE, SPACING, ROW = 200, 250, 260

def make_gt(ann_id, image_id, cat_id, bbox):
    return {"id": ann_id, "image_id": image_id, "category_id": cat_id,
            "bbox": bbox, "area": bbox[2]*bbox[3], "iscrowd": 0}

def make_dt(image_id, cat_id, bbox, score):
    return {"image_id": image_id, "category_id": cat_id, "bbox": bbox, "score": score}

def contained_box(gt_box, iou):
    x, y, s, _ = gt_box
    p = s * math.sqrt(iou)
    off = (s - p) / 2
    return [x + off, y + off, p, p]

image_id = 1
anns, dets = [], []
ann_id = 1

# Class 1: 10 GTs, 10 TPs, 0 FPs
for i, conf in enumerate([0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]):
    gt = [float(i*SPACING), 0.0, float(SIZE), float(SIZE)]
    anns.append(make_gt(ann_id, image_id, 1, gt))
    dets.append(make_dt(image_id, 1, contained_box(gt, 0.96), conf))
    ann_id += 1

# Class 2: 10 GTs, 10 TPs, 10 FPs
for i, conf in enumerate([0.95, 0.90, 0.85, 0.80, 0.75, 0.70, 0.65, 0.60, 0.55, 0.50]):
    gt = [float(i*SPACING), float(ROW), float(SIZE), float(SIZE)]
    anns.append(make_gt(ann_id, image_id, 2, gt))
    dets.append(make_dt(image_id, 2, contained_box(gt, 0.96), conf))
    ann_id += 1

# Class 2 FPs: placed in a separate row with no GT overlap
for i, conf in enumerate([0.45, 0.40, 0.35, 0.30, 0.25, 0.20, 0.15, 0.10, 0.05, 0.00]):
    dets.append(make_dt(image_id, 2, [float(i*SPACING), float(2*ROW), float(SIZE), float(SIZE)], conf))

coco_gt = COCO()
coco_gt.dataset = {
    "images": [{"id": image_id, "width": 10*SPACING, "height": 3*ROW}],
    "annotations": anns,
    "categories": [{"id": 1, "name": "cat1"}, {"id": 2, "name": "cat2"}],
}
coco_gt.createIndex()
coco_dt = coco_gt.loadRes(dets)

coco_eval = COCOeval(coco_gt, coco_dt, iouType="bbox")
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()

m = coco_eval.extended_metrics
print(f"precision : {m['precision']:.4f}  (expected 0.75)")
print(f"recall    : {m['recall']:.4f}  (expected 1.00)")

Expected vs actual output

precision : 1.0000  (expected 0.75)   ← WRONG
recall    : 0.7500  (expected 1.00)   ← WRONG

Hand-verified expected values

At the F1-optimal confidence threshold (0.0 — include all predictions):

Class TP FP FN Precision Recall F1
Class 1 10 0 0 1.000 1.000 1.000
Class 2 10 10 0 0.500 1.000 0.667
Macro 0.750 1.000 0.833

Fix direction

The fix requires reconstructing the actual (non-interpolated) PR curve from raw
annotation data, then sweeping candidate confidence thresholds to find the one that
maximises macro-F1. The interpolated eval["precision"] array shoul...


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits February 19, 2026 13:33
Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>
Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix extended_metrics precision over-estimation in COCOeval_faster Fix extended_metrics precision over-estimation due to monotone PR-curve interpolation Feb 19, 2026
@Borda Borda marked this pull request as ready for review February 19, 2026 13:45
Copilot AI review requested due to automatic review settings February 19, 2026 13:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug in extended_metrics where precision was over-estimated due to using COCO's interpolated precision-recall curve. The interpolated curve hides false positives below the recall ceiling, causing incorrect F1-optimal threshold selection.

Changes:

  • Replaced interpolated-precision sweep with confidence-threshold sweep using actual TP/FP counts from eval["matched"]
  • Added comprehensive regression tests demonstrating the bug and verifying the fix
  • Updated documentation to clarify the algorithm change

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
faster_coco_eval/core/faster_eval_api.py Core fix: replaces interpolated precision with actual per-threshold precision computation using detection-GT matches from eval["matched"]
tests/test_basic.py Adds TestExtendedMetrics class with two tests: one demonstrating the over-estimation bug and one verifying perfect-prediction behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>
Copilot AI requested a review from Borda February 19, 2026 13:50
Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>
Copilot AI requested a review from Borda February 19, 2026 14:03
@Borda Borda requested a review from MiXaiLL76 February 19, 2026 15:38
@MiXaiLL76
Copy link
Owner

Thanks for the implementation, it's awesome.
I ported extended_metrics from the earlier rf-detr, so I didn't test it much and only used it a few times in my implementation.

I think it's cool that it's possible to move the validation engine part to a library, thereby reducing the engine's codebase.

@MiXaiLL76 MiXaiLL76 merged commit 33ab609 into main Feb 21, 2026
12 checks passed
@MiXaiLL76 MiXaiLL76 mentioned this pull request Feb 21, 2026
7 tasks
@Borda Borda deleted the copilot/fix-precision-estimation-bug branch February 22, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: extended_metrics over-estimates precision due to monotone PR-curve interpolation

3 participants