Fixes: Align WOSAC metric calculation with original #258

daphne-cornelisse · 2026-01-14T20:04:53Z

Description

We observed a discrepancy between the WOSAC meta-scores from the original implementation and those produced by PufferDrive. This PR resolves the majority of these discrepancies by addressing the bugs below.

Updates

Random agent baseline is now the same as the random agent in WOSAC (Gaussian independent noise, description added in docs).
Three main fixes that lead to a more representative score.
Evaluations are rerun, and the baseline tables are updated.

Updated baseline table and scores

The baseline scores are now more representative:

The goal-conditioned self-play RL policy has a meta-score of ~ 0.67, which is close to the score reported in an earlier paper using a similar approach in GPUDrive (score is 0.629, Table 2). The SMART meta-score is almost exactly the same as its score on the real WOSAC leaderboard. We should note that the current scores may still be slightly optimistic as compared to the original. It is not implausible that this optimism comes from the known differences written down below.

Summary of known differences and status

🟢: Resolved in this PR; 🔴: Outstanding.

🟢 [status=fixed] Order of operations: We were averaging log-likelihood metrics before exponentiating; WOSAC exponentiates first and then averages.
🟢 [status=fixed] PufferDrive computes the ttc for all agents whereas the original only computes ttc for vehicles
🟢 [status=fixed]* PufferDrive used to multiply the final meta-score by the total weight (0.95) to compensate for the missing traffic light violation metric. What we do now instead is use the weights from the WOSAC 2024 challenge that did not have a traffic light metric.
🔴 [status=diff] PufferDrive does not include the traffic light violation as part of the WOSAC meta-score.
🔴 [status=diff] Original implementation uses the z-axis. PufferDrive omits the z-axis from metric computation since the z-axis is currently not supported.

…ndex=-1.

daphne-cornelisse · 2026-01-16T20:44:41Z

pufferlib/ocean/drive/drive.h

 // Maximum number of agents per scene
 #ifndef MAX_AGENTS
-#define MAX_AGENTS 32
+#define MAX_AGENTS 64


Note to self: remember to revert MAX_AGENTS

… the traffic light metric.

greptile-apps · 2026-01-16T23:38:24Z

Greptile Summary

This PR fixes three critical bugs that were causing discrepancies between PufferDrive's WOSAC metric calculation and the original implementation.

Key Changes:

Fixed exponentiation order: Log-likelihoods are now averaged first, then exponentiated (lines 387-513 in evaluator.py). Previously, the code was exponentiating first and then averaging, which is mathematically incorrect.
TTC filtering for vehicles only: Time-to-collision metric now correctly filters for vehicles only using the new is_vehicle field (line 418 in evaluator.py), matching the original WOSAC implementation.
Updated to WOSAC 2024 weights: Configuration updated to use 2024 challenge weights where traffic light violation has weight 0.0 (was 0.05 in 2025) and TTC weight increased from 0.05 to 0.1. Weights now sum to 1.0, so the weight normalization division was removed from _compute_metametric() (line 51 in evaluator.py).

Additional improvements:

Removed agent shrinking code (0.7x width/length multiplier) in drive.h:1210-1212
Fixed SDC initialization to handle cases where SDC index is -1 in drive.h:1270-1278
Renamed use_all_maps to sequential_map_sampling for clarity across all files
Added random agent baseline documentation matching WOSAC 2023 paper specification

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The changes are well-tested bug fixes that align PufferDrive with the reference WOSAC implementation. The three main fixes address specific mathematical and filtering issues with clear justification. The code changes are clean, focused, and include corresponding updates to documentation and evaluation scripts. The agent shrinking removal and SDC initialization fix are also valid improvements.
No files require special attention

Important Files Changed

Filename	Overview
pufferlib/ocean/benchmark/evaluator.py	Fixed exponentiation order (log-likelihood averaged then exponentiated), added TTC vehicle-only filtering, removed weight normalization
pufferlib/ocean/benchmark/wosac.ini	Updated to WOSAC 2024 weights (TTC weight 0.1→0.1, traffic light 0.05→0.0), weights now sum to 1.0
pufferlib/ocean/drive/drive.h	Removed agent shrinking code (0.7x width/length), fixed SDC initialization logic, added is_vehicle field to ground truth
pufferlib/ocean/drive/drive.py	Renamed use_all_maps to sequential_map_sampling, added is_vehicle field to trajectory data structure
pufferlib/ocean/env_binding.h	Added is_vehicle parameter to ground truth trajectory functions (8th parameter)

Sequence Diagram

sequenceDiagram
    participant Evaluator as WOSACEvaluator
    participant Env as PufferEnv
    participant Metrics as metrics module
    participant Estimators as estimators module
    
    Evaluator->>Env: collect_ground_truth_trajectories()
    Env->>Env: get_global_ground_truth_trajectories()
    Note over Env: Now includes is_vehicle field
    Env-->>Evaluator: trajectories with is_vehicle
    
    Evaluator->>Env: collect_simulated_trajectories()
    Env-->>Evaluator: simulated trajectories
    
    Evaluator->>Evaluator: compute_metrics()
    
    Note over Evaluator: Compute log-likelihoods for each metric
    Evaluator->>Estimators: log_likelihood_estimate()
    Estimators-->>Evaluator: log-likelihood values
    
    Note over Evaluator: Average log-likelihoods over time (per agent)
    Evaluator->>Metrics: _reduce_average_with_validity()
    Note over Metrics: For TTC: filter by is_vehicle
    Metrics-->>Evaluator: averaged log-likelihoods
    
    Note over Evaluator: Group by scenario_id and average
    Evaluator->>Evaluator: df.groupby("scenario_id").mean()
    
    Note over Evaluator: NEW: Exponentiate AFTER averaging
    Evaluator->>Evaluator: np.exp(scene_level_results)
    
    Note over Evaluator: Compute weighted meta-score
    Evaluator->>Evaluator: _compute_metametric()
    Note over Evaluator: NEW: No weight normalization (sum=1.0)
    Evaluator-->>Evaluator: final WOSAC score

Wael Boumediene Doulazmi and others added 3 commits January 14, 2026 13:58

quick commit so you can read the code

c720519

Merge remote-tracking branch 'origin/2.0' into wbd/wosac_debug

e59cc38

Merge remote-tracking branch 'origin/2.0' into wbd/wosac_debug

0ddfac3

daphne-cornelisse added bug Something isn't working benchmarking documentation Improvements or additions to documentation labels Jan 14, 2026

daphne-cornelisse and others added 5 commits January 15, 2026 09:44

Improve naming of sampling argument to better describe its function.

962d8f2

Ensure that initialization works with Carla maps or other, when sdc_i…

57d8e31

…ndex=-1.

Simplify CARLA compatibility

6c9cd18

Replace num_maps by wosac_num_maps in all the eval scripts

b75a99a

Comment about random baseline.

635867b

daphne-cornelisse commented Jan 16, 2026

View reviewed changes

daphne-cornelisse added 3 commits January 16, 2026 17:16

Bug fix: do not reweight by the total weight (0.95).

1c490a7

Add this back for now.

fc37e3c

Delete agent shrinking code.

3500b21

daphne-cornelisse changed the title ~~Fix: Fully align WOSAC metric calculation with original~~ Fix: Align WOSAC metric calculation with original Jan 16, 2026

daphne-cornelisse changed the title ~~Fix: Align WOSAC metric calculation with original~~ Fixes: Align WOSAC metric calculation with original Jan 16, 2026

Ignore ttc metric when agents are not vehicles

0184379

daphne-cornelisse requested a review from WaelDLZ January 16, 2026 23:27

Update WOSAC weights to align with 2024 challenge since we don't have…

6c8715f

… the traffic light metric.

daphne-cornelisse marked this pull request as ready for review January 16, 2026 23:35

WaelDLZ approved these changes Jan 17, 2026

View reviewed changes

daphne-cornelisse added 3 commits January 16, 2026 19:52

Update table

957d95b

Revert MAX_AGENTS to original.

a115067

Update formatting.

c5335b5

daphne-cornelisse merged commit 088eb17 into 2.0 Jan 17, 2026
14 checks passed

daphne-cornelisse deleted the wbd/wosac_debug branch January 17, 2026 01:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes: Align WOSAC metric calculation with original #258

Fixes: Align WOSAC metric calculation with original #258

Uh oh!

daphne-cornelisse commented Jan 14, 2026 •

edited

Loading

Uh oh!

daphne-cornelisse Jan 16, 2026

Uh oh!

greptile-apps bot commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fixes: Align WOSAC metric calculation with original #258

Fixes: Align WOSAC metric calculation with original #258

Uh oh!

Conversation

daphne-cornelisse commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Updates

Updated baseline table and scores

Summary of known differences and status

Uh oh!

daphne-cornelisse Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 16, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

daphne-cornelisse commented Jan 14, 2026 •

edited

Loading