Skip to content

Conversation

@daphne-cornelisse
Copy link

@daphne-cornelisse daphne-cornelisse commented Jan 14, 2026

Description

We observed a discrepancy between the WOSAC meta-scores from the original implementation and those produced by PufferDrive. This PR resolves the majority of these discrepancies by addressing the bugs below.

Updates

  • Random agent baseline is now the same as the random agent in WOSAC (Gaussian independent noise, description added in docs).
  • Three main fixes that lead to a more representative score.
  • Evaluations are rerun, and the baseline tables are updated.

Updated baseline table and scores

The baseline scores are now more representative:

Screenshot 2026-01-16 at 20 07 16

The goal-conditioned self-play RL policy has a meta-score of ~ 0.67, which is close to the score reported in an earlier paper using a similar approach in GPUDrive (score is 0.629, Table 2). The SMART meta-score is almost exactly the same as its score on the real WOSAC leaderboard. We should note that the current scores may still be slightly optimistic as compared to the original. It is not implausible that this optimism comes from the known differences written down below.

Summary of known differences and status


🟢: Resolved in this PR; 🔴: Outstanding.


  • 🟢 [status=fixed] Order of operations: We were averaging log-likelihood metrics before exponentiating; WOSAC exponentiates first and then averages.
  • 🟢 [status=fixed] PufferDrive computes the ttc for all agents whereas the original only computes ttc for vehicles
  • 🟢 [status=fixed]* PufferDrive used to multiply the final meta-score by the total weight (0.95) to compensate for the missing traffic light violation metric. What we do now instead is use the weights from the WOSAC 2024 challenge that did not have a traffic light metric.
  • 🔴 [status=diff] PufferDrive does not include the traffic light violation as part of the WOSAC meta-score.
  • 🔴 [status=diff] Original implementation uses the z-axis. PufferDrive omits the z-axis from metric computation since the z-axis is currently not supported.

@daphne-cornelisse daphne-cornelisse added bug Something isn't working benchmarking documentation Improvements or additions to documentation labels Jan 14, 2026
// Maximum number of agents per scene
#ifndef MAX_AGENTS
#define MAX_AGENTS 32
#define MAX_AGENTS 64
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: remember to revert MAX_AGENTS

@daphne-cornelisse daphne-cornelisse changed the title Fix: Fully align WOSAC metric calculation with original Fix: Align WOSAC metric calculation with original Jan 16, 2026
@daphne-cornelisse daphne-cornelisse changed the title Fix: Align WOSAC metric calculation with original Fixes: Align WOSAC metric calculation with original Jan 16, 2026
@daphne-cornelisse daphne-cornelisse marked this pull request as ready for review January 16, 2026 23:35
@greptile-apps
Copy link

greptile-apps bot commented Jan 16, 2026

Greptile Summary

This PR fixes three critical bugs that were causing discrepancies between PufferDrive's WOSAC metric calculation and the original implementation.

Key Changes:

  • Fixed exponentiation order: Log-likelihoods are now averaged first, then exponentiated (lines 387-513 in evaluator.py). Previously, the code was exponentiating first and then averaging, which is mathematically incorrect.
  • TTC filtering for vehicles only: Time-to-collision metric now correctly filters for vehicles only using the new is_vehicle field (line 418 in evaluator.py), matching the original WOSAC implementation.
  • Updated to WOSAC 2024 weights: Configuration updated to use 2024 challenge weights where traffic light violation has weight 0.0 (was 0.05 in 2025) and TTC weight increased from 0.05 to 0.1. Weights now sum to 1.0, so the weight normalization division was removed from _compute_metametric() (line 51 in evaluator.py).

Additional improvements:

  • Removed agent shrinking code (0.7x width/length multiplier) in drive.h:1210-1212
  • Fixed SDC initialization to handle cases where SDC index is -1 in drive.h:1270-1278
  • Renamed use_all_maps to sequential_map_sampling for clarity across all files
  • Added random agent baseline documentation matching WOSAC 2023 paper specification

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are well-tested bug fixes that align PufferDrive with the reference WOSAC implementation. The three main fixes address specific mathematical and filtering issues with clear justification. The code changes are clean, focused, and include corresponding updates to documentation and evaluation scripts. The agent shrinking removal and SDC initialization fix are also valid improvements.
  • No files require special attention

Important Files Changed

Filename Overview
pufferlib/ocean/benchmark/evaluator.py Fixed exponentiation order (log-likelihood averaged then exponentiated), added TTC vehicle-only filtering, removed weight normalization
pufferlib/ocean/benchmark/wosac.ini Updated to WOSAC 2024 weights (TTC weight 0.1→0.1, traffic light 0.05→0.0), weights now sum to 1.0
pufferlib/ocean/drive/drive.h Removed agent shrinking code (0.7x width/length), fixed SDC initialization logic, added is_vehicle field to ground truth
pufferlib/ocean/drive/drive.py Renamed use_all_maps to sequential_map_sampling, added is_vehicle field to trajectory data structure
pufferlib/ocean/env_binding.h Added is_vehicle parameter to ground truth trajectory functions (8th parameter)

Sequence Diagram

sequenceDiagram
    participant Evaluator as WOSACEvaluator
    participant Env as PufferEnv
    participant Metrics as metrics module
    participant Estimators as estimators module
    
    Evaluator->>Env: collect_ground_truth_trajectories()
    Env->>Env: get_global_ground_truth_trajectories()
    Note over Env: Now includes is_vehicle field
    Env-->>Evaluator: trajectories with is_vehicle
    
    Evaluator->>Env: collect_simulated_trajectories()
    Env-->>Evaluator: simulated trajectories
    
    Evaluator->>Evaluator: compute_metrics()
    
    Note over Evaluator: Compute log-likelihoods for each metric
    Evaluator->>Estimators: log_likelihood_estimate()
    Estimators-->>Evaluator: log-likelihood values
    
    Note over Evaluator: Average log-likelihoods over time (per agent)
    Evaluator->>Metrics: _reduce_average_with_validity()
    Note over Metrics: For TTC: filter by is_vehicle
    Metrics-->>Evaluator: averaged log-likelihoods
    
    Note over Evaluator: Group by scenario_id and average
    Evaluator->>Evaluator: df.groupby("scenario_id").mean()
    
    Note over Evaluator: NEW: Exponentiate AFTER averaging
    Evaluator->>Evaluator: np.exp(scene_level_results)
    
    Note over Evaluator: Compute weighted meta-score
    Evaluator->>Evaluator: _compute_metametric()
    Note over Evaluator: NEW: No weight normalization (sum=1.0)
    Evaluator-->>Evaluator: final WOSAC score
Loading

@daphne-cornelisse daphne-cornelisse merged commit 088eb17 into 2.0 Jan 17, 2026
14 checks passed
@daphne-cornelisse daphne-cornelisse deleted the wbd/wosac_debug branch January 17, 2026 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmarking bug Something isn't working documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants