[CI/Build] Ignore data_parallel_size_local #30281

rjrock · 2025-12-08T22:05:35Z

Purpose

In order to run an API only server with headless engines, the data_parallel_size_local factor needs to be ignored.

Test Plan

TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/distributed/test_internal_lb_dp.py::test_api_only_multinode_dp_completion

Alternatively, by running commands in the data parallel deployment docs.

# Node 0  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 0 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
# Node 1
vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 4 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345

Without ignoring data_parallel_size_local the above commands fail.

Test Result

Before:

(APIServer pid=926136) RuntimeError: Configuration mismatch detected for engine 1.

All DP workers must have identical configurations for parameters that affect collective communication
(e.g., `enable_eplb`, `eplb_config.log_balancedness`).

Worker hash:
  abc38510dd3e898f9219159008ba01f04cacb70d751fc20869a08d9e633c1d55

Expected hash:
  c3bede6721a040bfb0d95bb4a38aaceb15db8ab521d1328e8b18c8508b551f22

Resolution:
  Ensure all workers are started with identical command-line arguments.

After: No error

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

In order to run an API only server with headless engines, the data_parallel_size_local factor needs to be ignored. Signed-off-by: Ryan Rock <ryan.rock@amd.com>

gemini-code-assist

Code Review

This pull request modifies the ParallelConfig to ignore data_parallel_size_local when computing the configuration hash for data-parallel workers. This is a necessary change to support running an API-only server with headless engines, where the API server can have data_parallel_size_local=0 while workers have a non-zero value. The change is correct and well-targeted. However, I've identified a potential high-severity issue where allowing data_parallel_size_local=0 could lead to a ZeroDivisionError in another part of the ParallelConfig class if a specific code path is taken. While it seems this is not an issue for the current use case, it's a latent bug that should be addressed to prevent future crashes.

gemini-code-assist · 2025-12-08T22:07:51Z

vllm/config/parallel.py

            # Derived/runtime topology, networking, or launch details
            "data_parallel_rank",
            "data_parallel_rank_local",
+            "data_parallel_size_local",


This change correctly allows running an API-only server with data_parallel_size_local=0. However, this change may expose a latent bug in the nnodes_within_dp property (line 417). This property calculates self.data_parallel_size // self.data_parallel_size_local without checking if data_parallel_size_local is zero. If a code path on the API server process were to access this property, it would result in a ZeroDivisionError. While your tests indicate this path isn't currently triggered for your use case, this creates a risk of crashes in other scenarios or future changes. A follow-up change to make nnodes_within_dp safe against division by zero would be advisable to improve robustness.

ProExpertProg

Does DP local size affect the computation graph in any way? If yes we can't remove it from the hash as it's used for compile caching.

[CI/Build] Ignore data_parallel_size_local

d6bbbdd

In order to run an API only server with headless engines, the data_parallel_size_local factor needs to be ignored. Signed-off-by: Ryan Rock <ryan.rock@amd.com>

rjrock requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners December 8, 2025 22:05

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

heheda12345 assigned ProExpertProg Dec 9, 2025

ProExpertProg reviewed Dec 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI/Build] Ignore data_parallel_size_local #30281

[CI/Build] Ignore data_parallel_size_local #30281

rjrock commented Dec 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Uh oh!

ProExpertProg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[CI/Build] Ignore data_parallel_size_local #30281

Are you sure you want to change the base?

[CI/Build] Ignore data_parallel_size_local #30281

Conversation

rjrock commented Dec 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjrock commented Dec 8, 2025 •

edited by github-actions bot

Loading