Skip to content

Conversation

@rjrock
Copy link
Contributor

@rjrock rjrock commented Dec 8, 2025

Purpose

In order to run an API only server with headless engines, the data_parallel_size_local factor needs to be ignored.

Test Plan

TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/distributed/test_internal_lb_dp.py::test_api_only_multinode_dp_completion

Alternatively, by running commands in the data parallel deployment docs.

# Node 0  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 0 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
# Node 1
vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 4 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345

Without ignoring data_parallel_size_local the above commands fail.

Test Result

Before:

(APIServer pid=926136) RuntimeError: Configuration mismatch detected for engine 1.

All DP workers must have identical configurations for parameters that affect collective communication
(e.g., `enable_eplb`, `eplb_config.log_balancedness`).

Worker hash:
  abc38510dd3e898f9219159008ba01f04cacb70d751fc20869a08d9e633c1d55

Expected hash:
  c3bede6721a040bfb0d95bb4a38aaceb15db8ab521d1328e8b18c8508b551f22

Resolution:
  Ensure all workers are started with identical command-line arguments.

After: No error


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

In order to run an API only server with headless engines, the
data_parallel_size_local factor needs to be ignored.

Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the ParallelConfig to ignore data_parallel_size_local when computing the configuration hash for data-parallel workers. This is a necessary change to support running an API-only server with headless engines, where the API server can have data_parallel_size_local=0 while workers have a non-zero value. The change is correct and well-targeted. However, I've identified a potential high-severity issue where allowing data_parallel_size_local=0 could lead to a ZeroDivisionError in another part of the ParallelConfig class if a specific code path is taken. While it seems this is not an issue for the current use case, it's a latent bug that should be addressed to prevent future crashes.

# Derived/runtime topology, networking, or launch details
"data_parallel_rank",
"data_parallel_rank_local",
"data_parallel_size_local",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change correctly allows running an API-only server with data_parallel_size_local=0. However, this change may expose a latent bug in the nnodes_within_dp property (line 417). This property calculates self.data_parallel_size // self.data_parallel_size_local without checking if data_parallel_size_local is zero. If a code path on the API server process were to access this property, it would result in a ZeroDivisionError. While your tests indicate this path isn't currently triggered for your use case, this creates a risk of crashes in other scenarios or future changes. A follow-up change to make nnodes_within_dp safe against division by zero would be advisable to improve robustness.

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does DP local size affect the computation graph in any way? If yes we can't remove it from the hash as it's used for compile caching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants