-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[CI/Build] Ignore data_parallel_size_local #30281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In order to run an API only server with headless engines, the data_parallel_size_local factor needs to be ignored. Signed-off-by: Ryan Rock <ryan.rock@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request modifies the ParallelConfig to ignore data_parallel_size_local when computing the configuration hash for data-parallel workers. This is a necessary change to support running an API-only server with headless engines, where the API server can have data_parallel_size_local=0 while workers have a non-zero value. The change is correct and well-targeted. However, I've identified a potential high-severity issue where allowing data_parallel_size_local=0 could lead to a ZeroDivisionError in another part of the ParallelConfig class if a specific code path is taken. While it seems this is not an issue for the current use case, it's a latent bug that should be addressed to prevent future crashes.
| # Derived/runtime topology, networking, or launch details | ||
| "data_parallel_rank", | ||
| "data_parallel_rank_local", | ||
| "data_parallel_size_local", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly allows running an API-only server with data_parallel_size_local=0. However, this change may expose a latent bug in the nnodes_within_dp property (line 417). This property calculates self.data_parallel_size // self.data_parallel_size_local without checking if data_parallel_size_local is zero. If a code path on the API server process were to access this property, it would result in a ZeroDivisionError. While your tests indicate this path isn't currently triggered for your use case, this creates a risk of crashes in other scenarios or future changes. A follow-up change to make nnodes_within_dp safe against division by zero would be advisable to improve robustness.
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does DP local size affect the computation graph in any way? If yes we can't remove it from the hash as it's used for compile caching.
Purpose
In order to run an API only server with headless engines, the data_parallel_size_local factor needs to be ignored.
Test Plan
TP_SIZE=1 DP_SIZE=4 pytest -v -s v1/distributed/test_internal_lb_dp.py::test_api_only_multinode_dp_completionAlternatively, by running commands in the data parallel deployment docs.
Without ignoring
data_parallel_size_localthe above commands fail.Test Result
Before:
After: No error
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.