Ray Scheduler Implementation for Single Controller #741

HwVanICI · 2025-12-17T19:42:23Z

Description

This PR is a first implementation of a single controller implementation using Ray (RayScheduler) to allow for multi-node deployments, as a distributed alternative to the pre-existing LocalScheduler.

This change introduces RayScheduler, implementing the Scheduler interface, and RayRPCServer, mimicking the RPCServer class.

RayScheduler
RayScheduler creates one ray actor per training rank, and one ray actor per rollout instance. Ray handles the device assignments to each actor. Current design generates 1 placement group for the training, with one placement group per rollout actor.

RayRPCServer
RayRPCServer is the ray actor itself and resembles the RPCServer. Instead of communicating with the Scheduler using HTTP, all communication is done with ray remote calls.

Caveats
The current implementation is tested on vLLM using the "mp" distributed executor backend. A follow-up design supporting the "ray" backend is ideal, but there are currently some limitations preventing such implementation due to how the EngineCore hooks are run.

Related Issue

Fix #661

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not
work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass (test_rtensor.py)
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with
jb build docs
No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

Additional Context

Need help? Check the Contributing Guide or ask in
GitHub Discussions!

gemini-code-assist · 2025-12-17T19:42:43Z

Summary of Changes

Hello @HwVanICI, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's scalability by integrating Ray for distributed scheduling. It replaces the local scheduling mechanism with a Ray-based approach, allowing for seamless multi-node deployments of training and inference workloads. The core change involves the introduction of RayScheduler to manage worker actors and RayRPCServer for efficient, Ray-native inter-actor communication, moving away from HTTP-based RPC. This lays the groundwork for more robust and scalable distributed machine learning experiments.

Highlights

Ray Scheduler Introduction: A new RayScheduler is implemented, serving as a distributed alternative to the LocalScheduler for multi-node deployments.
Ray Actor-based Workers: The RayScheduler creates Ray actors for each training rank and rollout instance, with Ray managing device assignments.
RayRPCServer for Communication: A RayRPCServer class is introduced, mirroring the existing RPCServer but utilizing Ray remote calls for inter-process communication instead of HTTP.
Current Backend Support: The initial implementation is tested with vLLM using the "mp" distributed executor backend.
Future Backend Plans: Future work aims to support the "ray" backend, pending resolution of current limitations with EngineCore hooks.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a RayScheduler as a distributed alternative to the LocalScheduler, enabling multi-node deployments. The implementation includes the RayScheduler class for managing Ray actors and placement groups, and the RayRPCServer actor which wraps the training/inference engines. The code is well-structured, with robust error handling and retry mechanisms for remote calls. I've identified a minor bug in a log message and a typo, and have also suggested a small design improvement for handling worker ports to enhance maintainability. Overall, this is a solid first implementation of Ray-based scheduling.

gemini-code-assist · 2025-12-17T19:44:07Z

areal/scheduler/ray.py

+            ).remote()
+
+            # 0 needed to pad the list as the trainer takes index 1 for ports
+            worker_ports = ["0", str(master_port)]


The use of a padded list ["0", str(master_port)] for worker_ports seems a bit brittle, as it relies on an implicit contract with the consumer (the trainer) about which index to use. For future improvements, consider using a more descriptive data structure like a dictionary ({"master_port": master_port}), or making the consumer more robust to handle different port list formats. This would make the code easier to understand and maintain.

This is according to TrainController taking port index 1 instead of 0.

areal/scheduler/ray.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

garrett4wade

LGTM except for some minor style issues.

garrett4wade · 2025-12-18T07:12:04Z

areal/utils/data.py


 def tensor_container_to(
-    d: dict[str, Any] | torch.Tensor | list[torch.Tensor], *args, **kwargs
+    d: dict[str, Any] | torch.Tensor | list[torch.Tensor] | tuple[torch.Tensor],


The tuple annotation is not correct. It should be tuple[torch.Tensor, ...].

garrett4wade · 2025-12-18T07:15:12Z

areal/scheduler/rpc/ray_rpc_server.py

+        if self._engine is None:
+            raise RuntimeError("Engine not initialized. Call create_engine() first")
+
+        should_bcast = kwargs.pop("_should_bcast", True)


This key has been changed to should_broadcast.

garrett4wade · 2025-12-18T07:16:05Z

areal/scheduler/rpc/ray_rpc_server.py

+        try:
+            fn = getattr(self._engine, method)
+            result = fn(*args, **kwargs)
+            if isinstance(result, Future):
+                result = result.result()
+            # put back to cpu to mimic RPCServer encode/decode
+            result = tensor_container_to(result, "cpu")
+            return result
+        except Exception as e:
+            self.logger.error(
+                f"RayRPCServer Engine method '{method}' failed: {e}\n"
+                f"{traceback.format_exc()}"
+            )
+            raise


We may want to add some debug log with logger.debug around the engine creation and method calls.

garrett4wade · 2025-12-18T07:19:07Z

areal/scheduler/ray.py

+    def __init__(
+        self,
+        gpu_devices: list[int] | None = None,
+        log_dir: str | None = None,
+        startup_timeout: float = 30.0,
+        health_check_interval: float = 1.0,
+        *,
+        fileroot: str | None = None,
+        experiment_name: str | None = None,
+        trial_name: str | None = None,
+        exp_config: BaseExperimentConfig | None = None,
+    ):


We don't have to maintain the same init APIs for different schedulers. If some parameters are not used, they can be removed.

garrett4wade · 2025-12-18T07:20:00Z

areal/utils/device.py

+def ray_resource_type():
+    if torch.cuda.is_available():
+        return "GPU"
+
+    from areal.platforms import is_npu_available
+
+    if is_npu_available:
+        return "NPU"
+
+    return "CPU"


We probably want to move it to areal.platform or make it inline with the ray scheduler.

garrett4wade · 2025-12-18T07:23:33Z

areal/scheduler/ray.py

+            options = self._actor_resource_spec(spec.cpu, spec.gpu, spec.mem)
+
+            env = get_env_vars(
+                "", ",".join([f"{k}={v}" for k, v in spec.env_vars.items()])


We should input a cluster name, which should be an init argument of ray scheduler

garrett4wade · 2025-12-18T07:31:01Z

areal/scheduler/ray.py

+                ref = wi.actor.call.remote(method, *args, **kwargs)
+                result = await asyncio.to_thread(ray.get, ref, timeout=http_timeout)


We can use native Ray async APIs instead of threading, which may have some risk of getting stuck.

ref: https://docs.ray.io/en/latest/ray-core/actors/async_api.html

garrett4wade · 2025-12-18T07:32:30Z

One more thing: it would be better if we write a unit test file to test the basic functionality of RayScheduler, e.g., in areal/tests/test_ray_scheduler.py.

HwVanICI · 2025-12-18T18:58:17Z

I am putting this back to WIP to accommodate the RTensor changes.

HwVanICI · 2025-12-18T22:06:42Z

To accommodate the RTensor changes, I will need to implement a RayRTensor class using the Ray Object Store instead of HTTP. I plan to refactor the RTensor class such that there is a BaseRTensor class that implements RTensor's shared staticmethods as classmethods. Then I would have RTensor and RayRTensor both extend from BaseRTensor. Functions such as _fetch and from_batched would of course be specific to the subclasses.
Let me know if you have any other preferences.

…o ray_scheduler

HwVanICI · 2025-12-19T00:46:26Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a RayScheduler as a distributed alternative to the existing LocalScheduler, enabling multi-node deployments. The implementation includes a RayRPCServer actor to manage engine lifecycle and communication, and a RayRTensor for distributed tensor representation. The changes are extensive and well-structured, introducing a significant new capability. My review focuses on ensuring resource management is robust, identifying potential bugs in the new logic, and suggesting improvements for performance and maintainability. Key areas of feedback include improving the reliability of resource cleanup, fixing a bug in placement group handling, and enhancing type safety and test coverage.

areal/scheduler/ray.py

gemini-code-assist · 2025-12-19T00:47:58Z

areal/scheduler/ray.py

+    def __del__(self):
+        try:
+            self.delete_workers()
+        except Exception:
+            pass


Using __del__ for resource cleanup is unreliable in Python, as its execution is not guaranteed, especially in the presence of reference cycles. This can lead to leaked Ray actors and placement groups. An explicit shutdown() method should be provided and called by the user to ensure proper resource release. The broad except Exception: pass also dangerously hides any errors that might occur during cleanup.

def shutdown(self): """Shuts down the scheduler and cleans up all associated workers and resources.""" try: self.delete_workers() except Exception as e: logger.error(f"Error during RayScheduler shutdown: {e}", exc_info=True)

Explicit shutdown is done by calling delete_workers.

areal/scheduler/rpc/rtensor.py

areal/scheduler/ray.py

gemini-code-assist · 2025-12-19T00:47:58Z

areal/tests/test_ray_scheduler.py

+import ray
+from ray.util.state import summarize_actors
+
+from areal.api.cli_args import BaseExperimentConfig
+from areal.api.scheduler_api import (
+    Job,
+    SchedulingSpec,
+)
+from areal.scheduler.ray import RayScheduler, ray_resource_type
+
+
+class TestRaySchedulerInitialization:
+    def test_init(self):
+        scheduler = RayScheduler(
+            startup_timeout=60.0, exp_config=BaseExperimentConfig()
+        )
+        assert scheduler.startup_timeout == 60.0
+
+
+class TestWorkerCreationAndDeletion:
+    def test_create_delete_workers(self):
+        ray.init()
+
+        config = BaseExperimentConfig()
+
+        scheduler = RayScheduler(startup_timeout=60.0, exp_config=config)
+
+        job = Job(
+            replicas=2,
+            role="train",
+            tasks=[
+                SchedulingSpec(
+                    cpu=1,
+                    mem=1024,
+                    gpu=1,
+                ),
+                SchedulingSpec(
+                    cpu=1,
+                    mem=1024,
+                    gpu=1,
+                ),
+            ],
+        )
+
+        # create workers
+        worker_ids = scheduler.create_workers(job)
+        assert len(worker_ids) == 2
+        assert len(scheduler._workers["train"]) == 2
+
+        actor_summary = summarize_actors()
+
+        assert (
+            actor_summary["cluster"]["summary"]["RayRPCServer"]["state_counts"]["ALIVE"]
+            == 2
+        )
+        assert len(scheduler.get_workers("train")) == 2
+
+        scheduler._ping_workers("train")
+
+        # delete workers
+        scheduler.delete_workers()
+        assert len(scheduler._workers["train"]) == 0
+
+        actor_summary = summarize_actors()
+        assert (
+            actor_summary["cluster"]["summary"]["RayRPCServer"]["state_counts"]["DEAD"]
+            == 2
+        )
+
+
+class TestUtilityFunctions:
+    def test_utilities(self):
+        _num_gpu_per_node = 16
+        config = BaseExperimentConfig()
+
+        config.cluster.n_gpus_per_node = _num_gpu_per_node
+
+        scheduler = RayScheduler(startup_timeout=60.0, exp_config=config)
+
+        schedulings = [
+            SchedulingSpec(
+                cpu=1,
+                mem=1024,
+                gpu=1,
+            ),
+            SchedulingSpec(
+                cpu=1,
+                mem=1024,
+                gpu=1,
+            ),
+        ]
+
+        new_schedulings = scheduler._prepare_worker_specs("train", 2, schedulings)
+        assert len(new_schedulings) == 2
+        for spec in new_schedulings:
+            assert spec.cpu == 1
+            assert spec.mem == 1024
+            assert spec.gpu == 1
+
+        # case where only 1 spec is passed but multiple workers
+        new_schedulings = scheduler._prepare_worker_specs("train", 2, schedulings[0:])
+        assert len(new_schedulings) == 2
+        for spec in new_schedulings:
+            assert spec.cpu == 1
+            assert spec.mem == 1024
+            assert spec.gpu == 1
+
+        bundle_list = scheduler._create_bundle_list_gpu(1, 24, 1024)
+        assert len(bundle_list) == 2
+        for bundle in bundle_list:
+            assert bundle[ray_resource_type()] <= _num_gpu_per_node


The added tests for RayScheduler cover basic initialization and worker creation/deletion, which is a good start. However, there is no test coverage for more complex and critical functionalities, such as call_engine, async_call_engine, and the RayRTensor logic. These are core components of the new Ray-based scheduling and should be tested to ensure correctness and prevent regressions.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…o ray_scheduler

HwVanICI · 2025-12-22T18:56:29Z

I've performed the refactor and implemented the RayRTensor class. Should be ready for review again.

…o ray_scheduler

garrett4wade

Hi @HwVanICI , thanks for the addition of RayRTensor, but currently there are too much redundant code in both RTensor implementations. At the current stage we should consider improving the code quality before merging:

There's no need to create a base abstract class. Ray only differs in how the data is fetched and what's the in-memory form of the tensor metadata. That's the only part that we should extend. I suggest only subclassing the ShardInfo class and let it provide the fetch functionality in a dependency injection manner, and we nearly don't modify the top-level RTensor implementation.

…stead of subclassing RTensor

HwVanICI · 2025-12-23T22:59:47Z

Thanks for the suggestion. I have done as requested and updated the RayScheduler code to be compatible with the PPOTrainer changes.

Ray scheduler implementation

1f9f63b

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

HwVanICI and others added 3 commits December 17, 2025 11:45

Merge branch 'main' into ray_scheduler

40d2428

Update areal/scheduler/ray.py

3701cef

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update areal/scheduler/ray.py

a9ebec2

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

garrett4wade reviewed Dec 18, 2025

View reviewed changes

HwVanICI changed the title ~~Ray Scheduler Implementation for Single Controller~~ [WIP] Ray Scheduler Implementation for Single Controller Dec 18, 2025

hlyli and others added 5 commits December 18, 2025 15:35

Stylistic changes and remove asyncio.to_thread from ray calls

aa4fec5

RayRTensor, RTensor refactor, and tests for RayScheduler

18d7737

Merge branch 'main' into ray_scheduler

15c28a8

Fix typos

9942924

Merge branch 'ray_scheduler' of https://github.com/HwVanICI/AReaL int…

691b4d3

…o ray_scheduler

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

HwVanICI and others added 6 commits December 18, 2025 16:49

Update areal/scheduler/rpc/rtensor.py

fd2b0b3

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update areal/scheduler/ray.py

75c5e53

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Add gemini suggestions

6e34fa2

Merge branch 'ray_scheduler' of https://github.com/HwVanICI/AReaL int…

bc3ee12

…o ray_scheduler

Fix rtensor test regex assertion error

8681a27

Merge branch 'main' into ray_scheduler

f18f77d

HwVanICI changed the title ~~[WIP] Ray Scheduler Implementation for Single Controller~~ Ray Scheduler Implementation for Single Controller Dec 22, 2025

hlyli added 2 commits December 22, 2025 12:14

Tests for ray scheduler create and call engine

30abe0e

Merge branch 'ray_scheduler' of https://github.com/HwVanICI/AReaL int…

27ce76a

…o ray_scheduler

garrett4wade reviewed Dec 23, 2025

View reviewed changes

Refactor ray implementation of rtensor to use dependency injection in…

60ecdd8

…stead of subclassing RTensor

HwVanICI and others added 4 commits December 23, 2025 12:33

Merge branch 'main' into ray_scheduler

dd221ca

Fix torch import

9190ddd

Support PPOTrainer change for RayScheduler

fb40ac2

Remove gsm8k_grpo_ray.py as it is handled in a unified script now.

5f7967b

Merge branch 'main' into ray_scheduler

265fa1c

		ref = wi.actor.call.remote(method, args, *kwargs)
		result = await asyncio.to_thread(ray.get, ref, timeout=http_timeout)

Ray Scheduler Implementation for Single Controller #741

Are you sure you want to change the base?

Ray Scheduler Implementation for Single Controller #741

Conversation

HwVanICI commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot commented Dec 17, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garrett4wade commented Dec 18, 2025

Uh oh!

HwVanICI commented Dec 18, 2025

Uh oh!

HwVanICI commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HwVanICI commented Dec 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

HwVanICI commented Dec 22, 2025

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

HwVanICI commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

HwVanICI commented Dec 17, 2025 •

edited

Loading

HwVanICI commented Dec 18, 2025 •

edited

Loading