-
Notifications
You must be signed in to change notification settings - Fork 257
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Checklist
- The error occurs when using our provided Docker image.
- I can consistently reproduce the bug across multiple trials or random seeds.
- If the error causes experiment abortion, I've verified that this error is the root
cause, not a secondary error caused by peer workers.
Detailed Information
Describe the bug
When following the AReaL debugging best practices to separately launch inference and training processes,
the inference process starts successfully with allocation_mode=sglang:, but the training process fails to start when using allocation_mode=fsdp:.
Specifically, the training process crashes early with the following error:
ValueError: Invalid backend: None, expected sglang or vllm
Expected behavior
When running training and inference as separate processes, successfully start training-only workers using FSDP
Full logs
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
20251219-09:39:34.788 [FSDP Engine Rank 1] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
20251219-09:39:34.788 [FSDP Engine Rank 0] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
20251219-09:39:34.790 [FSDP Engine Rank 1] INFO: Data parallel head 1 and rank 1
[rank1]: Traceback (most recent call last):
[rank1]: File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 62, in <module>
[rank1]: main(sys.argv[1:])
[rank1]: File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 34, in main
[rank1]: with PPOTrainer(
[rank1]: ^^^^^^^^^^^
[rank1]: File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 95, in __init__
[rank1]: self.rollout = self._init_rollout(config.rollout, is_eval=False)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 413, in _init_rollout
[rank1]: raise ValueError(
[rank1]: ValueError: Invalid backend: None, expected sglang or vllm
20251219-09:39:34.791 [FSDP Engine Rank 0] INFO: Data parallel head 0 and rank 0
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 62, in <module>
[rank0]: main(sys.argv[1:])
[rank0]: File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 34, in main
[rank0]: with PPOTrainer(
[rank0]: ^^^^^^^^^^^
[rank0]: File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 95, in __init__
[rank0]: self.rollout = self._init_rollout(config.rollout, is_eval=False)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 413, in _init_rollout
[rank0]: raise ValueError(
[rank0]: ValueError: Invalid backend: None, expected sglang or vllm
[rank0]:[W1219 09:39:35.433559464 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank1]:[W1219 09:39:36.464258515 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1219 09:39:36.911000 5172 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 5307 closing signal SIGTERM
E1219 09:39:37.025000 5172 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 5306) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 7, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 143, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/math/gsm8k_rl.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-12-19_09:39:36
host : li-X640-G40
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5306)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
20251219-09:39:39.022 Local Scheduler INFO: Stopping local process with signal SIGTERM, pid: [5171]
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/data/AReaL-0.5.1/areal/launcher/local.py", line 419, in <module>
main()
File "/data/AReaL-0.5.1/areal/launcher/local.py", line 261, in main
local_main(config, run_id=0)
File "/data/AReaL-0.5.1/areal/launcher/local.py", line 413, in local_main
raise e
File "/data/AReaL-0.5.1/areal/launcher/local.py", line 389, in local_main
launcher.wait(
File "/data/AReaL-0.5.1/areal/launcher/local.py", line 236, in wait
raise JobException(
areal.utils.launcher.JobException: Job test_test:trainer JobState.COMPLETED at node local
20251219-09:39:39.972 Local Scheduler INFO: Waiting for 0 local running processes, pids:
To Reproduce
Commit ID
Please provide your Git commit ID.
Environment
Using the released Docker image:
ghcr.io/inclusionai/areal-runtime:v0.5.1
4 × NVIDIA A10 (24GB)
Script
nohup python -m areal.launcher.local examples/math/gsm8k_rl.py \
--config examples/math/gsm8k_grpo.yaml \
allocation_mode=sglang:d4p1t1 \
> llm_server.log 2>&1 &
python3 -m areal.launcher.local examples/math/gsm8k_rl.py \
--config examples/math/gsm8k_grpo.yaml \
experiment_name=test \
trial_name=test \
allocation_mode=fsdp:d2p1t1 \
cluster.n_nodes=1 \
cluster.n_gpus_per_node=4 \
gconfig.max_new_tokens=2048 \
train_dataset.batch_size=1024 \
+sglang.attention_backend=triton
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working