-
Notifications
You must be signed in to change notification settings - Fork 257
Description
Checklist
- The error occurs when using our provided Docker image.
- I can consistently reproduce the bug across multiple trials or random seeds.
- If the error causes experiment abortion, I've verified that this error is the root
cause, not a secondary error caused by peer workers.
Detailed Information
Describe the bug
During LoRA-based RL fine-tuning with vLLM v0.11.0, the training crashes with a KeyError in vLLM’s async output handling. Specifically, when finishing a request, vLLM attempts to remove a request ID from lora_stats.running_requests, but the key does not exist. This suggests that the LoRA request state is not properly tracked or has already been cleared.
Expected behavior
LoRA RL fine-tuning should complete without errors, and request states should be consistently added to and removed from running_requests without raising a KeyError.
Potential fix
Seems like we can fix this bug in AReal by managing clearing state of requests better but where this can be done is not clear yet. However, there are other similar issues raised in vLLM due to the brittle nature of code in vllm. Recent versions (v12.0+) has fixed these kinds of bugs by rewriting the code in a more defensive manner. See this vLLM PR #26801.
Full logs
ERROR [async_llm.py:480] AsyncLLM output_handler failed.
Traceback (most recent call last):
File ".../async_llm.py", line 457, in output_handler
processed_outputs = output_processor.process_outputs(...)
File ".../output_processor.py", line 470, in process_outputs
self._update_stats_from_finished(...)
File ".../output_processor.py", line 572, in _update_stats_from_finished
self.lora_states.finish_request(req_state)
File ".../stats.py", line 222, in finish_request
lora_stats.running_requests.remove(req_state.request_id)
KeyError: 'cmpl-247631329f8f4eaf905fbbe34a96b13c-0'
To Reproduce
Commit ID
N/A (observed on released versions)
Environment
- vLLM v0.11.0
- AReal v0.5.1
Script
python3 -m areal.launcher.local \
examples/lora/gsm8k_grpo_lora_vllm.py \
--config examples/lora/gsm8k_grpo_lora_vllm.yamlNotes:
- The bug happens during LoRA RL fine-tuning with vLLM v0.11.0.
- The immediate cause is a missing request ID in
running_requestswhen removing a finished request. - This likely occurs because AReal does not properly clear or synchronize LoRA states with vLLM.
- Similar issues have been reported in vLLM and led to a rewrite of this logic (see vLLM PR #26801).
- In newer versions of vLLM, this issue appears to be resolved.