Skip to content

Conversation

@TinLongYu
Copy link

Description

This PR fixes a crash during LoRA-based RL fine-tuning with vLLM caused by a KeyError when cleaning up finished LoRA requests.

Applies a simple defensive check before removing a request ID from running_requests that only impacts the LoRA training pipeline. Updated the finish_request function to simply check if req_state.request_id in lora_stats.running_requests:

This patch may be removed once updating vLLM to v12.0+

File changed

  • areal/thirdparty/vllm/areal_vllm_server.py

Related Issue

Fixes #751

Testing

  • Tested xccl-based LoRA gsm8k training

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

Additional Context


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @TinLongYu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical "KeyError" that occurred during the cleanup of LoRA requests within the vLLM integration, specifically impacting LoRA-based reinforcement learning fine-tuning. The change introduces a robust check in the finish_request function to prevent crashes by ensuring that a request ID is present before attempting its removal from tracking. This targeted bug fix enhances the stability of the LoRA training pipeline and is designed to be a temporary measure until a broader vLLM update is adopted.

Highlights

  • Bug Fix for LoRA Fine-tuning: Addresses a "KeyError" crash in vLLM during LoRA-based RL fine-tuning by adding a defensive check when cleaning up finished LoRA requests.
  • Enhanced Request Cleanup: Implements a check to ensure a request ID exists in lora_stats.running_requests before attempting to remove it, preventing unexpected crashes.
  • Temporary Patch: This specific patch is intended as a temporary solution and may be removed once the project updates to vLLM v12.0 or newer.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a KeyError crash during LoRA-based fine-tuning by adding a defensive check in the finish_request function. The change is a targeted fix to prevent removing a request ID that is no longer in the running_requests list. The approach of monkey-patching vllm's LoRARequestStates is a reasonable short-term solution as noted. My review includes a suggestion to further improve the robustness of the finish_request function by also handling cases where the LoRA name might not be present in lora_name_to_stats, preventing another potential KeyError.

Comment on lines +343 to +345
lora_stats = self.lora_name_to_stats[req_state.lora_name]
if req_state.request_id in lora_stats.running_requests:
lora_stats.running_requests.remove(req_state.request_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation at line 343 (self.lora_name_to_stats[req_state.lora_name]) could still raise a KeyError if req_state.lora_name is not present in the lora_name_to_stats dictionary. To make this function more robust and prevent another potential crash, it's better to use the .get() method for safer dictionary access. This also allows for a more concise check.

Suggested change
lora_stats = self.lora_name_to_stats[req_state.lora_name]
if req_state.request_id in lora_stats.running_requests:
lora_stats.running_requests.remove(req_state.request_id)
lora_stats = self.lora_name_to_stats.get(req_state.lora_name)
if lora_stats and req_state.request_id in lora_stats.running_requests:
lora_stats.running_requests.remove(req_state.request_id)

Comment on lines +340 to +345
def finish_request(self, req_state: 'RequestState'):
if req_state.lora_name is None:
return
lora_stats = self.lora_name_to_stats[req_state.lora_name]
if req_state.request_id in lora_stats.running_requests:
lora_stats.running_requests.remove(req_state.request_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but could you add a comment about the change? We may want to note: (1) this is a simple copy from vllm v0.11.0 and we add a if-condition to check whether the request is running; (2) this patch can be removed once vllm is updated to v0.12.0 or higher.

A better solution may be applying the patch with the following condition:

from areal.utils import pkg_version

if not pkg_version.is_version_higher("vllm", "0.11.1"):
   setattr(...)  # apply the patch

FYI we wanted to update vLLM and SGLang, but they do not support the same pytorch version until the latest vllm 0.14.0.rc1. Hopefully we will update their versions when 0.14.0 is offically released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] KeyError in vLLM LoRA request cleanup during lora RL fine-tuning (missing request_id in running_requests)

2 participants