Skip to content

Conversation

@dhh1995
Copy link
Collaborator

@dhh1995 dhh1995 commented Dec 17, 2025

Description

This example demonstrates how to train customer service agents using the $\tau^2$-Bench with AReaL's PPO/GRPO training pipeline. The $\tau^2$-Bench provides realistic customer service simulation environments across multiple domains (retail, airline, telecom) where agents must help with user's request by both using agent tools and guiding users using their tools.

curve

Curve for training reward on telecom-small subset.

Related Issue

Fixes #(issue)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

Additional Context


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dhh1995, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, self-contained example for training customer service agents within the tau2-bench simulation framework, leveraging AReaL's reinforcement learning capabilities. It includes necessary adjustments to the proxy server's reward handling for improved robustness and flexibility, alongside a new workflow to orchestrate training episodes. The addition of dedicated configuration and utility files streamlines the setup and execution of these complex training scenarios, making it easier to reproduce and extend the example.

Highlights

  • New Example: Tau2-Bench Training: Introduced a comprehensive example demonstrating how to train customer service agents using the tau2-bench simulation environment with AReaL's PPO/GRPO training pipeline.
  • Enhanced Reward Setting Logic: Improved the reward setting mechanism in proxy.py to gracefully handle cases where an interaction ID is not explicitly provided, defaulting to the last interaction and adding robust error logging for missing interactions.
  • Flexible Reward Processing: Modified proxy_utils.py to allow the reward function to return additional information alongside the reward value, enhancing data capture during training episodes.
  • New Workflow for Episode Management: Implemented a Tau2Workflow in tau2_train.py to manage the execution of training episodes, including parallel processing of agent runs and structured dumping of simulation results.
  • Dedicated Configuration and Utilities: Added config.yaml for detailed configuration of the tau2-bench training and tau2_utils.py for defining data models and environment-specific settings, ensuring a modular and organized example.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new example for training customer service agents using the tau2-bench, which is a significant and valuable addition. The overall structure is well-designed, with clear separation of concerns between the agent logic, training script, and utilities. The modifications to existing proxy utilities, such as improved error handling and more flexible function returns, are also commendable. My review focuses on enhancing the new example files by addressing minor issues in documentation, improving code clarity, and suggesting small optimizations for better maintainability.


The code is modified from the [proxy](../experimental/proxy/README.md) example so that the training workflow (`tau2_train.py`) and the agent runner script (`tau2_agent.py`) can be decoupled, with common utilities in `tau2_utils.py`.

* `tau2_train.py`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description for tau2_train.py is incomplete. Please add a brief explanation of its role to improve the documentation's clarity and help users understand the example's structure.


## Notes

1. When using litellm with multiprocessing, the `Queue bound to different event loop` error may occur. See also: [litellm issue #17813](https://github.com/BerriAI/litellm/issues/17813). This will not stop the training, but will make the outputs hard to read. You may use `grep -aivE "loop|queue|\^|asyncio|litellm"` to filter out the error messages before this issue is fixed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a typo in the litellm issue number. Issue #17813 does not exist. The correct issue number is likely #1781, which discusses the Queue bound to different event loop error. Please correct the link to ensure it points to the correct resource.

Suggested change
1. When using litellm with multiprocessing, the `Queue bound to different event loop` error may occur. See also: [litellm issue #17813](https://github.com/BerriAI/litellm/issues/17813). This will not stop the training, but will make the outputs hard to read. You may use `grep -aivE "loop|queue|\^|asyncio|litellm"` to filter out the error messages before this issue is fixed.
1. When using litellm with multiprocessing, the `Queue bound to different event loop` error may occur. See also: [litellm issue #1781](https://github.com/BerriAI/litellm/issues/1781). This will not stop the training, but will make the outputs hard to read. You may use `grep -aivE "loop|queue|\^|asyncio|litellm"` to filter out the error messages before this issue is fixed.

Comment on lines +34 to +38
tasks: list[Task] = registry.get_tasks_loader(domain)(split)
for task in tasks:
if task.id == task_id:
return task
raise ValueError(f"No task found with id {task_id} for domain {domain}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation iterates through the list of tasks to find a match, which has a time complexity of O(n). For better performance, especially if the number of tasks grows, consider converting the list of tasks into a dictionary for O(1) lookups.

Suggested change
tasks: list[Task] = registry.get_tasks_loader(domain)(split)
for task in tasks:
if task.id == task_id:
return task
raise ValueError(f"No task found with id {task_id} for domain {domain}")
tasks: list[Task] = registry.get_tasks_loader(domain)(split)
task_map = {task.id: task for task in tasks}
if task_id not in task_map:
raise ValueError(f"No task found with id {task_id} for domain {domain}")
return task_map[task_id]

Comment on lines +68 to +78
# * Backup: use acreate to replace acompletion
# async def _acreate(*args, **kwargs):
# kwargs.pop("num_retries", None)
# completion = await client.chat.completions.create(*args, **kwargs)
# return completion

# async def _acreate_with_base_url(*args, **kwargs):
# kwargs.pop("num_retries", None)
# async with AsyncOpenAI(base_url=self.econfig.user_llm_base_url) as client:
# completion = await client.chat.completions.create(*args, **kwargs)
# return completion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of commented-out code appears to be a backup or alternative implementation. To improve code clarity and maintainability, it's best to remove such code. If this logic is important for reference, consider moving it to the PR description or a separate document.


# Dump info to file
if "task_id" in data:
real_task_id = data["task_id"][:120] + "-" + task_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The slice [:120] uses a magic number to truncate the task ID. This could be confusing for future readers. To improve readability and maintainability, please add a comment explaining why the task ID is being truncated (e.g., to prevent overly long filenames), or define 120 as a named constant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants