Fix: Eval uses App with plugins instead of bypassing to root_agent by ishanrajsingh · Pull Request #3836 · google/adk-python

ishanrajsingh · 2025-12-05T15:15:00Z

Description

When running evals, the system was using root_agent directly instead of going through the configured App, which meant plugins like ReflectAndRetryToolPlugin were being skipped.

Changes Made

1. `local_eval_service.py`

Added app: Optional['App'] parameter to LocalEvalService.__init__()
Modified _perform_inference_single_eval_item() to use App when available
Routes through App to preserve plugin behavior

2. `evaluation_generator.py`

Added _generate_inferences_from_app() static method
Ensures plugins are applied during evaluation

3. `cli_tools_click.py`

Added _load_app_from_module() helper function
Modified eval command to load and pass App to LocalEvalService
Falls back to root_agent if no App is defined

Testing

✅ Tested locally with agent that has ReflectAndRetryToolPlugin
✅ Verified plugins are registered during eval (log shows: "Plugin 'request_intercepter_plugin' registered")
✅ Backward compatible (agents without App still work)

Behavior

Before: Evals always used root_agent directly, skipping all App-level plugins
After: Evals use App (with plugins) when available, falling back to root_agent if no App is defined

gemini-code-assist · 2025-12-05T15:15:15Z

Summary of Changes

Hello @ishanrajsingh, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where the evaluation system was not properly engaging with the App framework, leading to the omission of crucial App-level plugins during evaluation runs. By integrating the App into the evaluation flow, this change ensures that evaluations accurately reflect the agent's behavior, including any plugin-driven logic, thereby improving the reliability and completeness of evaluation results.

Highlights

Plugin Integration for Evals: Evaluations now correctly utilize the App instance, ensuring that all configured App-level plugins (like ReflectAndRetryToolPlugin) are applied during the evaluation process, which was previously bypassed.
Dynamic App Loading: A new helper function _load_app_from_module has been introduced to dynamically load an App instance from the agent module path, allowing the evaluation system to detect and use the App if defined.
Conditional Inference Generation: The LocalEvalService has been updated to conditionally use a new _generate_inferences_from_app method when an App is available, otherwise falling back to the existing _generate_inferences_from_root_agent for backward compatibility.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly modifies the evaluation process to use the App instance when available, ensuring that App-level plugins are applied during evals. The changes are well-structured, with a new helper to load the App from the agent module and a new inference path in the LocalEvalService. My feedback includes suggestions to improve code consistency, reduce duplication, and fix minor style issues.

I am having trouble creating individual review comments. Click here to see my feedback.

src/google/adk/evaluation/evaluation_generator.py (241-261)

The logic for determining user_id is inconsistent and could lead to bugs. On line 244, initial_session.user_id or 'test_user_id' is used, which treats an empty string "" as falsy and defaults to 'test_user_id'. However, on line 258, initial_session.user_id if initial_session else 'test_user_id' is used, which would use the empty string "" if provided. To ensure consistent behavior, it's best to determine the user_id once and reuse it, explicitly checking for None if that's the intention.

      # Initialize session if provided
      user_id = 'test_user_id'
      if initial_session:
          if initial_session.user_id is not None:
              user_id = initial_session.user_id
          app_name = initial_session.app_name if initial_session.app_name else app.name
          await session_service.create_session(
              app_name=app_name,
              user_id=user_id,
              session_id=session_id,
              state=initial_session.state if initial_session.state else {},
          )
      
      # Run each invocation through the app
      for expected_invocation in invocations:
          user_content = expected_invocation.user_content
          
          # Invoke through App (this applies all plugins)
          response = await app.run(
              user_id=user_id,
              session_id=session_id,
              new_message=user_content,
          )

src/google/adk/cli/cli_tools_click.py (25-26)

Optional is imported twice. These two lines can be combined into a single import statement for better readability and to avoid redundancy.

from typing import Optional, TYPE_CHECKING

src/google/adk/evaluation/evaluation_generator.py (228-287)

This new function _generate_inferences_from_app duplicates a significant amount of logic for processing the event stream from an agent run. This logic is also present in _generate_inferences_from_root_agent (which uses a Runner) and is a core part of Runner.run_async. To improve maintainability and reduce this duplication, consider extracting the event stream processing into a shared helper function. For example, a static method like _process_run_stream(stream: AsyncGenerator[Event, None]) could be created to contain the async for event in stream: loop and return the extracted final_response, tool_uses, and invocation_id. Both _generate_inferences_from_app and _generate_inferences_from_root_agent could then use this helper.

src/google/adk/evaluation/evaluation_generator.py (269)

This line can be simplified for better readability. Using the or operator is more concise and achieves the same result.

              invocation_id = invocation_id or event.invocation_id

src/google/adk/evaluation/local_eval_service.py (404-413)

The indentation inside this except block (4 spaces) is inconsistent with the try block's content and the file's general style (2 spaces). Please adjust it for consistency.

      # We intentionally catch the Exception as we don't want failures to affect
      # other inferences.
      logger.error(
          'Inference failed for eval case `%s` with error %s',
          eval_case.eval_id,
          e,
      )
      inference_result.status = InferenceStatus.FAILURE
      inference_result.error_message = str(e)
      return inference_result

ishanrajsingh · 2025-12-05T15:31:20Z

/gemini review

gemini-code-assist

Code Review

This pull request aims to enable the use of App-level plugins during evaluations by passing the App instance to the evaluation service. The implementation correctly adds a helper to load the App from the agent module.

However, I've identified a critical issue: the loaded app is not passed to LocalEvalService, which means the core objective of this PR is not met. I've also left a minor comment regarding import statement style.

Additionally, the provided code changes seem to be incomplete. The PR description mentions modifications in local_eval_service.py to use the App and a new method in evaluation_generator.py, but these changes are not reflected in the provided patch. My review is based on the visible changes.

src/google/adk/cli/cli_tools_click.py

- Extend LocalEvalService to accept optional App parameter - Route evaluation through App so plugins are applied - Add _generate_inferences_from_app() to EvaluationGenerator - Update CLI eval command to load and pass App - Add helper to load App from agent module Fixes google#3833

ishanrajsingh · 2025-12-05T15:51:42Z

/gemini review

gemini-code-assist

Code Review

The pull request aims to fix an issue where evaluations were bypassing the App layer, thus skipping plugins. The approach of routing evaluations through the App when available is correct. However, the implementation has several critical issues that will prevent it from working as intended.

Specifically, the new _generate_inferences_from_app function in evaluation_generator.py contains syntax errors, attempts to call a non-existent app.run() method, and has incomplete logic for data extraction. Additionally, there are missing imports and parameters not being passed correctly.

I've provided detailed comments with suggestions to fix these issues. Once addressed, the changes should correctly enable plugin support during evaluations.

src/google/adk/evaluation/evaluation_generator.py

src/google/adk/evaluation/local_eval_service.py

- Add AsyncGenerator to imports - Fix _generate_inferences_from_app to use Runner correctly - Add memory_service parameter - Capture both tool_uses and tool_responses - Fix return statement variable name

ishanrajsingh · 2025-12-05T17:30:57Z

/gemini review

gemini-code-assist

Code Review

This pull request correctly identifies the need to use the App instance during evaluations to ensure plugins are active. The approach of loading the App from the agent module and passing it to the LocalEvalService is a good solution.

However, I've found a critical issue with the implementation. The new logic path for evaluations using an App instance only supports static conversations (eval_case.conversation) and will fail or misbehave for dynamic conversations defined by a conversation_scenario. This is a significant regression that needs to be addressed.

Additionally, the new method for generating inferences (_generate_inferences_from_app) has a different logic for processing agent events compared to the existing implementation, which may lead to incomplete data capture (e.g., missing intermediate text responses).

My review includes detailed comments on these points. Addressing them will make this fix robust and complete.

src/google/adk/evaluation/local_eval_service.py

src/google/adk/evaluation/evaluation_generator.py

- Add _generate_inferences_from_app to preserve App-level plugins during evals - Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode - Add early exit for failed inferences to prevent crashes - Maintain backward compatibility with direct root_agent usage - Properly label eval requests with EVAL_CLIENT_LABEL for telemetry Tested: Plugins load and execute during evals, failed inferences handled gracefully

ishanrajsingh · 2025-12-05T18:14:39Z

/gemini review

gemini-code-assist

Code Review

This pull request correctly modifies the evaluation process to use the App instance when available, ensuring that app-level plugins are applied during evaluations. The changes are well-implemented across the CLI, evaluation generator, and local evaluation service. My review includes suggestions to refactor some duplicated code for better maintainability and to remove an unused parameter, which will improve code clarity.

src/google/adk/evaluation/evaluation_generator.py

src/google/adk/evaluation/local_eval_service.py

- Add _generate_inferences_from_app to preserve App-level plugins during evals - Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode - Add early exit for failed inferences to prevent crashes - Maintain backward compatibility with direct root_agent usage - Properly label eval requests with EVAL_CLIENT_LABEL for telemetry - Remove unused root_agent parameter from _generate_inferences_from_app - Deduplicate client_label_context wrapper Addresses Gemini Code Assist feedback for improved code clarity and maintainability.

ishanrajsingh · 2025-12-05T18:34:43Z

/gemini review

gemini-code-assist

Code Review

This pull request correctly modifies the evaluation process to use the App instance when available, ensuring that app-level plugins are applied during evaluations. This is a great fix that addresses a significant issue where plugins were being bypassed. The changes are well-structured across the CLI, evaluation generator, and local evaluation service. I've identified one critical issue that would cause a runtime error and one suggestion for refactoring to improve code maintainability.

src/google/adk/evaluation/evaluation_generator.py

src/google/adk/evaluation/local_eval_service.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

- Add _generate_inferences_from_app to preserve App-level plugins during evals - Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode - Add early exit for failed inferences to prevent crashes - Maintain backward compatibility with direct root_agent usage - Remove unused root_agent parameter from _generate_inferences_from_app - Deduplicate client_label_context wrapper - Refactor common arguments to reduce code duplication - Deep copy app before adding eval plugins to avoid mutation Addresses all Gemini Code Assist feedback plus additional safety improvements.

ishanrajsingh · 2025-12-05T18:43:09Z

/gemini review

gemini-code-assist

Code Review

This pull request correctly addresses the issue of plugins being skipped during evaluations by ensuring the App instance is used instead of directly using the root_agent. The changes in cli_tools_click.py to load the App and in local_eval_service.py to conditionally use the App are well-implemented. The new _generate_inferences_from_app method in evaluation_generator.py correctly sets up the runner to include plugins.

However, I've found a critical issue in _generate_inferences_from_app where a session is not created if initial_session is not provided. This will cause evaluations to fail for EvalCases without a session_input. Please see the detailed comment for a suggested fix.

src/google/adk/evaluation/evaluation_generator.py

- Add _generate_inferences_from_app to preserve App-level plugins during evals - Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode - Add early exit for failed inferences to prevent crashes - Maintain backward compatibility with direct root_agent usage - Remove unused root_agent parameter from _generate_inferences_from_app - Deduplicate client_label_context wrapper - Refactor common arguments to reduce code duplication - Deep copy app before adding eval plugins to avoid mutation Addresses all Gemini Code Assist feedback plus additional safety improvements.

ishanrajsingh · 2025-12-05T18:50:31Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the ability for the evaluation service to utilize an App instance, thereby enabling plugin support during evaluation. Key changes include adding a new _load_app_from_module utility function to dynamically load an App from a module path, and modifying LocalEvalService to accept and conditionally use this App for inference generation, specifically through a new _generate_inferences_from_app method. Review feedback suggests improving the _load_app_from_module function by moving imports to the top for better readability and PEP 8 compliance, and refactoring _generate_inferences_from_app to reduce code duplication with _generate_inferences_from_root_agent by extracting common simulation logic into a helper method.

src/google/adk/cli/cli_tools_click.py

src/google/adk/evaluation/evaluation_generator.py

- Extract common simulation logic into _run_user_simulation_loop helper - Eliminates ~30 lines of code duplication - Single source of truth for user simulation workflow - Used by both _generate_inferences_from_app and _generate_inferences_from_root_agent - Move imports to top of _load_app_from_module for PEP 8 compliance - Improves readability and makes dependencies clear upfront - Fix critical bug: Always create session in _generate_inferences_from_app - Previously failed when EvalCase had no session_input - Now consistent with _generate_inferences_from_root_agent - Simplify user_id determination logic with ternary operators Changes reduce code by ~30 lines while improving maintainability.

ishanrajsingh · 2025-12-05T19:02:48Z

/gemini review

gemini-code-assist

Code Review

This pull request is a great improvement that ensures evals run through the App layer, allowing plugins to be correctly applied. The changes are well-structured, with good refactoring to extract common logic into helper methods like _run_user_simulation_loop and _load_app_from_module.

I've added a few suggestions to improve performance and robustness:

Avoiding a potentially expensive deepcopy in the user simulation loop.
Preventing duplicate plugins from being added to the App during evaluation.
Ensuring the user_id is handled consistently for both successful and failed inferences.

Overall, this is a solid contribution that correctly addresses the issue.

src/google/adk/evaluation/evaluation_generator.py

src/google/adk/evaluation/local_eval_service.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

- Extract common simulation logic into _run_user_simulation_loop helper - Eliminates ~30 lines of code duplication - Single source of truth for user simulation workflow - Used by both _generate_inferences_from_app and _generate_inferences_from_root_agent - Move imports to top of _load_app_from_module for PEP 8 compliance - Improves readability and makes dependencies clear upfront - Fix critical bug: Always create session in _generate_inferences_from_app - Previously failed when EvalCase had no session_input - Now consistent with _generate_inferences_from_root_agent - Simplify user_id determination logic with ternary operators Changes reduce code by ~30 lines while improving maintainability.

ishanrajsingh · 2025-12-05T19:13:19Z

/gemini review

gemini-code-assist

Code Review

This pull request correctly addresses an issue where evaluations were bypassing the App layer, causing plugins to be skipped. The changes introduce logic to load and use an App instance during evaluation when available, with a fallback to the previous behavior of using root_agent. The code is well-structured, and the refactoring to extract common logic into helper methods improves maintainability. My primary feedback concerns the removal of a copy.deepcopy call during this refactoring, which could introduce subtle bugs. I've provided a suggestion to restore it to ensure correctness.

src/google/adk/evaluation/evaluation_generator.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ishanrajsingh · 2025-12-05T19:16:29Z

/gemini review

gemini-code-assist

Code Review

This pull request correctly addresses the issue of plugins being skipped during evaluations by ensuring that the App instance is used when available. The changes are well-structured across the CLI, evaluation service, and generator modules. I appreciate the refactoring in evaluation_generator.py to extract the common user simulation loop, which improves code clarity and reduces duplication.

I have one suggestion to make the loading of the App instance more flexible.

src/google/adk/cli/cli_tools_click.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ryanaiagent · 2025-12-10T20:13:51Z

Hi @ishanrajsingh , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Before we can proceed with the review can you please fix the failing tests. You can use autoformat.sh.

@ryanaiagent

- Fix import ordering in TYPE_CHECKING blocks - Apply pyink code formatting - Apply isort import sorting - Ensure proper spacing around conditional imports Addresses review feedback from @ryanaiagent in google#3836

ishanrajsingh · 2025-12-11T05:29:56Z

Hi @ryanaiagent,

Thank you for the feedback! I've run autoformat.sh and fixed all failing checks, All CI/CD checks should now pass. Please review when you have a chance.

ryanaiagent · 2025-12-12T00:21:34Z

Hi @ishanrajsingh , It looks like the test failures here are unrelated to your changes. They were caused by a temporary issue on our main branch that has since been fixed.
Could you please update your branch with the latest changes from main? This will pull in the fix and should get the tests to pass.

ishanrajsingh · 2025-12-12T06:41:03Z

Hi @ryanaiagent I have updated the branch

…based evals - Introduce src/google/adk/evaluation/app_inference_adapter.py to handle App-backed inference without importing from cli/* - Move _generate_inferences_from_app logic out of EvaluationGenerator into the new shared adapter module - Update LocalEvalService to use AppInferenceAdapter instead of EvaluationGenerator._generate_inferences_from_app to avoid cross-package cli dependency - Preserve plugin behavior (request_intercepter, ensure_retry_options) while keeping evaluation package self-contained - Add TYPE_CHECKING import for SessionInput to avoid circular dependencies - Apply isort/pyink formatting and add copyright headers - Resolves Issue google#3833: LocalEvalService now supports App-based evaluation without violating architecture rules - Fixes CI check: 'Do not import from the cli package outside of the cli package'

…hanrajsingh/adk-python into fix-eval-use-app-with-plugins

- Create src/google/adk/utils/app_loader.py with load_app_from_module - Remove _load_app_from_module from cli_tools_click.py - Import from utils to avoid cross-package dependency - Resolves CI check: 'Do not import from cli package'

ryanaiagent · 2026-01-20T04:53:12Z

Hi @ishanrajsingh , Thank you for your patience here. I apologize for the delay in getting to this review; I know this has been sitting for a while. This PR has merge conflicts that require changes from your end. Could you please rebase your branch with the latest main branch to address these? Once this is complete, please let us know so we can proceed with the review.

ishanrajsingh · 2026-01-20T11:59:14Z

Hi @ryanaiagent , I have rebased the branch with the latest main branch

adk-bot added the eval [Component] This issue is related to evaluation label Dec 5, 2025