Skip to content

Conversation

@ishanrajsingh
Copy link
Contributor

Description

Fixes #3833

When running evals, the system was using root_agent directly instead of going through the configured App, which meant plugins like ReflectAndRetryToolPlugin were being skipped.

Changes Made

1. local_eval_service.py

  • Added app: Optional['App'] parameter to LocalEvalService.__init__()
  • Modified _perform_inference_single_eval_item() to use App when available
  • Routes through App to preserve plugin behavior

2. evaluation_generator.py

  • Added _generate_inferences_from_app() static method
  • Ensures plugins are applied during evaluation

3. cli_tools_click.py

  • Added _load_app_from_module() helper function
  • Modified eval command to load and pass App to LocalEvalService
  • Falls back to root_agent if no App is defined

Testing

✅ Tested locally with agent that has ReflectAndRetryToolPlugin
✅ Verified plugins are registered during eval (log shows: "Plugin 'request_intercepter_plugin' registered")
✅ Backward compatible (agents without App still work)

Behavior

  • Before: Evals always used root_agent directly, skipping all App-level plugins
  • After: Evals use App (with plugins) when available, falling back to root_agent if no App is defined

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ishanrajsingh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where the evaluation system was not properly engaging with the App framework, leading to the omission of crucial App-level plugins during evaluation runs. By integrating the App into the evaluation flow, this change ensures that evaluations accurately reflect the agent's behavior, including any plugin-driven logic, thereby improving the reliability and completeness of evaluation results.

Highlights

  • Plugin Integration for Evals: Evaluations now correctly utilize the App instance, ensuring that all configured App-level plugins (like ReflectAndRetryToolPlugin) are applied during the evaluation process, which was previously bypassed.
  • Dynamic App Loading: A new helper function _load_app_from_module has been introduced to dynamically load an App instance from the agent module path, allowing the evaluation system to detect and use the App if defined.
  • Conditional Inference Generation: The LocalEvalService has been updated to conditionally use a new _generate_inferences_from_app method when an App is available, otherwise falling back to the existing _generate_inferences_from_root_agent for backward compatibility.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the eval [Component] This issue is related to evaluation label Dec 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly modifies the evaluation process to use the App instance when available, ensuring that App-level plugins are applied during evals. The changes are well-structured, with a new helper to load the App from the agent module and a new inference path in the LocalEvalService. My feedback includes suggestions to improve code consistency, reduce duplication, and fix minor style issues.

I am having trouble creating individual review comments. Click here to see my feedback.

src/google/adk/evaluation/evaluation_generator.py (241-261)

high

The logic for determining user_id is inconsistent and could lead to bugs. On line 244, initial_session.user_id or 'test_user_id' is used, which treats an empty string "" as falsy and defaults to 'test_user_id'. However, on line 258, initial_session.user_id if initial_session else 'test_user_id' is used, which would use the empty string "" if provided. To ensure consistent behavior, it's best to determine the user_id once and reuse it, explicitly checking for None if that's the intention.

      # Initialize session if provided
      user_id = 'test_user_id'
      if initial_session:
          if initial_session.user_id is not None:
              user_id = initial_session.user_id
          app_name = initial_session.app_name if initial_session.app_name else app.name
          await session_service.create_session(
              app_name=app_name,
              user_id=user_id,
              session_id=session_id,
              state=initial_session.state if initial_session.state else {},
          )
      
      # Run each invocation through the app
      for expected_invocation in invocations:
          user_content = expected_invocation.user_content
          
          # Invoke through App (this applies all plugins)
          response = await app.run(
              user_id=user_id,
              session_id=session_id,
              new_message=user_content,
          )

src/google/adk/cli/cli_tools_click.py (25-26)

medium

Optional is imported twice. These two lines can be combined into a single import statement for better readability and to avoid redundancy.

from typing import Optional, TYPE_CHECKING

src/google/adk/evaluation/evaluation_generator.py (228-287)

medium

This new function _generate_inferences_from_app duplicates a significant amount of logic for processing the event stream from an agent run. This logic is also present in _generate_inferences_from_root_agent (which uses a Runner) and is a core part of Runner.run_async. To improve maintainability and reduce this duplication, consider extracting the event stream processing into a shared helper function. For example, a static method like _process_run_stream(stream: AsyncGenerator[Event, None]) could be created to contain the async for event in stream: loop and return the extracted final_response, tool_uses, and invocation_id. Both _generate_inferences_from_app and _generate_inferences_from_root_agent could then use this helper.

src/google/adk/evaluation/evaluation_generator.py (269)

medium

This line can be simplified for better readability. Using the or operator is more concise and achieves the same result.

              invocation_id = invocation_id or event.invocation_id

src/google/adk/evaluation/local_eval_service.py (404-413)

medium

The indentation inside this except block (4 spaces) is inconsistent with the try block's content and the file's general style (2 spaces). Please adjust it for consistency.

      # We intentionally catch the Exception as we don't want failures to affect
      # other inferences.
      logger.error(
          'Inference failed for eval case `%s` with error %s',
          eval_case.eval_id,
          e,
      )
      inference_result.status = InferenceStatus.FAILURE
      inference_result.error_message = str(e)
      return inference_result

@ishanrajsingh ishanrajsingh force-pushed the fix-eval-use-app-with-plugins branch from 52f5cfa to a76304a Compare December 5, 2025 15:30
@ishanrajsingh
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to enable the use of App-level plugins during evaluations by passing the App instance to the evaluation service. The implementation correctly adds a helper to load the App from the agent module.

However, I've identified a critical issue: the loaded app is not passed to LocalEvalService, which means the core objective of this PR is not met. I've also left a minor comment regarding import statement style.

Additionally, the provided code changes seem to be incomplete. The PR description mentions modifications in local_eval_service.py to use the App and a new method in evaluation_generator.py, but these changes are not reflected in the provided patch. My review is based on the visible changes.

- Extend LocalEvalService to accept optional App parameter
- Route evaluation through App so plugins are applied
- Add _generate_inferences_from_app() to EvaluationGenerator
- Update CLI eval command to load and pass App
- Add helper to load App from agent module

Fixes google#3833
@ishanrajsingh ishanrajsingh force-pushed the fix-eval-use-app-with-plugins branch from 3161b0c to ede184e Compare December 5, 2025 15:39
@ishanrajsingh
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request aims to fix an issue where evaluations were bypassing the App layer, thus skipping plugins. The approach of routing evaluations through the App when available is correct. However, the implementation has several critical issues that will prevent it from working as intended.

Specifically, the new _generate_inferences_from_app function in evaluation_generator.py contains syntax errors, attempts to call a non-existent app.run() method, and has incomplete logic for data extraction. Additionally, there are missing imports and parameters not being passed correctly.

I've provided detailed comments with suggestions to fix these issues. Once addressed, the changes should correctly enable plugin support during evaluations.

- Add AsyncGenerator to imports
- Fix _generate_inferences_from_app to use Runner correctly
- Add memory_service parameter
- Capture both tool_uses and tool_responses
- Fix return statement variable name
@ishanrajsingh
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies the need to use the App instance during evaluations to ensure plugins are active. The approach of loading the App from the agent module and passing it to the LocalEvalService is a good solution.

However, I've found a critical issue with the implementation. The new logic path for evaluations using an App instance only supports static conversations (eval_case.conversation) and will fail or misbehave for dynamic conversations defined by a conversation_scenario. This is a significant regression that needs to be addressed.

Additionally, the new method for generating inferences (_generate_inferences_from_app) has a different logic for processing agent events compared to the existing implementation, which may lead to incomplete data capture (e.g., missing intermediate text responses).

My review includes detailed comments on these points. Addressing them will make this fix robust and complete.

- Add _generate_inferences_from_app to preserve App-level plugins during evals
- Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode
- Add early exit for failed inferences to prevent crashes
- Maintain backward compatibility with direct root_agent usage
- Properly label eval requests with EVAL_CLIENT_LABEL for telemetry

Tested: Plugins load and execute during evals, failed inferences handled gracefully
@ishanrajsingh
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly modifies the evaluation process to use the App instance when available, ensuring that app-level plugins are applied during evaluations. The changes are well-implemented across the CLI, evaluation generator, and local evaluation service. My review includes suggestions to refactor some duplicated code for better maintainability and to remove an unused parameter, which will improve code clarity.

- Add _generate_inferences_from_app to preserve App-level plugins during evals
- Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode
- Add early exit for failed inferences to prevent crashes
- Maintain backward compatibility with direct root_agent usage
- Properly label eval requests with EVAL_CLIENT_LABEL for telemetry
- Remove unused root_agent parameter from _generate_inferences_from_app
- Deduplicate client_label_context wrapper

Addresses Gemini Code Assist feedback for improved code clarity and maintainability.
@ishanrajsingh
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly modifies the evaluation process to use the App instance when available, ensuring that app-level plugins are applied during evaluations. This is a great fix that addresses a significant issue where plugins were being bypassed. The changes are well-structured across the CLI, evaluation generator, and local evaluation service. I've identified one critical issue that would cause a runtime error and one suggestion for refactoring to improve code maintainability.

ishanrajsingh and others added 2 commits December 6, 2025 00:07
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Add _generate_inferences_from_app to preserve App-level plugins during evals
- Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode
- Add early exit for failed inferences to prevent crashes
- Maintain backward compatibility with direct root_agent usage
- Remove unused root_agent parameter from _generate_inferences_from_app
- Deduplicate client_label_context wrapper
- Refactor common arguments to reduce code duplication
- Deep copy app before adding eval plugins to avoid mutation

Addresses all Gemini Code Assist feedback plus additional safety improvements.
@ishanrajsingh
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses the issue of plugins being skipped during evaluations by ensuring the App instance is used instead of directly using the root_agent. The changes in cli_tools_click.py to load the App and in local_eval_service.py to conditionally use the App are well-implemented. The new _generate_inferences_from_app method in evaluation_generator.py correctly sets up the runner to include plugins.

However, I've found a critical issue in _generate_inferences_from_app where a session is not created if initial_session is not provided. This will cause evaluations to fail for EvalCases without a session_input. Please see the detailed comment for a suggested fix.

- Add _generate_inferences_from_app to preserve App-level plugins during evals
- Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode
- Add early exit for failed inferences to prevent crashes
- Maintain backward compatibility with direct root_agent usage
- Remove unused root_agent parameter from _generate_inferences_from_app
- Deduplicate client_label_context wrapper
- Refactor common arguments to reduce code duplication
- Deep copy app before adding eval plugins to avoid mutation

Addresses all Gemini Code Assist feedback plus additional safety improvements.
@ishanrajsingh
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ability for the evaluation service to utilize an App instance, thereby enabling plugin support during evaluation. Key changes include adding a new _load_app_from_module utility function to dynamically load an App from a module path, and modifying LocalEvalService to accept and conditionally use this App for inference generation, specifically through a new _generate_inferences_from_app method. Review feedback suggests improving the _load_app_from_module function by moving imports to the top for better readability and PEP 8 compliance, and refactoring _generate_inferences_from_app to reduce code duplication with _generate_inferences_from_root_agent by extracting common simulation logic into a helper method.

- Extract common simulation logic into _run_user_simulation_loop helper
  - Eliminates ~30 lines of code duplication
  - Single source of truth for user simulation workflow
  - Used by both _generate_inferences_from_app and _generate_inferences_from_root_agent

- Move imports to top of _load_app_from_module for PEP 8 compliance
  - Improves readability and makes dependencies clear upfront

- Fix critical bug: Always create session in _generate_inferences_from_app
  - Previously failed when EvalCase had no session_input
  - Now consistent with _generate_inferences_from_root_agent

- Simplify user_id determination logic with ternary operators

Changes reduce code by ~30 lines while improving maintainability.
@ishanrajsingh
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement that ensures evals run through the App layer, allowing plugins to be correctly applied. The changes are well-structured, with good refactoring to extract common logic into helper methods like _run_user_simulation_loop and _load_app_from_module.

I've added a few suggestions to improve performance and robustness:

  • Avoiding a potentially expensive deepcopy in the user simulation loop.
  • Preventing duplicate plugins from being added to the App during evaluation.
  • Ensuring the user_id is handled consistently for both successful and failed inferences.

Overall, this is a solid contribution that correctly addresses the issue.

ishanrajsingh and others added 3 commits December 6, 2025 00:36
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Extract common simulation logic into _run_user_simulation_loop helper
  - Eliminates ~30 lines of code duplication
  - Single source of truth for user simulation workflow
  - Used by both _generate_inferences_from_app and _generate_inferences_from_root_agent

- Move imports to top of _load_app_from_module for PEP 8 compliance
  - Improves readability and makes dependencies clear upfront

- Fix critical bug: Always create session in _generate_inferences_from_app
  - Previously failed when EvalCase had no session_input
  - Now consistent with _generate_inferences_from_root_agent

- Simplify user_id determination logic with ternary operators

Changes reduce code by ~30 lines while improving maintainability.
@ishanrajsingh
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an issue where evaluations were bypassing the App layer, causing plugins to be skipped. The changes introduce logic to load and use an App instance during evaluation when available, with a fallback to the previous behavior of using root_agent. The code is well-structured, and the refactoring to extract common logic into helper methods improves maintainability. My primary feedback concerns the removal of a copy.deepcopy call during this refactoring, which could introduce subtle bugs. I've provided a suggestion to restore it to ensure correctness.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@ishanrajsingh
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses the issue of plugins being skipped during evaluations by ensuring that the App instance is used when available. The changes are well-structured across the CLI, evaluation service, and generator modules. I appreciate the refactoring in evaluation_generator.py to extract the common user simulation loop, which improves code clarity and reduces duplication.

I have one suggestion to make the loading of the App instance more flexible.

ishanrajsingh and others added 2 commits December 6, 2025 11:22
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval uses root agent directly rather than through registered app, skipping retry plugin

2 participants