-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Fix: Eval uses App with plugins instead of bypassing to root_agent #3836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix: Eval uses App with plugins instead of bypassing to root_agent #3836
Conversation
Summary of ChangesHello @ishanrajsingh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where the evaluation system was not properly engaging with the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly modifies the evaluation process to use the App instance when available, ensuring that App-level plugins are applied during evals. The changes are well-structured, with a new helper to load the App from the agent module and a new inference path in the LocalEvalService. My feedback includes suggestions to improve code consistency, reduce duplication, and fix minor style issues.
I am having trouble creating individual review comments. Click here to see my feedback.
src/google/adk/evaluation/evaluation_generator.py (241-261)
The logic for determining user_id is inconsistent and could lead to bugs. On line 244, initial_session.user_id or 'test_user_id' is used, which treats an empty string "" as falsy and defaults to 'test_user_id'. However, on line 258, initial_session.user_id if initial_session else 'test_user_id' is used, which would use the empty string "" if provided. To ensure consistent behavior, it's best to determine the user_id once and reuse it, explicitly checking for None if that's the intention.
# Initialize session if provided
user_id = 'test_user_id'
if initial_session:
if initial_session.user_id is not None:
user_id = initial_session.user_id
app_name = initial_session.app_name if initial_session.app_name else app.name
await session_service.create_session(
app_name=app_name,
user_id=user_id,
session_id=session_id,
state=initial_session.state if initial_session.state else {},
)
# Run each invocation through the app
for expected_invocation in invocations:
user_content = expected_invocation.user_content
# Invoke through App (this applies all plugins)
response = await app.run(
user_id=user_id,
session_id=session_id,
new_message=user_content,
)src/google/adk/cli/cli_tools_click.py (25-26)
Optional is imported twice. These two lines can be combined into a single import statement for better readability and to avoid redundancy.
from typing import Optional, TYPE_CHECKING
src/google/adk/evaluation/evaluation_generator.py (228-287)
This new function _generate_inferences_from_app duplicates a significant amount of logic for processing the event stream from an agent run. This logic is also present in _generate_inferences_from_root_agent (which uses a Runner) and is a core part of Runner.run_async. To improve maintainability and reduce this duplication, consider extracting the event stream processing into a shared helper function. For example, a static method like _process_run_stream(stream: AsyncGenerator[Event, None]) could be created to contain the async for event in stream: loop and return the extracted final_response, tool_uses, and invocation_id. Both _generate_inferences_from_app and _generate_inferences_from_root_agent could then use this helper.
src/google/adk/evaluation/evaluation_generator.py (269)
This line can be simplified for better readability. Using the or operator is more concise and achieves the same result.
invocation_id = invocation_id or event.invocation_id
src/google/adk/evaluation/local_eval_service.py (404-413)
The indentation inside this except block (4 spaces) is inconsistent with the try block's content and the file's general style (2 spaces). Please adjust it for consistency.
# We intentionally catch the Exception as we don't want failures to affect
# other inferences.
logger.error(
'Inference failed for eval case `%s` with error %s',
eval_case.eval_id,
e,
)
inference_result.status = InferenceStatus.FAILURE
inference_result.error_message = str(e)
return inference_result
52f5cfa to
a76304a
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to enable the use of App-level plugins during evaluations by passing the App instance to the evaluation service. The implementation correctly adds a helper to load the App from the agent module.
However, I've identified a critical issue: the loaded app is not passed to LocalEvalService, which means the core objective of this PR is not met. I've also left a minor comment regarding import statement style.
Additionally, the provided code changes seem to be incomplete. The PR description mentions modifications in local_eval_service.py to use the App and a new method in evaluation_generator.py, but these changes are not reflected in the provided patch. My review is based on the visible changes.
- Extend LocalEvalService to accept optional App parameter - Route evaluation through App so plugins are applied - Add _generate_inferences_from_app() to EvaluationGenerator - Update CLI eval command to load and pass App - Add helper to load App from agent module Fixes google#3833
3161b0c to
ede184e
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request aims to fix an issue where evaluations were bypassing the App layer, thus skipping plugins. The approach of routing evaluations through the App when available is correct. However, the implementation has several critical issues that will prevent it from working as intended.
Specifically, the new _generate_inferences_from_app function in evaluation_generator.py contains syntax errors, attempts to call a non-existent app.run() method, and has incomplete logic for data extraction. Additionally, there are missing imports and parameters not being passed correctly.
I've provided detailed comments with suggestions to fix these issues. Once addressed, the changes should correctly enable plugin support during evaluations.
- Add AsyncGenerator to imports - Fix _generate_inferences_from_app to use Runner correctly - Add memory_service parameter - Capture both tool_uses and tool_responses - Fix return statement variable name
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly identifies the need to use the App instance during evaluations to ensure plugins are active. The approach of loading the App from the agent module and passing it to the LocalEvalService is a good solution.
However, I've found a critical issue with the implementation. The new logic path for evaluations using an App instance only supports static conversations (eval_case.conversation) and will fail or misbehave for dynamic conversations defined by a conversation_scenario. This is a significant regression that needs to be addressed.
Additionally, the new method for generating inferences (_generate_inferences_from_app) has a different logic for processing agent events compared to the existing implementation, which may lead to incomplete data capture (e.g., missing intermediate text responses).
My review includes detailed comments on these points. Addressing them will make this fix robust and complete.
- Add _generate_inferences_from_app to preserve App-level plugins during evals - Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode - Add early exit for failed inferences to prevent crashes - Maintain backward compatibility with direct root_agent usage - Properly label eval requests with EVAL_CLIENT_LABEL for telemetry Tested: Plugins load and execute during evals, failed inferences handled gracefully
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly modifies the evaluation process to use the App instance when available, ensuring that app-level plugins are applied during evaluations. The changes are well-implemented across the CLI, evaluation generator, and local evaluation service. My review includes suggestions to refactor some duplicated code for better maintainability and to remove an unused parameter, which will improve code clarity.
- Add _generate_inferences_from_app to preserve App-level plugins during evals - Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode - Add early exit for failed inferences to prevent crashes - Maintain backward compatibility with direct root_agent usage - Properly label eval requests with EVAL_CLIENT_LABEL for telemetry - Remove unused root_agent parameter from _generate_inferences_from_app - Deduplicate client_label_context wrapper Addresses Gemini Code Assist feedback for improved code clarity and maintainability.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly modifies the evaluation process to use the App instance when available, ensuring that app-level plugins are applied during evaluations. This is a great fix that addresses a significant issue where plugins were being bypassed. The changes are well-structured across the CLI, evaluation generator, and local evaluation service. I've identified one critical issue that would cause a runtime error and one suggestion for refactoring to improve code maintainability.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Add _generate_inferences_from_app to preserve App-level plugins during evals - Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode - Add early exit for failed inferences to prevent crashes - Maintain backward compatibility with direct root_agent usage - Remove unused root_agent parameter from _generate_inferences_from_app - Deduplicate client_label_context wrapper - Refactor common arguments to reduce code duplication - Deep copy app before adding eval plugins to avoid mutation Addresses all Gemini Code Assist feedback plus additional safety improvements.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses the issue of plugins being skipped during evaluations by ensuring the App instance is used instead of directly using the root_agent. The changes in cli_tools_click.py to load the App and in local_eval_service.py to conditionally use the App are well-implemented. The new _generate_inferences_from_app method in evaluation_generator.py correctly sets up the runner to include plugins.
However, I've found a critical issue in _generate_inferences_from_app where a session is not created if initial_session is not provided. This will cause evaluations to fail for EvalCases without a session_input. Please see the detailed comment for a suggested fix.
- Add _generate_inferences_from_app to preserve App-level plugins during evals - Enables ReflectAndRetryToolPlugin and other plugins to run in evaluation mode - Add early exit for failed inferences to prevent crashes - Maintain backward compatibility with direct root_agent usage - Remove unused root_agent parameter from _generate_inferences_from_app - Deduplicate client_label_context wrapper - Refactor common arguments to reduce code duplication - Deep copy app before adding eval plugins to avoid mutation Addresses all Gemini Code Assist feedback plus additional safety improvements.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the ability for the evaluation service to utilize an App instance, thereby enabling plugin support during evaluation. Key changes include adding a new _load_app_from_module utility function to dynamically load an App from a module path, and modifying LocalEvalService to accept and conditionally use this App for inference generation, specifically through a new _generate_inferences_from_app method. Review feedback suggests improving the _load_app_from_module function by moving imports to the top for better readability and PEP 8 compliance, and refactoring _generate_inferences_from_app to reduce code duplication with _generate_inferences_from_root_agent by extracting common simulation logic into a helper method.
- Extract common simulation logic into _run_user_simulation_loop helper - Eliminates ~30 lines of code duplication - Single source of truth for user simulation workflow - Used by both _generate_inferences_from_app and _generate_inferences_from_root_agent - Move imports to top of _load_app_from_module for PEP 8 compliance - Improves readability and makes dependencies clear upfront - Fix critical bug: Always create session in _generate_inferences_from_app - Previously failed when EvalCase had no session_input - Now consistent with _generate_inferences_from_root_agent - Simplify user_id determination logic with ternary operators Changes reduce code by ~30 lines while improving maintainability.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a great improvement that ensures evals run through the App layer, allowing plugins to be correctly applied. The changes are well-structured, with good refactoring to extract common logic into helper methods like _run_user_simulation_loop and _load_app_from_module.
I've added a few suggestions to improve performance and robustness:
- Avoiding a potentially expensive
deepcopyin the user simulation loop. - Preventing duplicate plugins from being added to the
Appduring evaluation. - Ensuring the
user_idis handled consistently for both successful and failed inferences.
Overall, this is a solid contribution that correctly addresses the issue.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Extract common simulation logic into _run_user_simulation_loop helper - Eliminates ~30 lines of code duplication - Single source of truth for user simulation workflow - Used by both _generate_inferences_from_app and _generate_inferences_from_root_agent - Move imports to top of _load_app_from_module for PEP 8 compliance - Improves readability and makes dependencies clear upfront - Fix critical bug: Always create session in _generate_inferences_from_app - Previously failed when EvalCase had no session_input - Now consistent with _generate_inferences_from_root_agent - Simplify user_id determination logic with ternary operators Changes reduce code by ~30 lines while improving maintainability.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses an issue where evaluations were bypassing the App layer, causing plugins to be skipped. The changes introduce logic to load and use an App instance during evaluation when available, with a fallback to the previous behavior of using root_agent. The code is well-structured, and the refactoring to extract common logic into helper methods improves maintainability. My primary feedback concerns the removal of a copy.deepcopy call during this refactoring, which could introduce subtle bugs. I've provided a suggestion to restore it to ensure correctness.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses the issue of plugins being skipped during evaluations by ensuring that the App instance is used when available. The changes are well-structured across the CLI, evaluation service, and generator modules. I appreciate the refactoring in evaluation_generator.py to extract the common user simulation loop, which improves code clarity and reduces duplication.
I have one suggestion to make the loading of the App instance more flexible.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Description
Fixes #3833
When running evals, the system was using
root_agentdirectly instead of going through the configuredApp, which meant plugins likeReflectAndRetryToolPluginwere being skipped.Changes Made
1.
local_eval_service.pyapp: Optional['App']parameter toLocalEvalService.__init__()_perform_inference_single_eval_item()to use App when available2.
evaluation_generator.py_generate_inferences_from_app()static method3.
cli_tools_click.py_load_app_from_module()helper functionTesting
✅ Tested locally with agent that has
ReflectAndRetryToolPlugin✅ Verified plugins are registered during eval (log shows: "Plugin 'request_intercepter_plugin' registered")
✅ Backward compatible (agents without App still work)
Behavior
root_agentdirectly, skipping all App-level pluginsApp(with plugins) when available, falling back toroot_agentif no App is defined