Add non-English language support to FinalResponseMatchV2Evaluator#3503
Add non-English language support to FinalResponseMatchV2Evaluator#3503G26karthik wants to merge 6 commits intogoogle:mainfrom
Conversation
Enhanced the LLM-as-judge prompt to explicitly handle non-English languages including Chinese, Thai, Japanese, Korean, Arabic, Hebrew, Hindi, and other non-Latin scripts. The evaluator now: - Recognizes identical strings in any language as valid matches - Handles Unicode and character encoding differences - Accepts language-specific punctuation variations (e.g., 。 vs . in Chinese) - Treats all languages with equal evaluation standards Fixes google#3111 Fixes google#3162
Summary of ChangesHello @G26karthik, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a great enhancement to the FinalResponseMatchV2Evaluator prompt, adding explicit support for non-English languages. The new instructions are comprehensive and should effectively address the reported issues with evaluating strings in languages like Thai and Chinese. I have one minor suggestion to further improve the clarity of the prompt for the LLM.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Hi @G26karthik , Thank you for your contribution! We appreciate you taking the time to submit this pull request. |
Thanks for the update! |
|
Hi @G26karthik , Your PR has been received by the team and is currently under review. We will provide feedback as soon as we have an update to share. |
|
Hi @ankursharmas , can you please review this. LGTM. |
|
@ankursharmas |
|
Hi @ryanaiagent @ankursharmas, I have investigated the CI failures in depth and confirmed the following: Code Quality Checks (All Passing):
About the Earlier Test Failures (Jan 19-20): The failures on commits 6f642a8 and ce3fc92 showed these errors:
These are async event loop infrastructure issues in the test runner, NOT failures caused by my code changes. My PR only modifies a prompt string (9 lines added) with zero code logic changes. Verification: I cloned the repo locally and confirmed:
My Changes Follow the Contributor Guide:
Could you please re-run the CI checks or approve the pending workflows? My latest commit (cb3b2c1) shows the completed checks as passing, with remaining checks awaiting maintainer approval. Thank you for your time! |
Summary
Fixes #3111
Fixes #3162
Enhanced the
FinalResponseMatchV2EvaluatorLLM-as-judge prompt to explicitly support non-English languages, addressing evaluation failures for Thai, Chinese, and other non-Latin scripts.Problem
The evaluator was returning
score=0for identical strings in non-English languages (Thai, Chinese, Japanese, Korean, Arabic, etc.), even when the agent response and expected response were byte-for-byte identical. This occurred because the LLM judge was not explicitly instructed to handle Unicode characters and language-specific conventions.Solution
Enhanced the evaluation prompt with:
Changes
src/google/adk/evaluation/final_response_match_v2.py_FINAL_RESPONSE_MATCH_V2_PROMPTtemplate with i18n guidanceTesting Plan
This fix enhances the LLM-as-judge prompt with explicit i18n instructions. The prompt modification instructs the evaluator to properly handle non-English text.
Manual Testing:
Can be verified by reproducing the original issues:
score=1.0(previouslyscore=0.0)score=1.0(previouslyscore=0.0)Unit Tests:
Existing test suite in
tests/unittests/evaluation/test_final_response_match_v2.pyverifies the evaluator's core functionality. The prompt enhancement preserves existing English evaluation behavior while adding i18n support.Impact