feat: implement evaluation framework for praisonaiagents#976
feat: implement evaluation framework for praisonaiagents#976MervinPraison wants to merge 1 commit intomainfrom
Conversation
- Add comprehensive evaluation framework with minimal client-side code - Implement AccuracyEval with simple similarity and LLM-based scoring - Implement ReliabilityEval for tool usage validation - Implement PerformanceEval for runtime, memory, and token benchmarking - Add EvalSuite for automated test suites with CI/CD integration - Include EvalCriteria for multi-dimensional evaluation scoring - Support statistical reliability with multiple iterations and confidence intervals - Add result export capabilities (JSON, HTML, Markdown) - Integrate with existing Agent, Task, and PraisonAIAgents classes - Ensure backward compatibility with lazy loading - Include comprehensive test suite and usage examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
📝 WalkthroughWalkthroughIntroduces a comprehensive client-side evaluation framework for PraisonAI agents, including AccuracyEval with LLM-based multi-criteria scoring, ReliabilityEval for tool usage verification, PerformanceEval for benchmarking with statistical analysis, and EvalSuite for orchestrated multi-type evaluations. Includes supporting data models, examples, and test utilities. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant EvalSuite
participant AccuracyEval
participant Agent
participant LLM as OpenAI LLM
participant ResultAgg as Result Aggregator
Client->>EvalSuite: run(verbose=True)
EvalSuite->>EvalSuite: Iterate agents & test_cases
loop For each TestCase
EvalSuite->>AccuracyEval: _run_accuracy_test()
AccuracyEval->>Agent: execute(input)
Agent-->>AccuracyEval: actual_output
alt Has EvalCriteria
AccuracyEval->>LLM: evaluate_with_criteria(prompt)
LLM-->>AccuracyEval: JSON scores
AccuracyEval->>AccuracyEval: calculate_weighted_score()
else Simple Scoring
AccuracyEval->>AccuracyEval: _simple_similarity_score()
end
AccuracyEval-->>EvalSuite: test_result
EvalSuite->>ResultAgg: accumulate result
end
EvalSuite->>EvalSuite: Compute success_rate
EvalSuite->>EvalSuite: _check_alerts()
EvalSuite->>EvalSuite: _export_results()
EvalSuite-->>Client: EvalSuiteResult
sequenceDiagram
participant Client
participant AccuracyEval
participant Agent
participant LLM as Evaluator LLM
participant Scorer
Client->>AccuracyEval: run(verbose=False)
alt Multiple Iterations
loop For each iteration
AccuracyEval->>AccuracyEval: _run_single_iteration()
AccuracyEval->>AccuracyEval: Iterate test_cases
activate AccuracyEval
AccuracyEval->>Agent: execute(input)
Agent-->>AccuracyEval: TaskOutput
AccuracyEval->>AccuracyEval: _evaluate_single_output()
alt With EvalCriteria
AccuracyEval->>LLM: construct prompt
LLM-->>AccuracyEval: parse JSON scores
AccuracyEval->>Scorer: calculate_weighted_score()
Scorer-->>AccuracyEval: weighted_score
else Without Criteria
AccuracyEval->>AccuracyEval: _simple_similarity_score()
end
AccuracyEval->>AccuracyEval: aggregate scores
deactivate AccuracyEval
end
AccuracyEval->>AccuracyEval: _create_batch_result()
else Single Iteration
AccuracyEval->>AccuracyEval: execute once
end
opt save_results configured
AccuracyEval->>AccuracyEval: _save_results()
end
AccuracyEval-->>Client: EvalResult or BatchEvalResult
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Summary of Changes
Hello @MervinPraison, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request delivers a robust and extensible evaluation framework for PraisonAI agents, addressing the need for systematic quality assessment. It provides developers with tools to measure and improve agent performance, reliability, and accuracy through configurable tests, statistical analysis, and automation features, ultimately enhancing the overall quality assurance pipeline for agent development.
Highlights
- New Evaluation Framework: Introduced a comprehensive evaluation framework for PraisonAI agents, including core classes like
AccuracyEval,ReliabilityEval,PerformanceEval,EvalSuite,TestCase, andEvalCriteria. - Multi-faceted Evaluation Capabilities: The framework supports diverse evaluation types: accuracy (via simple similarity or LLM-based multi-criteria scoring), reliability (tool usage validation, including order and additional tool tolerance), and performance (benchmarking runtime, memory, token usage, and time to first token).
- Automation and Reporting: Features include statistical reliability with confidence intervals, automated test suites with scheduling and alerts, and flexible result export options (JSON, HTML, Markdown) for continuous integration and quality assurance.
- Backward Compatibility: The new evaluation components are integrated using lazy loading to ensure full backward compatibility with existing PraisonAI agent implementations.
- Example and Test Coverage: A new example file (
example_eval_usage.py) demonstrates the framework's capabilities, and a dedicated test script (test_eval_framework.py) validates its core components.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive evaluation framework for PraisonAI agents, including modules for accuracy, reliability, and performance testing. The implementation is well-structured with clear separation of concerns. I've identified a few areas for improvement, including a high-severity performance issue in report generation, a medium severity bug in result saving, and opportunities to make evaluation thresholds more configurable for better flexibility. Overall, this is a great addition to the library.
| # Run the evaluation | ||
| result = self.run() |
There was a problem hiding this comment.
The generate_report method currently calls self.run() internally. This is highly inefficient, as it will re-run the entire evaluation suite every time a report is generated, which can be very time-consuming and expensive.
The report generation should be decoupled from the test execution. A better approach is to have run() return the results, and then pass those results to generate_report().
I suggest changing the signature of generate_report to accept an EvalSuiteResult object.
def generate_report(
self,
result: EvalSuiteResult,
format: str = "json",
include_graphs: bool = False,
compare_with: Optional[str] = None
) -> str:
"""
Generate a comprehensive evaluation report.
Args:
result: The result object from an EvalSuite run.
format: Report format ("json", "html", "markdown")
include_graphs: Whether to include performance graphs
compare_with: Compare with previous results (e.g., "last_week")
Returns:
Report content or file path
"""
try:
# No longer runs the evaluation, uses the passed-in result object| if hasattr(self, 'verbose') and self.verbose: | ||
| print(f"Results saved to {self.save_results}") |
There was a problem hiding this comment.
The condition hasattr(self, 'verbose') and self.verbose will always evaluate to false because verbose is a parameter of the run method and is not set as an attribute on the class instance. This means the confirmation message for saving results is never printed, which can be confusing for users.
A better approach would be to use the logging module to inform the user that the file has been saved. This is more idiomatic for a library and allows the user to control visibility via their logging configuration.
logger.info(f"Results saved to {self.save_results}")|
|
||
| return { | ||
| 'type': 'accuracy', | ||
| 'passed': result.success and result.score >= 7.0, # Default threshold |
There was a problem hiding this comment.
The passing threshold for accuracy tests is hardcoded to 7.0. This reduces the flexibility of the evaluation suite, as different tests might require different passing criteria.
Consider making this threshold configurable by adding a property to the TestCase dataclass, for example min_accuracy_score: Optional[float] = 7.0. This would allow users to define custom thresholds for each test case.
| 'passed': result.success and result.score >= 7.0, # Default threshold | |
| 'passed': result.success and result.score >= (test_case.min_score if hasattr(test_case, 'min_score') else 7.0), # Default threshold |
| ) | ||
| result = evaluator.run() | ||
|
|
||
| passed = result.success and result.success_rate >= 80.0 # Default threshold |
There was a problem hiding this comment.
Similar to the accuracy test, the success rate threshold for reliability tests is hardcoded to 80.0. This should be configurable to accommodate different reliability requirements.
I recommend adding a min_success_rate field to the TestCase dataclass to allow users to set this value on a per-test basis.
| passed = result.success and result.success_rate >= 80.0 # Default threshold | |
| passed = result.success and result.success_rate >= (test_case.min_success_rate if hasattr(test_case, 'min_success_rate') else 80.0) # Default threshold |
There was a problem hiding this comment.
Bug: Verbose Mode Not Functional
The _save_results method attempts to use self.verbose to control a print statement, but the verbose parameter is only passed to the run() method and never stored as an instance attribute. This causes the hasattr(self, 'verbose') check to always be False, preventing the "Results saved to..." message from being printed even when verbose mode is intended. The verbose parameter should be stored as an instance attribute or passed directly to _save_results.
src/praisonai-agents/praisonaiagents/eval/accuracy_eval.py#L279-L281
PraisonAI/src/praisonai-agents/praisonaiagents/eval/accuracy_eval.py
Lines 279 to 281 in 3fa2a13
BugBot free trial expires on July 22, 2025
Learn more in the Cursor dashboard.
Was this report helpful? Give feedback by reacting with 👍 or 👎
|
@claude review this pull request and do a detailed analysis and fix this if the existing code doesn't have the solution implemented. Making sure it has backward compatibility, no existing features removed. After making those changes, again review the applied changes. Use @web to search if you dont know any information or to find the latest documentation or to find the latest version. Run the code if you think you need to run it to test it. Minimal code change to start with if required any changes. |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
3fa2a13 to
274345d
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
This PR is being reviewed by Cursor Bugbot
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| 'passed': passed and result.success, | ||
| 'runtime': result.runtime, | ||
| 'memory_mb': result.memory_mb, | ||
| 'tokens': result.tokens, |
There was a problem hiding this comment.
Performance test fails when result type is batch
Medium Severity
The _run_performance_test method accesses result.runtime, result.memory_mb, and result.tokens as scalar attributes, but PerformanceEval.run() returns Union[PerformanceResult, PerformanceBatchResult]. When PerformanceBatchResult is returned, these attributes don't exist — it has runtimes, memory_mbs, and tokens as lists instead. While current defaults (1 iteration, 1 query) avoid this, any configuration change or future refactoring would cause AttributeError at runtime.
| # Execute the task | ||
| task_result = self.agent.execute(test_input) | ||
| if not isinstance(task_result, TaskOutput): | ||
| task_result = TaskOutput(raw=str(task_result)) |
There was a problem hiding this comment.
TaskOutput instantiation missing required Pydantic fields
High Severity
When the agent's execute method returns a non-TaskOutput result, the code attempts to wrap it with TaskOutput(raw=str(task_result)). However, TaskOutput is a Pydantic model with required fields description and agent in addition to raw. This instantiation will raise a Pydantic ValidationError at runtime, causing the reliability evaluation to fail for any agent that doesn't return a TaskOutput directly.
| return { | ||
| 'type': 'accuracy', | ||
| 'passed': result.success and result.score >= 7.0, # Default threshold | ||
| 'score': result.score, |
There was a problem hiding this comment.
Accuracy test accesses missing score attribute on batch result
Medium Severity
The _run_accuracy_test method accesses result.score directly, but AccuracyEval.run() returns Union[EvalResult, BatchEvalResult]. BatchEvalResult doesn't have a score attribute — it has avg_score instead. While current defaults return EvalResult, this would cause an AttributeError if the AccuracyEval configuration ever uses iterations > 1.
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Fix all issues with AI agents
In `@src/praisonai-agents/example_eval_usage.py`:
- Around line 32-41: The example creates unused objects like eval_test
(AccuracyEval) which trigger F841; update the example to reference or use these
objects in the existing prints (e.g., include eval_test and agent in the print
output or call a lightweight method like eval_test.run() or str(eval_test) to
show configuration) so the variables are used; apply the same change pattern to
the other unused objects in the file (the blocks around lines noted in the
review) so each created variable (e.g., eval_test, any other *_test or created
agent objects) is referenced in a print or benign call to avoid the
unused-variable lint error.
In `@src/praisonai-agents/praisonaiagents/eval/accuracy_eval.py`:
- Around line 117-163: In _llm_evaluate_with_criteria: stop passing
self.evaluator_llm into get_openai_client (which takes api_key/base_url), call
get_openai_client() with no arguments so authentication uses env/defaults, and
keep using self.evaluator_llm only as the model= value in the
client.chat.completions.create call; update the invocation of get_openai_client
in this method and verify client is used unchanged for the chat completion
request.
In `@src/praisonai-agents/praisonaiagents/eval/eval_criteria.py`:
- Around line 16-37: The __post_init__ currently only checks the sum of
factual_accuracy, completeness, and relevance in eval_criteria but does not
prevent negative values; update __post_init__ to validate each weight
(factual_accuracy, completeness, relevance) is >= 0 (and optionally <= 1 if you
prefer) and raise a ValueError with a clear message if any weight is negative;
keep the existing sum check, and ensure calculate_weighted_score and the weights
property remain unchanged so negative weights cannot invert scoring.
In `@src/praisonai-agents/praisonaiagents/eval/eval_suite.py`:
- Around line 214-310: The code currently keys agent_results by agent_name in
run(), which allows later agents with the same name to overwrite earlier
results; fix by disambiguating duplicate agent names before assignment: when
computing agent_name (in run), check if agent_name already exists in
agent_results and if so append a deterministic suffix (e.g., "_1", "_2") or
include the object's id (f"{agent_name}_{id(agent)}") to produce a unique key,
then store agent_test_results under that unique key; ensure any printed/logged
agent label uses the same unique name so outputs and keys remain consistent.
In `@src/praisonai-agents/praisonaiagents/eval/performance_eval.py`:
- Around line 5-55: The module currently unconditionally imports psutil which
will crash import if it's not installed; update the code so psutil is imported
optionally (wrap the import in try/except ImportError at module scope), set a
fallback (e.g., set a module-level flag like _PSUTIL_AVAILABLE = False) and
ensure PerformanceEval._get_memory_usage checks that flag and returns None (or
disables the 'memory' metric in self.metrics) when psutil is unavailable;
alternatively, if you prefer requiring psutil, add it to install
requirements—make the change around the top-level import and in the
PerformanceEval.__init__ / _get_memory_usage logic so memory metrics are
guarded.
In `@src/praisonai-agents/praisonaiagents/eval/reliability_eval.py`:
- Around line 32-52: The success_rate property currently returns 100.0 when
total_scenarios == 0 which is misleading; update the success_rate getter to
return 0.0 (or another explicit neutral value) when self.total_scenarios == 0
instead of 100.0 so empty datasets don't appear fully successful; modify the
success_rate property implementation that references
total_scenarios/passed_scenarios to check for zero and return 0.0 before
performing the division.
- Around line 159-221: The code currently builds TaskOutput using
TaskOutput(raw=str(task_result)) in the execute block which will raise
validation errors because TaskOutput requires description, raw, and agent; also
expected_tools may be None or not a list. Update the agent.execute handling in
the method (where task_result is set) to: if the returned value is not an
instance of TaskOutput, wrap it in a TaskOutput providing sensible default
values for all required fields (e.g., description as an empty string or short
summary, raw=str(task_result), and agent as the agent's identifier), and
normalize expected_tools right after reading it (ensure expected_tools is a
list, defaulting to [] if None or not iterable) so downstream set/list
operations and _extract_tool_calls(...) work safely when evaluating and
constructing the ReliabilityScenario.
In `@src/praisonai-agents/test_eval_framework.py`:
- Around line 38-44: The created evaluator instance (e.g., eval_test created by
AccuracyEval) is never used and triggers unused-variable lint errors; replace
the placeholder return True with returning the created object (return eval_test)
and do the same for the other evaluator variables in the file (for the blocks at
58-63, 71-77, 100-106, 114-120) — either return each created evaluator or
collect them into a list/tuple and return that so the objects are referenced and
the F841 warnings are resolved.
🧹 Nitpick comments (1)
src/praisonai-agents/praisonaiagents/eval/__init__.py (1)
15-23: Sort__all__to satisfy Ruff RUF022.
Purely stylistic, but keeps lint clean.🔧 Example sorting
__all__ = [ - 'AccuracyEval', - 'ReliabilityEval', - 'PerformanceEval', - 'EvalSuite', - 'TestCase', - 'EvalCriteria', - 'EvalResult' + 'AccuracyEval', + 'EvalCriteria', + 'EvalResult', + 'EvalSuite', + 'PerformanceEval', + 'ReliabilityEval', + 'TestCase', ]
| eval_test = AccuracyEval( | ||
| agent=agent, | ||
| input="What is the capital of France?", | ||
| expected_output="Paris" | ||
| ) | ||
|
|
||
| print("Running basic accuracy evaluation...") | ||
| # Note: In a real scenario, you would run: result = eval_test.run() | ||
| # print(f"Accuracy: {result.score}/10") | ||
| print("✓ AccuracyEval configured successfully") |
There was a problem hiding this comment.
Use created objects to avoid F841 unused-variable errors.
These example objects are currently unused and trigger lint errors. A minimal fix is to use them in the existing print statements.
🛠️ Suggested fix
- print("✓ AccuracyEval configured successfully")
+ print(f"✓ AccuracyEval configured successfully for agent: {eval_test.agent.name}")
@@
- print("Advanced accuracy evaluation configured with:")
+ print(f"Advanced accuracy evaluation configured for {eval_test.agent.name}:")
@@
- print("Reliability testing configured for:")
+ print(f"Reliability testing configured for {len(eval_test.test_scenarios)} scenarios:")
@@
- print("Performance evaluation configured with:")
+ print(f"Performance evaluation configured for {len(eval_test.benchmark_queries)} queries:")
@@
- agents = [agent] # In practice, you'd have multiple agents
- # comparison = PerformanceEval.compare(
- # agents=agents,
+ # comparison = PerformanceEval.compare(
+ # agents=[agent],
# benchmark_suite="standard",
# export_format="html"
# )
@@
- print("Automated test suite configured with:")
+ print(f"Automated test suite '{suite.name}' configured with {len(suite.test_cases)} tests:")
@@
- print("Integration features planned:")
+ print(f"Integration features planned for agent: {agent.name}")
+ print("Integration features planned:")Also applies to: 56-84, 104-124, 144-167, 172-172, 192-231, 251-285
🧰 Tools
🪛 Ruff (0.14.14)
[error] 32-32: Local variable eval_test is assigned to but never used
Remove assignment to unused variable eval_test
(F841)
🤖 Prompt for AI Agents
In `@src/praisonai-agents/example_eval_usage.py` around lines 32 - 41, The example
creates unused objects like eval_test (AccuracyEval) which trigger F841; update
the example to reference or use these objects in the existing prints (e.g.,
include eval_test and agent in the print output or call a lightweight method
like eval_test.run() or str(eval_test) to show configuration) so the variables
are used; apply the same change pattern to the other unused objects in the file
(the blocks around lines noted in the review) so each created variable (e.g.,
eval_test, any other *_test or created agent objects) is referenced in a print
or benign call to avoid the unused-variable lint error.
| def _llm_evaluate_with_criteria(self, actual: str, expected: str, criteria: EvalCriteria) -> float: | ||
| """Use LLM to evaluate output against criteria.""" | ||
| try: | ||
| from ..llm import get_openai_client | ||
|
|
||
| client = get_openai_client(self.evaluator_llm) | ||
|
|
||
| evaluation_prompt = f""" | ||
| Evaluate the following response based on these criteria: | ||
| - Factual Accuracy ({criteria.factual_accuracy*100}%): How factually correct is the response? | ||
| - Completeness ({criteria.completeness*100}%): How complete is the response? | ||
| - Relevance ({criteria.relevance*100}%): How relevant is the response to the expected output? | ||
|
|
||
| Expected Output: {expected} | ||
| Actual Output: {actual} | ||
|
|
||
| Rate each criterion from 0-10 and provide the scores in this exact JSON format: | ||
| {{ | ||
| "factual_accuracy": <score>, | ||
| "completeness": <score>, | ||
| "relevance": <score>, | ||
| "explanation": "<brief explanation>" | ||
| }} | ||
| """ | ||
|
|
||
| response = client.chat.completions.create( | ||
| model=self.evaluator_llm, | ||
| messages=[{"role": "user", "content": evaluation_prompt}], | ||
| temperature=0.1 | ||
| ) | ||
|
|
||
| # Parse response | ||
| response_text = response.choices[0].message.content.strip() | ||
| if response_text.startswith('```json'): | ||
| response_text = response_text[7:-3] | ||
| elif response_text.startswith('```'): | ||
| response_text = response_text[3:-3] | ||
|
|
||
| eval_scores = json.loads(response_text) | ||
|
|
||
| # Calculate weighted score | ||
| return criteria.calculate_weighted_score(eval_scores) | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Error in LLM evaluation: {e}") | ||
| # Fallback to simple similarity | ||
| return self._simple_similarity_score(actual, expected) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify get_openai_client signature and implementation
find src -name "*.py" -type f | xargs grep -l "def get_openai_client" | head -5Repository: MervinPraison/PraisonAI
Length of output: 367
🏁 Script executed:
#!/bin/bash
# Search for get_openai_client definition across the codebase
rg -n "def get_openai_client" --type pyRepository: MervinPraison/PraisonAI
Length of output: 231
🏁 Script executed:
#!/bin/bash
# Read the get_openai_client function implementation
sed -n '1139,1160p' src/praisonai-agents/praisonaiagents/llm/openai_client.pyRepository: MervinPraison/PraisonAI
Length of output: 590
Don't pass the model name into get_openai_client.
get_openai_client expects optional api_key and base_url parameters, so passing self.evaluator_llm (a model name like "gpt-4o-mini") will incorrectly bind it as the api_key argument and break authentication. Use get_openai_client() without arguments to rely on environment variables or defaults, and keep the model name only for the model= parameter in the chat completion call.
🛠️ Suggested fix
- client = get_openai_client(self.evaluator_llm)
+ client = get_openai_client()🧰 Tools
🪛 Ruff (0.14.14)
[warning] 160-160: Do not catch blind exception: Exception
(BLE001)
[warning] 161-161: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In `@src/praisonai-agents/praisonaiagents/eval/accuracy_eval.py` around lines 117
- 163, In _llm_evaluate_with_criteria: stop passing self.evaluator_llm into
get_openai_client (which takes api_key/base_url), call get_openai_client() with
no arguments so authentication uses env/defaults, and keep using
self.evaluator_llm only as the model= value in the
client.chat.completions.create call; update the invocation of get_openai_client
in this method and verify client is used unchanged for the chat completion
request.
| def __post_init__(self): | ||
| """Validate that weights sum to 1.0.""" | ||
| total = self.factual_accuracy + self.completeness + self.relevance | ||
| if abs(total - 1.0) > 0.001: | ||
| raise ValueError(f"Criteria weights must sum to 1.0, got {total}") | ||
|
|
||
| @property | ||
| def weights(self) -> Dict[str, float]: | ||
| """Get criteria weights as dictionary.""" | ||
| return { | ||
| 'factual_accuracy': self.factual_accuracy, | ||
| 'completeness': self.completeness, | ||
| 'relevance': self.relevance | ||
| } | ||
|
|
||
| def calculate_weighted_score(self, scores: Dict[str, float]) -> float: | ||
| """Calculate weighted score from individual criteria scores.""" | ||
| total_score = 0.0 | ||
| for criterion, weight in self.weights.items(): | ||
| if criterion in scores: | ||
| total_score += scores[criterion] * weight | ||
| return total_score |
There was a problem hiding this comment.
Validate non‑negative weights in criteria.
Weights like 1.2, -0.1, -0.1 pass the sum check but invert scoring. Add a non‑negative guard.
🛠️ Suggested fix
def __post_init__(self):
"""Validate that weights sum to 1.0."""
+ if any(w < 0 for w in (self.factual_accuracy, self.completeness, self.relevance)):
+ raise ValueError("Criteria weights must be non-negative")
total = self.factual_accuracy + self.completeness + self.relevance
if abs(total - 1.0) > 0.001:
raise ValueError(f"Criteria weights must sum to 1.0, got {total}")🧰 Tools
🪛 Ruff (0.14.14)
[warning] 20-20: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@src/praisonai-agents/praisonaiagents/eval/eval_criteria.py` around lines 16 -
37, The __post_init__ currently only checks the sum of factual_accuracy,
completeness, and relevance in eval_criteria but does not prevent negative
values; update __post_init__ to validate each weight (factual_accuracy,
completeness, relevance) is >= 0 (and optionally <= 1 if you prefer) and raise a
ValueError with a clear message if any weight is negative; keep the existing sum
check, and ensure calculate_weighted_score and the weights property remain
unchanged so negative weights cannot invert scoring.
| def run(self, verbose: bool = False) -> EvalSuiteResult: | ||
| """ | ||
| Run the complete evaluation suite. | ||
|
|
||
| Args: | ||
| verbose: Whether to print detailed output | ||
|
|
||
| Returns: | ||
| EvalSuiteResult with comprehensive results | ||
| """ | ||
| if verbose: | ||
| print(f"Running evaluation suite: {self.name}") | ||
| print(f"Agents: {len(self.agents)}, Test cases: {len(self.test_cases)}") | ||
|
|
||
| total_tests = 0 | ||
| passed_tests = 0 | ||
| agent_results = {} | ||
|
|
||
| try: | ||
| for agent in self.agents: | ||
| agent_name = getattr(agent, 'name', f"Agent_{id(agent)}") | ||
| if verbose: | ||
| print(f"\nEvaluating agent: {agent_name}") | ||
|
|
||
| agent_test_results = [] | ||
|
|
||
| for test_case in self.test_cases: | ||
| if verbose: | ||
| print(f" Running test: {test_case.name}") | ||
|
|
||
| total_tests += 1 | ||
|
|
||
| # Run appropriate test type | ||
| if test_case.eval_type == "accuracy": | ||
| test_result = self._run_accuracy_test(agent, test_case) | ||
| elif test_case.eval_type == "reliability": | ||
| test_result = self._run_reliability_test(agent, test_case) | ||
| elif test_case.eval_type == "performance": | ||
| test_result = self._run_performance_test(agent, test_case) | ||
| else: | ||
| logger.warning(f"Unknown test type: {test_case.eval_type}") | ||
| test_result = { | ||
| 'type': test_case.eval_type, | ||
| 'passed': False, | ||
| 'error': f"Unknown test type: {test_case.eval_type}" | ||
| } | ||
|
|
||
| test_result['test_case'] = test_case.to_dict() | ||
| agent_test_results.append(test_result) | ||
|
|
||
| if test_result['passed']: | ||
| passed_tests += 1 | ||
|
|
||
| if verbose: | ||
| status = "PASS" if test_result['passed'] else "FAIL" | ||
| print(f" {status}: {test_case.name}") | ||
|
|
||
| agent_results[agent_name] = agent_test_results | ||
|
|
||
| # Calculate overall results | ||
| failed_tests = total_tests - passed_tests | ||
| success_rate = (passed_tests / total_tests * 100) if total_tests > 0 else 0.0 | ||
|
|
||
| suite_result = EvalSuiteResult( | ||
| name=self.name, | ||
| total_tests=total_tests, | ||
| passed_tests=passed_tests, | ||
| failed_tests=failed_tests, | ||
| success_rate=success_rate, | ||
| details={ | ||
| 'agent_results': agent_results, | ||
| 'test_cases': [tc.to_dict() for tc in self.test_cases] | ||
| } | ||
| ) | ||
|
|
||
| if verbose: | ||
| print(f"\nSuite Results: {suite_result.summary}") | ||
|
|
||
| # Check alerts | ||
| self._check_alerts(suite_result) | ||
|
|
||
| # Export results | ||
| if self.export_results: | ||
| self._export_results(suite_result) | ||
|
|
||
| return suite_result | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Error running evaluation suite: {e}") | ||
| return EvalSuiteResult( | ||
| name=self.name, | ||
| total_tests=0, | ||
| passed_tests=0, | ||
| failed_tests=0, | ||
| success_rate=0.0, | ||
| details={'error': str(e)} | ||
| ) |
There was a problem hiding this comment.
Prevent duplicate agent names from overwriting results.
agent_results uses agent names as keys, so duplicate names will overwrite earlier entries.
🛠️ Suggested fix
- for agent in self.agents:
- agent_name = getattr(agent, 'name', f"Agent_{id(agent)}")
+ for idx, agent in enumerate(self.agents):
+ agent_name = getattr(agent, 'name', f"Agent_{id(agent)}")
+ agent_key = agent_name if agent_name not in agent_results else f"{agent_name}_{idx}"
@@
- agent_results[agent_name] = agent_test_results
+ agent_results[agent_key] = agent_test_results📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def run(self, verbose: bool = False) -> EvalSuiteResult: | |
| """ | |
| Run the complete evaluation suite. | |
| Args: | |
| verbose: Whether to print detailed output | |
| Returns: | |
| EvalSuiteResult with comprehensive results | |
| """ | |
| if verbose: | |
| print(f"Running evaluation suite: {self.name}") | |
| print(f"Agents: {len(self.agents)}, Test cases: {len(self.test_cases)}") | |
| total_tests = 0 | |
| passed_tests = 0 | |
| agent_results = {} | |
| try: | |
| for agent in self.agents: | |
| agent_name = getattr(agent, 'name', f"Agent_{id(agent)}") | |
| if verbose: | |
| print(f"\nEvaluating agent: {agent_name}") | |
| agent_test_results = [] | |
| for test_case in self.test_cases: | |
| if verbose: | |
| print(f" Running test: {test_case.name}") | |
| total_tests += 1 | |
| # Run appropriate test type | |
| if test_case.eval_type == "accuracy": | |
| test_result = self._run_accuracy_test(agent, test_case) | |
| elif test_case.eval_type == "reliability": | |
| test_result = self._run_reliability_test(agent, test_case) | |
| elif test_case.eval_type == "performance": | |
| test_result = self._run_performance_test(agent, test_case) | |
| else: | |
| logger.warning(f"Unknown test type: {test_case.eval_type}") | |
| test_result = { | |
| 'type': test_case.eval_type, | |
| 'passed': False, | |
| 'error': f"Unknown test type: {test_case.eval_type}" | |
| } | |
| test_result['test_case'] = test_case.to_dict() | |
| agent_test_results.append(test_result) | |
| if test_result['passed']: | |
| passed_tests += 1 | |
| if verbose: | |
| status = "PASS" if test_result['passed'] else "FAIL" | |
| print(f" {status}: {test_case.name}") | |
| agent_results[agent_name] = agent_test_results | |
| # Calculate overall results | |
| failed_tests = total_tests - passed_tests | |
| success_rate = (passed_tests / total_tests * 100) if total_tests > 0 else 0.0 | |
| suite_result = EvalSuiteResult( | |
| name=self.name, | |
| total_tests=total_tests, | |
| passed_tests=passed_tests, | |
| failed_tests=failed_tests, | |
| success_rate=success_rate, | |
| details={ | |
| 'agent_results': agent_results, | |
| 'test_cases': [tc.to_dict() for tc in self.test_cases] | |
| } | |
| ) | |
| if verbose: | |
| print(f"\nSuite Results: {suite_result.summary}") | |
| # Check alerts | |
| self._check_alerts(suite_result) | |
| # Export results | |
| if self.export_results: | |
| self._export_results(suite_result) | |
| return suite_result | |
| except Exception as e: | |
| logger.error(f"Error running evaluation suite: {e}") | |
| return EvalSuiteResult( | |
| name=self.name, | |
| total_tests=0, | |
| passed_tests=0, | |
| failed_tests=0, | |
| success_rate=0.0, | |
| details={'error': str(e)} | |
| ) | |
| def run(self, verbose: bool = False) -> EvalSuiteResult: | |
| """ | |
| Run the complete evaluation suite. | |
| Args: | |
| verbose: Whether to print detailed output | |
| Returns: | |
| EvalSuiteResult with comprehensive results | |
| """ | |
| if verbose: | |
| print(f"Running evaluation suite: {self.name}") | |
| print(f"Agents: {len(self.agents)}, Test cases: {len(self.test_cases)}") | |
| total_tests = 0 | |
| passed_tests = 0 | |
| agent_results = {} | |
| try: | |
| for idx, agent in enumerate(self.agents): | |
| agent_name = getattr(agent, 'name', f"Agent_{id(agent)}") | |
| agent_key = agent_name if agent_name not in agent_results else f"{agent_name}_{idx}" | |
| if verbose: | |
| print(f"\nEvaluating agent: {agent_name}") | |
| agent_test_results = [] | |
| for test_case in self.test_cases: | |
| if verbose: | |
| print(f" Running test: {test_case.name}") | |
| total_tests += 1 | |
| # Run appropriate test type | |
| if test_case.eval_type == "accuracy": | |
| test_result = self._run_accuracy_test(agent, test_case) | |
| elif test_case.eval_type == "reliability": | |
| test_result = self._run_reliability_test(agent, test_case) | |
| elif test_case.eval_type == "performance": | |
| test_result = self._run_performance_test(agent, test_case) | |
| else: | |
| logger.warning(f"Unknown test type: {test_case.eval_type}") | |
| test_result = { | |
| 'type': test_case.eval_type, | |
| 'passed': False, | |
| 'error': f"Unknown test type: {test_case.eval_type}" | |
| } | |
| test_result['test_case'] = test_case.to_dict() | |
| agent_test_results.append(test_result) | |
| if test_result['passed']: | |
| passed_tests += 1 | |
| if verbose: | |
| status = "PASS" if test_result['passed'] else "FAIL" | |
| print(f" {status}: {test_case.name}") | |
| agent_results[agent_key] = agent_test_results | |
| # Calculate overall results | |
| failed_tests = total_tests - passed_tests | |
| success_rate = (passed_tests / total_tests * 100) if total_tests > 0 else 0.0 | |
| suite_result = EvalSuiteResult( | |
| name=self.name, | |
| total_tests=total_tests, | |
| passed_tests=passed_tests, | |
| failed_tests=failed_tests, | |
| success_rate=success_rate, | |
| details={ | |
| 'agent_results': agent_results, | |
| 'test_cases': [tc.to_dict() for tc in self.test_cases] | |
| } | |
| ) | |
| if verbose: | |
| print(f"\nSuite Results: {suite_result.summary}") | |
| # Check alerts | |
| self._check_alerts(suite_result) | |
| # Export results | |
| if self.export_results: | |
| self._export_results(suite_result) | |
| return suite_result | |
| except Exception as e: | |
| logger.error(f"Error running evaluation suite: {e}") | |
| return EvalSuiteResult( | |
| name=self.name, | |
| total_tests=0, | |
| passed_tests=0, | |
| failed_tests=0, | |
| success_rate=0.0, | |
| details={'error': str(e)} | |
| ) |
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 299-299: Consider moving this statement to an else block
(TRY300)
[warning] 301-301: Do not catch blind exception: Exception
(BLE001)
[warning] 302-302: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In `@src/praisonai-agents/praisonaiagents/eval/eval_suite.py` around lines 214 -
310, The code currently keys agent_results by agent_name in run(), which allows
later agents with the same name to overwrite earlier results; fix by
disambiguating duplicate agent names before assignment: when computing
agent_name (in run), check if agent_name already exists in agent_results and if
so append a deterministic suffix (e.g., "_1", "_2") or include the object's id
(f"{agent_name}_{id(agent)}") to produce a unique key, then store
agent_test_results under that unique key; ensure any printed/logged agent label
uses the same unique name so outputs and keys remain consistent.
| import time | ||
| import psutil | ||
| import os | ||
| import json | ||
| import logging | ||
| from typing import List, Dict, Any, Optional, Union | ||
| from ..agent.agent import Agent | ||
| from ..main import TaskOutput | ||
| from .eval_result import PerformanceResult, PerformanceBatchResult | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| class PerformanceEval: | ||
| """Evaluate agent performance metrics like runtime, memory, and token usage.""" | ||
|
|
||
| def __init__( | ||
| self, | ||
| agent: Agent, | ||
| benchmark_queries: Optional[List[str]] = None, | ||
| metrics: Optional[Dict[str, bool]] = None, | ||
| iterations: int = 1, | ||
| warmup: int = 0 | ||
| ): | ||
| """ | ||
| Initialize performance evaluation. | ||
|
|
||
| Args: | ||
| agent: Agent to evaluate | ||
| benchmark_queries: List of queries to benchmark | ||
| metrics: Dict of metrics to track (runtime, memory, tokens, ttft) | ||
| iterations: Number of iterations to run | ||
| warmup: Number of warmup iterations (not counted in results) | ||
| """ | ||
| self.agent = agent | ||
| self.benchmark_queries = benchmark_queries or ["Hello, how are you?"] | ||
| self.metrics = metrics or { | ||
| 'runtime': True, | ||
| 'memory': True, | ||
| 'tokens': True, | ||
| 'ttft': True | ||
| } | ||
| self.iterations = iterations | ||
| self.warmup = warmup | ||
|
|
||
| def _get_memory_usage(self) -> float: | ||
| """Get current memory usage in MB.""" | ||
| try: | ||
| process = psutil.Process(os.getpid()) | ||
| return process.memory_info().rss / 1024 / 1024 # Convert to MB | ||
| except Exception: | ||
| return None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if psutil is declared in dependency files
rg -n "psutil" -g "pyproject.toml" -g "requirements*.txt" -g "setup.cfg" -g "setup.py"Repository: MervinPraison/PraisonAI
Length of output: 49
🏁 Script executed:
#!/bin/bash
# View the imports at the top of performance_eval.py
head -15 src/praisonai-agents/praisonaiagents/eval/performance_eval.py
# Check what dependency files exist
echo "=== Dependency files in repo ==="
find . -maxdepth 3 -type f \( -name "pyproject.toml" -o -name "setup.py" -o -name "setup.cfg" -o -name "requirements*.txt" \) | head -20Repository: MervinPraison/PraisonAI
Length of output: 577
🏁 Script executed:
#!/bin/bash
# Check the pyproject.toml for praisonai-agents
echo "=== src/praisonai-agents/pyproject.toml ==="
cat ./src/praisonai-agents/pyproject.toml
echo ""
echo "=== src/praisonai-agents/requirements.txt ==="
cat ./src/praisonai-agents/requirements.txtRepository: MervinPraison/PraisonAI
Length of output: 1839
🏁 Script executed:
#!/bin/bash
# Search for psutil usage throughout the codebase
rg "psutil" --type py -nRepository: MervinPraison/PraisonAI
Length of output: 2242
🏁 Script executed:
#!/bin/bash
# Check how shell_tools.py handles psutil import
sed -n '15,40p' src/praisonai-agents/praisonaiagents/tools/shell_tools.pyRepository: MervinPraison/PraisonAI
Length of output: 792
Guard psutil or declare it as a dependency.
Unconditional import on line 6 will crash module load if psutil isn't installed, even when memory metrics aren't used. Prefer an optional import or add it to install requirements.
🛠️ Suggested fix (optional import + safe fallback)
-import psutil
+try:
+ import psutil
+ _psutil_available = True
+except ImportError:
+ psutil = None
+ _psutil_available = False
@@
- def _get_memory_usage(self) -> float:
+ def _get_memory_usage(self) -> Optional[float]:
"""Get current memory usage in MB."""
+ if not _psutil_available:
+ return None
try:
process = psutil.Process(os.getpid())
return process.memory_info().rss / 1024 / 1024 # Convert to MB
except Exception:
return None🧰 Tools
🪛 Ruff (0.14.14)
[warning] 54-54: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@src/praisonai-agents/praisonaiagents/eval/performance_eval.py` around lines 5
- 55, The module currently unconditionally imports psutil which will crash
import if it's not installed; update the code so psutil is imported optionally
(wrap the import in try/except ImportError at module scope), set a fallback
(e.g., set a module-level flag like _PSUTIL_AVAILABLE = False) and ensure
PerformanceEval._get_memory_usage checks that flag and returns None (or disables
the 'memory' metric in self.metrics) when psutil is unavailable; alternatively,
if you prefer requiring psutil, add it to install requirements—make the change
around the top-level import and in the PerformanceEval.__init__ /
_get_memory_usage logic so memory metrics are guarded.
| @property | ||
| def total_scenarios(self) -> int: | ||
| """Total number of scenarios.""" | ||
| return len(self.scenarios) | ||
|
|
||
| @property | ||
| def passed_scenarios(self) -> int: | ||
| """Number of passed scenarios.""" | ||
| return len([s for s in self.scenarios if s.status == "passed"]) | ||
|
|
||
| @property | ||
| def failed_scenarios(self) -> int: | ||
| """Number of failed scenarios.""" | ||
| return len([s for s in self.scenarios if s.status == "failed"]) | ||
|
|
||
| @property | ||
| def success_rate(self) -> float: | ||
| """Success rate as percentage.""" | ||
| if self.total_scenarios == 0: | ||
| return 100.0 | ||
| return (self.passed_scenarios / self.total_scenarios) * 100.0 |
There was a problem hiding this comment.
Avoid reporting 100% success when there are zero scenarios.
Returning 100.0 with no data can mislead reports and gates.
🩹 Suggested fix
def success_rate(self) -> float:
"""Success rate as percentage."""
if self.total_scenarios == 0:
- return 100.0
+ return 0.0
return (self.passed_scenarios / self.total_scenarios) * 100.0🤖 Prompt for AI Agents
In `@src/praisonai-agents/praisonaiagents/eval/reliability_eval.py` around lines
32 - 52, The success_rate property currently returns 100.0 when total_scenarios
== 0 which is misleading; update the success_rate getter to return 0.0 (or
another explicit neutral value) when self.total_scenarios == 0 instead of 100.0
so empty datasets don't appear fully successful; modify the success_rate
property implementation that references total_scenarios/passed_scenarios to
check for zero and return 0.0 before performing the division.
| scenario_name = scenario.get('name', f"Scenario {scenario.get('input', '')[:20]}") | ||
| test_input = scenario.get('input', '') | ||
| expected_tools = scenario.get('expected_tools', []) | ||
| required_order = scenario.get('required_order', False) | ||
| allow_additional = scenario.get('allow_additional', False) | ||
|
|
||
| try: | ||
| # Execute the task | ||
| task_result = self.agent.execute(test_input) | ||
| if not isinstance(task_result, TaskOutput): | ||
| task_result = TaskOutput(raw=str(task_result)) | ||
|
|
||
| # Extract actual tool calls | ||
| actual_tools = self._extract_tool_calls(task_result) | ||
|
|
||
| # Evaluate tool usage | ||
| failed_tools = [] | ||
| unexpected_tools = [] | ||
|
|
||
| # Check for missing expected tools | ||
| if required_order: | ||
| # Check order and presence | ||
| expected_set = set(expected_tools) | ||
| actual_set = set(actual_tools) | ||
| missing_tools = expected_set - actual_set | ||
| failed_tools.extend(list(missing_tools)) | ||
|
|
||
| # Check order for tools that are present | ||
| common_tools = [t for t in expected_tools if t in actual_tools] | ||
| actual_order = [t for t in actual_tools if t in common_tools] | ||
|
|
||
| if common_tools != actual_order[:len(common_tools)]: | ||
| # Order mismatch | ||
| failed_tools.append("tool_order_mismatch") | ||
| else: | ||
| # Just check presence | ||
| missing_tools = set(expected_tools) - set(actual_tools) | ||
| failed_tools.extend(list(missing_tools)) | ||
|
|
||
| # Check for unexpected tools | ||
| if not allow_additional: | ||
| extra_tools = set(actual_tools) - set(expected_tools) | ||
| unexpected_tools.extend(list(extra_tools)) | ||
|
|
||
| # Determine status | ||
| status = "passed" if not failed_tools and not unexpected_tools else "failed" | ||
|
|
||
| details = { | ||
| 'input': test_input, | ||
| 'expected_tools': expected_tools, | ||
| 'actual_tools': actual_tools, | ||
| 'required_order': required_order, | ||
| 'allow_additional': allow_additional, | ||
| 'task_output': task_result.raw if hasattr(task_result, 'raw') else str(task_result) | ||
| } | ||
|
|
||
| return ReliabilityScenario( | ||
| name=scenario_name, | ||
| status=status, | ||
| failed_tools=failed_tools, | ||
| unexpected_tools=unexpected_tools, | ||
| details=details | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find TaskOutput class definition in Python codebase
rg -n "class TaskOutput" src/praisonai-agents/Repository: MervinPraison/PraisonAI
Length of output: 144
🏁 Script executed:
#!/bin/bash
# Get the full TaskOutput class definition
# First, let's find where it's defined
find src/praisonai-agents -name "*.py" -type f | xargs grep -l "class TaskOutput" | head -5Repository: MervinPraison/PraisonAI
Length of output: 354
🏁 Script executed:
#!/bin/bash
# Check the main.py file mentioned in the review comment
if [ -f "src/praisonai-agents/praisonaiagents/main.py" ]; then
wc -l src/praisonai-agents/praisonaiagents/main.py
sed -n '408,500p' src/praisonai-agents/praisonaiagents/main.py
fiRepository: MervinPraison/PraisonAI
Length of output: 1051
🏁 Script executed:
#!/bin/bash
# Check the actual code in reliability_eval.py
if [ -f "src/praisonai-agents/praisonaiagents/eval/reliability_eval.py" ]; then
wc -l src/praisonai-agents/praisonaiagents/eval/reliability_eval.py
sed -n '159,221p' src/praisonai-agents/praisonaiagents/eval/reliability_eval.py
fiRepository: MervinPraison/PraisonAI
Length of output: 2869
🏁 Script executed:
#!/bin/bash
# Check what TaskOutput is being imported as in reliability_eval.py
head -50 src/praisonai-agents/praisonaiagents/eval/reliability_eval.py | grep -E "^import|^from"Repository: MervinPraison/PraisonAI
Length of output: 274
🏁 Script executed:
#!/bin/bash
# Verify if TaskOutput could be instantiated with only raw parameter
# Check if there are any default factories or validators that might handle this
sed -n '409,437p' src/praisonai-agents/praisonaiagents/main.pyRepository: MervinPraison/PraisonAI
Length of output: 972
Fix TaskOutput construction with all required fields to prevent validation errors.
Line 169 constructs TaskOutput(raw=str(task_result)) but the class requires three fields: description: str, raw: str, and agent: str. This will raise a Pydantic ValidationError whenever a non-TaskOutput result is returned. Also guard expected_tools against None or non-list values.
🛠️ Suggested fix
- expected_tools = scenario.get('expected_tools', [])
+ expected_tools = scenario.get('expected_tools') or []
+ if not isinstance(expected_tools, list):
+ expected_tools = [str(expected_tools)]
required_order = scenario.get('required_order', False)
allow_additional = scenario.get('allow_additional', False)
try:
# Execute the task
task_result = self.agent.execute(test_input)
- if not isinstance(task_result, TaskOutput):
- task_result = TaskOutput(raw=str(task_result))
+ if isinstance(task_result, TaskOutput):
+ task_output = task_result
+ else:
+ task_output = TaskOutput(
+ description="reliability_eval",
+ raw=str(task_result),
+ agent=getattr(self.agent, "name", "unknown"),
+ output_format="RAW",
+ )
# Extract actual tool calls
- actual_tools = self._extract_tool_calls(task_result)
+ actual_tools = self._extract_tool_calls(task_output)
@@
- 'task_output': task_result.raw if hasattr(task_result, 'raw') else str(task_result)
+ 'task_output': task_output.raw if hasattr(task_output, 'raw') else str(task_output)
}🤖 Prompt for AI Agents
In `@src/praisonai-agents/praisonaiagents/eval/reliability_eval.py` around lines
159 - 221, The code currently builds TaskOutput using
TaskOutput(raw=str(task_result)) in the execute block which will raise
validation errors because TaskOutput requires description, raw, and agent; also
expected_tools may be None or not a list. Update the agent.execute handling in
the method (where task_result is set) to: if the returned value is not an
instance of TaskOutput, wrap it in a TaskOutput providing sensible default
values for all required fields (e.g., description as an empty string or short
summary, raw=str(task_result), and agent as the agent's identifier), and
normalize expected_tools right after reading it (ensure expected_tools is a
list, defaulting to [] if None or not iterable) so downstream set/list
operations and _extract_tool_calls(...) work safely when evaluating and
constructing the ReliabilityScenario.
| eval_test = AccuracyEval( | ||
| agent=agent, | ||
| input="What is the capital of France?", | ||
| expected_output="Paris" | ||
| ) | ||
| print("✅ AccuracyEval created successfully") | ||
| return True |
There was a problem hiding this comment.
Avoid F841 unused-variable errors for created evaluators.
Use the created objects in return statements so lint doesn’t fail.
🛠️ Suggested fix
eval_test = AccuracyEval(
agent=agent,
input="What is the capital of France?",
expected_output="Paris"
)
print("✅ AccuracyEval created successfully")
- return True
+ return eval_test is not None
@@
eval_test = ReliabilityEval(
agent=agent,
test_scenarios=test_scenarios
)
print("✅ ReliabilityEval created successfully")
- return True
+ return eval_test is not None
@@
eval_test = PerformanceEval(
agent=agent,
benchmark_queries=["Hello, how are you?"],
metrics={"runtime": True, "memory": True}
)
print("✅ PerformanceEval created successfully")
- return True
+ return eval_test is not None
@@
suite = EvalSuite(
name="Test Suite",
agents=[agent],
test_cases=test_cases
)
print("✅ EvalSuite created successfully")
- return True
+ return suite is not None
@@
criteria = EvalCriteria(
factual_accuracy=0.5,
completeness=0.3,
relevance=0.2
)
print("✅ EvalCriteria created successfully")
- return True
+ return criteria is not NoneAlso applies to: 58-63, 71-77, 100-106, 114-120
🧰 Tools
🪛 Ruff (0.14.14)
[error] 38-38: Local variable eval_test is assigned to but never used
Remove assignment to unused variable eval_test
(F841)
[warning] 44-44: Consider moving this statement to an else block
(TRY300)
🤖 Prompt for AI Agents
In `@src/praisonai-agents/test_eval_framework.py` around lines 38 - 44, The
created evaluator instance (e.g., eval_test created by AccuracyEval) is never
used and triggers unused-variable lint errors; replace the placeholder return
True with returning the created object (return eval_test) and do the same for
the other evaluator variables in the file (for the blocks at 58-63, 71-77,
100-106, 114-120) — either return each created evaluator or collect them into a
list/tuple and return that so the objects are referenced and the F841 warnings
are resolved.


Implements comprehensive evaluation framework as requested in issue #967
Closes #967
Generated with Claude Code
Note
Medium Risk
New evaluation code executes agents and can invoke LLM calls and
psutil-based memory tracking, which may affect runtime, dependencies, and network usage when used. Changes are additive and lazily imported, reducing impact on existing consumers.Overview
Introduces a new
praisonaiagents.evalmodule providingAccuracyEval(string-similarity scoring plus optional LLM-based multi-criteria judging with iterations and result export),ReliabilityEval(validates expected tool calls/order fromTaskOutput), andPerformanceEval(benchmarks runtime/memory and optionally token usage with warmups, batch stats, and cross-agent comparison).Adds
EvalSuite/TestCaseorchestration for running mixed evaluation types across agents with basic pass/fail thresholds, alert/export hooks, and report generation, plus shared result dataclasses (EvalResult, batch/performance/reliability result types) and anEvalCriteriaweight helper.Exports the evaluation API via
praisonaiagents.__init__using lazy loading for backward compatibility, and includes anexample_eval_usage.pywalkthrough and atest_eval_framework.pysmoke-test script.Written by Cursor Bugbot for commit 274345d. This will update automatically on new commits. Configure here.
Summary by CodeRabbit