feat: implement evaluation framework for praisonaiagents by MervinPraison · Pull Request #976 · MervinPraison/PraisonAI

MervinPraison · 2025-07-17T00:26:49Z

Implements comprehensive evaluation framework as requested in issue #967

Added AccuracyEval, ReliabilityEval, PerformanceEval, EvalSuite classes
Supports simple similarity and LLM-based multi-criteria evaluation
Includes statistical reliability with confidence intervals
Provides tool usage validation and performance benchmarking
Features automated test suites with CI/CD integration
Maintains full backward compatibility with lazy loading
All tests passing (6/6)

Closes #967

Note

Medium Risk
New evaluation code executes agents and can invoke LLM calls and psutil-based memory tracking, which may affect runtime, dependencies, and network usage when used. Changes are additive and lazily imported, reducing impact on existing consumers.

Overview
Introduces a new praisonaiagents.eval module providing AccuracyEval (string-similarity scoring plus optional LLM-based multi-criteria judging with iterations and result export), ReliabilityEval (validates expected tool calls/order from TaskOutput), and PerformanceEval (benchmarks runtime/memory and optionally token usage with warmups, batch stats, and cross-agent comparison).

Adds EvalSuite/TestCase orchestration for running mixed evaluation types across agents with basic pass/fail thresholds, alert/export hooks, and report generation, plus shared result dataclasses (EvalResult, batch/performance/reliability result types) and an EvalCriteria weight helper.

Exports the evaluation API via praisonaiagents.__init__ using lazy loading for backward compatibility, and includes an example_eval_usage.py walkthrough and a test_eval_framework.py smoke-test script.

^{Written by Cursor Bugbot for commit 274345d. This will update automatically on new commits. Configure here.}

Summary by CodeRabbit

New Features
- Introduced a comprehensive evaluation framework for PraisonAI agents supporting accuracy, reliability, and performance testing.
- Added configurable test suites with multi-iteration support and weighted criteria evaluation.
- Enabled statistical analysis including averages, standard deviations, percentiles, and confidence intervals for batch evaluations.
- Added result export and reporting capabilities with alert configuration options.

- Add comprehensive evaluation framework with minimal client-side code - Implement AccuracyEval with simple similarity and LLM-based scoring - Implement ReliabilityEval for tool usage validation - Implement PerformanceEval for runtime, memory, and token benchmarking - Add EvalSuite for automated test suites with CI/CD integration - Include EvalCriteria for multi-dimensional evaluation scoring - Support statistical reliability with multiple iterations and confidence intervals - Add result export capabilities (JSON, HTML, Markdown) - Integrate with existing Agent, Task, and PraisonAIAgents classes - Ensure backward compatibility with lazy loading - Include comprehensive test suite and usage examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>

coderabbitai · 2025-07-17T00:26:56Z

📝 Walkthrough

Walkthrough

Introduces a comprehensive client-side evaluation framework for PraisonAI agents, including AccuracyEval with LLM-based multi-criteria scoring, ReliabilityEval for tool usage verification, PerformanceEval for benchmarking with statistical analysis, and EvalSuite for orchestrated multi-type evaluations. Includes supporting data models, examples, and test utilities.

Changes

Cohort / File(s)	Summary
Evaluation Framework Core `praisonaiagents/eval/accuracy_eval.py`, `praisonaiagents/eval/reliability_eval.py`, `praisonaiagents/eval/performance_eval.py`, `praisonaiagents/eval/eval_suite.py`	Implements four core evaluators: AccuracyEval with simple similarity and LLM-based criteria scoring; ReliabilityEval for tool usage validation; PerformanceEval for runtime/memory/token benchmarking; EvalSuite orchestrator for multi-type test execution with alerting and result export.
Evaluation Data Models `praisonaiagents/eval/eval_criteria.py`, `praisonaiagents/eval/eval_result.py`	Defines EvalCriteria with weighted scoring (factual_accuracy, completeness, relevance) and multiple result classes (EvalResult, BatchEvalResult, PerformanceResult, PerformanceBatchResult, ReliabilityResult) with statistical helpers and serialization.
Package Integration `praisonaiagents/__init__.py`, `praisonaiagents/eval/__init__.py`	Adds lazy loading of evaluation framework components in main package init with fallback for missing eval module; establishes eval subpackage exports for AccuracyEval, ReliabilityEval, PerformanceEval, EvalSuite, TestCase, EvalCriteria, EvalResult.
Examples & Tests `example_eval_usage.py`, `test_eval_framework.py`	Provides demonstration code for all evaluation types (basic/advanced accuracy, reliability, performance, automated suite) and comprehensive test coverage instantiating framework components with structured test scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant EvalSuite
    participant AccuracyEval
    participant Agent
    participant LLM as OpenAI LLM
    participant ResultAgg as Result Aggregator

    Client->>EvalSuite: run(verbose=True)
    EvalSuite->>EvalSuite: Iterate agents & test_cases

    loop For each TestCase
        EvalSuite->>AccuracyEval: _run_accuracy_test()
        AccuracyEval->>Agent: execute(input)
        Agent-->>AccuracyEval: actual_output
        
        alt Has EvalCriteria
            AccuracyEval->>LLM: evaluate_with_criteria(prompt)
            LLM-->>AccuracyEval: JSON scores
            AccuracyEval->>AccuracyEval: calculate_weighted_score()
        else Simple Scoring
            AccuracyEval->>AccuracyEval: _simple_similarity_score()
        end
        
        AccuracyEval-->>EvalSuite: test_result
        EvalSuite->>ResultAgg: accumulate result
    end

    EvalSuite->>EvalSuite: Compute success_rate
    EvalSuite->>EvalSuite: _check_alerts()
    EvalSuite->>EvalSuite: _export_results()
    EvalSuite-->>Client: EvalSuiteResult

sequenceDiagram
    participant Client
    participant AccuracyEval
    participant Agent
    participant LLM as Evaluator LLM
    participant Scorer

    Client->>AccuracyEval: run(verbose=False)
    
    alt Multiple Iterations
        loop For each iteration
            AccuracyEval->>AccuracyEval: _run_single_iteration()
            AccuracyEval->>AccuracyEval: Iterate test_cases
            activate AccuracyEval
                AccuracyEval->>Agent: execute(input)
                Agent-->>AccuracyEval: TaskOutput
                AccuracyEval->>AccuracyEval: _evaluate_single_output()
                
                alt With EvalCriteria
                    AccuracyEval->>LLM: construct prompt
                    LLM-->>AccuracyEval: parse JSON scores
                    AccuracyEval->>Scorer: calculate_weighted_score()
                    Scorer-->>AccuracyEval: weighted_score
                else Without Criteria
                    AccuracyEval->>AccuracyEval: _simple_similarity_score()
                end
                
                AccuracyEval->>AccuracyEval: aggregate scores
            deactivate AccuracyEval
        end
        AccuracyEval->>AccuracyEval: _create_batch_result()
    else Single Iteration
        AccuracyEval->>AccuracyEval: execute once
    end
    
    opt save_results configured
        AccuracyEval->>AccuracyEval: _save_results()
    end
    
    AccuracyEval-->>Client: EvalResult or BatchEvalResult

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

Review effort 4/5, Possible security concern

Poem

🐰 Hoppy eval times ahead!
Five frameworks leap in a single bound,
Accuracy, reliability, performance sound,
With LLM wisdom and tools we test,
The agents now wear a quality vest! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title 'feat: implement evaluation framework for praisonaiagents' clearly and concisely describes the primary change: adding a comprehensive evaluation framework.
Linked Issues check	✅ Passed	PR implements all core requirements from issue `#967`: AccuracyEval with multi-criteria scoring [`#967`], ReliabilityEval with tool validation [`#967`], PerformanceEval with metrics [`#967`], EvalSuite with test orchestration [`#967`], and EvalCriteria [`#967`].
Out of Scope Changes check	✅ Passed	All changes directly support the evaluation framework implementation. No unrelated modifications detected in PR scope.
Docstring Coverage	✅ Passed	Docstring coverage is 98.57% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch claude/issue-967-20250717-0003

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Summary of Changes

Hello @MervinPraison, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a robust and extensible evaluation framework for PraisonAI agents, addressing the need for systematic quality assessment. It provides developers with tools to measure and improve agent performance, reliability, and accuracy through configurable tests, statistical analysis, and automation features, ultimately enhancing the overall quality assurance pipeline for agent development.

Highlights

New Evaluation Framework: Introduced a comprehensive evaluation framework for PraisonAI agents, including core classes like AccuracyEval, ReliabilityEval, PerformanceEval, EvalSuite, TestCase, and EvalCriteria.
Multi-faceted Evaluation Capabilities: The framework supports diverse evaluation types: accuracy (via simple similarity or LLM-based multi-criteria scoring), reliability (tool usage validation, including order and additional tool tolerance), and performance (benchmarking runtime, memory, token usage, and time to first token).
Automation and Reporting: Features include statistical reliability with confidence intervals, automated test suites with scheduling and alerts, and flexible result export options (JSON, HTML, Markdown) for continuous integration and quality assurance.
Backward Compatibility: The new evaluation components are integrated using lazy loading to ensure full backward compatibility with existing PraisonAI agent implementations.
Example and Test Coverage: A new example file (example_eval_usage.py) demonstrates the framework's capabilities, and a dedicated test script (test_eval_framework.py) validates its core components.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive evaluation framework for PraisonAI agents, including modules for accuracy, reliability, and performance testing. The implementation is well-structured with clear separation of concerns. I've identified a few areas for improvement, including a high-severity performance issue in report generation, a medium severity bug in result saving, and opportunities to make evaluation thresholds more configurable for better flexibility. Overall, this is a great addition to the library.

gemini-code-assist · 2025-07-17T00:28:34Z

src/praisonai-agents/praisonaiagents/eval/eval_suite.py

+            # Run the evaluation
+            result = self.run()


The generate_report method currently calls self.run() internally. This is highly inefficient, as it will re-run the entire evaluation suite every time a report is generated, which can be very time-consuming and expensive.

The report generation should be decoupled from the test execution. A better approach is to have run() return the results, and then pass those results to generate_report().

I suggest changing the signature of generate_report to accept an EvalSuiteResult object.

def generate_report( self, result: EvalSuiteResult, format: str = "json", include_graphs: bool = False, compare_with: Optional[str] = None ) -> str: """ Generate a comprehensive evaluation report. Args: result: The result object from an EvalSuite run. format: Report format ("json", "html", "markdown") include_graphs: Whether to include performance graphs compare_with: Compare with previous results (e.g., "last_week") Returns: Report content or file path """ try: # No longer runs the evaluation, uses the passed-in result object

gemini-code-assist · 2025-07-17T00:28:34Z

src/praisonai-agents/praisonaiagents/eval/accuracy_eval.py

+            if hasattr(self, 'verbose') and self.verbose:
+                print(f"Results saved to {self.save_results}")


The condition hasattr(self, 'verbose') and self.verbose will always evaluate to false because verbose is a parameter of the run method and is not set as an attribute on the class instance. This means the confirmation message for saving results is never printed, which can be confusing for users.

A better approach would be to use the logging module to inform the user that the file has been saved. This is more idiomatic for a library and allows the user to control visibility via their logging configuration.

logger.info(f"Results saved to {self.save_results}")

gemini-code-assist · 2025-07-17T00:28:34Z

src/praisonai-agents/praisonaiagents/eval/eval_suite.py

+
+            return {
+                'type': 'accuracy',
+                'passed': result.success and result.score >= 7.0,  # Default threshold


The passing threshold for accuracy tests is hardcoded to 7.0. This reduces the flexibility of the evaluation suite, as different tests might require different passing criteria.

Consider making this threshold configurable by adding a property to the TestCase dataclass, for example min_accuracy_score: Optional[float] = 7.0. This would allow users to define custom thresholds for each test case.

Suggested change

'passed': result.success and result.score >= 7.0, # Default threshold

'passed': result.success and result.score >= (test_case.min_score if hasattr(test_case, 'min_score') else 7.0), # Default threshold

gemini-code-assist · 2025-07-17T00:28:35Z

src/praisonai-agents/praisonaiagents/eval/eval_suite.py

+            )
+            result = evaluator.run()
+
+            passed = result.success and result.success_rate >= 80.0  # Default threshold


Similar to the accuracy test, the success rate threshold for reliability tests is hardcoded to 80.0. This should be configurable to accommodate different reliability requirements.

I recommend adding a min_success_rate field to the TestCase dataclass to allow users to set this value on a per-test basis.

Suggested change

passed = result.success and result.success_rate >= 80.0 # Default threshold

passed = result.success and result.success_rate >= (test_case.min_success_rate if hasattr(test_case, 'min_success_rate') else 80.0) # Default threshold

cursor

Bug: Verbose Mode Not Functional

The _save_results method attempts to use self.verbose to control a print statement, but the verbose parameter is only passed to the run() method and never stored as an instance attribute. This causes the hasattr(self, 'verbose') check to always be False, preventing the "Results saved to..." message from being printed even when verbose mode is intended. The verbose parameter should be stored as an instance attribute or passed directly to _save_results.

src/praisonai-agents/praisonaiagents/eval/accuracy_eval.py#L279-L281

PraisonAI/src/praisonai-agents/praisonaiagents/eval/accuracy_eval.py

Lines 279 to 281 in 3fa2a13

    
               json.dump(results, f, indent=2) 
        
           if hasattr(self, 'verbose') and self.verbose: 
        
               print(f"Results saved to {self.save_results}")

Fix in Cursor • Fix in Web

BugBot free trial expires on July 22, 2025
Learn more in the Cursor dashboard.

Was this report helpful? Give feedback by reacting with 👍 or 👎

MervinPraison · 2025-07-17T06:24:21Z

@claude review this pull request and do a detailed analysis and fix this if the existing code doesn't have the solution implemented. Making sure it has backward compatibility, no existing features removed. After making those changes, again review the applied changes. Use @web to search if you dont know any information or to find the latest documentation or to find the latest version. Run the code if you think you need to run it to test it. Minimal code change to start with if required any changes.

github-actions · 2025-07-17T12:20:12Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2026-02-04T08:24:42Z

src/praisonai-agents/praisonaiagents/eval/eval_suite.py

+                'passed': passed and result.success,
+                'runtime': result.runtime,
+                'memory_mb': result.memory_mb,
+                'tokens': result.tokens,


Performance test fails when result type is batch

Medium Severity

The _run_performance_test method accesses result.runtime, result.memory_mb, and result.tokens as scalar attributes, but PerformanceEval.run() returns Union[PerformanceResult, PerformanceBatchResult]. When PerformanceBatchResult is returned, these attributes don't exist — it has runtimes, memory_mbs, and tokens as lists instead. While current defaults (1 iteration, 1 query) avoid this, any configuration change or future refactoring would cause AttributeError at runtime.

cursor · 2026-02-04T08:24:42Z

src/praisonai-agents/praisonaiagents/eval/reliability_eval.py

+            # Execute the task
+            task_result = self.agent.execute(test_input)
+            if not isinstance(task_result, TaskOutput):
+                task_result = TaskOutput(raw=str(task_result))


TaskOutput instantiation missing required Pydantic fields

High Severity

When the agent's execute method returns a non-TaskOutput result, the code attempts to wrap it with TaskOutput(raw=str(task_result)). However, TaskOutput is a Pydantic model with required fields description and agent in addition to raw. This instantiation will raise a Pydantic ValidationError at runtime, causing the reliability evaluation to fail for any agent that doesn't return a TaskOutput directly.

cursor · 2026-02-04T08:24:42Z

src/praisonai-agents/praisonaiagents/eval/eval_suite.py

+            return {
+                'type': 'accuracy',
+                'passed': result.success and result.score >= 7.0,  # Default threshold
+                'score': result.score,


Accuracy test accesses missing score attribute on batch result

Medium Severity

The _run_accuracy_test method accesses result.score directly, but AccuracyEval.run() returns Union[EvalResult, BatchEvalResult]. BatchEvalResult doesn't have a score attribute — it has avg_score instead. While current defaults return EvalResult, this would cause an AttributeError if the AccuracyEval configuration ever uses iterations > 1.

coderabbitai

Actionable comments posted: 8

🤖 Fix all issues with AI agents

In `@src/praisonai-agents/example_eval_usage.py`:
- Around line 32-41: The example creates unused objects like eval_test
(AccuracyEval) which trigger F841; update the example to reference or use these
objects in the existing prints (e.g., include eval_test and agent in the print
output or call a lightweight method like eval_test.run() or str(eval_test) to
show configuration) so the variables are used; apply the same change pattern to
the other unused objects in the file (the blocks around lines noted in the
review) so each created variable (e.g., eval_test, any other *_test or created
agent objects) is referenced in a print or benign call to avoid the
unused-variable lint error.

In `@src/praisonai-agents/praisonaiagents/eval/accuracy_eval.py`:
- Around line 117-163: In _llm_evaluate_with_criteria: stop passing
self.evaluator_llm into get_openai_client (which takes api_key/base_url), call
get_openai_client() with no arguments so authentication uses env/defaults, and
keep using self.evaluator_llm only as the model= value in the
client.chat.completions.create call; update the invocation of get_openai_client
in this method and verify client is used unchanged for the chat completion
request.

In `@src/praisonai-agents/praisonaiagents/eval/eval_criteria.py`:
- Around line 16-37: The __post_init__ currently only checks the sum of
factual_accuracy, completeness, and relevance in eval_criteria but does not
prevent negative values; update __post_init__ to validate each weight
(factual_accuracy, completeness, relevance) is >= 0 (and optionally <= 1 if you
prefer) and raise a ValueError with a clear message if any weight is negative;
keep the existing sum check, and ensure calculate_weighted_score and the weights
property remain unchanged so negative weights cannot invert scoring.

In `@src/praisonai-agents/praisonaiagents/eval/eval_suite.py`:
- Around line 214-310: The code currently keys agent_results by agent_name in
run(), which allows later agents with the same name to overwrite earlier
results; fix by disambiguating duplicate agent names before assignment: when
computing agent_name (in run), check if agent_name already exists in
agent_results and if so append a deterministic suffix (e.g., "_1", "_2") or
include the object's id (f"{agent_name}_{id(agent)}") to produce a unique key,
then store agent_test_results under that unique key; ensure any printed/logged
agent label uses the same unique name so outputs and keys remain consistent.

In `@src/praisonai-agents/praisonaiagents/eval/performance_eval.py`:
- Around line 5-55: The module currently unconditionally imports psutil which
will crash import if it's not installed; update the code so psutil is imported
optionally (wrap the import in try/except ImportError at module scope), set a
fallback (e.g., set a module-level flag like _PSUTIL_AVAILABLE = False) and
ensure PerformanceEval._get_memory_usage checks that flag and returns None (or
disables the 'memory' metric in self.metrics) when psutil is unavailable;
alternatively, if you prefer requiring psutil, add it to install
requirements—make the change around the top-level import and in the
PerformanceEval.__init__ / _get_memory_usage logic so memory metrics are
guarded.

In `@src/praisonai-agents/praisonaiagents/eval/reliability_eval.py`:
- Around line 32-52: The success_rate property currently returns 100.0 when
total_scenarios == 0 which is misleading; update the success_rate getter to
return 0.0 (or another explicit neutral value) when self.total_scenarios == 0
instead of 100.0 so empty datasets don't appear fully successful; modify the
success_rate property implementation that references
total_scenarios/passed_scenarios to check for zero and return 0.0 before
performing the division.
- Around line 159-221: The code currently builds TaskOutput using
TaskOutput(raw=str(task_result)) in the execute block which will raise
validation errors because TaskOutput requires description, raw, and agent; also
expected_tools may be None or not a list. Update the agent.execute handling in
the method (where task_result is set) to: if the returned value is not an
instance of TaskOutput, wrap it in a TaskOutput providing sensible default
values for all required fields (e.g., description as an empty string or short
summary, raw=str(task_result), and agent as the agent's identifier), and
normalize expected_tools right after reading it (ensure expected_tools is a
list, defaulting to [] if None or not iterable) so downstream set/list
operations and _extract_tool_calls(...) work safely when evaluating and
constructing the ReliabilityScenario.

In `@src/praisonai-agents/test_eval_framework.py`:
- Around line 38-44: The created evaluator instance (e.g., eval_test created by
AccuracyEval) is never used and triggers unused-variable lint errors; replace
the placeholder return True with returning the created object (return eval_test)
and do the same for the other evaluator variables in the file (for the blocks at
58-63, 71-77, 100-106, 114-120) — either return each created evaluator or
collect them into a list/tuple and return that so the objects are referenced and
the F841 warnings are resolved.

🧹 Nitpick comments (1)

src/praisonai-agents/praisonaiagents/eval/__init__.py (1)

15-23: Sort __all__ to satisfy Ruff RUF022.
Purely stylistic, but keeps lint clean.

🔧 Example sorting

 __all__ = [
-    'AccuracyEval',
-    'ReliabilityEval', 
-    'PerformanceEval',
-    'EvalSuite',
-    'TestCase',
-    'EvalCriteria',
-    'EvalResult'
+    'AccuracyEval',
+    'EvalCriteria',
+    'EvalResult',
+    'EvalSuite',
+    'PerformanceEval',
+    'ReliabilityEval',
+    'TestCase',
 ]

coderabbitai · 2026-02-04T08:26:11Z

src/praisonai-agents/example_eval_usage.py

+    eval_test = AccuracyEval(
+        agent=agent,
+        input="What is the capital of France?",
+        expected_output="Paris"
+    )
+
+    print("Running basic accuracy evaluation...")
+    # Note: In a real scenario, you would run: result = eval_test.run()
+    # print(f"Accuracy: {result.score}/10")
+    print("✓ AccuracyEval configured successfully")


⚠️ Potential issue | 🟠 Major

Use created objects to avoid F841 unused-variable errors.
These example objects are currently unused and trigger lint errors. A minimal fix is to use them in the existing print statements.

🛠️ Suggested fix

- print("✓ AccuracyEval configured successfully") + print(f"✓ AccuracyEval configured successfully for agent: {eval_test.agent.name}") @@ - print("Advanced accuracy evaluation configured with:") + print(f"Advanced accuracy evaluation configured for {eval_test.agent.name}:") @@ - print("Reliability testing configured for:") + print(f"Reliability testing configured for {len(eval_test.test_scenarios)} scenarios:") @@ - print("Performance evaluation configured with:") + print(f"Performance evaluation configured for {len(eval_test.benchmark_queries)} queries:") @@ - agents = [agent] # In practice, you'd have multiple agents - # comparison = PerformanceEval.compare( - # agents=agents, + # comparison = PerformanceEval.compare( + # agents=[agent], # benchmark_suite="standard", # export_format="html" # ) @@ - print("Automated test suite configured with:") + print(f"Automated test suite '{suite.name}' configured with {len(suite.test_cases)} tests:") @@ - print("Integration features planned:") + print(f"Integration features planned for agent: {agent.name}") + print("Integration features planned:")

Also applies to: 56-84, 104-124, 144-167, 172-172, 192-231, 251-285

🧰 Tools

🪛 Ruff (0.14.14)

[error] 32-32: Local variable eval_test is assigned to but never used

Remove assignment to unused variable eval_test

(F841)

🤖 Prompt for AI Agents

In `@src/praisonai-agents/example_eval_usage.py` around lines 32 - 41, The example creates unused objects like eval_test (AccuracyEval) which trigger F841; update the example to reference or use these objects in the existing prints (e.g., include eval_test and agent in the print output or call a lightweight method like eval_test.run() or str(eval_test) to show configuration) so the variables are used; apply the same change pattern to the other unused objects in the file (the blocks around lines noted in the review) so each created variable (e.g., eval_test, any other *_test or created agent objects) is referenced in a print or benign call to avoid the unused-variable lint error.

coderabbitai · 2026-02-04T08:26:12Z

src/praisonai-agents/praisonaiagents/eval/accuracy_eval.py

+    def _llm_evaluate_with_criteria(self, actual: str, expected: str, criteria: EvalCriteria) -> float:
+        """Use LLM to evaluate output against criteria."""
+        try:
+            from ..llm import get_openai_client
+
+            client = get_openai_client(self.evaluator_llm)
+
+            evaluation_prompt = f"""
+            Evaluate the following response based on these criteria:
+            - Factual Accuracy ({criteria.factual_accuracy*100}%): How factually correct is the response?
+            - Completeness ({criteria.completeness*100}%): How complete is the response?
+            - Relevance ({criteria.relevance*100}%): How relevant is the response to the expected output?
+
+            Expected Output: {expected}
+            Actual Output: {actual}
+
+            Rate each criterion from 0-10 and provide the scores in this exact JSON format:
+            {{
+                "factual_accuracy": <score>,
+                "completeness": <score>, 
+                "relevance": <score>,
+                "explanation": "<brief explanation>"
+            }}
+            """
+
+            response = client.chat.completions.create(
+                model=self.evaluator_llm,
+                messages=[{"role": "user", "content": evaluation_prompt}],
+                temperature=0.1
+            )
+
+            # Parse response
+            response_text = response.choices[0].message.content.strip()
+            if response_text.startswith('```json'):
+                response_text = response_text[7:-3]
+            elif response_text.startswith('```'):
+                response_text = response_text[3:-3]
+
+            eval_scores = json.loads(response_text)
+
+            # Calculate weighted score
+            return criteria.calculate_weighted_score(eval_scores)
+
+        except Exception as e:
+            logger.error(f"Error in LLM evaluation: {e}")
+            # Fallback to simple similarity
+            return self._simple_similarity_score(actual, expected)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify get_openai_client signature and implementation find src -name "*.py" -type f | xargs grep -l "def get_openai_client" | head -5

Repository: MervinPraison/PraisonAI

Length of output: 367

🏁 Script executed:

#!/bin/bash # Search for get_openai_client definition across the codebase rg -n "def get_openai_client" --type py

Repository: MervinPraison/PraisonAI

Length of output: 231

🏁 Script executed:

#!/bin/bash # Read the get_openai_client function implementation sed -n '1139,1160p' src/praisonai-agents/praisonaiagents/llm/openai_client.py

Repository: MervinPraison/PraisonAI

Length of output: 590

Don't pass the model name into get_openai_client.

get_openai_client expects optional api_key and base_url parameters, so passing self.evaluator_llm (a model name like "gpt-4o-mini") will incorrectly bind it as the api_key argument and break authentication. Use get_openai_client() without arguments to rely on environment variables or defaults, and keep the model name only for the model= parameter in the chat completion call.

🛠️ Suggested fix

- client = get_openai_client(self.evaluator_llm) + client = get_openai_client()

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 160-160: Do not catch blind exception: Exception

(BLE001)

[warning] 161-161: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents

In `@src/praisonai-agents/praisonaiagents/eval/accuracy_eval.py` around lines 117 - 163, In _llm_evaluate_with_criteria: stop passing self.evaluator_llm into get_openai_client (which takes api_key/base_url), call get_openai_client() with no arguments so authentication uses env/defaults, and keep using self.evaluator_llm only as the model= value in the client.chat.completions.create call; update the invocation of get_openai_client in this method and verify client is used unchanged for the chat completion request.

coderabbitai · 2026-02-04T08:26:12Z

src/praisonai-agents/praisonaiagents/eval/eval_criteria.py

+    def __post_init__(self):
+        """Validate that weights sum to 1.0."""
+        total = self.factual_accuracy + self.completeness + self.relevance
+        if abs(total - 1.0) > 0.001:
+            raise ValueError(f"Criteria weights must sum to 1.0, got {total}")
+
+    @property
+    def weights(self) -> Dict[str, float]:
+        """Get criteria weights as dictionary."""
+        return {
+            'factual_accuracy': self.factual_accuracy,
+            'completeness': self.completeness,
+            'relevance': self.relevance
+        }
+
+    def calculate_weighted_score(self, scores: Dict[str, float]) -> float:
+        """Calculate weighted score from individual criteria scores."""
+        total_score = 0.0
+        for criterion, weight in self.weights.items():
+            if criterion in scores:
+                total_score += scores[criterion] * weight
+        return total_score


⚠️ Potential issue | 🟡 Minor

Validate non‑negative weights in criteria.
Weights like 1.2, -0.1, -0.1 pass the sum check but invert scoring. Add a non‑negative guard.

🛠️ Suggested fix

def __post_init__(self): """Validate that weights sum to 1.0.""" + if any(w < 0 for w in (self.factual_accuracy, self.completeness, self.relevance)): + raise ValueError("Criteria weights must be non-negative") total = self.factual_accuracy + self.completeness + self.relevance if abs(total - 1.0) > 0.001: raise ValueError(f"Criteria weights must sum to 1.0, got {total}")

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 20-20: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In `@src/praisonai-agents/praisonaiagents/eval/eval_criteria.py` around lines 16 - 37, The __post_init__ currently only checks the sum of factual_accuracy, completeness, and relevance in eval_criteria but does not prevent negative values; update __post_init__ to validate each weight (factual_accuracy, completeness, relevance) is >= 0 (and optionally <= 1 if you prefer) and raise a ValueError with a clear message if any weight is negative; keep the existing sum check, and ensure calculate_weighted_score and the weights property remain unchanged so negative weights cannot invert scoring.

coderabbitai · 2026-02-04T08:26:12Z

src/praisonai-agents/praisonaiagents/eval/eval_suite.py

+    def run(self, verbose: bool = False) -> EvalSuiteResult:
+        """
+        Run the complete evaluation suite.
+
+        Args:
+            verbose: Whether to print detailed output
+
+        Returns:
+            EvalSuiteResult with comprehensive results
+        """
+        if verbose:
+            print(f"Running evaluation suite: {self.name}")
+            print(f"Agents: {len(self.agents)}, Test cases: {len(self.test_cases)}")
+
+        total_tests = 0
+        passed_tests = 0
+        agent_results = {}
+
+        try:
+            for agent in self.agents:
+                agent_name = getattr(agent, 'name', f"Agent_{id(agent)}")
+                if verbose:
+                    print(f"\nEvaluating agent: {agent_name}")
+
+                agent_test_results = []
+
+                for test_case in self.test_cases:
+                    if verbose:
+                        print(f"  Running test: {test_case.name}")
+
+                    total_tests += 1
+
+                    # Run appropriate test type
+                    if test_case.eval_type == "accuracy":
+                        test_result = self._run_accuracy_test(agent, test_case)
+                    elif test_case.eval_type == "reliability":
+                        test_result = self._run_reliability_test(agent, test_case)
+                    elif test_case.eval_type == "performance":
+                        test_result = self._run_performance_test(agent, test_case)
+                    else:
+                        logger.warning(f"Unknown test type: {test_case.eval_type}")
+                        test_result = {
+                            'type': test_case.eval_type,
+                            'passed': False,
+                            'error': f"Unknown test type: {test_case.eval_type}"
+                        }
+
+                    test_result['test_case'] = test_case.to_dict()
+                    agent_test_results.append(test_result)
+
+                    if test_result['passed']:
+                        passed_tests += 1
+
+                    if verbose:
+                        status = "PASS" if test_result['passed'] else "FAIL"
+                        print(f"    {status}: {test_case.name}")
+
+                agent_results[agent_name] = agent_test_results
+
+            # Calculate overall results
+            failed_tests = total_tests - passed_tests
+            success_rate = (passed_tests / total_tests * 100) if total_tests > 0 else 0.0
+
+            suite_result = EvalSuiteResult(
+                name=self.name,
+                total_tests=total_tests,
+                passed_tests=passed_tests,
+                failed_tests=failed_tests,
+                success_rate=success_rate,
+                details={
+                    'agent_results': agent_results,
+                    'test_cases': [tc.to_dict() for tc in self.test_cases]
+                }
+            )
+
+            if verbose:
+                print(f"\nSuite Results: {suite_result.summary}")
+
+            # Check alerts
+            self._check_alerts(suite_result)
+
+            # Export results
+            if self.export_results:
+                self._export_results(suite_result)
+
+            return suite_result
+
+        except Exception as e:
+            logger.error(f"Error running evaluation suite: {e}")
+            return EvalSuiteResult(
+                name=self.name,
+                total_tests=0,
+                passed_tests=0,
+                failed_tests=0,
+                success_rate=0.0,
+                details={'error': str(e)}
+            )


⚠️ Potential issue | 🟡 Minor

Prevent duplicate agent names from overwriting results.

agent_results uses agent names as keys, so duplicate names will overwrite earlier entries.

🛠️ Suggested fix

- for agent in self.agents: - agent_name = getattr(agent, 'name', f"Agent_{id(agent)}") + for idx, agent in enumerate(self.agents): + agent_name = getattr(agent, 'name', f"Agent_{id(agent)}") + agent_key = agent_name if agent_name not in agent_results else f"{agent_name}_{idx}" @@ - agent_results[agent_name] = agent_test_results + agent_results[agent_key] = agent_test_results

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def run(self, verbose: bool = False) -> EvalSuiteResult:

"""

Run the complete evaluation suite.

Args:

verbose: Whether to print detailed output

Returns:

EvalSuiteResult with comprehensive results

"""

if verbose:

print(f"Running evaluation suite: {self.name}")

print(f"Agents: {len(self.agents)}, Test cases: {len(self.test_cases)}")

total_tests = 0

passed_tests = 0

agent_results = {}

try:

for agent in self.agents:

agent_name = getattr(agent, 'name', f"Agent_{id(agent)}")

if verbose:

print(f"\nEvaluating agent: {agent_name}")

agent_test_results = []

for test_case in self.test_cases:

if verbose:

print(f" Running test: {test_case.name}")

total_tests += 1

# Run appropriate test type

if test_case.eval_type == "accuracy":

test_result = self._run_accuracy_test(agent, test_case)

elif test_case.eval_type == "reliability":

test_result = self._run_reliability_test(agent, test_case)

elif test_case.eval_type == "performance":

test_result = self._run_performance_test(agent, test_case)

else:

logger.warning(f"Unknown test type: {test_case.eval_type}")

test_result = {

'type': test_case.eval_type,

'passed': False,

'error': f"Unknown test type: {test_case.eval_type}"

}

test_result['test_case'] = test_case.to_dict()

agent_test_results.append(test_result)

if test_result['passed']:

passed_tests += 1

if verbose:

status = "PASS" if test_result['passed'] else "FAIL"

print(f" {status}: {test_case.name}")

agent_results[agent_name] = agent_test_results

# Calculate overall results

failed_tests = total_tests - passed_tests

success_rate = (passed_tests / total_tests * 100) if total_tests > 0 else 0.0

suite_result = EvalSuiteResult(

name=self.name,

total_tests=total_tests,

passed_tests=passed_tests,

failed_tests=failed_tests,

success_rate=success_rate,

details={

'agent_results': agent_results,

'test_cases': [tc.to_dict() for tc in self.test_cases]

}

)

if verbose:

print(f"\nSuite Results: {suite_result.summary}")

# Check alerts

self._check_alerts(suite_result)

# Export results

if self.export_results:

self._export_results(suite_result)

return suite_result

except Exception as e:

logger.error(f"Error running evaluation suite: {e}")

return EvalSuiteResult(

name=self.name,

total_tests=0,

passed_tests=0,

failed_tests=0,

success_rate=0.0,

details={'error': str(e)}

)

def run(self, verbose: bool = False) -> EvalSuiteResult:

"""

Run the complete evaluation suite.

Args:

verbose: Whether to print detailed output

Returns:

EvalSuiteResult with comprehensive results

"""

if verbose:

print(f"Running evaluation suite: {self.name}")

print(f"Agents: {len(self.agents)}, Test cases: {len(self.test_cases)}")

total_tests = 0

passed_tests = 0

agent_results = {}

try:

for idx, agent in enumerate(self.agents):

agent_name = getattr(agent, 'name', f"Agent_{id(agent)}")

agent_key = agent_name if agent_name not in agent_results else f"{agent_name}_{idx}"

if verbose:

print(f"\nEvaluating agent: {agent_name}")

agent_test_results = []

for test_case in self.test_cases:

if verbose:

print(f" Running test: {test_case.name}")

total_tests += 1

# Run appropriate test type

if test_case.eval_type == "accuracy":

test_result = self._run_accuracy_test(agent, test_case)

elif test_case.eval_type == "reliability":

test_result = self._run_reliability_test(agent, test_case)

elif test_case.eval_type == "performance":

test_result = self._run_performance_test(agent, test_case)

else:

logger.warning(f"Unknown test type: {test_case.eval_type}")

test_result = {

'type': test_case.eval_type,

'passed': False,

'error': f"Unknown test type: {test_case.eval_type}"

}

test_result['test_case'] = test_case.to_dict()

agent_test_results.append(test_result)

if test_result['passed']:

passed_tests += 1

if verbose:

status = "PASS" if test_result['passed'] else "FAIL"

print(f" {status}: {test_case.name}")

agent_results[agent_key] = agent_test_results

# Calculate overall results

failed_tests = total_tests - passed_tests

success_rate = (passed_tests / total_tests * 100) if total_tests > 0 else 0.0

suite_result = EvalSuiteResult(

name=self.name,

total_tests=total_tests,

passed_tests=passed_tests,

failed_tests=failed_tests,

success_rate=success_rate,

details={

'agent_results': agent_results,

'test_cases': [tc.to_dict() for tc in self.test_cases]

}

)

if verbose:

print(f"\nSuite Results: {suite_result.summary}")

# Check alerts

self._check_alerts(suite_result)

# Export results

if self.export_results:

self._export_results(suite_result)

return suite_result

except Exception as e:

logger.error(f"Error running evaluation suite: {e}")

return EvalSuiteResult(

name=self.name,

total_tests=0,

passed_tests=0,

failed_tests=0,

success_rate=0.0,

details={'error': str(e)}

)

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 299-299: Consider moving this statement to an else block

(TRY300)

[warning] 301-301: Do not catch blind exception: Exception

(BLE001)

[warning] 302-302: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents

In `@src/praisonai-agents/praisonaiagents/eval/eval_suite.py` around lines 214 - 310, The code currently keys agent_results by agent_name in run(), which allows later agents with the same name to overwrite earlier results; fix by disambiguating duplicate agent names before assignment: when computing agent_name (in run), check if agent_name already exists in agent_results and if so append a deterministic suffix (e.g., "_1", "_2") or include the object's id (f"{agent_name}_{id(agent)}") to produce a unique key, then store agent_test_results under that unique key; ensure any printed/logged agent label uses the same unique name so outputs and keys remain consistent.

coderabbitai · 2026-02-04T08:26:12Z

src/praisonai-agents/praisonaiagents/eval/performance_eval.py

+import time
+import psutil
+import os
+import json
+import logging
+from typing import List, Dict, Any, Optional, Union
+from ..agent.agent import Agent
+from ..main import TaskOutput
+from .eval_result import PerformanceResult, PerformanceBatchResult
+
+logger = logging.getLogger(__name__)
+
+class PerformanceEval:
+    """Evaluate agent performance metrics like runtime, memory, and token usage."""
+
+    def __init__(
+        self,
+        agent: Agent,
+        benchmark_queries: Optional[List[str]] = None,
+        metrics: Optional[Dict[str, bool]] = None,
+        iterations: int = 1,
+        warmup: int = 0
+    ):
+        """
+        Initialize performance evaluation.
+
+        Args:
+            agent: Agent to evaluate
+            benchmark_queries: List of queries to benchmark
+            metrics: Dict of metrics to track (runtime, memory, tokens, ttft)
+            iterations: Number of iterations to run
+            warmup: Number of warmup iterations (not counted in results)
+        """
+        self.agent = agent
+        self.benchmark_queries = benchmark_queries or ["Hello, how are you?"]
+        self.metrics = metrics or {
+            'runtime': True,
+            'memory': True,
+            'tokens': True,
+            'ttft': True
+        }
+        self.iterations = iterations
+        self.warmup = warmup
+
+    def _get_memory_usage(self) -> float:
+        """Get current memory usage in MB."""
+        try:
+            process = psutil.Process(os.getpid())
+            return process.memory_info().rss / 1024 / 1024  # Convert to MB
+        except Exception:
+            return None


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if psutil is declared in dependency files rg -n "psutil" -g "pyproject.toml" -g "requirements*.txt" -g "setup.cfg" -g "setup.py"

Repository: MervinPraison/PraisonAI

Length of output: 49

🏁 Script executed:

#!/bin/bash # View the imports at the top of performance_eval.py head -15 src/praisonai-agents/praisonaiagents/eval/performance_eval.py # Check what dependency files exist echo "=== Dependency files in repo ===" find . -maxdepth 3 -type f \( -name "pyproject.toml" -o -name "setup.py" -o -name "setup.cfg" -o -name "requirements*.txt" \) | head -20

Repository: MervinPraison/PraisonAI

Length of output: 577

🏁 Script executed:

#!/bin/bash # Check the pyproject.toml for praisonai-agents echo "=== src/praisonai-agents/pyproject.toml ===" cat ./src/praisonai-agents/pyproject.toml echo "" echo "=== src/praisonai-agents/requirements.txt ===" cat ./src/praisonai-agents/requirements.txt

Repository: MervinPraison/PraisonAI

Length of output: 1839

🏁 Script executed:

#!/bin/bash # Search for psutil usage throughout the codebase rg "psutil" --type py -n

Repository: MervinPraison/PraisonAI

Length of output: 2242

🏁 Script executed:

#!/bin/bash # Check how shell_tools.py handles psutil import sed -n '15,40p' src/praisonai-agents/praisonaiagents/tools/shell_tools.py

Repository: MervinPraison/PraisonAI

Length of output: 792

Guard psutil or declare it as a dependency.
Unconditional import on line 6 will crash module load if psutil isn't installed, even when memory metrics aren't used. Prefer an optional import or add it to install requirements.

🛠️ Suggested fix (optional import + safe fallback)

-import psutil +try: + import psutil + _psutil_available = True +except ImportError: + psutil = None + _psutil_available = False @@ - def _get_memory_usage(self) -> float: + def _get_memory_usage(self) -> Optional[float]: """Get current memory usage in MB.""" + if not _psutil_available: + return None try: process = psutil.Process(os.getpid()) return process.memory_info().rss / 1024 / 1024 # Convert to MB except Exception: return None

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 54-54: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@src/praisonai-agents/praisonaiagents/eval/performance_eval.py` around lines 5 - 55, The module currently unconditionally imports psutil which will crash import if it's not installed; update the code so psutil is imported optionally (wrap the import in try/except ImportError at module scope), set a fallback (e.g., set a module-level flag like _PSUTIL_AVAILABLE = False) and ensure PerformanceEval._get_memory_usage checks that flag and returns None (or disables the 'memory' metric in self.metrics) when psutil is unavailable; alternatively, if you prefer requiring psutil, add it to install requirements—make the change around the top-level import and in the PerformanceEval.__init__ / _get_memory_usage logic so memory metrics are guarded.

coderabbitai · 2026-02-04T08:26:12Z

src/praisonai-agents/praisonaiagents/eval/reliability_eval.py

+    @property
+    def total_scenarios(self) -> int:
+        """Total number of scenarios."""
+        return len(self.scenarios)
+
+    @property
+    def passed_scenarios(self) -> int:
+        """Number of passed scenarios."""
+        return len([s for s in self.scenarios if s.status == "passed"])
+
+    @property
+    def failed_scenarios(self) -> int:
+        """Number of failed scenarios."""
+        return len([s for s in self.scenarios if s.status == "failed"])
+
+    @property
+    def success_rate(self) -> float:
+        """Success rate as percentage."""
+        if self.total_scenarios == 0:
+            return 100.0
+        return (self.passed_scenarios / self.total_scenarios) * 100.0


⚠️ Potential issue | 🟡 Minor

Avoid reporting 100% success when there are zero scenarios.

Returning 100.0 with no data can mislead reports and gates.

🩹 Suggested fix

def success_rate(self) -> float: """Success rate as percentage.""" if self.total_scenarios == 0: - return 100.0 + return 0.0 return (self.passed_scenarios / self.total_scenarios) * 100.0

🤖 Prompt for AI Agents

In `@src/praisonai-agents/praisonaiagents/eval/reliability_eval.py` around lines 32 - 52, The success_rate property currently returns 100.0 when total_scenarios == 0 which is misleading; update the success_rate getter to return 0.0 (or another explicit neutral value) when self.total_scenarios == 0 instead of 100.0 so empty datasets don't appear fully successful; modify the success_rate property implementation that references total_scenarios/passed_scenarios to check for zero and return 0.0 before performing the division.

coderabbitai · 2026-02-04T08:26:12Z

src/praisonai-agents/praisonaiagents/eval/reliability_eval.py

+        scenario_name = scenario.get('name', f"Scenario {scenario.get('input', '')[:20]}")
+        test_input = scenario.get('input', '')
+        expected_tools = scenario.get('expected_tools', [])
+        required_order = scenario.get('required_order', False)
+        allow_additional = scenario.get('allow_additional', False)
+
+        try:
+            # Execute the task
+            task_result = self.agent.execute(test_input)
+            if not isinstance(task_result, TaskOutput):
+                task_result = TaskOutput(raw=str(task_result))
+
+            # Extract actual tool calls
+            actual_tools = self._extract_tool_calls(task_result)
+
+            # Evaluate tool usage
+            failed_tools = []
+            unexpected_tools = []
+
+            # Check for missing expected tools
+            if required_order:
+                # Check order and presence
+                expected_set = set(expected_tools)
+                actual_set = set(actual_tools)
+                missing_tools = expected_set - actual_set
+                failed_tools.extend(list(missing_tools))
+
+                # Check order for tools that are present
+                common_tools = [t for t in expected_tools if t in actual_tools]
+                actual_order = [t for t in actual_tools if t in common_tools]
+
+                if common_tools != actual_order[:len(common_tools)]:
+                    # Order mismatch
+                    failed_tools.append("tool_order_mismatch")
+            else:
+                # Just check presence
+                missing_tools = set(expected_tools) - set(actual_tools)
+                failed_tools.extend(list(missing_tools))
+
+            # Check for unexpected tools
+            if not allow_additional:
+                extra_tools = set(actual_tools) - set(expected_tools)
+                unexpected_tools.extend(list(extra_tools))
+
+            # Determine status
+            status = "passed" if not failed_tools and not unexpected_tools else "failed"
+
+            details = {
+                'input': test_input,
+                'expected_tools': expected_tools,
+                'actual_tools': actual_tools,
+                'required_order': required_order,
+                'allow_additional': allow_additional,
+                'task_output': task_result.raw if hasattr(task_result, 'raw') else str(task_result)
+            }
+
+            return ReliabilityScenario(
+                name=scenario_name,
+                status=status,
+                failed_tools=failed_tools,
+                unexpected_tools=unexpected_tools,
+                details=details
+            )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find TaskOutput class definition in Python codebase rg -n "class TaskOutput" src/praisonai-agents/

Repository: MervinPraison/PraisonAI

Length of output: 144

🏁 Script executed:

#!/bin/bash # Get the full TaskOutput class definition # First, let's find where it's defined find src/praisonai-agents -name "*.py" -type f | xargs grep -l "class TaskOutput" | head -5

Repository: MervinPraison/PraisonAI

Length of output: 354

🏁 Script executed:

#!/bin/bash # Check the main.py file mentioned in the review comment if [ -f "src/praisonai-agents/praisonaiagents/main.py" ]; then wc -l src/praisonai-agents/praisonaiagents/main.py sed -n '408,500p' src/praisonai-agents/praisonaiagents/main.py fi

Repository: MervinPraison/PraisonAI

Length of output: 1051

🏁 Script executed:

#!/bin/bash # Check the actual code in reliability_eval.py if [ -f "src/praisonai-agents/praisonaiagents/eval/reliability_eval.py" ]; then wc -l src/praisonai-agents/praisonaiagents/eval/reliability_eval.py sed -n '159,221p' src/praisonai-agents/praisonaiagents/eval/reliability_eval.py fi

Repository: MervinPraison/PraisonAI

Length of output: 2869

🏁 Script executed:

#!/bin/bash # Check what TaskOutput is being imported as in reliability_eval.py head -50 src/praisonai-agents/praisonaiagents/eval/reliability_eval.py | grep -E "^import|^from"

Repository: MervinPraison/PraisonAI

Length of output: 274

🏁 Script executed:

#!/bin/bash # Verify if TaskOutput could be instantiated with only raw parameter # Check if there are any default factories or validators that might handle this sed -n '409,437p' src/praisonai-agents/praisonaiagents/main.py

Repository: MervinPraison/PraisonAI

Length of output: 972

Fix TaskOutput construction with all required fields to prevent validation errors.

Line 169 constructs TaskOutput(raw=str(task_result)) but the class requires three fields: description: str, raw: str, and agent: str. This will raise a Pydantic ValidationError whenever a non-TaskOutput result is returned. Also guard expected_tools against None or non-list values.

🛠️ Suggested fix

- expected_tools = scenario.get('expected_tools', []) + expected_tools = scenario.get('expected_tools') or [] + if not isinstance(expected_tools, list): + expected_tools = [str(expected_tools)] required_order = scenario.get('required_order', False) allow_additional = scenario.get('allow_additional', False) try: # Execute the task task_result = self.agent.execute(test_input) - if not isinstance(task_result, TaskOutput): - task_result = TaskOutput(raw=str(task_result)) + if isinstance(task_result, TaskOutput): + task_output = task_result + else: + task_output = TaskOutput( + description="reliability_eval", + raw=str(task_result), + agent=getattr(self.agent, "name", "unknown"), + output_format="RAW", + ) # Extract actual tool calls - actual_tools = self._extract_tool_calls(task_result) + actual_tools = self._extract_tool_calls(task_output) @@ - 'task_output': task_result.raw if hasattr(task_result, 'raw') else str(task_result) + 'task_output': task_output.raw if hasattr(task_output, 'raw') else str(task_output) }

🤖 Prompt for AI Agents

In `@src/praisonai-agents/praisonaiagents/eval/reliability_eval.py` around lines 159 - 221, The code currently builds TaskOutput using TaskOutput(raw=str(task_result)) in the execute block which will raise validation errors because TaskOutput requires description, raw, and agent; also expected_tools may be None or not a list. Update the agent.execute handling in the method (where task_result is set) to: if the returned value is not an instance of TaskOutput, wrap it in a TaskOutput providing sensible default values for all required fields (e.g., description as an empty string or short summary, raw=str(task_result), and agent as the agent's identifier), and normalize expected_tools right after reading it (ensure expected_tools is a list, defaulting to [] if None or not iterable) so downstream set/list operations and _extract_tool_calls(...) work safely when evaluating and constructing the ReliabilityScenario.

coderabbitai · 2026-02-04T08:26:12Z

src/praisonai-agents/test_eval_framework.py

+        eval_test = AccuracyEval(
+            agent=agent,
+            input="What is the capital of France?",
+            expected_output="Paris"
+        )
+        print("✅ AccuracyEval created successfully")
+        return True


⚠️ Potential issue | 🟠 Major

Avoid F841 unused-variable errors for created evaluators.
Use the created objects in return statements so lint doesn’t fail.

🛠️ Suggested fix

eval_test = AccuracyEval( agent=agent, input="What is the capital of France?", expected_output="Paris" ) print("✅ AccuracyEval created successfully") - return True + return eval_test is not None @@ eval_test = ReliabilityEval( agent=agent, test_scenarios=test_scenarios ) print("✅ ReliabilityEval created successfully") - return True + return eval_test is not None @@ eval_test = PerformanceEval( agent=agent, benchmark_queries=["Hello, how are you?"], metrics={"runtime": True, "memory": True} ) print("✅ PerformanceEval created successfully") - return True + return eval_test is not None @@ suite = EvalSuite( name="Test Suite", agents=[agent], test_cases=test_cases ) print("✅ EvalSuite created successfully") - return True + return suite is not None @@ criteria = EvalCriteria( factual_accuracy=0.5, completeness=0.3, relevance=0.2 ) print("✅ EvalCriteria created successfully") - return True + return criteria is not None

Also applies to: 58-63, 71-77, 100-106, 114-120

🧰 Tools

🪛 Ruff (0.14.14)

[error] 38-38: Local variable eval_test is assigned to but never used

Remove assignment to unused variable eval_test

(F841)

[warning] 44-44: Consider moving this statement to an else block

(TRY300)

🤖 Prompt for AI Agents

In `@src/praisonai-agents/test_eval_framework.py` around lines 38 - 44, The created evaluator instance (e.g., eval_test created by AccuracyEval) is never used and triggers unused-variable lint errors; replace the placeholder return True with returning the created object (return eval_test) and do the same for the other evaluator variables in the file (for the blocks at 58-63, 71-77, 100-106, 114-120) — either return each created evaluator or collect them into a list/tuple and return that so the objects are referenced and the F841 warnings are resolved.

gemini-code-assist bot reviewed Jul 17, 2025

View reviewed changes

cursor bot reviewed Jul 17, 2025

View reviewed changes

MervinPraison force-pushed the main branch from b9cecc0 to 54cddc0 Compare December 26, 2025 04:32

MervinPraison force-pushed the claude/issue-967-20250717-0003 branch from 3fa2a13 to 274345d Compare February 4, 2026 08:16

MervinPraison force-pushed the main branch from 3dcbf9d to c059cbf Compare February 4, 2026 08:19

cursor bot reviewed Feb 4, 2026

View reviewed changes

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

		if hasattr(self, 'verbose') and self.verbose:
		print(f"Results saved to {self.save_results}")

	'passed': result.success and result.score >= 7.0, # Default threshold
	'passed': result.success and result.score >= (test_case.min_score if hasattr(test_case, 'min_score') else 7.0), # Default threshold

	passed = result.success and result.success_rate >= 80.0 # Default threshold
	passed = result.success and result.success_rate >= (test_case.min_success_rate if hasattr(test_case, 'min_success_rate') else 80.0) # Default threshold

	json.dump(results, f, indent=2)
	if hasattr(self, 'verbose') and self.verbose:
	print(f"Results saved to {self.save_results}")

Uh oh!

Conversation

MervinPraison commented Jul 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Poem

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Verbose Mode Not Functional

Uh oh!

MervinPraison commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Uh oh!

cursor bot Feb 4, 2026

Choose a reason for hiding this comment

Performance test fails when result type is batch

Uh oh!

cursor bot Feb 4, 2026

Choose a reason for hiding this comment

TaskOutput instantiation missing required Pydantic fields

Uh oh!

cursor bot Feb 4, 2026

Choose a reason for hiding this comment

Accuracy test accesses missing score attribute on batch result

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

MervinPraison commented Jul 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 17, 2025 •

edited

Loading

github-actions bot commented Jul 17, 2025 •

edited

Loading