Skip to content

Conversation

@hassiebp
Copy link
Contributor

@hassiebp hassiebp commented Sep 11, 2025

Important

Adds an experiment runner to the Langfuse Python SDK for evaluating AI tasks with automatic tracing and scoring, including async execution support and extensive test coverage.

  • Behavior:
    • Adds run_experiment method to client.py for running experiments with automatic tracing and evaluation.
    • Supports concurrent task execution with semaphore-based concurrency control.
    • Handles errors gracefully, isolating individual failures.
  • Modules:
    • New experiments.py module with type definitions for TaskFunction, EvaluatorFunction, and RunEvaluatorFunction.
    • types.py updated to expose experiment types as a public API.
    • datasets.py updated to integrate dataset objects with experiments.
  • Utilities:
    • Adds run_async_safely function in utils.py to handle async execution safely.
  • Tests:
    • Extensive tests added in test_experiments.py to validate experiment functionality.
    • Tests for utility functions in test_utils.py to ensure async handling works correctly.

This description was created by Ellipsis for 9e7cac6. You can customize this summary. It will automatically update as commits are pushed.


Disclaimer: Experimental PR review

Greptile Summary

Updated On: 2025-09-11 16:02:25 UTC

This review covers only the changes made since the last review, not the entire PR. The previous review identified two critical issues: a missing statistics import and a type inconsistency with an evaluator returning a plain dict instead of an Evaluation object. These issues remain unaddressed in the current version.

The experiment runner framework introduces substantial new functionality to the Langfuse Python SDK, enabling systematic evaluation of AI tasks on datasets with automatic tracing and scoring. The core implementation centers around a new run_experiment method in the main Langfuse client that handles concurrent execution of tasks and evaluators while creating proper traces and persisting results.

The new langfuse/_client/experiments.py module establishes the foundational type system with Protocol definitions for TaskFunction, EvaluatorFunction, and RunEvaluatorFunction, along with comprehensive TypedDict structures for data handling. The langfuse/types.py file transforms from a private module to a public API, exposing experiment types through a controlled export interface. The dataset integration in langfuse/_client/datasets.py provides a convenience method for running experiments directly on dataset objects.

The implementation includes semaphore-based concurrency control, comprehensive error handling that isolates individual failures, and extensive test coverage validating both local and Langfuse-hosted dataset scenarios. The async architecture efficiently handles large datasets while respecting API rate limits, and automatic score persistence integrates seamlessly with the Langfuse platform.

Confidence score: 3/5

  • This PR has significant implementation issues that prevent safe merging, including missing imports that will cause runtime errors
  • Score lowered due to unresolved critical issues from previous review: missing statistics import and type inconsistency in evaluator returns
  • Pay close attention to the import statements in client.py and experiments.py, and the evaluator return types in test_experiments.py

Context used:

Rule - Open a GitHub issue or discussion first before submitting PRs to explain the rationale and necessity of the proposed changes, as required by the contributing guide. (link)

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing changes made in this pull request

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing changes made in this pull request

@hassiebp hassiebp merged commit a0ddc00 into main Sep 17, 2025
11 checks passed
@hassiebp hassiebp deleted the add-experiments branch September 17, 2025 09:46
@hassiebp hassiebp restored the add-experiments branch September 17, 2025 09:46
Copy link

@bdepebhe bdepebhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After using the new experiment runner, I had issues with the processing of the Evaluation objects


# Store evaluations as scores
for evaluation in eval_results:
self.create_score(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data_type attribute of Evaluation is unused for item-level evaluators, despite being showcased in the docstring of Evaluation with an example of item-level CATEGORICAL evaluator. This is misleading for users.
For example, as I tested with BOOLEAN evaluator, I expected True or False, and got 1 and -1

self.create_score(
trace_id=trace_id,
name=evaluation.name,
value=evaluation.value or -1,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion to -1 is problematic for numerical scores. 0.0 can be a legitimate value for a score (example: 0. as a Pearson correlation coefficient means no correlation, but -1 means a perfect negative correlation !)

Copy link
Contributor Author

Thanks for raising your comments, let's continue the discussion here: langfuse/langfuse#9290

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants