-
Notifications
You must be signed in to change notification settings - Fork 223
feat(experiments): add experiment runner #1334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing changes made in this pull request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing changes made in this pull request
bdepebhe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After using the new experiment runner, I had issues with the processing of the Evaluation objects
|
|
||
| # Store evaluations as scores | ||
| for evaluation in eval_results: | ||
| self.create_score( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data_type attribute of Evaluation is unused for item-level evaluators, despite being showcased in the docstring of Evaluation with an example of item-level CATEGORICAL evaluator. This is misleading for users.
For example, as I tested with BOOLEAN evaluator, I expected True or False, and got 1 and -1
| self.create_score( | ||
| trace_id=trace_id, | ||
| name=evaluation.name, | ||
| value=evaluation.value or -1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversion to -1 is problematic for numerical scores. 0.0 can be a legitimate value for a score (example: 0. as a Pearson correlation coefficient means no correlation, but -1 means a perfect negative correlation !)
|
Thanks for raising your comments, let's continue the discussion here: langfuse/langfuse#9290 |
Important
Adds an experiment runner to the Langfuse Python SDK for evaluating AI tasks with automatic tracing and scoring, including async execution support and extensive test coverage.
run_experimentmethod toclient.pyfor running experiments with automatic tracing and evaluation.experiments.pymodule with type definitions forTaskFunction,EvaluatorFunction, andRunEvaluatorFunction.types.pyupdated to expose experiment types as a public API.datasets.pyupdated to integrate dataset objects with experiments.run_async_safelyfunction inutils.pyto handle async execution safely.test_experiments.pyto validate experiment functionality.test_utils.pyto ensure async handling works correctly.This description was created by
for 9e7cac6. You can customize this summary. It will automatically update as commits are pushed.
Disclaimer: Experimental PR review
Greptile Summary
Updated On: 2025-09-11 16:02:25 UTC
This review covers only the changes made since the last review, not the entire PR. The previous review identified two critical issues: a missing
statisticsimport and a type inconsistency with an evaluator returning a plain dict instead of an Evaluation object. These issues remain unaddressed in the current version.The experiment runner framework introduces substantial new functionality to the Langfuse Python SDK, enabling systematic evaluation of AI tasks on datasets with automatic tracing and scoring. The core implementation centers around a new
run_experimentmethod in the main Langfuse client that handles concurrent execution of tasks and evaluators while creating proper traces and persisting results.The new
langfuse/_client/experiments.pymodule establishes the foundational type system with Protocol definitions forTaskFunction,EvaluatorFunction, andRunEvaluatorFunction, along with comprehensive TypedDict structures for data handling. Thelangfuse/types.pyfile transforms from a private module to a public API, exposing experiment types through a controlled export interface. The dataset integration inlangfuse/_client/datasets.pyprovides a convenience method for running experiments directly on dataset objects.The implementation includes semaphore-based concurrency control, comprehensive error handling that isolates individual failures, and extensive test coverage validating both local and Langfuse-hosted dataset scenarios. The async architecture efficiently handles large datasets while respecting API rate limits, and automatic score persistence integrates seamlessly with the Langfuse platform.
Confidence score: 3/5
client.pyandexperiments.py, and the evaluator return types intest_experiments.pyContext used:
Rule - Open a GitHub issue or discussion first before submitting PRs to explain the rationale and necessity of the proposed changes, as required by the contributing guide. (link)