feat(experiments): add experiment runner #1334

hassiebp · 2025-09-11T15:59:47Z

Important

Adds an experiment runner to the Langfuse Python SDK for evaluating AI tasks with automatic tracing and scoring, including async execution support and extensive test coverage.

Behavior:
- Adds run_experiment method to client.py for running experiments with automatic tracing and evaluation.
- Supports concurrent task execution with semaphore-based concurrency control.
- Handles errors gracefully, isolating individual failures.
Modules:
- New experiments.py module with type definitions for TaskFunction, EvaluatorFunction, and RunEvaluatorFunction.
- types.py updated to expose experiment types as a public API.
- datasets.py updated to integrate dataset objects with experiments.
Utilities:
- Adds run_async_safely function in utils.py to handle async execution safely.
Tests:
- Extensive tests added in test_experiments.py to validate experiment functionality.
- Tests for utility functions in test_utils.py to ensure async handling works correctly.

^{This description was created by}^{for 9e7cac6. You can customize this summary. It will automatically update as commits are pushed.}

Disclaimer: Experimental PR review

Greptile Summary

Updated On: 2025-09-11 16:02:25 UTC

This review covers only the changes made since the last review, not the entire PR. The previous review identified two critical issues: a missing statistics import and a type inconsistency with an evaluator returning a plain dict instead of an Evaluation object. These issues remain unaddressed in the current version.

The experiment runner framework introduces substantial new functionality to the Langfuse Python SDK, enabling systematic evaluation of AI tasks on datasets with automatic tracing and scoring. The core implementation centers around a new run_experiment method in the main Langfuse client that handles concurrent execution of tasks and evaluators while creating proper traces and persisting results.

The new langfuse/_client/experiments.py module establishes the foundational type system with Protocol definitions for TaskFunction, EvaluatorFunction, and RunEvaluatorFunction, along with comprehensive TypedDict structures for data handling. The langfuse/types.py file transforms from a private module to a public API, exposing experiment types through a controlled export interface. The dataset integration in langfuse/_client/datasets.py provides a convenience method for running experiments directly on dataset objects.

The implementation includes semaphore-based concurrency control, comprehensive error handling that isolates individual failures, and extensive test coverage validating both local and Langfuse-hosted dataset scenarios. The async architecture efficiently handles large datasets while respecting API rate limits, and automatic score persistence integrates seamlessly with the Langfuse platform.

Confidence score: 3/5

This PR has significant implementation issues that prevent safe merging, including missing imports that will cause runtime errors
Score lowered due to unresolved critical issues from previous review: missing statistics import and type inconsistency in evaluator returns
Pay close attention to the import statements in client.py and experiments.py, and the evaluator return types in test_experiments.py

Context used:

Rule - Open a GitHub issue or discussion first before submitting PRs to explain the rationale and necessity of the proposed changes, as required by the contributing guide. (link)

greptile-apps

Reviewing changes made in this pull request

tests/test_experiments.py

langfuse/_client/experiments.py

greptile-apps

Reviewing changes made in this pull request

langfuse/_client/client.py

langfuse/__init__.py

langfuse/_client/client.py

langfuse/experiment.py

bdepebhe

After using the new experiment runner, I had issues with the processing of the Evaluation objects

bdepebhe · 2025-09-22T11:40:59Z

langfuse/_client/client.py

+
+                        # Store evaluations as scores
+                        for evaluation in eval_results:
+                            self.create_score(


The data_type attribute of Evaluation is unused for item-level evaluators, despite being showcased in the docstring of Evaluation with an example of item-level CATEGORICAL evaluator. This is misleading for users.
For example, as I tested with BOOLEAN evaluator, I expected True or False, and got 1 and -1

bdepebhe · 2025-09-22T11:43:51Z

langfuse/_client/client.py

+                            self.create_score(
+                                trace_id=trace_id,
+                                name=evaluation.name,
+                                value=evaluation.value or -1,


This conversion to -1 is problematic for numerical scores. 0.0 can be a legitimate value for a score (example: 0. as a Pearson correlation coefficient means no correlation, but -1 means a perfect negative correlation !)

hassiebp · 2025-09-22T15:13:21Z

Thanks for raising your comments, let's continue the discussion here: langfuse/langfuse#9290

hassiebp added 8 commits September 11, 2025 11:24

feat(experiments): add experiment runner

7a2232a

push

2cbf43b

push

9eee51d

push

00565f6

push

f5f2cac

expand tests

ce290f5

expand docstrings

477a1c9

Merge branch 'main' into add-experiments

0625b11

greptile-apps bot reviewed Sep 11, 2025

View reviewed changes

tests/test_experiments.py Outdated Show resolved Hide resolved

langfuse/_client/experiments.py Show resolved Hide resolved

greptile-apps bot reviewed Sep 11, 2025

View reviewed changes

ellipsis-dev bot reviewed Sep 11, 2025

View reviewed changes

langfuse/_client/client.py Outdated Show resolved Hide resolved

hassiebp added 7 commits September 11, 2025 18:17

add run safe async

b8b2f8c

push

db09d7f

push

285cc99

add autoevals adapter

f94dab3

push

52f7d80

push

7c583fe

push

e2d08ae

ellipsis-dev bot reviewed Sep 15, 2025

View reviewed changes

langfuse/__init__.py Show resolved Hide resolved

hassiebp added 2 commits September 15, 2025 17:40

push

07b17b9

push

b01cbd0

marcklingen reviewed Sep 15, 2025

View reviewed changes

langfuse/_client/client.py Show resolved Hide resolved

langfuse/_client/client.py Outdated Show resolved Hide resolved

langfuse/_client/client.py Show resolved Hide resolved

langfuse/_client/client.py Show resolved Hide resolved

hassiebp added 3 commits September 16, 2025 11:05

push

cbfcdd4

push

009c191

move to classes

e4a4599

ellipsis-dev bot reviewed Sep 16, 2025

View reviewed changes

langfuse/experiment.py Show resolved Hide resolved

hassiebp added 4 commits September 16, 2025 18:58

move to classes

fbe5497

add comment metadata

36ca2c2

Merge branch 'main' into add-experiments

13c42d9

add run_name

32cbe02

hassiebp added 3 commits September 16, 2025 23:17

push

469166b

add docstring

1c9f012

add observationid to link calls

9e7cac6

hassiebp merged commit a0ddc00 into main Sep 17, 2025
11 checks passed

hassiebp deleted the add-experiments branch September 17, 2025 09:46

hassiebp restored the add-experiments branch September 17, 2025 09:46

bdepebhe reviewed Sep 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(experiments): add experiment runner #1334

feat(experiments): add experiment runner #1334

Uh oh!

hassiebp commented Sep 11, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdepebhe left a comment

Uh oh!

bdepebhe Sep 22, 2025

Uh oh!

bdepebhe Sep 22, 2025

Uh oh!

hassiebp commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feat(experiments): add experiment runner #1334

feat(experiments): add experiment runner #1334

Uh oh!

Conversation

hassiebp commented Sep 11, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Disclaimer: Experimental PR review

Greptile Summary

Confidence score: 3/5

Context used:

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdepebhe left a comment

Choose a reason for hiding this comment

Uh oh!

bdepebhe Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

bdepebhe Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

hassiebp commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hassiebp commented Sep 11, 2025 •

edited by ellipsis-dev bot

Loading