fix(experiments): pass full evaluation to score creation #1391

hassiebp · 2025-10-02T08:25:10Z

Important

Enhance score creation by passing config_id and data_type in client.py and add tests for BOOLEAN score types in test_experiments.py.

Behavior:
- Update process_item and _process_experiment_item in client.py to pass config_id and data_type to create_score.
- Ensure create_score handles config_id and data_type correctly.
Testing:
- Add test_boolean_score_types in test_experiments.py to verify BOOLEAN score types are ingested and persisted correctly.
- Validate BOOLEAN evaluations and run evaluations in the test.
Misc:
- Remove default value -1 for value in create_score calls in client.py.

^{This description was created by}^{for ff21f15. You can customize this summary. It will automatically update as commits are pushed.}

Disclaimer: Experimental PR review

Greptile Overview

Updated On: 2025-10-02 08:26:27 UTC

Summary

This PR fixes a bug in the experiment evaluation system where complete evaluation metadata was being lost during score creation. The changes ensure that when evaluators return `Evaluation` objects with `config_id` and `data_type` fields, these fields are properly passed through to the score creation API calls.

The core issue was in the experiment processing pipeline within langfuse/_client/client.py. Previously, when creating scores from evaluation results, the code was only passing basic score fields but omitting critical metadata like config_id and data_type. This meant that evaluations with explicit data type specifications (like ScoreDataType.BOOLEAN) would lose their type information during persistence.

The fix adds the missing fields to both the async and sync experiment processing paths:

In the async path (line 2733): Added config_id=evaluation.config_id
In the sync path (lines 2860-2864): Added both config_id=evaluation.config_id and data_type=evaluation.data_type

To validate the fix, a comprehensive test was added in tests/test_experiments.py that specifically tests boolean score types. The test creates evaluators that return boolean values with explicit ScoreDataType.BOOLEAN annotations, runs an experiment with mixed pass/fail results, and verifies that the boolean scores are properly persisted in the API with correct data types.

This change integrates well with the existing Langfuse evaluation framework, which uses Pydantic models like BooleanScore to represent typed scores. The fix ensures that the experiment system properly respects the score type hierarchy and maintains data integrity throughout the evaluation pipeline.

Important Files Changed

Changed Files

Filename	Score	Overview
tests/test_experiments.py	5/5	Added comprehensive test for BOOLEAN score types to validate the experiment evaluation pipeline
langfuse/_client/client.py	4/5	Fixed experiment score creation by adding missing config_id and data_type fields to API calls

Confidence score: 4/5

This PR is safe to merge as it fixes a clear bug with minimal risk of introducing new issues
Score reflects a targeted bug fix with appropriate test coverage, though the client.py changes could benefit from more comprehensive testing
Pay close attention to langfuse/_client/client.py to ensure the fix doesn't introduce any regressions in the experiment processing pipeline

Sequence Diagram

sequenceDiagram
    participant User
    participant Langfuse as "Langfuse Client"
    participant Experiment as "Experiment Runner"
    participant Task as "Task Function"
    participant Evaluator as "Evaluator Function"
    participant ScoreCreation as "Score Creation"
    participant API as "Langfuse API"

    User->>Langfuse: "run_experiment(name, data, task, evaluators)"
    Langfuse->>Experiment: "_run_experiment_async()"
    
    loop For each experiment item
        Experiment->>Experiment: "_process_experiment_item()"
        
        Experiment->>Task: "await _run_task(task, item)"
        Task-->>Experiment: "task_output"
        
        loop For each evaluator
            Experiment->>Evaluator: "await _run_evaluator(evaluator, input, output, expected_output)"
            Evaluator-->>Experiment: "List[Evaluation]"
            
            loop For each evaluation result
                Note over Experiment,ScoreCreation: FIX: Pass full evaluation object
                Experiment->>ScoreCreation: "create_score(evaluation.name, evaluation.value, ...)"
                Note over ScoreCreation: evaluation.comment, evaluation.metadata,<br/>evaluation.data_type, evaluation.config_id
                ScoreCreation->>API: "score creation request"
                API-->>ScoreCreation: "score created"
            end
        end
    end
    
    loop For each run evaluator
        Experiment->>Evaluator: "await _run_evaluator(run_evaluator, item_results)"
        Evaluator-->>Experiment: "List[Evaluation]"
        
        loop For each run evaluation
            Note over Experiment,ScoreCreation: FIX: Pass full evaluation object for run-level scores
            Experiment->>ScoreCreation: "create_score(dataset_run_id, evaluation.name, evaluation.value, ...)"
            Note over ScoreCreation: evaluation.comment, evaluation.metadata,<br/>evaluation.data_type, evaluation.config_id
            ScoreCreation->>API: "score creation request"  
            API-->>ScoreCreation: "score created"
        end
    end
    
    Experiment-->>Langfuse: "ExperimentResult"
    Langfuse-->>User: "experiment results"

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

fix(experiments): pass full evaluation to score creation

ff21f15

hassiebp linked an issue Oct 2, 2025 that may be closed by this pull request

bug: Numeric scores with a value of 0 are incorrectly stored as -1 langfuse/langfuse#9364

Closed

greptile-apps bot reviewed Oct 2, 2025

View reviewed changes

hassiebp merged commit 0dc0d2c into main Oct 2, 2025
11 checks passed

hassiebp deleted the fix-experiment-scores branch October 2, 2025 08:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(experiments): pass full evaluation to score creation #1391

fix(experiments): pass full evaluation to score creation #1391

Uh oh!

hassiebp commented Oct 2, 2025 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(experiments): pass full evaluation to score creation #1391

fix(experiments): pass full evaluation to score creation #1391

Uh oh!

Conversation

hassiebp commented Oct 2, 2025 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Disclaimer: Experimental PR review

Greptile Overview

Summary

Important Files Changed

Confidence score: 4/5

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hassiebp commented Oct 2, 2025 •

edited by greptile-apps bot

Loading