Skip to content

Conversation

@hassiebp
Copy link
Contributor

@hassiebp hassiebp commented Oct 2, 2025

Important

Enhance score creation by passing config_id and data_type in client.py and add tests for BOOLEAN score types in test_experiments.py.

  • Behavior:
    • Update process_item and _process_experiment_item in client.py to pass config_id and data_type to create_score.
    • Ensure create_score handles config_id and data_type correctly.
  • Testing:
    • Add test_boolean_score_types in test_experiments.py to verify BOOLEAN score types are ingested and persisted correctly.
    • Validate BOOLEAN evaluations and run evaluations in the test.
  • Misc:
    • Remove default value -1 for value in create_score calls in client.py.

This description was created by Ellipsis for ff21f15. You can customize this summary. It will automatically update as commits are pushed.

Disclaimer: Experimental PR review

Greptile Overview

Updated On: 2025-10-02 08:26:27 UTC

Summary

This PR fixes a bug in the experiment evaluation system where complete evaluation metadata was being lost during score creation. The changes ensure that when evaluators return `Evaluation` objects with `config_id` and `data_type` fields, these fields are properly passed through to the score creation API calls.

The core issue was in the experiment processing pipeline within langfuse/_client/client.py. Previously, when creating scores from evaluation results, the code was only passing basic score fields but omitting critical metadata like config_id and data_type. This meant that evaluations with explicit data type specifications (like ScoreDataType.BOOLEAN) would lose their type information during persistence.

The fix adds the missing fields to both the async and sync experiment processing paths:

  • In the async path (line 2733): Added config_id=evaluation.config_id
  • In the sync path (lines 2860-2864): Added both config_id=evaluation.config_id and data_type=evaluation.data_type

To validate the fix, a comprehensive test was added in tests/test_experiments.py that specifically tests boolean score types. The test creates evaluators that return boolean values with explicit ScoreDataType.BOOLEAN annotations, runs an experiment with mixed pass/fail results, and verifies that the boolean scores are properly persisted in the API with correct data types.

This change integrates well with the existing Langfuse evaluation framework, which uses Pydantic models like BooleanScore to represent typed scores. The fix ensures that the experiment system properly respects the score type hierarchy and maintains data integrity throughout the evaluation pipeline.

Important Files Changed

Changed Files
Filename Score Overview
tests/test_experiments.py 5/5 Added comprehensive test for BOOLEAN score types to validate the experiment evaluation pipeline
langfuse/_client/client.py 4/5 Fixed experiment score creation by adding missing config_id and data_type fields to API calls

Confidence score: 4/5

  • This PR is safe to merge as it fixes a clear bug with minimal risk of introducing new issues
  • Score reflects a targeted bug fix with appropriate test coverage, though the client.py changes could benefit from more comprehensive testing
  • Pay close attention to langfuse/_client/client.py to ensure the fix doesn't introduce any regressions in the experiment processing pipeline

Sequence Diagram

sequenceDiagram
    participant User
    participant Langfuse as "Langfuse Client"
    participant Experiment as "Experiment Runner"
    participant Task as "Task Function"
    participant Evaluator as "Evaluator Function"
    participant ScoreCreation as "Score Creation"
    participant API as "Langfuse API"

    User->>Langfuse: "run_experiment(name, data, task, evaluators)"
    Langfuse->>Experiment: "_run_experiment_async()"
    
    loop For each experiment item
        Experiment->>Experiment: "_process_experiment_item()"
        
        Experiment->>Task: "await _run_task(task, item)"
        Task-->>Experiment: "task_output"
        
        loop For each evaluator
            Experiment->>Evaluator: "await _run_evaluator(evaluator, input, output, expected_output)"
            Evaluator-->>Experiment: "List[Evaluation]"
            
            loop For each evaluation result
                Note over Experiment,ScoreCreation: FIX: Pass full evaluation object
                Experiment->>ScoreCreation: "create_score(evaluation.name, evaluation.value, ...)"
                Note over ScoreCreation: evaluation.comment, evaluation.metadata,<br/>evaluation.data_type, evaluation.config_id
                ScoreCreation->>API: "score creation request"
                API-->>ScoreCreation: "score created"
            end
        end
    end
    
    loop For each run evaluator
        Experiment->>Evaluator: "await _run_evaluator(run_evaluator, item_results)"
        Evaluator-->>Experiment: "List[Evaluation]"
        
        loop For each run evaluation
            Note over Experiment,ScoreCreation: FIX: Pass full evaluation object for run-level scores
            Experiment->>ScoreCreation: "create_score(dataset_run_id, evaluation.name, evaluation.value, ...)"
            Note over ScoreCreation: evaluation.comment, evaluation.metadata,<br/>evaluation.data_type, evaluation.config_id
            ScoreCreation->>API: "score creation request"  
            API-->>ScoreCreation: "score created"
        end
    end
    
    Experiment-->>Langfuse: "ExperimentResult"
    Langfuse-->>User: "experiment results"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@hassiebp hassiebp merged commit 0dc0d2c into main Oct 2, 2025
11 checks passed
@hassiebp hassiebp deleted the fix-experiment-scores branch October 2, 2025 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Numeric scores with a value of 0 are incorrectly stored as -1

2 participants