fix(experiments): pass full evaluation to score creation #1391
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Important
Enhance score creation by passing
config_idanddata_typeinclient.pyand add tests for BOOLEAN score types intest_experiments.py.process_itemand_process_experiment_iteminclient.pyto passconfig_idanddata_typetocreate_score.create_scorehandlesconfig_idanddata_typecorrectly.test_boolean_score_typesintest_experiments.pyto verify BOOLEAN score types are ingested and persisted correctly.-1forvalueincreate_scorecalls inclient.py.This description was created by
for ff21f15. You can customize this summary. It will automatically update as commits are pushed.
Disclaimer: Experimental PR review
Greptile Overview
Updated On: 2025-10-02 08:26:27 UTC
Summary
This PR fixes a bug in the experiment evaluation system where complete evaluation metadata was being lost during score creation. The changes ensure that when evaluators return `Evaluation` objects with `config_id` and `data_type` fields, these fields are properly passed through to the score creation API calls.The core issue was in the experiment processing pipeline within
langfuse/_client/client.py. Previously, when creating scores from evaluation results, the code was only passing basic score fields but omitting critical metadata likeconfig_idanddata_type. This meant that evaluations with explicit data type specifications (likeScoreDataType.BOOLEAN) would lose their type information during persistence.The fix adds the missing fields to both the async and sync experiment processing paths:
config_id=evaluation.config_idconfig_id=evaluation.config_idanddata_type=evaluation.data_typeTo validate the fix, a comprehensive test was added in
tests/test_experiments.pythat specifically tests boolean score types. The test creates evaluators that return boolean values with explicitScoreDataType.BOOLEANannotations, runs an experiment with mixed pass/fail results, and verifies that the boolean scores are properly persisted in the API with correct data types.This change integrates well with the existing Langfuse evaluation framework, which uses Pydantic models like
BooleanScoreto represent typed scores. The fix ensures that the experiment system properly respects the score type hierarchy and maintains data integrity throughout the evaluation pipeline.Important Files Changed
Changed Files
Confidence score: 4/5
Sequence Diagram
sequenceDiagram participant User participant Langfuse as "Langfuse Client" participant Experiment as "Experiment Runner" participant Task as "Task Function" participant Evaluator as "Evaluator Function" participant ScoreCreation as "Score Creation" participant API as "Langfuse API" User->>Langfuse: "run_experiment(name, data, task, evaluators)" Langfuse->>Experiment: "_run_experiment_async()" loop For each experiment item Experiment->>Experiment: "_process_experiment_item()" Experiment->>Task: "await _run_task(task, item)" Task-->>Experiment: "task_output" loop For each evaluator Experiment->>Evaluator: "await _run_evaluator(evaluator, input, output, expected_output)" Evaluator-->>Experiment: "List[Evaluation]" loop For each evaluation result Note over Experiment,ScoreCreation: FIX: Pass full evaluation object Experiment->>ScoreCreation: "create_score(evaluation.name, evaluation.value, ...)" Note over ScoreCreation: evaluation.comment, evaluation.metadata,<br/>evaluation.data_type, evaluation.config_id ScoreCreation->>API: "score creation request" API-->>ScoreCreation: "score created" end end end loop For each run evaluator Experiment->>Evaluator: "await _run_evaluator(run_evaluator, item_results)" Evaluator-->>Experiment: "List[Evaluation]" loop For each run evaluation Note over Experiment,ScoreCreation: FIX: Pass full evaluation object for run-level scores Experiment->>ScoreCreation: "create_score(dataset_run_id, evaluation.name, evaluation.value, ...)" Note over ScoreCreation: evaluation.comment, evaluation.metadata,<br/>evaluation.data_type, evaluation.config_id ScoreCreation->>API: "score creation request" API-->>ScoreCreation: "score created" end end Experiment-->>Langfuse: "ExperimentResult" Langfuse-->>User: "experiment results"