Skip to content

Conversation

@hassiebp
Copy link
Contributor

@hassiebp hassiebp commented Nov 12, 2025

Important

Move evaluator execution out of the root span in _process_experiment_item() in client.py to ensure evaluations run independently of task execution.

  • Behavior:
    • Move evaluator execution and score creation out of the root span in _process_experiment_item() in client.py.
    • Evaluations are now processed independently of task execution, ensuring they run even if the task fails.
  • Error Handling:
    • Maintains error logging for evaluator failures with langfuse_logger.error().
  • Misc:
    • Adjusted indentation for clarity and separation of concerns in client.py.

This description was created by Ellipsis for dad3bfe. You can customize this summary. It will automatically update as commits are pushed.

Disclaimer: Experimental PR review

Greptile Overview

Greptile Summary

Moved evaluator execution outside the experiment-item-run span context to prevent evaluation operations from being nested under the root experiment span.

Key changes:

  • Evaluators now run after the experiment span context exits (unindented the evaluator loop by one level)
  • Evaluations are still correctly associated with the span via observation_id=span.id
  • Error handling remains intact - evaluators only run if task execution succeeds
  • Variable scoping is preserved as Python retains variables defined within with blocks after exit

Confidence Score: 4/5

  • This PR is safe to merge with low risk - it's a straightforward refactoring that changes span nesting without affecting functionality
  • The change is architecturally sound: moving evaluators outside the span context prevents them from appearing as nested operations. All variable references remain valid (Python retains variables from with blocks), error handling is preserved (evaluators only run on success), and the span.id reference is valid after context exit since it's an instance attribute.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
langfuse/_client/client.py 4/5 Moved evaluator execution outside the experiment-item-run span context to prevent evaluations from being nested under the root span. The change maintains correct variable scoping and error handling.

Sequence Diagram

sequenceDiagram
    participant Client as Langfuse Client
    participant Span as Experiment Span
    participant Task as User Task
    participant Eval as Evaluators
    
    Client->>Span: start_as_current_span("experiment-item-run")
    activate Span
    
    Span->>Task: run task with item input
    Task-->>Span: return output
    
    Span->>Span: update span with input/output
    
    Client->>Span: exit span context
    deactivate Span
    
    Note over Client,Eval: Evaluators run OUTSIDE span context
    
    loop For each evaluator
        Client->>Eval: run_evaluator(input, output, expected_output)
        Eval-->>Client: evaluation results
        Client->>Client: create_score(trace_id, observation_id=span.id)
    end
    
    Client->>Client: return ExperimentItemResult
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@hassiebp hassiebp merged commit fd7e850 into main Nov 12, 2025
12 checks passed
@hassiebp hassiebp deleted the fix-evals-out-of-exp branch November 12, 2025 14:49
@thdesc
Copy link

thdesc commented Nov 22, 2025

Hi @hassiebp, thanks for the update! I have a question regarding the new tracing structure.

Now that the evaluator (LLM-based in our case) runs outside of the root span, how can we easily understand or debug how an evaluation produced a given score for a task? Since the evaluation events now appear in a separate trace, it seems harder to connect the task run with the corresponding evaluation.

In our workflow, we upload a dataset to Langfuse, run an agent over all items using the experiment SDK (the task function), and then use another agent to generate scores for those runs. Since this update, I don’t see an easy way in the Langfuse UI to quickly navigate from the task’s trace to the associated evaluation trace.

Is there something we’re missing, or any recommended way to link them now?
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants