Evals Course: Analyzing multi turn traces

Опубликовано: 21 Май 2026
на канале: Braintrust

149

We've now moved on to evals for multi-turn conversations in Braintrust's Evals course, and have seen this in action with our customer support example. Next up is using our full traces to score and analyze these multi-turn conversations.

Module eleven introduces trace-level scoring, which allows us to evaluate an entire multi-turn conversation as one unit. You'll create a new "Conversation Quality" score alongside the existing per-turn "Brand Alignment" score, and write a script that fetches production logs, buckets spans by conversation, and writes scores back to Braintrust.

By the end we'll have two levels of signal on every conversation, and we'll see why the best insights are found when the two scoring methods disagree.

Timestamps:

0:00 — Why single-turn scoring isn't enough for multi-turn conversations
0:16 — Example failure: bot asks for order number it was already given
0:36 — Intro to trace-level scoring: evaluating a full conversation as one unit
0:44 — Creating the "Conversation Quality" score in the UI (LLM-as-a-judge)
1:07 — thread vs. input: two ways to pass conversation history to the scorer
1:25 — Creating the same score in code using LLMClassifier
1:42 — format_conversation helper: converting raw JSON to readable text
1:53 — scoretraces.py: fetching logs, scoring, and writing results back to Braintrust
2:12 — How spans are structured for a 4-turn conversation (13 spans total)
2:29 — Bucketing spans by conversation using root_span_id
2:43 — Scoring turn spans with Brand Alignment (per-turn)
2:52 — Scoring the root span with Conversation Quality (full conversation)
3:23 — Running the script and viewing results in Braintrust
3:31 — Results: every turn scores A, full conversation scores 100%
3:40 — Chain-of-thought rationale walkthrough
3:59 — When Brand Alignment and Conversation Quality disagree
4:07 — Recap & what's next: Setting up online scoring for every new production log