In Module three of Braintrust's Evals course, you built and saved two experiments in the Braintrust UI. Now it's time to look at the data and see how the two experiments compare.
In Module four, you'll use Braintrust's side-by-side comparison view and diff mode to analyze the polite vs. concise personality results row by row. You'll see how chain-of-thought reasoning transforms a raw score into actionable feedback, and discover a real tradeoff: the polite persona scores 100% on brand alignment but uses 3x more tokens than the concise persona at 71.88%.
We'll also break down three paths forward for dealing with this tradeoff, and explain why evals are meant to inform these kinds of decisions rather than make them for you.
Timestamps:
0:00 — Recap: Two experiments saved, now time to analyze
0:14 — Opening the Experiments tab and loading both results
0:36 — Enabling diff mode to highlight where experiments diverge
0:49 — Aggregate scores: Polite (100%) vs. Concise (71.88%) brand alignment
1:05 — Drilling into rows 1–9 where the concise persona scored 50%
1:19 — Row 1 deep dive: "Why did my package disappear?"
1:32 — Chain-of-thought breakdown for the polite response (score: A / 100%)
1:56 — Chain-of-thought breakdown for the concise response (score: B / 50%)
2:25 — Why chain-of-thought reasoning makes scores actually useful
2:35 — Comparing token usage: ~16K (polite) vs. ~6K (concise)
2:59 — The real tradeoff: quality vs. cost at scale
3:03 — Your 3 options: ship polite, ship concise, or iterate
3:36 — Key insight: Evals inform decisions, they don't make them
3:45 — Recap & what's next: Playgrounds vs. Experiments explained