In Braintrust's Evals course we've set up our evals, learned about traces, explored different approaches to scoring, and are now up and running. So what do you do after you eval?
Module nine covers the four methods in Braintrust for making sense of your experiment data: 1) Side-by-side experiment comparison. 2) Loop (Braintrust's natural language query interface). 3) The Braintrust MCP server for SQL-powered analysis. 4) Manual filtering in the UI.
Using these methods, you'll learn how to systematically answer real questions about your AI system, like which inputs consistently underperform, where your costs are highest, and where two experiments disagreed the most.
Timestamps:
0:00 — Intro: You have data — now what do you do with it?
0:19 — Method 1: Comparing experiments side by side
0:30 — The core eval question: "Did this change actually help?"
0:50 — Method 2: Loop — Braintrust's natural language query interface
1:04 — Example query: inputs that consistently scored below 100% on brand alignment
1:19 — Loop's findings: nonsensical inputs, refund edge cases, tone mismatches
1:28 — Example query: average brand alignment for refund-related questions (~90%)
1:44 — Method 3: Braintrust MCP server for direct data querying
1:55 — MCP tools: SQL queries, schema inference, experiment summaries, project listing
2:18 — Example: finding where Module 3 and Module 6 polite experiments disagreed most
2:41 — Method 4: Manual exploration and filtering in the Braintrust UI
2:48 — Filtering by score range, keyword, and duration
3:08 — Recap: 4 methods for analyzing eval results
3:12 — What's next: Evolving the chatbot into a multi-turn app and logging to production