Evals Course: Online scoring

Опубликовано: 23 Май 2026
на канале: Braintrust
114
0

In the prior module of Braintrust's Evals course, we manually ran a scoring script once on "offline" data. But production AI systems need scoring to happen automatically on every new conversation as your product interacts with real users.

Module twelve covers setting up online scoring in Braintrust, with two automation rules that trigger Brand Alignment (per turn) and Conversation Quality (full trace) scoring on every log as it comes in.

You'll then generate a batch of 10 scripted production conversations to see it all working live, including an example where Brand Alignment scores high but Conversation Quality scores zero because the bot asked diagnostic questions but never actually resolved the issue.

Timestamps:

0:00 — The problem with manual scoring scripts in production
0:18 — What online scoring is and why you need it
0:25 — Setup: both scores from Module 11 are ready to reuse
0:41 — Creating Automation Rule 1: "Online Brand Alignment"
0:49 — Configuring scope (span), filter (turn number), and sampling rate (100%)
1:13 — Why span-scoped rules require a filter
1:29 — Creating Automation Rule 2: "Online Conversation Quality"
1:34 — Configuring scope (trace), no filter needed
1:47 — What each rule does: per-turn quality vs. full conversation resolution
2:01 — Generating production logs to test scoring
2:14 — generate_conversations.py: 10 scripted conversations (5 single-turn, 5 multi-turn)
2:42 — Running the script and watching scores populate automatically in Braintrust
2:54 — Clicking into a multi-turn conversation: per-turn + trace-level scores visible
3:09 — Live example: Brand Alignment high, Conversation Quality = 0
3:16 — The conversation: iPhone 15 app crash, consistent B on brand alignment
3:27 — Chain-of-thought reveals why: 3 diagnostic questions, no concrete resolution
3:47 — Recap & what's next: Generating enough logs to find patterns across hundreds of conversations