Evals Course: Building a simple eval using Braintrust SDK

Опубликовано: 21 Май 2026
на канале: Braintrust

150

In Module six of Braintrust's Evals course, we move beyond the UI and start building with the Braintrust SDK. While the UI is great for getting started, most teams eventually want to run evals in code for version control, programmatic execution, and CI/CD integration.

In this module, you'll recreate the same customer support eval from the UI, but this time entirely in Python using the Braintrust SDK. You'll walk through installing dependencies, setting API keys, defining datasets as lists of dictionaries, writing separate task functions for each personality, and recreating the brand alignment LLM-as-a-judge scorer using the autoevals library.

You'll also notice that the scores differ slightly when using the SDK as compared to our example in the UI. We'll cover that mystery in the next module.

Timestamps:

0:00 — Why move from UI to code: version control, automation, CI/CD
0:18 — Overview: Recreating the customer support eval in Python
0:29 — Installing dependencies: Braintrust, autoevals, OpenAI
0:35 — What each dependency does
0:56 — Setting up API keys: Braintrust + OpenAI
1:16 — Code walkthrough begins
1:20 — braintrust.init and auto-instrumentation explained
1:42 — Defining the dataset as a list of dictionaries
1:56 — Task functions: polite and concise personas in code
2:25 — Recreating the brand alignment score with LLMClassifier
2:40 — Mapping A/B/C grades to numeric scores (1, 0.5, 0)
2:59 — Enabling chain-of-thought and passing input/output
3:11 — Tying it together with the eval() call
3:33 — Running the file and viewing results in Braintrust UI
3:46 — Results: Polite (93.75%) vs. Concise (81.25%) brand alignment
3:57 — Noticing score differences between UI and code runs — why?
4:06 — What's next: Understanding and fixing score inconsistency