Building AI systems is hard because AI behaves unpredictably. In the first module of Braintrust's Evals course, Jess Wang breaks down the six most common problems developers face when shipping AI applications, explains why traditional software thinking doesn't apply, and shows how evals can be the solution.
Using a real-world example from OpenAI's 2025 model rollback, you'll learn how evals help you measure quality, track improvements, catch regressions, and ship with confidence. Check out module two to learn about the three core components of an eval system.
Timestamps:
0:00 — Intro: Common problems when building AI systems
0:05 — Problem 1: Hallucinations & inconsistent results after deployment
0:10 — Problem 2: Model upgrades breaking your application
0:16 — Problem 3: Changes causing regressions elsewhere
0:21 — Problem 4: Balancing accuracy vs. cost when choosing a model
0:26 — Problem 5: Not knowing how to measure prompt improvements
0:31 — Problem 6: No data-driven way to decide if a feature is ready to ship
0:38 — Why AI is different: non-determinism vs. traditional software
1:05 — Real-world example: OpenAI's 2025 model rollback
1:26 — How evals solve these problems (quality, cost, latency, regressions)
1:45 — What's next: The 3 core components of an eval