Evals Course: How to deal with nondeterminism

Опубликовано: 23 Май 2026
на канале: Braintrust

138

In Module six of Braintrust's Evals course, we noticed a difference in scoring between our example in the UI versus the same example when using the SDK. But this isn't an error. Running the same eval twice doesn't always give you the same score. This is a fundamental property of LLMs, and something that you need to account for when building evals.

In Module seven we dig into why score variance happens, what temperature has to do with it, and why even setting temperature to 0 doesn't fully eliminate non-determinism.

You'll learn how to run your evals multiple times and average the results. You'll see how Braintrust's trial_count parameter makes this easy, and watch a real example where a single input scores an A on one trial and B on the next two. This will help us understand why averaged scores are far more trustworthy than single runs.

Timestamps:

0:00 — The problem: same eval, same data, different scores
0:16 — Root cause: LLMs are non-deterministic
0:20 — Score comparison: Module 3 (UI) vs. Module 6 (code) results
0:42 — What temperature is and how it affects output randomness
0:50 — Temperature = 0: more deterministic, picks most likely token
0:59 — Temperature = 1: samples full distribution, high variance
1:07 — Best practice: set temperature to 0 when running evals
1:16 — Why variance still occurs even at temperature = 0
1:26 — The solution: run evals multiple times and average the scores
1:35 — How to use trial_count in the Braintrust SDK
1:54 — Running the eval with trial_count=3 and reviewing results
2:05 — Results: 48 rows instead of 16, averaged scores per input
2:20 — Real example: one input scores A once and B twice across trials
2:40 — Recap & what's next: How to read a trace in Braintrust