Grading scales matter for eval consistency

Grading scales matter for eval consistency

Some thought should be paid to the grading scale when systematizing your LLM evals, regardless of whether they're for humans for LLM-as-a-judge. A bad grading scale can skew your ability to assess your output accurately.

Grading scales are the ratings by which humans or LLMs judge the output of another LLM doing a task. You can ask human/LLM judges to rate the output on a numeric scale like Yelp reviews.

Typical scales are 1 to 5 or 1 to 10. The trouble is, they're inconsistent.What exactly constitutes a 5 on a 10pt scale? and exactly how is it different than a 6? Even if you could spell it out, a typical human won't be able to keep it all in their head.In addition, on a 10pt scale, people won't use the full scale evenly. Most humans will pick 7 to mean "meh." 1-6 is considered "failing", and if something is bad, they'll pounce on 1 as emotional emphasis, thus neglecting 2-6.On the other end, 8-10 are considered "safely optimistic", so out of a 10 point, you'd get rating compression to just 3 positions on a 10pt scale.

As it turns out, LLMs are also bad at this. As a result, your evals will be inconsistent with not much discrimination power.Both humans and LLMs do best with a binary judgment (yes/no, true/false, relevant/not), or a side-to-side comparison between two competing outputs.

Side-by-side comparisons can generate ELO scores over time for a ranking of outputs.The more nuanced the grading scale, the more likely two graders (or the same grader on different days!) would give different answers.

Counter-intuitive, but you get the nuance from sampling across many independent judgments, rather than from resolution in grading scale!

🍲
Like what you read? We're serving it hot with a digital zine on LLM evals. Visit Forest Friends to see all the details.