Grading scales matter for eval consistency
Some thought should be paid to the grading scale when systematizing your LLM evals, regardless of whether they're for humans for LLM-as-a-judge. A bad grading scale can skew your ability to assess your output accurately.
Grading scales are the ratings by which humans or LLMs judge the output of another LLM doing a task. You can ask human/LLM judges to rate the output on a numeric scale like Yelp reviews.
Typical scales are 1 to 5 or 1 to 10. The trouble is, they're inconsistent.What exactly constitutes a 5 on a 10pt scale? and exactly how is it different than a 6? Even if you could spell it out, a typical human won't be able to keep it all in their head.In addition, on a 10pt scale, people won't use the full scale evenly. Most humans will pick 7 to mean "meh." 1-6 is considered "failing", and if something is bad, they'll pounce on 1 as emotional emphasis, thus neglecting 2-6.On the other end, 8-10 are considered "safely optimistic", so out of a 10 point, you'd get rating compression to just 3 positions on a 10pt scale.
As it turns out, LLMs are also bad at this. As a result, your evals will be inconsistent with not much discrimination power.Both humans and LLMs do best with a binary judgment (yes/no, true/false, relevant/not), or a side-to-side comparison between two competing outputs.
Side-by-side comparisons can generate ELO scores over time for a ranking of outputs.The more nuanced the grading scale, the more likely two graders (or the same grader on different days!) would give different answers.
Counter-intuitive, but you get the nuance from sampling across many independent judgments, rather than from resolution in grading scale!