Guidelines for consistent grading in LLM evals

Guidelines for consistent grading in LLM evals

When starting off with eval, you might start with humans. It gives you experience with edge cases, and it hones your intuition of what good actually looks like. But once you do that, you need to scale up a little, which means writing guidelines on what good looks like.

This isn't time wasted, because the work you do writing guidelines for human evaluators will be similar, if not identical to the prompt you would need to give to LLMs-as-a-judge. So do the reps and don't skip this step.

You want to be specific and prescriptive in your guidelines. The idea is to codify your intuition of what good looks like to other people so they can replicate your process and judgment. Be specific but not overly prescriptive, and this can be a hard balance to strike.

First, explain the context of the task. What is the user trying to accomplish? What do they expect from the system? What is the context in which they're trying to accomplish this task?

Then, show the grader examples of responses and how those responses would be graded. The examples should cover both typical responses and edge cases. However, there's no need to enumerate every edge case. Just three to five examples will do. Both humans and LLMs can't hold that much in their proverbial heads, despite the lengthening context window.

There are sections of a prompt context that the LLM will forget and unable to find a piece of information. So do what you can to keep things clear, succinct, and to the point. For more, check out "needle in the haystack" experiments.

Then finally, present the grading scale and what it means. It's better to use a binary grade. Better yet, compare two samples, and let the grade choose which one is better.

You should give a checklist of things to would be things to watch out for. This will help human and LLM graders be more consistent about how they grade a particular output.

Some thought should be paid to the grading scale when systematizing your LLM evals, regardless of whether they're for humans for LLM-as-a-judge. A bad grading scale can skew your ability to assess your output accurately. Humans do best if they have binary choices or have two samples to compare with each other.

And after you've deployed to either human or LLM graders, you're not done. It's now time to iterate. Keep an eye on the output, and see what kind of bugs and complaints come in. Then keep adjusting your eval and see how the metrics you've set up in place improve.

💡
Intrigued about evals? We're publishing a zine on LLM evals. Click here to preview and pre-order.