Aligning LLMs-as-a-judge

Aligning LLMs-as-a-judge

LLMs-as-a-judge sounds counterintuitive. How can LLMs judge its own work? If it could, wouldn't it just do better in the first place? Remember, LLMs are like stochastic functions out of the box. They don't have any memory outside of what you provide in its context.

And after all, a human can judge the work of another human, if that's easier for you to think about. Before we use LLMs-as-a-judge, we have to align it with human judgments of the output in this particular domain. The raw output of an LLM is a summary of the crush of humanity.

You start like you do with all Machine Learning: the unsexy collecting and cleaning of data. You want to collect a number of samples of what "good" looks like and cover a lot of the edge cases too. The good news is that you don't need to collect nearly the amount that you used to.

But you still need to do enough. This is why it pays to be systematic and organized about your manual vibes-based judgments. If you haven't started doing that, you'll need to start to do that to hand off the job to an LLM. This is your golden dataset, the source of truth.

This golden dataset should at least cover the most common cases and some edge cases that you care about. Any sort of biases to your dataset that you want to avoid should be paid attention to here at this stage.

Then, you want to run the examples through your LLM eval and compute how correlated the LLM eval labeling is to the golden dataset of human labeling.

Lastly, create a confusion matrix to see where the LLM and the human disagree. If there's complete agreement, then there'd only be numbers in the diagonal. Then you simply iterate your prompt to drive down confusion. It usually only takes a few iterations, empirically. Shoot for above 80%. The whole idea is to have an LLM that would judge the same as a human user.

Would you LLM eval an evaluator? Is it turtles all the way down? Usually no. Because the evaluation of an eval is usually a yes/no or the selection between choices, it's easier to judge, so we opt for statistical metrics and code to just the output of the evaluator!

💡
Intrigued about evals? We're publishing a zine on LLM evals. Click here to preview and pre-order.