Leverage code for system evals

Leverage code for system evals

There's no one tool that will nail down a notion of "good" for you to express in a LLM eval. Instead, you need to use multiple tools to triangulate this notion of "good" to evaluate an output.

A simple, yet overlooked eval is code. No need to throw away tools we already know.

This goes by different names: property-based testing or unit tests. They're useful when what you want to check are very precisely defined and easy to express in code. No need to leverage humans or LLMs to do this kind of checking, as they can't do better than code.For example, simply checking JSON output formatting or conformity to an API specification can be done in code.

Certain size restrictions, such as being less than 420 chars, or that it has certain fixed templates or elements. You can do all that with regexes and counts.Don't go too far with regexes. Think about where the 80% line is. For something like detecting whether there's a full name or just a last name, lean towards a LLM. There's far too many variations in the wild for a regex.

Did you account for deBussey? van Buren? Stenson-clifford?For something like "is it an email?", you get away with a regex on "@" most of the time. While the full email spec results in a ridiculous regex to catch all edge cases, they occur far less than odd last names.

But you have to judge, for your own case, and acceptable false neg.Either way, leverage code for what it's good at: quick checks for properties of the output that you want high precision, and can be easily expressed in code.

🍲
Like what you read? We're serving it hot with a digital zine on LLM evals. Click here to see all the details.