Defining good metrics for evaluation
What constitutes good? LLMs aren’t magic. If you can’t articulate what good looks like, the LLM won’t know either. Part of that is learning to pick the metrics to bracket in that elusive definition of good. All that you learned in ML engineering doesn’t go away with Prompt engineering.
When picking metrics for your eval, keep in mind there is no one good measure. You’ll need to use multiple. There are two types of metrics: optimizing metrics and satisficing metrics.
Optimizing metrics are the numbers you want to keep improving as a metric of how well you’re hitting your goal. This can depend on the domain, but keep the metric simple. It won’t capture every aspect of good, but all metrics are a proxy for good.
Satisficing metrics are guardrails to ensure your product is avoiding risk to your users or your business. Typically there are responses considered universally inappropriate in your domain, such as medical advice or life advice.
These two numbers are in tension, often playing tug-of-war at the intersection of helpfulness. For example, to improve a customer support AI, it may be more helpful to be more empathetic, but it may also decrease safety by dispensing medical advice.
Then for each of the two types of metrics, you can pay attention to two attributes for the anatomy of that metric: Quality and grading scale.
What is quality? It’s not enough to say you’ll know it when you see it. It’s also not enough to give general advice about it for every domain. For example, in a Q&A app, quality might be measured by classic information retrieval metrics like relevance, precision, and recall.
Whereas with conversational assistant apps, quality might be measured by the tool selection accuracy and the end-to-end success rate. It all depends on the application. Again, remember you’ll need multiple metrics that are simple to calculate to box in a definition of good
For the grading scale of a metric, you want to ensure you get consistent judgments from the same output. Both people and LLMs are much more consistent with binary choices and choosing the better of two versions.
Remember that metrics are proxies for quality. You'll always need to use multiple metrics to triangulate the qualities that you want in an output. And finally, you'll need to keep an eye on the actual output over time to ensure that the metrics aren't being gamed.
Each time you're doing a vibe-based eval, keep building an intuition of what good looks and write down that intuition. That'll make it easier to find metrics that will align with the intuition later down the line!