Go from Zero to Eval
A digital guide for AI Engineers building the wild world of LLM system evals
Drive improvements in your LLM-powered systems by designing, implementing, and interpreting evaluations. As engineers and product managers, learn the how to build your first eval and convince your team of the value of evals.
- 64 pages of completely downloadable PDF
- 10+ diagrams lovingly aligned
- 70+ meticulously hand-selected images
- Witness cosmic horrors explain statistical metrics
- Join Fox and Bear for nuggets of insight
Tame the AI chaos
Set in Brightwood Forest, a Large Language Model Shoggoth made its home in the canopy. Forest creatures have been using this alien intelligence to answer their questions, tell stories, and even write love letters.
But integrating LLMs into complex applications is not easy: Sometimes the LLM misinterprets their instructions, struggles to understand the data it reads, or chooses to do the wrong thing entirely.
Sometimes it feels like "check the vibes, cross your fingers, and ship it" is the only option, but it's not. We've outlined a more systematic (and more effective!) approach to evaluations, where you:
- Start simple and scale up, so you can begin evaluating immediately without getting tangled in vines
- Use a variety of evaluation techniques, ensuring you always have a way to measure progress
- Design custom metrics that capture what "good" really means for your specific use case
- Create a golden dataset that lets you confidently compare different versions of your system
- Ultimately, transform vague feelings into actionable data, making it easy to improve your LLM implementation
So, grab your solar-powered laptop and find a cozy spot under the bioluminescent mushroom. After reading this enchanted zine, you'll have a toolkit of specific evaluation strategies that you can apply to any LLM-powered system, helping you build with greater confidence and control.
Stop guessing and start measuring
This guide equips developers and product managers with the practical strategies to design, implement, and interpret evaluations that drive real improvements in your LLM-powered systems.
Go from unpredictable to unbeatable
This easy-to-read guide will help you convince your team to invest in LLM system evaluations and walk you step-by-step to implement your first eval.
What you'll learn
The value of evals
- What are evals?
- Why should you design your own evals?
- Convincing your team and manager
- Evaluating your system as a whole
Designing your first eval
- Understanding the overall eval structure
- Balancing different goals of eval
- Ensuring reproducibility in eval design
Using robust testing
- Using property-based LLM unit tests
- Doing vibes-based engineering right
- Conducting loss analysis
Selecting quality measures
- Choosing measures of quality
- Picking a grading scale
- Using statistical metric functions
Running effective task evals grader
- Choosing a human vs a LLM grader
- Writing great prompts for LLM-as-a-judge
- Setting up the develop-label-analyze loop
About the Authors
Sridatta Thatipamala
AI Research Engineer
Sri is a research engineer focused on applying LLMs to domain-specific data. He led the development of question answering and search ranking algorithms for medical records at Google Health. Prior, he started the shopping team at Pinterest and founded a developer tools startup backed by YC and a16z.
Wil Chung
Applied Researcher
Wil is leveraging LLMs for building infrastructure for local-first software. He was the first engineer at Pulley, eCommerce at Pebble, and consulted and experimented in VR, cryptocurrencies, and 3D printing. Prior, he founded a CRM tool backed by YC and SV Angel.