Go from Zero to Eval

A digital guide for AI Engineers building the wild world of LLM system evals

Drive improvements in your LLM-powered systems by designing, implementing, and interpreting evaluations. As engineers and product managers, learn the how to build your first eval and convince your team of the value of evals.

64 pages of completely downloadable PDF
10+ diagrams lovingly aligned
70+ meticulously hand-selected images

Witness cosmic horrors explain statistical metrics
Join Fox and Bear for nuggets of insight

$35

VISIT PRODUCT PAGE READ PREVIEW

Tame the AI chaos

Set in Brightwood Forest, a Large Language Model Shoggoth made its home in the canopy. Forest creatures have been using this alien intelligence to answer their questions, tell stories, and even write love letters.

But integrating LLMs into complex applications is not easy: Sometimes the LLM misinterprets their instructions, struggles to understand the data it reads, or chooses to do the wrong thing entirely.

Sometimes it feels like "check the vibes, cross your fingers, and ship it" is the only option, but it's not. We've outlined a more systematic (and more effective!) approach to evaluations, where you:

Start simple and scale up, so you can begin evaluating immediately without getting tangled in vines
Use a variety of evaluation techniques, ensuring you always have a way to measure progress
Design custom metrics that capture what "good" really means for your specific use case
Create a golden dataset that lets you confidently compare different versions of your system
Ultimately, transform vague feelings into actionable data, making it easy to improve your LLM implementation

So, grab your solar-powered laptop and find a cozy spot under the bioluminescent mushroom. After reading this enchanted zine, you'll have a toolkit of specific evaluation strategies that you can apply to any LLM-powered system, helping you build with greater confidence and control.

Stop guessing and start measuring

This guide equips developers and product managers with the practical strategies to design, implement, and interpret evaluations that drive real improvements in your LLM-powered systems.

Go from unpredictable to unbeatable

This easy-to-read guide will help you convince your team to invest in LLM system evaluations and walk you step-by-step to implement your first eval.

What you'll learn

The value of evals

What are evals?
Why should you design your own evals?
Convincing your team and manager
Evaluating your system as a whole

Designing your first eval

Understanding the overall eval structure
Balancing different goals of eval
Ensuring reproducibility in eval design

Using robust testing

Using property-based LLM unit tests
Doing vibes-based engineering right
Conducting loss analysis

Selecting quality measures

Choosing measures of quality
Picking a grading scale
Using statistical metric functions

Running effective task evals grader

Choosing a human vs a LLM grader
Writing great prompts for LLM-as-a-judge
Setting up the develop-label-analyze loop

About the Authors

Sridatta Thatipamala

AI Research Engineer

Sri is a research engineer focused on applying LLMs to domain-specific data. He led the development of question answering and search ranking algorithms for medical records at Google Health. Prior, he started the shopping team at Pinterest and founded a developer tools startup backed by YC and a16z.

Wil Chung

Applied Researcher

Wil is leveraging LLMs for building infrastructure for local-first software. He was the first engineer at Pulley, eCommerce at Pebble, and consulted and experimented in VR, cryptocurrencies, and 3D printing. Prior, he founded a CRM tool backed by YC and SV Angel.