Leveling up from vibes-based engineering

You've got an LLM in prod and are relying on "vibes" to gauge output quality. Sound familiar? It's a start, but you'll need to instrument your pipeline and get systematic to drive real improvements.

Our brains are powerful pattern-matchers for what makes one response good in one context, but terrible in another. But it can be hard to articulate. Starting with vibe-based engineering is how you begin articulating quality through an iterative process.

However, we have to systematize what makes a response good or bad, so we compare the responses between runs, between users, and over time.

First, instrument your pipeline and collect the outputs at every stage of the pipeline. For each user invocation, record the query, prompt, retrieved docs, chain of thought output, and final output. You want to be able to review and audit the sequence of responses from the LLM should anything go wrong. You can dump them in a .txt or a database. Whatever's easier.

Then, curate your example dataset. Come up with examples that cover the main use cases and as many edge cases of your application. Also, pull from queries from user feedback and bug reports. Typical business metrics like session length, bounce rate, will all be helpful.

Do loss analysis. Look through the data in the traced output: Did the error happen at the end or somewhere near the beginning? Gather losses into the same category and prioritize the largest category of losses. Tackle the largest category for fast improvement.

Lastly, iterate! It's surprising how fast you can make improvements with manual reviews, especially at first. It’s only when we start tackling the long tail of the edge cases and comparison over time that we’d want to leverage a systematized way of evaluating how our LLM is doing.

💡
Intrigued about evals? We're publishing a zine on LLM evals. Click here to preview and pre-order.