13 Aug 2024 • 2 min read

Doing vibes-based engineering right

If you're struggling to whip your LLM output into shape, start with "vibe-based engineering", by judging outputs by look & feel. It doesn't scale, but it's the foundation for systematizing evals later.

What is vibe-based engineering? Just start by looking at the generated results. Do they look alright? It’s ok to start here and do things that don’t scale. Good can be hard to articulate, and by looking at output one by one, you can start to define what constitutes a good output.

Be willing to iterate. Sci-fi promised AI as an intellectual supreme but the reality is more a writer of bad high school essays by default. You must be willing to keep changing the prompt to see what coaxes more focused and inspired responses. It's helpful to think of LLMs as a throng of humanity, and it's up to you to steer it to the geniuses among us with your prompt.

Use an exacting communication style. LLMs are like fast-working interns who need a lot of guidance. By starting out judging each output manually, you’ll start to be able to articulate why an output is good. If you're still struggling, just find examples to show the LLM.

Build a golden dataset. Based on the manual work that you’re doing, you’re building a database of samples. Shoot for 30 to 50 as samples to help steer future automated eval. Examples should cover the span of outputs, as well as some edge cases.

Lastly, slowly systematize the eval process to automate. Use golden datasets, unit tests, and statistical metrics to see how well your eval process is doing. Unit tests catch easy things you don’t need an LLM for, such as output length or JSON formatting. Statistical metrics are like precision, recall, and F1 scores to triangulate the performance of the eval to be able to generate similar outputs from your golden dataset.

While vibes engineering doesn't scale, it's a good way to build intuition for your domain and the types of output that you're getting from which kinds of prompts. Before you can teach quality to another human or to an LLM, you have to know what good looks like yourself first. Vibes is definitely a way to do that, and making sure you're systematic and organized about it is a good way to slowly automate it.

💡

Intrigued about evals? We're publishing a zine on LLM evals. Click here to preview and pre-order.