I Let an AI Speak for Me. Then I Tested It Like a Product.
Dieser Artikel ist auf Englisch.
In my last post I wrote a sentence that kept bothering me afterwards: AI can write your story, but it can't know which version of it is true to you.
Here's why it bothered me. My website has a chat. Visitors ask it questions about my career, my projects, how I think. It answers in my name, around the clock, without me in the room. Every answer it gives is, as far as the visitor is concerned, me speaking.
And for months, I had no idea whether those answers were any good.
I had read them, of course. A handful. They sounded right. But "sounds right" is exactly the trap I described last time. AI is excellent at producing text that sounds right. So I did what I would tell any organization to do with a system that speaks for them: I stopped sampling vibes and started measuring.
I expected to find problems with the AI. I did not expect it to catch one in me.
Step One: Read Before You Measure
The temptation is to start with a metric. Resist it. I started the way Hamel Husain teaches: ask real questions, read every answer, label them by hand, and only then decide what "bad" even means.
So I sat down with questions a curious visitor would actually ask. What does Adrian actually do? Does he have experience with GenAI in organizations? What does he believe about AI adoption? Then I read the answers slowly, the way a stranger would. Not skimming for facts I knew were in there. Reading for what the answer actually communicates.
It was humbling. The failures fell into patterns I would never have designed checks for upfront:
It answered German questions in English. Reliably. A visitor asks "Was macht Adrian eigentlich beruflich?" and gets back a polished English paragraph. The cause turned out to be plumbing, not intelligence: the site's language setting was overriding the visitor's language at the end of the prompt. One line, wrong place. Evals don't just catch quality problems. They catch integration bugs that no amount of prompt-polishing would fix.
It dumped lists when the question asked for a story. "Does he have experience with X?" came back as bullet points of activities. Technically accurate, completely lifeless. Nobody asks that question wanting an inventory. They want to know what happened, what was hard, what changed.
It was generic. Some answers could have described any AI lead at any German company, word for word. I started calling this the name test: if you can swap in someone else's name and the answer still works, the answer says nothing.
It got the perspective wrong. Questions about "he" sometimes got answers in "I". Small thing, deeply weird to read. An AI that can't keep track of who is speaking does not inspire confidence in anything else it says.
Four Failure Modes, Each With an Off Switch
Those patterns became a taxonomy: language match, story structure, specificity, voice. Four failure modes, each with a strict definition.
The detail that took longest to get right was not the definitions. It was the scope filters. Every rule needs an explicit answer to the question: when does this rule NOT apply? "Tell me about his experience" deserves a story. "How many people did he train?" deserves a number, and punishing the number for not being a story would just teach the system to pad. Without scope filters, an automated judge generates false alarms until you stop trusting it, and then you are back to vibes.
The fuzziest category, "generic", I broke into three yes/no checks: the name test, a check for differentiating details (numbers alone don't count, named roles alone don't count, specific decisions and concrete situations do), and buzzword density. Vague criteria produce vague judges. Checks you can answer with yes or no produce judges you can argue with.
The Judge Has to Earn It
Then I automated, with an LLM as the judge. But a judge you haven't tested is just vibes with extra steps.
So the judge had to earn its job. I hand-labeled a golden set of question-and-answer pairs: clean passes, deliberate failures, edge cases, with my own verdict on every failure mode for each. The judge's first task was not to evaluate new answers. It was to reproduce my labels. First version: four out of five. We argued about the fifth, the judge and I, and the disagreement was useful. It exposed a sloppy definition in my own rubric.
Only after the judge agreed with me on answers I had already labeled did it get to score answers I hadn't.
The Fix Wasn't Where I Expected
With the pipeline running, a pattern emerged that I would not have guessed. Narrow factual questions passed. Broad questions failed. Same model, same prompt, same knowledge base.
The root cause was the knowledge base itself. I had written it like a database: dense, factual, complete. Every fact was in there. What was missing was the causal glue. Why one thing led to another, what was at stake, what would have happened otherwise. A model composing an answer from fact-shards produces exactly what you fed it: shards.
So I rewrote the knowledge base story by story. Each one got the same skeleton: situation, tension, outcome, and one detail specific enough that the name test fails in the good direction. The eval scores told me which stories to fix first and whether each rewrite actually moved anything.
And then the part I didn't see coming: the evals caught me. While labeling answers, I reread one of the source stories and realized I had written it the way I wished it had gone. Cleaner than reality, a conflict resolved a little too neatly. The AI had faithfully repeated my own inflation back to me. I fixed the source. Measuring your AI, it turns out, occasionally measures you.
The Last Two Failures Were Not About Knowledge
After the rewrite, most questions passed consistently. Two kept flickering. The frustrating kind of flickering: pass, fail, pass, with no change in between.
The judge's reasoning showed why. The facts were all there. The composition was wrong. For certain broad question types, the model fell back to enumerating instead of telling, even though it had stories to tell. The fix wasn't in the knowledge base at all. It was a gap in the prompt: my "answer as a story, not as a resume" rule listed trigger phrases, and these two question types didn't match any of them. The rule existed. It just never fired.
That distinction matters beyond my little website. What the system knows and how the system composes are two different layers, and they fail differently. Eval reasoning tells you which layer to fix. Without it, you rewrite content when you should be fixing prompts, and vice versa, and everything takes four times as long.
The Numbers
Same transparency as last time:
- Golden set: 8 hand-labeled entries. Started with 5. Small, but every entry is one I personally argued with.
- Failure modes: 4, each with explicit scope filters and yes/no checks.
- Judge alignment: 4/5 on first version, then fixed the rubric, not the judge.
- Before: 5 of 8 questions passing, the weakest failing every single run.
- After: 8 of 8, three consecutive runs, 24 of 24 verdicts.
- Cost per full eval run: cents. The expensive part was the hand-labeling, a few evenings. The judge runs for less than a coffee.
Why Bother, For a Personal Website
Because of the principle, not the website.
I believe AI should work autonomously but never be a black box. It's one of three beliefs on this site. A chat that answers in my name, untested, is a black box with my face on it. I'm responsible for every sentence it produces, whether I read it or not. Testing it isn't perfectionism. It's the price of letting it speak for me.
And honestly, the bar out there is on the floor. Most chatbots ship after someone asks three questions in a demo and nods. The methodology that fixed my website chat is the same one I practice on production systems at work, and it isn't hard. Read real answers. Label by hand. Define failure precisely, including when the rule doesn't apply. Make the judge earn your trust before you trust it. Fix the layer the evidence points to.
The goal was never a perfect chatbot. It's knowing exactly how my system fails before a stranger finds out for me. That, in the end, is the difference between autonomous and out of control.