How I Built a QA System So AI Can Grade Its Own Work

My content machine generates scripts, captions, and videos for 23 Facebook pages every single day. That's a lot of AI-generated output. For a while, I had no idea if any of it was actually good.

The Problem With Blind Automation

When you're running a system at this scale, prompt quality matters enormously. One weak prompt affects every piece of content that page produces — daily — forever, until you notice and fix it.

The fix part is easy. The notice part is the problem.

Every time I edited a prompt — tightened the language, changed the structure, tried a new format — I was basically flying blind. Did the output get better? Did it get worse? Did I accidentally break something that was working fine? I had no way to know without manually reviewing dozens of posts. Which I never had time to do properly.

So I'd tweak a prompt, it would feel better to me, and I'd move on. That's not QA. That's guessing.

Building the Eval System

The eval system I built works in three layers.

Layer 1: Test Mode Runs

Every Loom workflow can be triggered in test mode — it goes through the full pipeline but doesn't actually post anything. I can run a workflow 5–10 times and collect all the outputs without touching a live page.

This is the foundation. Without test mode, you'd need to post real content to evaluate it, which is obviously backwards.

Layer 2: Code Graders

Once I have outputs, automated graders check each one. These are simple code checks — not AI calls. Things like:

Language detection — is the output actually in the right language? Thai pages should produce Thai. English pages should produce English. Sounds basic, but language bleed happens more than you'd expect.
Length check — is the script within the target word count range? Too short means the video will be awkward. Too long means it'll get cut off.
Pattern matching — does the caption include required elements? Hashtags present? No forbidden phrases? The right call-to-action format?

Each check returns pass or fail. The system aggregates them into a pass rate across all runs. If I run 10 tests and 7 pass, that's a 70% pass rate. I want to see 90%+.

Layer 3: LLM Judgment

Code graders catch structural problems. They don't catch "this script sounds weird" or "this caption is technically correct but the tone is totally wrong."

For that deeper quality judgment, Tim — my AI agent — reads the outputs directly and grades them in chat. No external API needed. Tim is the LLM. I just ask him to review the outputs and tell me what he thinks.

This is actually more useful than automated LLM-as-judge approaches I've seen, because Tim has context. He knows the page's niche, target audience, past content, and what good output looks like. A generic judge doesn't.

The Before/After Comparison Workflow

When I want to change a prompt, the process now looks like this:

Run evals on the current prompt. Save the results.
Make the change.
Run evals again. Save the results.
Compare. Did the pass rate go up or down? Where specifically did it change?

The eval runner has a built-in compare mode that shows the diff between two result files. I can see exactly which checks changed and in which direction.

This sounds obvious, but before building this system, I was making changes based on vibes. Now I have numbers. "The new prompt increased caption length compliance from 60% to 90% but dropped hashtag inclusion by 15%." That's actionable. I can tune from there.

When to Run Evals

I run them in three situations:

After editing any Loom workflow prompt. Always. No exceptions. Even "small" changes can cascade in unexpected ways at 23-page scale.
After creating a new page. I eval the caption and script quality before the page ever goes live. No point running ads to a page with broken content.
When something feels off. If engagement drops suddenly on a page, I'll run evals to check whether the content quality degraded.

What This Changed

Before the eval system, prompt editing was risky. I'd often leave prompts alone even when I suspected they could be better — because "if it ain't broke, don't fix it" felt safer than the uncertainty of making a change.

Now I can experiment freely. If a change doesn't improve the pass rate, I roll it back. If it does, I keep it. The feedback loop is tight and the evidence is concrete.

It also changed how I think about the content machine overall. I used to think of quality as something I checked occasionally. Now I think of it as something I measure continuously.

That's a small mindset shift that makes the whole system more reliable — and lets me actually trust what Tim is producing while I'm not watching.

If you're going to let AI produce content or code at scale, you need a way to verify the output. That's something I baked into Newton from the beginning — your agent can build and run its own eval pipelines, grading its work and improving over time. Trust, but verify.

— Pond