A Newton Customer Re-Signed Up, Got an SSL Error — My AI Found a Stale DNS Record Cloudflare Was Round-Robining Into a Dead Box

I hit this one while testing the re-signup flow on Newton. A customer who had cancelled months ago came back and signed up again. My auto-provision pipeline spun up a fresh VPS, dropped the new DNS record into Cloudflare, sent the "your server is ready" email — and the customer clicked the URL and got slapped with ERR_SSL_PROTOCOL_ERROR.

Brand-new everything. Why was it broken?

Tim, my AI agent, spent about half an hour walking the Cloudflare API and a few DNS resolvers. The short version: there were two A records for the same subdomain. The old one from the cancelled signup was still sitting there, pointing at an IP I'd long since released. Cloudflare lets you have multiple A records on the same name by default — that's how round-robin DNS works — so resolvers were splitting traffic 50/50 between a live box and a dead one.

That single discovery turned into a three-layer rebuild of the whole provision flow. None of it would have happened without an agent that could read its own logs, talk to Cloudflare's API, and look at the problem the way a sysadmin would. Worth writing up.

The Old Flow — Looks Clean, Wasn't

Quick recap of what auto-provision does, so the failure mode makes sense. When a Stripe webhook lands telling me a new customer paid, the system runs:

Create a fresh VPS via Hetzner's API — get the new IP back.
POST to /zones/{zone}/dns_records on Cloudflare to create an A record like newton-customer-N.com pointing at the new IP.
SSH into the box and run the provision script — Tim Chat, Claude Code, the whole stack.
Flip the row in the servers table to active.
Send the "ready" email to the customer and a Telegram ping to me.

Looks fine on paper. The hole is in step 2. Cloudflare's default behavior is to let you have multiple A records with the same name. A fresh POST doesn't overwrite the old one — it appends. For a brand-new customer with a brand-new subdomain, that's harmless. For a re-signup who gets reassigned to a previously-used subdomain, you end up with the old record and the new record sitting side by side, pointing at different IPs.

Why HTTPS Broke Even Though the New Server Was Fine

The new VPS itself was working — I could hit it directly by IP. So why did the browser throw an SSL error?

Because Caddy on the new box needs to talk to Let's Encrypt to get a certificate, and Let's Encrypt verifies you own the domain via an HTTP-01 challenge. When Let's Encrypt resolved the domain, it saw two IPs, round-robin'd to one of them — and half the time landed on the dead box. Challenge failed. No certificate. No HTTPS.

And here's the worst part: all of this was happening in the background. My webhook handler had no idea anything was wrong. The provision script exited 0. The "ready" email went out. The customer clicked, got an SSL error, and bounced.

This is a classic automation failure mode: script-exited-zero is not the same as outcome-actually-happened. The pipeline thought it was done. The customer's real-world result said otherwise.

Tim Shipped a Three-Layer Fix

Layer 1 — Pre-delete before POST.

Tim rewrote provision_server() so that before creating any new A record, it does a GET against Cloudflare for any existing records with the same name and deletes them. Then POSTs the new one. Fully idempotent — you can re-run provision a hundred times and you'll end up with exactly one record pointing at the current IP.

This isn't a bug in Cloudflare. Multi-A-record support exists for a reason — load balancing, blue/green deploys, that kind of thing. It's just that in my single-VPS-per-customer architecture, the default is the wrong one. I had to opt out of it explicitly.

Layer 2 — Verify HTTPS before marking active.

The bigger fix. Tim added a verification step inside auto_provision_server(). After the provision script finishes, it does a requests.get("https://newton-customer-N.com", timeout=...). Only if that returns 200 does the row get flipped to active and the customer email get sent.

If it fails — cert not issued yet, DNS still propagating, anything — the row goes to ssl_pending, I get a Telegram alert, and no email is sent to the customer. I open the dashboard and look at it myself.

This is the layer that actually mattered. Layers 1 and 3 prevent the failure mode. Layer 2 prevents the worst part of the failure mode — telling the customer it's ready when it isn't.

Layer 3 — Kill the default-zone fallback for good.

While Tim was auditing the surrounding code, he found a deeper bug. Newton runs both a TH side (incomeinclick.in.th) and an EN side (incomeinclick.com) out of the same codebase, but they live in two different Cloudflare zones.

The old cloudflare_api() helper had a zone_id parameter that defaulted to the COM zone. If any caller — including a TH-side call — forgot to pass zone_id explicitly, the request would silently fire against the wrong zone. And Cloudflare would return 200 OK, because the auth was valid and the API call shape was correct. The record just landed in the wrong tenant's zone. The TH instance would log "DNS created!" and... nothing would actually exist on .in.th.

Tim refactored the helper to require zone_id explicitly. No default. If any caller forgets it, you get a TypeError at import time — fails the unit test, never reaches deploy. The whole class of "silently wrote to the wrong tenant" bugs is now structurally impossible.

This is the same pattern as multi-tenant configs where a fallback "default tenant" hides the moment your code starts writing customer A's data into customer B's bucket. Kill the default. Force the explicit choice.

A Monthly Sanity Audit, Because Edge Cases Always Win

After the fix, Tim wrote a small script I can run anytime. It:

Reads the servers table from both TH and EN databases.
Pulls every A record matching newton-* and jarvis-* from both Cloudflare zones.
Flags any record that has no matching active server in the DB as stale.
Flags any record whose IP doesn't match servers.ip in the DB as mismatch.

I run it once a month. It's the safety net for edge cases I haven't thought of yet.

One subtlety Tim baked in: he doesn't blind-delete records he doesn't recognize. Some of my customers still live on jarvis-* subdomains from before the Newton rebrand. The script reads servers.domain first and only flags records that have no active row anywhere. Otherwise I'd have nuked working customer servers the first time I ran it.

The Real Lesson — "Verify Before Notify"

The thing I keep coming back to is that this is the same shape as the Documentor pipeline that silently failed for three mornings. Different subsystem entirely. Same failure mode.

The pattern: "Did the job finish? Looks like it. Send the notification." Meanwhile the real outcome — the customer's URL actually loading, the queue actually populating — was broken, and nobody verified. So nobody knew.

Tim now has a new internal rule: anywhere a side effect goes out to a customer — provisioning, payment grace, trial expiry, password reset — there must be a verification step between "looks done" and "tell the customer." Fire-and-forget is banned. He wrote it into his own memory so the next time I ask for a feature with a customer-facing email, he'll proactively ask me where the verify step lives.

(For another flavor of "silent failure with no error log," see the time Tim's own memory writes were silently blocked for three days — same family of bug, completely different layer of the stack.)

This Isn't a DevOps Story — It's an Automation Story

If you're reading this and thinking "okay, but I don't run a SaaS with provisioned VPSes, so this doesn't apply" — let me push back.

Every automation pipeline that ends with a customer-facing notification has this same failure mode latent inside it. Form submissions that confirm before the DB write committed. Order confirmations sent before the payment captured. Welcome emails sent before the user record actually exists. Whatever the surface, the underlying pattern is the same: exit 0 is not proof the outcome happened.

Catching this kind of bug requires an AI agent that can actually look at the running system — read the logs, hit the API, walk through DNS like Tim did. Not a chatbot that tells you what to do in the abstract. An agent with SSH access, API tokens, and a memory of how your specific system is supposed to work.

That's the whole point of Newton. You get your own AI agent on your own VPS — with shell access, with your credentials, with a brain that remembers the last thing it fixed. When something silently breaks, the agent doesn't need to be told. It can find it, fix it, and write the lesson into its own memory so the next system you build inherits the rule. Try Newton free for 7 days at newton.incomeinclick.com — your server is set up in about ten minutes, no code from you.

— Pond