Gemini 3 Tripled My AI Bill Overnight — My AI Agent Benchmarked the 4 Thinking Levels Itself and Cut It Back, 70% Cheaper, 2× Faster

A couple of days ago I switched Loom — my Facebook automation engine — from Gemini 2.5 Flash to the freshly released Gemini 3 Flash. I assumed it would cost about the same, maybe write slightly better captions. Then I opened the Google AI Studio bill the next morning. Token usage was up almost 3× across the board, and the outputs were still just short Facebook captions.

So I sent one sentence to Tim — my AI agent — and asked him to figure out why. By the end of the same evening he'd run a benchmark, picked a new default, shipped a fix, accepted a code-review note from me, reverted, and shipped a better version. Bill back to normal, captions twice as fast, no quality loss.

The Setup: Why Loom Calls a Language Model Hundreds of Times a Day

Quick context — Loom is the automation system I run my 23 Facebook pages on, across 6 languages. Every page has workflows that generate captions, video scripts, and image prompts using an LLM. Add it all up and we're talking many hundreds of model calls a day across the network.

The old setup used Gemini 2.5 Flash. Cheap, fast, perfectly fine quality for caption work. When Google announced Gemini 3 Flash I flipped a config flag, expected a quiet upgrade, and went to bed.

The next day, the bill showed otherwise. Gemini 3 Flash, on the exact same prompts, was burning roughly 3× the tokens.

The Hidden Knob: Gemini 3's "thinking level"

Tim went digging and found the culprit. Gemini 3 ships with a new config field called thinkingLevel. It has four settings — minimal, low, medium, high — and the default for every call is high.

Thinking level is exactly what it sounds like. Higher levels let the model "think to itself" before answering — generating internal reasoning tokens that don't appear in the output but absolutely show up on the bill, and absolutely add latency. For a hard reasoning task, it's worth it. For a 200-character Facebook caption, it's like driving a tank to the 7-Eleven.

The reason Google ships high as the default is obvious once you think about it: benchmarks. A new model launches and the world judges it by how clever it sounds. The default that wins the benchmark headline is not the same default that wins your cost report.

"Benchmark First, Then Tune"

I told Tim a single thing: before changing anything, give me numbers. Otherwise I'd have no way to know if dropping the thinking level was also dropping caption quality.

So Tim wrote a small script that ran the real production Thai caption prompt against all four thinking levels, capturing three metrics each:

Thought tokens (the internal reasoning, which you pay for but never see)
Latency in seconds
Total tokens billed

Here's what came back:

minimal — 0 thought tokens, 2.9s, 294 total tokens
low — 380 thought tokens, 4.5s, 639 total tokens
medium — 501 thought tokens, 5.6s, 772 total tokens
high — 654 thought tokens, 5.8s, 914 total tokens (= default)

That's roughly 3× the tokens and 2× the latency between minimal and high, on identical input. And these aren't theoretical numbers — this is the actual prompt Loom fires every time it writes a caption for one of my pages.

Then the real question: did the caption quality change?

Tim pasted the four outputs side by side and asked me to grade them. I could not tell them apart. Tone, hook, emoji usage, the little Thai conversational rhythm I'm picky about — all four landed in the same zone. Writing an engaging caption is a "pick the right words" job, not a "reason through five steps" job. The model is plenty smart enough to do it without burning thought tokens.

So we agreed: switch the default to minimal.

Where the Code Review Got Better Than the First Pass

Tim's first commit hardcoded thinkingLevel=minimal on every Gemini 3 call in Loom. Clean, simple, immediate cost cut.

When I read the diff, something nagged at me: I'd just stripped my own ability to ever opt into deeper thinking. What about workflows where I actually want the model to reason harder? Like Loom's full-script generator, which writes a 4-5 paragraph video script — not a caption. Or one of the analytics steps where the AI looks at dashboard numbers and writes a summary, which is genuinely a reasoning task. For those, "minimal" might be too light.

I told Tim to revert and do it right. He came back with this:

The backend reads a thinking_level field off each step's config. If unset, it defaults to minimal.
The frontend got a new dropdown in the step editor — minimal / low / medium / high — that only appears when the provider is Gemini and the model starts with gemini-3. (No reason to show it for Gemini 2.5 or other providers where the field doesn't exist.)
All 178 existing steps need zero migration. They just inherit the new default of minimal.
If a future workflow needs deeper reasoning, I open that step, pick a higher level from the dropdown, save. Done.

This is the kind of design choice I love seeing Tim make: change the default to whatever is right for 99% of cases, but never close the door on user choice. A dropdown costs a few minutes to add. Painting yourself into a corner costs hours to undo.

The Documentation Was Wrong

One funny detour. Tim's first implementation kept returning HTTP 400 errors no matter how he formatted the request.

Early Google docs said the key was thinkingLevel, sitting flat inside generationConfig. The API actually wants thinkingConfig.thinkingLevel — nested one level deeper. Same field name, different shape.

Tim figured it out by reading the error response, trying two or three schemas, and landing on the nested form. Then he left a comment in the code: "Flat key per Google docs returns 400 — must be nested under thinkingConfig" — so future-me, or anyone else reading the file, doesn't try to "fix" it.

This is one of those tiny moments where I trust my AI agent more than a senior contractor on Upwork. He didn't believe the docs. He tested against the real API. He left a footnote so the lesson sticks. (The same trust-real-behavior-not-docs habit later caught Claude Sonnet emitting unescaped quotes inside JSON — three days of silent failures my logs didn't surface.)

Three Business Lessons Hidden Inside a 3-Line Config Change

I'm not telling this story to show off a config diff. It's because there are three lessons hiding inside it that apply to anyone running AI in production, not just engineers:

1. A new tool's defaults are optimized to impress, not to save you money. Google didn't pick thinkingLevel=high by accident — that's the setting that makes Gemini 3 look smartest in head-to-head benchmarks. It's also the setting that quietly triples your bill on short tasks. Every time you upgrade a tool, treat the new knobs as suspect, not gifts.

2. Demand numbers before tuning. If I'd just told Tim "use less thinking," he would have done it — and I'd have no way to tell whether quality slipped at the same time. The benchmark across 4 levels meant the decision was a number, not a vibe. And I can re-open that file in six months if I want to revisit it.

3. Switching the default is great. Removing the choice is not. Hardcoded minimal would have looked like a clean win in the diff. But the day I hit a workflow that genuinely wanted high thinking, I'd have ripped the whole change out. A dropdown is insurance against your own future requirements — and it's cheap insurance.

This Is What an AI Engineer on Staff Looks Like

If I didn't have Tim, here's how this story ends. I see the bill go up, shrug, assume it's the price of upgrading, and pay an extra few hundred dollars a month for the rest of the year — because I had no reason to suspect a hidden config knob existed at all. Or worse, I roll back to Gemini 2.5 to save money and lose whatever quality wins Gemini 3 actually brought.

Instead, in one evening: benchmark run, default flipped, code-review feedback incorporated, dropdown shipped, future bug guarded with a comment. A human engineer with the same chops would be $40–60/hr on Upwork — and I'd still be writing the brief.

This is exactly why I stopped paying for monthly SaaS and started having my own AI agent build and tune my tools instead. It's not just "can write code." It's thinks like an engineer — tradeoff-aware, cost-aware, product-minded. Which is also the difference between using AI and having AI work for you.

If you want an AI agent that audits the bill on every API you use, runs benchmarks before tuning anything, and ships fixes in a single evening instead of a sprint — that's exactly what Newton is. You sign up, your own private server spins up with the AI agent already installed and connected, and you can hand it work like this from day one. No DevOps. No "let me set up an account." Just open your chat and complain about your bill. See how Newton works →

— Pond