Opus 4.7 Is the Best Model I’ve Used. It’s Also the One I Trust Least.

Anthropic gave us a smarter model and took away our thermostat. Here’s why that trade-off matters for anyone building on AI.

Apr 16, 2026

1. Introduction - I got a smarter model and a bigger bill on the same day

Okay so. Today Anthropic shipped Opus 4.7. I ran the same prompt I ran a week ago on 4.6. The answer was better. Cleaner plan, fewer tool errors, one fewer retry.

Then I opened the billing dashboard.

My API spend for the last 30 days is up 27% on the same workload. Same users. Same product. Same prompts, give or take. And none of that is because I chose to spend more.

Here’s the part that took me a minute to process: the model never asked me how much thinking to do. It decided. And it decided to think more. Sometimes a lot more. I did not notice until the bill showed up.

This is not a story about Opus 4.7 being bad. On the benchmarks it is a clear step up. +13% on Anthropic’s 93-task coding set. 70% on CursorBench versus 58% for 4.6. 87.6% on SWE-bench Verified, ahead of both GPT-5.4 and Gemini 3.1 Pro on the metrics that map to actual engineering work. The model is real, the gains are real, and for a lot of people it will just feel like a free upgrade.

This is a story about something else. Something you only see if you run a product on top of Claude.

Between February 9 and March 3, Anthropic made two changes to how Claude thinks. On February 9, they moved Opus 4.6 to adaptive thinking by default. On March 3, they dropped the default effort level from high to medium. No release notes in your inbox. No migration guide you had to sign off on. The API still works. Your code still compiles. Your users still get responses. The behavior of your product changed anyway.

Opus 4.7 is the moment this stops being a temporary default and becomes the architecture. The old budget_tokens parameter - the one where you told the model “you have 10,000 tokens to think, no more” - is deprecated. Adaptive thinking is now the only supported thinking mode on 4.7. You cannot go back. You can ask for effort: high. You cannot ask for predictability.

So I want to write about the thing most people are going to miss this week while they trade benchmark screenshots on X. We got a smarter model. And we lost the thermostat.

I am calling the pattern Default Drift. And if you build anything on top of an LLM, I think it is the most important AI story of 2026.

2. Perspective - The thermostat is gone, and nobody voted

Here is the way most people will cover the Opus 4.7 release.

They will put up a table. They will highlight +13% on coding, 94.2% on GPQA Diamond, a new tokenizer, a bigger vision input, claude-opus-4-7 in the model string. They will conclude that Opus is still the best agentic coding model on the market and move on. All of that is accurate. None of it is the story.

The actual story is architectural. Anthropic shipped the first flagship model in Claude’s history that removes a control knob developers had since extended thinking launched. It is not a feature deprecation. It is a philosophy change.

Old world - call it “extended thinking” - worked like a thermostat. You told Claude: “think up to 10,000 tokens on this.” Claude did that. Your latency was bounded. Your cost was bounded. You could tune per endpoint - expensive reasoning for the “plan” step, cheap quick answers for the “clarify” step. If your billing alert triggered, you dropped the budget. If users complained about speed, you dropped the budget. You were the operator.

New world - adaptive thinking - works like a smart home. Claude looks at the request, decides it is complex, and spends what it thinks the request deserves. You get an effort knob with three positions - low, medium, high - but those positions are not promises. Claude still decides how much to think inside each position. At the default (high), Claude almost always thinks. At medium, sometimes. At low, rarely. None of those are numbers. They are moods.

Interleaved thinking is on by default, which means Claude also thinks between tool calls. For a long-running agent that is great. For cost predictability it is a second multiplier you cannot turn off.

Here is how you feel this in practice.

In late February, an AMD engineer named Stella Laurenzo ran telemetry across 6,852 Claude Code sessions, 234,760 tool calls, and 17,871 thinking blocks from January through March. Her GitHub issue claimed a 67% drop in sustained reasoning compared to before the February changes. Boris Cherny, the Claude Code lead, had to respond publicly on Hacker News within a few hours. Anthropic’s counter was not “nothing changed.” Anthropic’s counter was “we changed defaults, we did not downgrade the model.” Which is technically true and beside the point.

The point is that for 52 days, every person building on Claude Code was running a different product than they thought they were running. Not because the model got worse. Because the controls Anthropic shipped underneath the model quietly moved.

Think about that for your product.

Your product’s behavior is downstream of Claude’s behavior. When Claude suddenly spends 2x the thinking tokens on the same user action, three things happen on your side. One, margin compresses. Two, p95 latency shifts and users notice. Three, you have to explain to a support request why “the AI is acting differently today” when you did not ship anything.

The same thing just happened to every serious developer on Claude Code. And 4.7 is not a reversal. It is the consolidation of the new model.

Here is the aha. The benchmarks are measuring capability. They are not measuring controllability. And in 2026, for anyone building a real product on AI, controllability is where the game is actually played.

You can write a great model review that says “4.7 is better than 4.6.” Every benchmark I looked at supports it. You can also write a true PM review that says “my product got harder to run this month,” and have that be true at the same time. Those two things are not in conflict. They are happening on two different axes, and most of the industry only looks at one.

The common frame is “new model, better.” My frame is “new model, new defaults, new operating contract - read it carefully.” I am calling this Default Drift because that names the actual threat. Defaults shift under you. You did not sign anything. Your product is running on someone else’s moving floor.

Anthropic is not the villain here. Anthropic is optimizing for the median developer - the one who never tuned budget_tokens, who wants the model to “just work,” who is building a chat app or a support bot where controllability matters less than raw capability. For that audience, adaptive thinking is a real win.

The trade-off is this: the better the defaults get for the median user, the more invisible the drift becomes for the operator user. And if you are reading this, you are almost certainly the operator.

3. Gamify - Five things I am doing this week because of Opus 4.7

I am not going to tell you to panic. I am not going to tell you to switch to GPT-5.4. I use Claude every day in my work. I am probably going to keep using it. The model is genuinely strong and the Routines feature Anthropic shipped alongside 4.7 is a real upgrade for agentic workflows.

But I am changing five things this week, and if you build on top of any LLM, you probably should too.

Step 1: Log your baselines before the next default moves

The biggest mistake I made in February was not having a “before” snapshot. When people started complaining about 4.6, I had no way to prove or disprove what they were saying. I was arguing from vibes.

Tonight I am going to do what I should have done in December. I am picking 20 real prompts from my own production traffic - a mix of easy ones, hard ones, agentic flows, and single-shot questions. I am going to run each of them once on Opus 4.7, and I am going to log three numbers per run: total output tokens, thinking tokens, end-to-end latency.

That is my new baseline. The next time Anthropic tweaks a default, I will know within a day. You cannot defend a position you did not measure.

Bonus: those 20 prompts become my regression suite. Every release, I rerun them. Every drift, I see it. This is the single highest-leverage change you can make.

Step 2: Pin your effort, do not inherit it

The default effort on Opus 4.7 is high. That sounds great until you remember the February story, when the default quietly dropped to medium and a lot of builders did not notice for weeks.

Every production call I make is now going to set effort explicitly. If I need deep reasoning, I write it. If I want fast and shallow, I write it. Nothing is left to inheritance. The two lines of code it adds are cheaper than a surprise outage.

The meta-point: treat defaults like you treat magic constants in code. Extract them, name them, own them. The moment a default is implicit, it belongs to whoever owns the upstream, and that is not you.

Step 3: Build a 10-prompt eval suite and run it weekly

Your product depends on the model behaving a certain way. That dependency is not tested. Fix that.

Pick 10 prompts that represent what your real users do. For each, define what “right” looks like - not a strict string match, but a structured rubric. Did it use the right tool? Did it include the key data point? Did it finish without looping? Did thinking tokens stay in the expected band?

Run this once a week. Manually, if you have to. Script it if you can. The goal is not perfection. The goal is an early-warning system. If next month your rubric score drops 15 points, you know the conversation to have with your users before they have it with you.

This is the analog to application monitoring from the pre-AI era. You would not run a web app without APM. You should not run an AI product without an eval harness. The drift is already happening. The only question is whether you see it.

Step 4: Write a fallback path into your architecture this quarter

I do not think GPT-5.4 is a better everyday model than Opus 4.7. But I do think a product that can only run on one upstream is fragile.

This quarter I am refactoring my AI layer so that every call goes through an internal abstraction that can route to Claude, GPT, or Gemini based on a config flag. That does not mean I am going to swap providers. That means when Anthropic makes its next quiet default change, I can move the 5% of my workload that is most sensitive to the change in one afternoon instead of one sprint.

This is not multi-cloud maximalism. It is insurance. The premium is a couple of weeks of engineering. The payout is every future default drift and every future outage. For a real product, that math is easy.

If you are pre-revenue or solo, make the abstraction small. One function. Two providers. The point is the optionality, not the elegance.

Step 5: Push back in public when something breaks

This is the part most builders skip, and it is the most important.

The February changes got addressed because Stella Laurenzo wrote a GitHub issue with numbers, ran telemetry, forced a public response. Not because thousands of people tweeted vibes. Anthropic reacts to specific, reproducible, measured evidence. If you see behavior change and you have the data, file the issue. Post the chart. Reply to Boris Cherny with your numbers, not your opinion.

I say this without snark - Anthropic is one of the more responsive labs in the industry. But they cannot read your mind and they cannot prioritize what they cannot see. The eval suite from step 3 is not just for your benefit. It is the evidence base for every future conversation you will have with a model vendor. And every builder who files a real issue raises the floor for every other builder.

You are not a customer. You are a counterparty. Act like one.

Closing

There is a version of this post that reads “Claude got nerfed, I am mad.” That is not the post I wanted to write because that post is not true and it is not useful. Opus 4.7 is the strongest model I have used. Anthropic shipped real, measurable improvements, and they are going to ship more.

The post I wanted to write is the one that nobody else seems to be writing this week.

When your product runs on top of a model that decides how much to think, you are no longer fully the product manager of your own product. You share the role with whichever team inside your vendor owns the default configuration. That team optimizes for their objective function, not yours. Right now those objectives happen to align most of the time. They will not always. And the places they diverge are exactly the places your users feel it first.

Default Drift is the name I am giving this because it describes what is actually happening. Defaults drift. Products built on those defaults drift with them. And most of the drift is invisible unless you measure it.

We are early. Every builder on AI right now is building on infrastructure that is being redesigned underneath us in real time. That is the nature of being early. The people who will win this decade are not the ones with the best prompts. They are the ones who notice when the floor moves, and who built the sensors to notice in time.

Go measure something this week. Pick the 20 prompts. Log the tokens. Pin the effort. Then do the same thing next week, and the week after. You will be shocked at how much information you have been leaving on the table.

Opus 4.7 is a great model. It is also a reminder that the contract between you and your upstream is shorter and softer than you thought. That is fine. But read it carefully.

And bring your own measurements.

If you run a product on top of Claude, GPT, or Gemini - reply and tell me what your baseline looks like. I am collecting examples for a follow-up, and the cleaner the data, the more useful the next piece will be.

Mykyta Pavlenko

Discussion about this post

Ready for more?