The Hidden Token Tax of MCP

A coding agent is debugging a failed deploy. The engineer needs one answer: what changed, and what broke?

Instead, the model gets CI status, logs, deploy history, feature flags, ticket search, Git history, incident summaries, and database inspection. Before it can help, it has to read that tool surface, decide what matters, and step through the loop.

That hidden work is the token tax of MCP.

MCP solves a real problem. Trouble starts when we treat it like neutral plumbing.

Tool schemas are part of your prompt, whether you call them or not

Start with one simple fact. From the model’s point of view, tools do not sit outside the prompt.

The model only knows a tool exists because you describe it in context. The schema, description, arguments, and expected usage all become text the model has to read before it can do useful work.

If you expose ten tools, you give the model ten options and ten chunks of prompt to parse. If you expose fifty tools, the model has to work through a small catalog before it can start solving the user’s problem.

That cost shows up even when no tool gets called.

In the failed-deploy example, the answer may only require logs and Git history. The model still has to carry ticket search, database inspection, feature flags, and everything else you exposed up front.

That is the standing cost. Tool availability lives inside the prompt.

Large tool surfaces make agents slower and less reliable

Extra tools do more than increase token count. They also make the model’s job harder.

Every extra tool expands the search space. The model has more names to scan, more descriptions to compare, and more ways to make a bad pick. It has to decide what the user wants, whether it needs a tool, which tool fits best, and how to format the call.

That decision overhead is easy to miss when you look only at raw latency numbers. Users still feel it as hesitation, wrong turns, or extra back-and-forth.

Provider guidance points in the same direction. The usual recommendation is to keep the initial function set small, then use deferred loading or tool search when the catalog grows. That advice exists because large tool surfaces create real runtime costs.

MCP makes it easy to expose many tools. That is great for integration speed. It also removes some of the friction that would normally force you to ask whether a tool belongs in this workflow at all.

Once that friction disappears, the tool list tends to grow. Over time, the model has more to read, more to choose from, and more chances to take the wrong path.

A small tool surface gives the model a clearer working set. A large one makes it sort through clutter before it can help.

The real latency tax comes from the loop

When people discuss MCP overhead, they often start with protocol or transport cost. In practice, the larger delay usually comes from the loop around the tool call.

Tool use is a multi-step interaction. The model decides to call a tool. The application executes it. The result goes back into the conversation. Then the model takes another turn to interpret the result and decide what to do next.

Each step is reasonable on its own. The slowdown appears when you stack several of them together.

In the failed-deploy example, the agent may ask for CI status, then recent logs, then the deploy diff, then feature flag state. Each of those calls can make sense. But each one adds another turn. Each turn carries more context forward. Each result becomes more text for the next step.

That is why tool-heavy systems can feel slow even when no single tool looks broken. The system keeps paying small decision and context costs in sequence.

Sometimes a direct CLI step or a small piece of code can collapse that loop. Instead of asking the model to choose among four read tools and interpret four payloads, you can execute one targeted operation and give the model the distilled result. That shrinks context and shortens time to answer.

Prompt caching helps, but it misses the cold starts teams feel

Prompt caching improves the picture, but not as much as people hope.

Caching can make repeated requests cheaper when the prefix stays stable and the session stays warm. That matters if the same agent keeps seeing the same tool surface and similar context.

So yes, it helps. It does not erase the problem.

Caching works best when the prompt shape stays stable, traffic stays warm, and sessions last long enough to benefit. A lot of production workflows do not look like that.

Some are short-lived. Some change tool availability per request. Some are cold-start heavy. Some rotate context aggressively because the agent keeps shifting tasks.

In those cases, the cached prefix either never shows up or does not stay reusable long enough to change the economics much.

If your support assistant wakes up cold for short bursts all day, prompt caching may look great in a benchmark and still disappoint in production. Those are the cases users notice.

A simple rule helps: if your system repeats a stable prompt shape, caching can soften the MCP tax. If your workflow is dynamic or bursty, caching will help less than you expect.

Use MCP where portability pays for the extra weight

None of this means MCP is a mistake.

MCP solves a real problem. If you want one integration surface that works across multiple clients, vendors, or developer environments, standardization has real value. One MCP server can be easier to maintain than several custom adapters. For a platform team, that engineering leverage can outweigh the extra prompt tokens.

That is the strongest case for MCP. Portability is what makes the extra weight worth carrying.

A good example is one internal tools team serving both an engineer-facing coding assistant and a support assistant that need the same deployment, incident, and repository data. One MCP server can keep those integrations aligned instead of forcing two separate tool stacks to drift apart.

That argument is real. It just does not apply everywhere.

If a workflow lives inside one controlled system and only needs one or two sharp operations, a native tool or direct code execution will usually serve you better. In that case, portability is not buying you much.

A simple decision rule: MCP vs native tool vs code execution

A practical framework is more useful than a purity test.

Use MCP when the same capability has to work across multiple clients or teams, when interoperability is part of the product requirement, or when a shared protocol meaningfully cuts engineering overhead.

Use native tools when the workflow lives inside one application and the action surface should stay narrow, explicit, and cheap. A native tool fits when you already control the host, do not need cross-client portability, and want the model to see only the few operations that matter. That usually improves latency and tool selection because the model has less to scan and fewer wrong branches to take.

Use direct code execution when the model does not need a menu of choices at all. Sometimes the next step is not really a tool choice. It is deterministic data gathering, transformation, or aggregation. In those cases, code can gather the diff, pull the relevant logs, summarize the result, and hand the model one compact payload instead of making it orchestrate a chain of small calls.

If you were explaining this to a junior developer, you could say it this way. Use MCP when shared access and portability matter enough to justify extra prompt weight. Use native tools when you control the environment and want the model to have a short, obvious list of actions. Use code when there is no real choice to make and you just need to gather or transform data.