You are probably already paying for AI. Maybe it is a Claude subscription, maybe API calls for a product you are building, maybe a team of developers using AI coding assistants every day. But here is something most people do not think about: you might be wasting 40 to 80 percent of what you spend. Not because the AI is bad. Because of how you use it.
The culprit is something called tokens. And understanding them is the single biggest lever you have for cutting AI costs without cutting AI quality.
What Are Tokens, Really?
Think of tokens as the currency of AI. Every time you ask Claude or ChatGPT a question, your words get broken into small pieces called tokens. The AI reads those pieces, thinks about them, and sends back an answer, also made of tokens. You pay for both: the tokens going in and the tokens coming out.
A rough rule of thumb: one token is about one word. So a 500 word email you paste into Claude costs about 500 tokens to read. The reply it writes costs tokens too. Output tokens are more expensive than input tokens because generating new text takes more computing power than reading existing text.
Here is where it gets real. The cheapest AI models in 2026 cost about $0.04 per million tokens. The most powerful ones cost up to $180 per million tokens. That is a 4,500x difference between the budget option and the premium option. Choosing the right model for the right job is like choosing between a bicycle and a private jet to go get groceries. Both get you there. One costs a lot more.
Why This Should Matter to You Right Now
AI costs are growing fast and in ways that are hard to predict. According to Deloitte, enterprise AI spending on tokens will exceed $1 trillion annually by 2030. And it is not just big companies feeling the squeeze. A 2026 AICC report found that enterprise token costs dropped 67 percent year over year thanks to multi-model strategies. The companies saving money are the ones paying attention to token efficiency. The ones who are not paying attention are bleeding budget.
For individuals, this shows up as hitting your usage limits too quickly on Claude or ChatGPT. For small businesses, it shows up as surprisingly large API bills. For enterprises, it shows up as MCP servers and AI agents consuming 48 percent of total coding agent spend just on tool descriptions that get re-sent every single conversation turn.
The good news: most of this waste is fixable.
Token Efficiency for Individuals
If you use Claude, ChatGPT, or any AI assistant for daily work, here are the habits that will stretch your usage the furthest.
Start new conversations more often. This is the single biggest thing you can do. AI re-reads your entire conversation history every time you send a message. A 30-message chat means the AI is re-processing all 30 messages with every new reply. Starting a fresh conversation for a new topic drops your token usage dramatically. If you are working on a long task, ask the AI to summarize progress so far, copy that summary, and start a new chat with it.
Pick the right model for the job. Most AI platforms now offer multiple model tiers. For quick answers, formatting, or simple rewrites, use the smaller and faster model (like Claude Haiku or Sonnet). Save the powerful model (like Claude Opus) for complex reasoning, analysis, or creative work that actually needs it. This alone can free up 50 to 70 percent of your budget for the work that matters.
Be specific in your prompts. Vague questions get long, wandering answers. That costs tokens. Instead of "tell me about marketing," say "give me 5 email subject lines for a SaaS product launch targeting small business owners." You will get a better answer and use fewer tokens doing it.
Send less, not more. Before pasting a whole document into the AI, extract just the relevant section. Crop screenshots tightly to only the part you need. A full-page screenshot can cost 1,300 tokens. A cropped version of just the relevant section might cost under 100.
Combine related questions into one prompt. Three separate messages about the same topic means the AI reloads the full context three times. Batching them into one message means it only loads context once.
Token Efficiency for Small and Medium Businesses
When you move from personal use to running AI as part of your business, the stakes go up. API costs can surprise you. Here is how SMBs are keeping them under control.
Use prompt caching. If your business makes repeated API calls with similar content (like a customer service bot that always includes the same company background, or a document processor that uses the same instructions), prompt caching is a game changer. Anthropic's Claude charges cached tokens at just 10 percent of the normal price. That is a 90 percent discount on the repeated parts of your prompts. You break even after just one cache hit.
Route tasks to the right model automatically. Not every customer question needs your most expensive model. Build routing logic that sends simple queries ("what are your business hours?") to a cheaper model and reserves the expensive model for complex questions ("compare pricing plans for my specific use case"). Industry data shows about 85 percent of typical business queries can be handled by budget-tier models with no drop in quality. This single strategy can cut costs by 60 to 90 percent.
Batch your API calls. If speed is not critical, group similar tasks and process them together. Most AI providers offer batch pricing at 50 percent of the regular rate. Good for things like overnight report generation, bulk content categorization, or processing a day's worth of support tickets at once.
Set output limits. If you need a one-paragraph summary, tell the AI that explicitly. Without limits, models tend to be verbose. Every unnecessary sentence costs tokens.
Monitor before you optimize. You cannot fix what you cannot see. Use a tool like Cloudflare AI Gateway (free tier available) to track which workflows consume the most tokens. You will almost certainly find that 20 percent of your workflows account for 80 percent of your token spend.
Token Efficiency for Enterprises: The Cloudflare Code Mode Story
This is where things get really interesting. If your company uses AI agents, coding assistants, or has built products on top of AI APIs, you are probably dealing with something called MCP (Model Context Protocol). It is the standard way AI agents connect to external tools and services. And it has a massive token problem built into its design.
Here is the problem in plain language. Imagine you have a filing cabinet with 2,500 folders. Every time your AI assistant needs to find one folder, it first reads the label on every single folder. All 2,500 of them. Every time. That reading costs tokens. And if you have multiple AI assistants, each one reads all 2,500 labels independently.
A real-world example: GitHub's official MCP server has 94 tools. Just loading the descriptions of those tools costs 17,600 tokens. Before the AI does any actual work. Connect a few more services and you are burning 30,000+ tokens just on tool descriptions every time the AI takes a turn.
This is exactly the problem Cloudflare solved with something called Code Mode.
What Cloudflare Did
The Cloudflare API has over 2,500 endpoints. Exposing each one as a separate tool to an AI agent would consume over 1.17 million tokens. That is more than the entire context window of the most advanced AI models available today. Your AI would literally run out of memory just reading the menu before it could order anything.
Cloudflare's solution was elegant. Instead of giving the AI 2,500 separate tools, they collapsed everything into just two tools: search() and execute(). The AI uses search() to find the specific endpoints it needs (like looking up a word in the index of a book instead of reading the whole book). Then it uses execute() to actually do the work.
The result: 1.17 million tokens reduced to roughly 1,000 tokens. A 99.9 percent reduction. And the number stays fixed at about 1,000 tokens no matter how many API endpoints exist behind it.
They also open-sourced a Code Mode SDK so any company can apply the same pattern to their own MCP servers.
The Four Approaches to Enterprise Token Optimization
Cloudflare's Code Mode is one of four major approaches being used in production today. Here is how they compare, based on published benchmarks from StackOne, Anthropic, Cloudflare, Speakeasy, and Atlassian.
Schema Compression strips unnecessary detail from tool descriptions. Think of it as removing the fine print from each folder label. Reduces tool description costs by 70 to 97 percent. Easy to set up. Works as a drop-in proxy. But it does not help with the cost of the data coming back from those tools.
Search-First Discovery is like having a librarian. Instead of reading every folder label, the AI asks the librarian "which folders deal with billing?" and only loads those. Reduces input costs by 91 to 97 percent. Claude Code already does this automatically when tool descriptions exceed 10 percent of context.
Response Filtering is about trimming the data that comes back. If you asked for employee records and only need names and departments, why receive salary history, emergency contacts, and benefits information? Filtering responses reduces output costs by about 95 percent per call.
Code Mode (what Cloudflare uses) tackles both problems at once. The AI writes a small program that searches for what it needs and returns only the specific data it wants. StackOne benchmarked this across their platform: Sonnet 4.6 with Code Mode scored 80 percent on tool accuracy, outperforming Opus 4.6 without Code Mode at 62 percent. A cheaper model with better architecture beat a more expensive model without it.
For most enterprises, the practical recommendation is to combine two or three of these approaches. Start with search-first discovery (lowest effort). Add response filtering for data-heavy integrations. Graduate to Code Mode for your highest-volume workflows.
How to Set This Up
Here are the practical starting points depending on your situation.
For individuals (no code required): Start new chats more often. Use the right model tier. Be specific in your prompts. Crop your screenshots. Combine related questions. These five habits alone will dramatically improve your token efficiency.
For SMBs using the API: Start with Cloudflare AI Gateway. It is free to set up and gives you caching, analytics, and rate limiting across all major AI providers. The quickest path is to use "default" as your gateway ID. AI Gateway creates one for you automatically on your first API call. If you use frameworks like Vercel AI SDK or LangChain, both integrate directly with Cloudflare AI Gateway.
For enterprises running AI agents: Point your MCP client at the Cloudflare MCP server. The configuration is one line:
"mcpServers": {
"cloudflare-api": {
"url": "https://mcp.cloudflare.com/mcp"
}
}
}
For broader deployments, look at MCP Server Portals. These let administrators create curated tool sets for different teams. Your finance team sees read-only tools. Your engineering team sees more powerful read/write tools. Each portal collapses everything behind it into two Code Mode tools.
For the full picture, consider an MCP gateway that sits between your agents and your MCP servers. A gateway gives you semantic caching (so the same query does not hit the API twice), tool selectivity (only load tools relevant to the current task), and compiled execution (Code Mode for all your MCP servers, not just Cloudflare's).
The Bigger Picture
Every AI model upgrade makes tokens both cheaper and more capable. Claude Opus 4.8 just launched at the same price as its predecessor, but with stronger performance across the board. Enterprise token costs dropped 67 percent year over year according to the AICC. Fast mode for Opus 4.8 is three times cheaper than it was for previous models.
But cheaper tokens do not mean you should waste them. The companies and individuals who build token-efficient habits now will compound those savings as AI becomes more central to everything they do. Just like cloud costs (where up to 47 percent of spending is waste), AI token costs reward the people who pay attention.
The pattern is clear: every time AI gets better, the opportunity to use it wisely gets bigger. Token efficiency is not about spending less on AI. It is about getting more from every dollar you do spend.
Is Your AI Spend Growing Faster Than Your AI Results?
Fermat Solutions helps small and mid-size businesses implement AI efficiently. From choosing the right models to architecting cost-effective workflows, we make sure your AI investment delivers real returns.
Book a Free 30-Minute AI Cost Review