Token Prices Are Falling,
The Illusion of the Price War
If you've been following the large language model (LLM) API market since 2024, the trend lines look unambiguously triumphant. DeepSeek V3 slashed input token prices to $0.27 per million tokens—a greater than 90% reduction from early 2024 levels. OpenAI's GPT-4o mini, Anthropic's Claude Haiku, Alibaba's Qwen-Turbo, and a wave of open-source challengers have all joined a race to the bottom that benefits developers on its face. On paper, the cost of intelligence is plummeting.
The price per token has fallen 90%, This is not a paradox of arithmetic. It is a crisis of transparency.
Southern Metropolis Daily, one of China's most influential investigative newspapers, published a landmark investigation in early 2026 that should unsettle every CTO and AI product lead. Their finding: despite the headline-grabbing price cuts, a substantial and growing number of enterprises and independent developers are seeing their actual LLM bills rise—in some cases by 30%, 50%, even 100% year-over-year.
The culprit is not the price tag. It is a metric that almost no service provider surfaces transparently: cache hit rate.
The Token Volume Explosion: When Efficiency Backfires
To understand why falling unit prices coexist with rising absolute costs, you have to look at how AI applications have evolved between 2024 and 2026. The dominant architectural pattern has shifted from simple request-response chat to agentic workflows—multi-step, autonomous, tool-using loops where an LLM orchestrates a sequence of actions to complete a complex task.
A 2024 chatbot application might have consumed 500 tokens per user query. A 2026 agentic system executing an equivalent task—say, researching a topic, drafting a report, generating images, and publishing to a CMS—can easily consume 50,000 Consider a concrete example. An agent built on the OpenClaw platform to automate SEO content production might execute the following steps in a single workflow:
- Topic analysis: Retrieve trending topics from an editorial database (1 API call, ~2,000 input tokens)
- Outline generation: Generate a structured outline based on the topic (1 API call, ~3,000 input tokens, including system prompt and tool definitions)
- Section drafting: Draft each section iteratively (5-10 API calls, each ~4,000 input tokens)
- Image prompt generation: Create image generation prompts (1 API call, ~2,000 input tokens)
- Quality review: Evaluate the draft against editorial guidelines (1 API call, ~5,000 input tokens)
- Publication: Format and publish the article (1 API call, ~2,000 input tokens)
In this workflow, the system prompt (which defines the agent's role, constraints, and capabilities) and the tool definitions (which specify available functions like search_web, generate_image, publish_article) are included in every single API call. They are static—identical across all steps. In our example, they might account for 2,500 tokens out of 4,000 total input tokens per call. That's 62.5% redundant context being retransmitted and re-billed on every step.
Your system prompt is the part of your API request that never changes— When token prices fall by 90%
KV Cache: The Technology That Should Save You
Large language models use a mechanism called KV Cache (Key-Value Cache) to avoid redundant computation. When a model processes a sequence of tokens, it computes intermediate attention states (the "keys" and "values" in the transformer architecture's attention mechanism) that are required to generate the next token. If the beginning of your input sequence is identical to a sequence the model has recently processed, those attention states can be reused rather than recomputed.
This is a big deal. Recomputing the attention states for 2,500 redundant tokens might consume 30-40% of the total GPU compute for an inference request. OpenAI charges $0.075 per million tokens for cached input (versus $0.150 for regular input—a 50% discount). Anthropic is even more aggressive: cached input on Claude is priced at roughly 10% of the full input price for long-context prompts. DeepSeek's pricing structure similarly differentiates between "cached" and "uncached" input tokens.
In a perfectly transparent world, every developer could simply look at their API dashboard, see their cache hit rate, and verify that they are being billed correctly.
The Black Box: How Cache Hit Rates Are Manipulated
Southern Metropolis Daily's investigation, based on interviews with 17 enterprise AI teams and an analysis of API billing data across three major Chinese LLM providers, identified four distinct mechanisms by which service providers quietly suppress cache hit rates—and thereby silently inflate their customers' bills.
1. Aggressive Cache Eviction Policies
The most common tactic is also the simplest: reduce the time window during which cached entries remain valid.
KV Cache entries are stored in GPU memory (VRAM), which is expensive and capacity-constrained. When a provider wants to cut costs, one of the first things they trim is cache retention time. Southern Metropolis Daily found that one major provider had quietly reduced its cache TTL (time-to-live) from 10 minutes to 60 seconds between December 2025 and March 2026—with no public announcement and no entry in the changelog.
The impact is devastating for agentic workloads. Agents often introduce delays between steps—waiting for a function result, polling an asynchronous task, or simply pacing their requests to avoid rate limits. A 30-second or 60-second gap between two API calls with identical system prompts is entirely normal in production agent systems. If the cache TTL is 60 seconds, those identical prompts will never hit the cache. The developer sees a "cache miss" and pays full price, entirely unaware that the cache was evicted
Cache eviction is
2. Capacity-Constrained Caching (The "Peak Hour" Problem)
GPU VRAM is the scarcest resource in AI infrastructure. Storing KV Cache entries consumes this memory whether or not those entries are currently being used for inference. During peak traffic periods—weekday mornings in the US, evening hours in China—providers face an ugly choice: allocate VRAM to active inference batches, or preserve cached states for potential future cache hits.
Multiple developers interviewed by Southern Metropolis Daily reported a striking pattern: their cache hit rates were consistently 20-30 percentage points lower during peak business hours than during off-peak periods, with no change to their actual prompting patterns. One enterprise team tracking hourly cache hit rates found that their 9:00 AM agent workflows hit the cache at 22%, while identical workflows at 2:00 AM hit at 78%.
The economic implication is clear: you are being silently upcharged precisely when you are most actively using the service. The provider is not "going down"—they are simply silently discarding your cached context to free up VRAM for incoming requests. Your subsequent calls are billed at the full rate, and the provider captures the full margin.
3. Endpoint-Level Cache Isolation
If you are building a sophisticated agent system in 2026, you are almost certainly using multiple models for different subtasks. You might use a fast, cheap model (like GPT-4o mini or DeepSeek V3) for classification and routing, and a more capable model (like Claude Opus or GPT-4.5) for complex reasoning and generation.
Here is the problem: cache entries are typically isolated by model endpoint. If your routing logic sends the same system prompt to gpt-4o-mini and then to gpt-4o, the second call will not benefit from the cache entry created by the first—even if the actual system prompt text is byte-for-byte identical. The KV Cache is keyed to the model weights; different models mean different attention states, which means no cache hits across model boundaries.
This makes sense from a purely technical standpoint.
4. Opaque and Unauditable Cache Reporting
Perhaps the most fundamental problem is that most LLM API providers do not give you granular cache hit rate data. OpenAI's dashboard shows you "cached_tokens" as a line item on your invoice, Can you see the cache hit rate for individual requests? No. Can you see when cache entries were evicted, or why a particular request missed the cache? No. Can you reproduce the cache behavior by replaying the same sequence of requests in a test environment? Almost never—because cache behavior depends on global server-side state (what else was running on that GPU node at that moment) that you cannot control or observe.
This is not a technical limitation. It is a transparency choice. Providers could emit detailed cache tracing information (which requests hit, which missed, and why). They could offer a "cache debugging mode" that guarantees cache retention for a test workload.
The Agentic Cost Trap: Why This Hurts Agents the Most
The cache transparency crisis matters more in 2026 than it did in 2024 because the dominant use case for LLMs has fundamentally changed.
In 2024, the stereotypical LLM workload was conversational: a user chats with a chatbot, each message adds to the context, and the context is (mostly) unique with every turn. Cache hit rates mattered less because there was less to cache.
In 2026, the stereotypical LLM workload is agentic: an autonomous loop that plans, acts, observes, and repeats—often for 10, 50, or 100 steps per task. And in these agentic workloads, the redundant context fraction is enormous.
Let's quantify this. A typical agentic system prompt in 2026 might include: - Role definition: 200-500 tokens describing the agent's identity and objectives - Tool definitions: 1,000-3,000 tokens describing available functions (often including complex JSON schemas) - Behavioral constraints: 500-1,000 tokens of rules, guidelines, and safety instructions - Examples and few-shot demonstrations: 500-2,000 tokens of illustrative examples - Context from previous steps: 1,000-5,000 tokens of accumulated state
Of these, the first four categories are completely static—they do not change across steps or across tasks. In a well-designed agent, they might account for 3,000-5,000 tokens of a 6,000-token input. That's 60-80% redundant context at every single step.
If cache hit rates were transparent and high, this would be a solved problem. Your static context would be cached after the first call, and you'd pay the discounted cached rate for the remaining 99 calls in your agent's workflow. Your effective input cost would be perhaps 20-30% of the nominal full-price cost.
In an agentic world, your system prompt is not a line item. It is a tax that compounds with every step.

The Token Black Hole: Agent Capability Inflation
There is a second, compounding force driving up token consumption for agentic systems: capability inflation.
In 2024, an agent might have been capable of a 5-step workflow: search → read → summarize → draft → publish. Each step consumed perhaps 2,000 tokens. Total: 10,000 tokens per task.
In 2026, that same agent has been upgraded. It now performs: topic research (with web search) → competitive analysis → outline generation → section drafting (with RAG retrieval) → image generation prompt creation → image generation (multimodal) → quality assessment → revision → SEO optimization → publication → social media snippet generation. That's a 12-step workflow, and each step now uses 5,000-8,000 tokens (because the agent's prompts have become more sophisticated, the tool definitions have expanded, and the context carried forward has grown).
Total: 60,000-100,000 tokens per task. And because cache hit rates are low (thanks to the opaque practices described above), 80% of those tokens are billed at the full rate.
This is what I call the Token Black Hole effect: as agents become more capable, they consume tokens at a superlinear rate, and the fraction of those tokens that are redundant context also grows (because more steps means more repeated system prompts and tool definitions).
Breaking Free: Local Orchestration and Smart Caching
Faced with this structural opacity, enterprises and sophisticated developers are beginning to pursue two strategies for reclaiming control over their token costs.
Strategy 1: Local Agent Orchestration
The first strategy is architectural: move the agent orchestration layer off the cloud API and onto local infrastructure.
In a typical cloud-native agent architecture, every step of the agent's reasoning loop is a separate API call to the LLM provider. The provider sees each call as an independent request, with no awareness of the agent's broader workflow. This is why cache hit rates are The alternative is to run the orchestration logic locally—on a device like the Kaihe A1 or B1, which are ARM-based, low-power "agent computers" designed to run 24/7. In this architecture, the LLM API is used only for inference—the actual reasoning step. Everything else—prompt assembly, context management, tool routing, state management—happens locally.
This architectural shift enables several cost optimizations that are simply impossible when you rely entirely on a cloud provider's black-box caching:
1. Prompt Templating and Differential Sending
Instead of sending the full system prompt + tool definitions + context with every API call, a locally-orchestrated agent can maintain a prompt template library on the local device. Before each API call, the orchestration layer assembles the minimal prompt needed for that specific inference step—omitting static context that can be implicitly assumed, and including only the dynamic elements (the current task state, the specific tool being invoked, etc.).
This can reduce input token counts by 40-60% before caching is even considered. And because the local orchestration layer controls the assembly, these optimizations can be systematically applied across the entire agent workflow.
2. Semantic-Level Caching
KV Cache at the provider level operates on exact token-sequence matching. If your system prompt is byte-for-byte identical to a previously-cached request, you get a cache hit. If you've added a single space or changed one word, you get a cache miss.
A local caching layer can be far more sophisticated. By using embedding-based similarity matching, a local cache can recognize that two requests are semantically equivalent even if their exact token sequences differ. If your agent has previously handled a very similar task, the local cache can return the cached result (or a partially cached result) without making any API call at all.
This is a form of caching that no cloud provider can offer—because no cloud provider has visibility into your agent's semantic intent. Only a local orchestration layer can build this capability.
3. Multi-Model Cache Sharing
As noted earlier, cloud providers typically isolate caches by model endpoint. A local orchestration layer can break this barrier. By maintaining a model-agnostic cache of inference results (stored as embeddings or structured outputs rather than raw token states), the local layer can serve cached results regardless of which model generated them. If your agent previously used GPT-4o-mini to classify a query, and now wants to use Claude Haiku for the same classification, the local cache can recognize the equivalence and skip the redundant call.

Strategy 2: Self-Hosted Cache Middleware
For organizations that cannot fully localize their agent orchestration (perhaps because they rely on managed agent platforms or serverless inference), a second strategy is emerging: cache middleware.
Cache middleware sits between your application and the LLM API. It intercepts outgoing requests, checks whether an equivalent request has been made before, and either serves a cached response or forwards the request to the API and caches the result.
Unlike provider-side KV Cache (which operates at the attention-state level and is opaque), local cache middleware operates at the request-response level and is fully transparent. You can see exactly what is being cached, why cache hits and misses occur, and precisely how much money the cache is saving you.
Several open-source projects have emerged to fill this niche.Projects like LangChain Cache, GPTCache, and Semantic Cache provide embeddable caching layers that can be integrated into any LLM application with minimal code changes. These tools typically use one of two strategies:
- Exact match caching: If the exact same prompt (including system prompt, messages, and parameters) has been sent before, return the cached response. This is simple and reliable
- Semantic match caching: Compute an embedding for the input prompt, check if a "close enough" embedding exists in the cache, and if For agentic workloads, semantic match caching is usually the better choice, because agent prompts are often mostly static (the system prompt and tool definitions) with a small dynamic component (the current task state). Exact match caching would miss these near-duplicates; semantic matching can catch them.
A local cache doesn't just save you money. It gives you the one thing the cloud provider won't: proof.
KaiheAiBox: Purpose-Built for the Local-First Agent Era
The architectural shift toward local orchestration is not just a cost optimization—it is a strategic repositioning of where intelligence lives in an AI system.
In the cloud-centric model that dominated 2023-2025, the LLM provider was the sun around which everything orbited. Your agent's logic, state, memory, tools, and orchestration all passed through the provider's API. You were fully dependent on their pricing, their caching policies, their rate limits, and their opaque infrastructure decisions.
In the local-first model that is emerging in 2026, the LLM is a component, not the center. The local device—the "agent computer"—is the orchestration hub. It manages workflow state, handles caching, optimizes token usage, and selectively calls the cloud API only when inference is actually needed.
KaiheAiBox (铠盒) is designed explicitly for this local-first agent architecture. The A1 and B1 devices are ARM-based, low-power computers that run agent orchestration software 24/7, The key insight behind the KaiheAiBox design is that agent orchestration is not computationally expensive—it is architecturally complex. You do not need a GPU to manage agent state, assemble prompts, or check a local cache. You need a reliable, always-on, low-power device that can maintain persistent connections, manage local state, and make intelligent decisions about when to call the cloud API and when to serve from cache.
By running this orchestration layer on a local device, enterprises achieve three things simultaneously:
- Cost control: Local caching and prompt optimization can reduce effective token costs by 50-70%, even after accounting for the cost of the local hardware.
- Transparency: Every token sent to the API, every cache hit, and every dollar spent is fully observable and auditable.
- Independence: The agent's core logic and state live on infrastructure you control. You are not locked into a single provider's caching policies or pricing structure.

The Broader Implication: Transparency as Infrastructure
The cache hit rate controversy is not really about caching. It is about whether the infrastructure layer of the AI economy will be transparent or opaque, auditable or black-boxed, user-controlled or provider-controlled.
In the early days of cloud computing (the 2010s), a similar dynamic played out around cloud cost transparency. Enterprises moving to AWS or Azure discovered that their cloud bills were far higher than expected, and that the pricing models were complex and difficult to audit. The response was a wave of cloud cost management tools (like Cloudability, CloudHealth, and later in-house FinOps teams) that gave enterprises visibility and control over their cloud spending.
We are now at a similar inflection point for AI. The "cloud cost management for LLMs" wave is just beginning. Cache transparency is one piece of it; there will be others (like reasoning token optimization, multimodal token accounting, and agent workflow cost attribution).
The companies that treat AI cost management as a first-class architectural concern—rather than an afterthought to be dealt with when the bill arrives—will have a substantial and compounding competitive advantage. They will be able to deploy more capable agents, on more tasks, at a lower unit cost than their competitors. Over time, this advantage will compound: lower costs enable more experimentation, which yields better agents, which delivers more value.
The winners of the AI era won't be the ones with the biggest models. They'll be the ones with the smartest caches.
Conclusion: The Price of Transparency
The LLM price war of 2025-2026 has delivered real value to developers and enterprises. Token prices have fallen dramatically, and access to state-of-the-art AI has never been more democratic.
But price cuts are not the same as cost cuts. And cost cuts are not the same as transparency.
Until LLM providers give developers granular, auditable, real-time visibility into cache hit rates, cache eviction policies, and the factors affecting cache performance, the声称的 "price cuts" will remain a marketing story rather than an economic reality. Developers building the next generation of agentic AI deserve better.
The rise of local agent orchestration—exemplified by devices like KaiheAiBox—is not a rejection of cloud AI. It is a correction. It is a recognition that the optimal architecture for agentic AI is not "everything in the cloud," In this hybrid architecture, the cloud does what it does best (massive-scale inference), and the local device does what it should do (intelligent orchestration, caching, and cost optimization). The result is an AI system that is more capable, more cost-effective, and more transparent than either approach alone.
The cache hit rate controversy has pulled back the curtain on one of the AI economy's open secrets: the gap between listed prices and real costs is where the real margin lives. Closing that gap—through transparency, local control, and intelligent architecture—is the work of this era.
And it starts with asking your LLM provider a simple question: "What is my cache hit rate, and why did it change?"
If they can't answer that question with precision and transparency, you have your answer about who is really paying for the AI future.
KaiheAiBox| Agentaibox that lets AI work for you 24/7· AI Frontier