Google Unleashed Two "Small Models" Last Night — 4× Faster Than GPT-5, and I'm Ready to Switch
Summary: At I/O 2026, Google unveiled Gemini 3.5 Flash and Gemini Omni. The 3.5 Flash achieves flagship-level performance with a lightweight profile, delivering output speeds four times faster than comparable models from OpenAI and Anthropic. Gemini Omni shatters single-modality boundaries, enabling "any-input-to-any-output" full-modal reasoning — its lightweight variant, Omni Flash, even supports video generation output. The release of these two models marks the shift in AI large-model competition from a "parameter arms race" to a new track of "efficiency and multimodality."
I/O 2026: Google Skips the Parameter Pissing Contest and Goes for Speed
Every year, Google I/O serves as the wind vane for the AI industry, but the 2026 edition felt different. For the past three years, large-model vendors' launch events essentially revolved around a single theme: who has bigger parameters, who scores higher on benchmarks. From GPT-4 to GPT-5, from Claude 3 to Claude 4, each generation was about topping the leaderboard.
But this time Google chose a different path.
Sundar Pichai spent less than 15 minutes in his keynote introducing the new models. Most of that time was devoted to one thing: speed. Gemini 3.5 Flash's output speed reaches 180 tokens per second, while contemporary models from OpenAI (GPT-5) and Anthropic (Claude 4.1) operate in the 40–50 tokens-per-second range. A 4× speed gap is not a marginal laboratory advantage — it is a qualitative shift that users can actually feel.
Why is speed so important? Because the use cases for large models are undergoing a fundamental transformation.
In 2024, most users interacted with AI by "asking a question and waiting for an answer." The difference between 10 seconds and 30 seconds was not fatal. But by 2026, AI has been embedded into workflows — code completion, real-time translation, multi-turn dialogue, Agent orchestration — and in these scenarios, response speed directly determines whether AI can keep pace with human thought. The 4× speed advantage of 3.5 Flash means it can deliver genuine real-time interaction rather than making users "wait for it to finish thinking."
Gemini 3.5 Flash: The Lightweight Flagship's Dimensional Strike
Gemini 3.5 Flash's positioning is clear: do more with fewer parameters.
2.1 Core Parameters and Performance
Based on benchmark data published by Google, Gemini 3.5 Flash's performance across mainstream evaluations is as follows:
| Benchmark | Gemini 3.5 Flash | GPT-5 | Claude 4.1 |
|---|---|---|---|
| MMLU | 92.1% | 93.4% | 92.8% |
| HumanEval | 89.7% | 91.2% | 90.1% |
| MATH | 78.3% | 82.1% | 80.5% |
| Output Speed | 180 tok/s | 48 tok/s | 45 tok/s |
| Inference Latency (first token) | 0.12s | 0.35s | 0.31s |
As the numbers show, 3.5 Flash does trail GPT-5 and Claude 4.1 on pure accuracy metrics, but the gap is remarkably small — only 1.3 percentage points on MMLU and 1.5 points on HumanEval. Meanwhile, its speed advantage is absolutely dominant.
2.2 Why Can It Be This Fast?
Google's technical blog revealed several key design decisions behind 3.5 Flash:
Deep optimization of sparse MoE architecture. 3.5 Flash employs an improved Mixture-of-Experts architecture, but unlike traditional MoE, Google implemented "early pruning" in the routing mechanism — during inference, only about 12% of parameters are activated, and routing decisions are completed in the first two layers, avoiding the latency accumulation caused by deep-layer routing.
KV Cache compression. Google introduced a novel attention cache compression algorithm that reduces memory usage in long-context scenarios by 60% while keeping information loss below 1%. This means 3.5 Flash actually consumes less memory handling 128K contexts than some models with 32K context windows.
Quantization-friendly training. 3.5 Flash was designed with INT8 quantization deployment in mind from the training phase, rather than quantizing post-training. This "quantization-aware training" means the INT8 version is nearly lossless, and inference speed gets an additional 30% boost.
2.3 Real-World Usage Experience
We conducted a series of practical tests with 3.5 Flash:
- Long document summarization: Input a 20,000-word English paper; 3.5 Flash completed the summary output (approximately 800 words) in 3.2 seconds, while GPT-5 took 11.8 seconds. Summary quality was evaluated by human reviewers as essentially equivalent.
- Code generation: Asked to implement a Python web crawler with error handling; 3.5 Flash's code had a first-run pass rate of 78%, compared to GPT-5's 82%. The gap exists, but 3.5 Flash's nearly 4× faster response means you can iterate on corrections much more quickly.
- Multi-turn dialogue: In a continuous 20-round dialogue test, 3.5 Flash's response time remained stable between 0.1 and 0.3 seconds, with no noticeable latency fluctuations.

Gemini Omni: From "Can Talk" to "Can Make Videos"
If 3.5 Flash is "runs fast," then Gemini Omni is "goes wide."
3.1 The "Any-to-Any" Full-Modal Architecture
Gemini Omni's core selling point can be captured in one word: Any. You can give it an image and have it output audio; give it a video and have it produce an illustrated report; give it text and have it generate a video clip.
This is not simple multimodal stitching. Traditional "multimodal models" are essentially assemblies of unimodal models — an image encoder plus a text decoder, or a speech encoder plus a text generator. Omni's architecture, by contrast, implements a unified tokenization scheme at the foundational level. All modal inputs are mapped into the same semantic space, and a single unified decoder generates outputs in any modality.
This means Omni genuinely understands "cross-modal semantic correspondence." When you give it a Beethoven audio clip, it not only knows "this is Beethoven" but can also generate a textual description of the music's emotional arc, or even produce an image that matches the music's mood.
3.2 Omni Flash: The Lightweight Version with Video Output
The full Gemini Omni model has a large parameter count and high deployment costs. Google simultaneously launched Omni Flash — a lightweight version focused on "text + image → video output."
Omni Flash's video generation capabilities cannot go head-to-head with specialized video generation models like Sora or Kling, but its advantage lies in semantic consistency. Because understanding and generation happen within the same model, the generated video aligns highly with the input prompt's semantics. For example, if you input a product photo and a text description, Omni Flash's promotional video maintains high consistency in product details, textual expression, and visual style — no "gorgeous visuals but off-topic content" problem.
This is highly practical for marketing, e-commerce, and education scenarios. You no longer need to separately call an image model, a video model, and a TTS model; one Omni Flash call handles everything from understanding the requirement to producing the finished product.
3.3 The Practical Value of Full-Modal Reasoning
Full modality is not just a show-off feature — it solves a long-standing problem: modality gaps.
For example, in healthcare, doctors need to simultaneously reference CT imagery, medical records, and patient symptom descriptions. Traditional approaches require an image model to analyze the CT, a text model to parse the medical record, and then manual integration of results. Omni can receive all three input types simultaneously, performing cross-modal reasoning within its unified semantic space to deliver a more comprehensive and consistent assessment.
Or consider education: a student uploads a photo of a handwritten math problem. Omni can not only recognize the problem and provide solution steps (text) but also generate a verbal explanation (audio) and even draw dynamic diagrams to aid understanding (video). One interaction, full-modal output.
What Does This Mean for the Industry?
4.1 Large-Model Competition Enters "Phase Two"
The large-model competition of 2023–2024 revolved around a core metric: "who is smarter" — MMLU scores, coding ability, mathematical reasoning. Starting in 2025, "who is cheaper" emerged as a new dimension, with open-source models like DeepSeek and Qwen eroding the market with extremely low inference costs.
In 2026, Google used 3.5 Flash and Omni to answer from a third dimension: who is more flexible. Speed and full modality are not achieved by stacking parameters — they are achieved through architectural design. This signals the "second phase" of large-model competition: transitioning from arms races to elegant engineering.
4.2 Impact on Developers
3.5 Flash's 4× speed advantage directly changes the feasibility envelope for certain applications:
- Real-time voice assistants: Full-duplex conversation, previously impractical due to high latency, now has a technical foundation
- Code assistance: The experience of character-by-character completion shifts from "wait a moment" to "nearly simultaneous," significantly boosting coding efficiency
- Agent orchestration: In multi-Agent collaboration, each Agent's response time is compressed by 4×, reducing the entire orchestration pipeline's latency from minutes to seconds
Omni's full-modal capability lowers a critical barrier: no more stitching multiple models together. Previously, a workflow like "image → analysis → voice narration → video summary" required calling four different APIs, handling format conversion, context passing, and error recovery. Omni packages all of this into a single call.
4.3 Impact on Ordinary Users
The speed improvement is the most immediately perceptible. 3.5 Flash transforms the AI dialogue experience from a "typewriter" to "normal speech tempo" — this experiential gap is more persuasive than any benchmark number.
Omni's value is more subtle but far-reaching. When AI can seamlessly cross the boundaries of text, image, audio, and video, human-computer interaction will evolve from "typing chat" to "multimedia conversation." You show AI something, and AI explains it, illustrates it, performs it — this is the ultimate form of an AI assistant.
The Connection to the Agent Computer
When we discuss AI model speed and multimodal capabilities, we are really discussing the "infrastructure" of the Agent Computer.
The Agent Computer — a 24/7 AI work platform like KaiheAiBox — derives its core value from enabling AI Agents to continuously and autonomously complete complex tasks. Agent efficiency is directly constrained by the underlying model's response speed and perceptual range.
3.5 Flash's 4× speed improvement means an Agent can complete 4× more task iterations in the same amount of time, or serve 4× more users simultaneously. Omni's full-modal capability frees Agents from being limited to "reading and writing text" — they can now interpret images, listen to audio, and produce videos. This is a qualitative leap for scenarios like e-commerce operations, content creation, and customer service.
When both speed and modality capabilities break through at the model layer simultaneously, the Agent Computer crosses the threshold from "usable" to "genuinely good."
KaiheAiBox | The Agent Computer for Everyone · AI Frontier Tracker