Nature 155-Year History: A Chinese Large Model Knocks on the Main Journal Door for the First Time and Silicon Valley Falls Silent

Published on: 2026-05-27

Nature's 155-Year History: A Chinese Large Model Knocks on the Main Journal's Door for the First Time — and Silicon Valley Falls Silent

Abstract: The paper on "Wujie · Emu3," a multimodal large model from the Beijing Academy of Artificial Intelligence (BAAI), has been published in Nature's main journal — the first time a large model achievement led by a Chinese research institution has earned this distinction. Emu3 uses the most straightforward autoregressive approach — "predict the next token" — to unify the generation of images, text, and video, proving that the GPT paradigm belongs not just to language but to the entire perceptual world. This is not merely the triumph of a single paper; it is a direct response from China's "low-cost, high-impact" AI paradigm to Silicon Valley's arms race for compute power.


In the spring of 2025, while Silicon Valley was still debating the release date of GPT-5, a paper from Beijing's Zhongguancun quietly appeared on Nature's main journal table of contents. The Beijing Academy of Artificial Intelligence's multimodal large model "Wujie · Emu3" became the first large model achievement led by a Chinese research institution to be published in Nature's main journal since the publication's founding in 1869.

The news went viral across China's AI community, but on the other side of the Pacific, the response was strikingly quiet.

It wasn't that they hadn't noticed — it was that they didn't know how to respond. Because what Emu3 accomplished was almost absurd: it didn't use diffusion models, it didn't use CLIP alignment, it didn't even deploy some elaborate cross-modal alignment architecture. It did just one thing: predict the next token. And then, image, text, and video generation — all done.

1. The "Stubbornness" of Autoregression: Taking One Path to the End — and Actually Getting There

In 2018, OpenAI released GPT-1, establishing the fundamental paradigm of autoregressive language models: given the preceding context, predict the next word. Over the following six years, from GPT-2 to GPT-4, this paradigm dominated natural language processing, but in multimodal generation, mainstream academia consistently deemed it "insufficient."

The reasoning was sound: images and video are continuous signals, unlike text which can naturally be discretized into token sequences. Therefore, breakthroughs in multimodal generation over the past few years have almost exclusively come from diffusion models — Stable Diffusion, Midjourney, Sora, without exception. Autoregression? In the visual domain, it seemed destined to remain a supporting player.

Emu3 shattered this assumption. Its core idea is strikingly elegant: discretize images and video into token sequences as well, then train and generate using the exact same "predict the next token" approach as language models. No iterative denoising from the diffusion process, no ingenious cross-modal alignment design — one model, one training objective, one inference pipeline, covering three modalities.

This is not laziness — it is a profound conviction: autoregression may be the shortest path to general intelligence. As the paper argues, human cognition is fundamentally a form of sequential prediction — we read text from left to right, watch scenes frame by frame, and understand the world by inferring the unknown from the known. Emu3 transformed this cognitive intuition into engineering reality.

2. What Nature's Main Journal Means: Not Just Publishing a Paper, But Earning an Admission Ticket

In academia, the prestige of Nature's main journal needs no elaboration. An impact factor of 64.8, an annual acceptance rate below 8%, and review standards so rigorous that it serves as the "gatekeeper" of the natural sciences. But even more crucial is Nature's editorial preference — it has never favored incremental improvements; it seeks paradigm shifts.

That Emu3 made it into Nature's main journal signals that international academic authorities have endorsed this judgment: extending the autoregressive paradigm to multimodality is not only feasible but may well represent the direction of the future. This matters far more than the content of the paper itself.

For Chinese AI research, this is a landmark moment. Previously, the quantity of Chinese AI papers had long ranked among the world's highest, but in top journals like Nature and Science, large model research achievements led by domestic institutions had been conspicuously absent. Emu3 fills this void, telling the world: Chinese AI possesses not only engineering capability but also the ability to define research paradigms.

文章配图

3. Resonance with DeepSeek-R1: The Low-Cost Route Is Rewriting the Rules of the Game

Place Emu3 alongside DeepSeek-R1, and you will notice a thrilling trend: Chinese AI is charting a technological path entirely distinct from Silicon Valley's.

Silicon Valley's approach is "brute force through scale" — more GPUs, bigger clusters, longer training runs. GPT-4 reportedly used tens of thousands of A100s, with training costs exceeding $100 million. Sora's training scale is astronomical. The implicit assumption of this approach: compute power is the primary driver of AI progress, and whoever commands the most compute wins.

But neither Emu3 nor DeepSeek-R1 buys into this philosophy.

DeepSeek-R1 achieved reasoning capabilities on par with or even surpassing GPT-4o at a training cost of just $6 million. Its secret lay not in stacking compute but in discovering an elegant combination of reinforcement learning and reasoning — enabling the model to teach itself how to "think."

Emu3 follows similar logic: rather than piling complexity on diffusion models and cross-modal alignment, it returns to the simplest autoregressive framework. The result not only simplifies the architecture but also matches or surpasses specialized models on multiple benchmarks.

This is not coincidence — it is a methodological awakening: in the AI field, architectural innovation delivers more qualitative leaps than compute stacking. While Silicon Valley is still queuing for the next generation of GPUs, Chinese researchers are achieving equivalent or superior results through smarter paths.

4. The Industrial Significance of Unified Architecture: A Bridge from Cloud to Edge

Emu3's unified architecture holds not only academic value but also far-reaching industrial implications.

Previously, deploying a multimodal AI system was a nightmare. To generate images, you needed to deploy a diffusion model; to generate text, a language model; to generate video, yet another video model. Three models, three inference pipelines, three times the computational resources. For large corporations, this may merely be a cost issue, but for small and medium enterprises and edge devices, it is an insurmountable barrier.

Emu3 changes this equation. One model processing three modalities means deployment costs drop to one-third, inference pipelines unify into a single set, and hardware requirements are dramatically reduced. This is particularly crucial for Agent Computers and similar edge devices.

The core demand of Agent Computers is to bring AI capabilities from the cloud to the local level, enabling users to use multimodal AI functions without relying on a network. But the compute power and memory of edge devices are limited — you cannot run three large models simultaneously on a laptop. Emu3-style unified architecture precisely resolves this contradiction — one model doing three things makes edge multimodality transition from theoretical possibility to engineering feasibility.

The practice of the KaiheAiBox Agent Computer also validates this trend. As multimodal models move toward unification and the threshold for edge deployment continues to decrease, a 7×24 online local intelligent agent is no longer a concept — it is a reality unfolding right now.

5. From "Follower" to "Definer": The Next Decade of Chinese AI

The significance of Emu3's publication in Nature's main journal extends far beyond a single paper.

It marks a shift in identity for Chinese AI research: from technology follower to paradigm definer. Over the past decade, China's AI development model has been "Fast Follower" — Silicon Valley proposes a new architecture, and Chinese teams rapidly replicate and optimize it. This has produced leaps in engineering capability, but questions about originality have persisted.

Emu3 changes this narrative. The autoregressive multimodal unified architecture is not a follower's take on Silicon Valley's path — it is a redefinition of the mainstream technological direction. When Nature's reviewers — the world's top scientists — endorse this direction, the doubts collapse of their own accord.

An even deeper shift lies in research philosophy. The "low-cost, high-impact" route jointly demonstrated by Emu3 and DeepSeek-R1 is fundamentally a different view of AI development from Silicon Valley's: not treating compute power as the primary factor, but prioritizing architectural innovation and training methodology. If this philosophy is validated as successful, it will fundamentally alter the rules of global AI competition — compute advantage will no longer be decisive; innovation capacity will be.

For the next decade of China's AI industry, this means: we do not need to compete with the United States on GPU counts; we need to sustain our lead in architectural innovation. Emu3 has proven this path is viable. The question that remains is simply how far it can go.

While Silicon Valley remains anxious about compute power, Chinese researchers have already offered a different answer through their actions: rather than chasing more GPUs, find a smarter route. Nature's endorsement is merely the first milestone on this path.


KaiheAiBox · AI Frontier

© KAIHE AI - Agent Computer Specialist