KADC2026: How Kunpeng and Ascend Are Building a Three-Layer Architecture to Accelerate Agent Ecosystem Adoption
Abstract: At the Kunpeng and Ascend Developer Conference 2026 (KADC2026) in Beijing, Huawei presented a comprehensive roadmap for "domestic computing + Agent ecosystem." The centerpiece was a three-layer agent architecture: a bottom layer of Kunpeng supernodes with 24TB unified memory pools, a middleware layer of openEuler in a heterogeneous fusion configuration, and an application layer of Agent development tools and frameworks. With CANN fully open-sourced, Mind series software upgraded, and a developer enablement program promising a working demo in 2 minutes, the message was unambiguous: domestic computing is moving from "it runs" to "it runs well."
There is a particular kind of silence that happens at technology conferences when a speaker says something that reframes the entire discourse. At KADC2026 in Beijing, it came from the rotating CEO of Huawei: "The CPU is no longer a supporting role in computing systems. It is the core scheduler of the Agent era."
To understand the weight of that statement, you have to understand the context of domestic computing in China. For years, the Kunpeng and Ascend product lines have faced a consistent critique: the hardware performance is competitive, but the software ecosystem is not. CUDA has a ten-year head start. Developer habits are hard to change. Framework compatibility is a moving target.
KADC2026's answer was a complete three-layer architecture—from silicon to operating system to application framework—designed to prove that domestic computing can do more than "run models." It can run Agents, efficiently, at scale, with a developer experience that does not feel like a compromise.
Layer 1: Kunpeng Supernodes, Lingqu Interconnect, and the 24TB Unified Memory Pool
The bottom layer of the three-layer architecture is the compute infrastructure, and it is here that KADC2026 delivered its most technically substantial announcements.
Kunpeng Supernodes
A Kunpeng Supernode combines multiple Kunpeng 920 processors through high-speed interconnect to form a single logical compute node. A single supernode provides thousands of CPU cores optimized specifically for the high-concurrency, low-latency requirements of Agent workflows.
This is not just a scale-up story. Agent workloads are fundamentally different from traditional HPC or model training workloads. They involve many small, latency-sensitive inference calls, frequent tool invocations, and continuous state management. A supernode architecture is well-suited to this access pattern because it minimizes communication overhead between discrete nodes.
Lingqu Interconnect
Lingqu is Huawei's proprietary chip-to-chip interconnect protocol, with bandwidth and latency characteristics designed to compete directly with NVIDIA's NVLink. In multi-chip inference scenarios, interconnect bandwidth is the single most important factor determining model serving efficiency. If weights have to be sharded across chips, the speed at which those chips communicate determines both latency and throughput.
Lingqu's significance extends beyond raw performance. It represents the first time a domestic computing vendor has addressed the "inter-chip communication bottleneck" at the physical layer rather than relying on software workarounds. For Agent workloads that frequently involve multi-model inference (running multiple specialized models in parallel), this architectural choice matters.
The 24TB Unified Memory Pool
This was the number that generated the most immediate excitement among developers at the conference. The memory bottleneck in large model inference is not just about model size—it is about the combination of model weights, KV cache, and intermediate activations that accumulate during inference.
A 1.5T parameter model at FP16 precision requires approximately 3TB just to load the weights. Add KV cache for a long context window (Agent workflows routinely exceed 100K tokens), and the memory requirement can easily exceed 5TB for a single inference session. A 24TB unified memory pool means that multiple large model instances can run concurrently on a Kunpeng supernode without constant memory swapping.
A 24TB memory pool is not a spec-sheet vanity metric. It solves the most painful constraint in Agent deployments: insufficient context windows, slow model switching, and inadequate memory for concurrent inference.
For Agent developers, the practical implication is significant: you can run a 70B parameter model with a 128K context window, or multiple 7B models in parallel for different Agent specializations, all within a single supernode. The memory ceiling is high enough that most Agent architectures will hit compute bottlenecks before they hit memory bottlenecks.
Layer 2: openEuler and Heterogeneous Fusion
The middle layer of the architecture is the openEuler operating system, which received a major update at KADC2026: heterogeneous fusion.
Heterogeneous fusion means that a single operating system kernel now manages compute resources from Kunpeng CPUs, Ascend NPUs, and other accelerators as a unified pool. Developers no longer need to write separate scheduling logic for different chips. You declare that a task requires NPU acceleration, and openEuler automatically handles resource allocation and scheduling.
Why This Matters for Agent Developers
Simplified deployment. Previously, deploying an Agent system meant managing CPU-based logic control and NPU-based model inference as two separate scheduling domains. With openEuler's heterogeneous fusion, the operating system abstracts away the hardware heterogeneity. The developer writes Agent logic; the OS handles where each computation runs.
Dynamic elasticity. Agent workloads are bursty. Ten Agents may be running inference simultaneously at 2 PM, and only one Agent may be making simple tool calls at 3 AM. openEuler's fusion scheduler dynamically reallocates NPU resources based on real-time load, avoiding both resource waste and performance bottlenecks.
Fault recovery. In 24/7 Agent deployments, hardware failures are inevitable. openEuler supports hot-swapping of NPUs: when an NPU fails, the model instances running on it are automatically migrated to available NPUs, and the Agent workflow continues without interruption. For enterprise deployments, this is not a nice-to-have—it is a requirement.
Layer 3: Agent Applications and Developer Enablement
The top layer of the architecture is the application framework and toolchain for Agent developers. This was the most immediately practical part of KADC2026, and the part most directly aimed at building an ecosystem rather than just showcasing hardware.
CANN Fully Open-Sourced
CANN (Compute Architecture for Neural Networks) is the operator library and inference acceleration framework for Ascend NPUs. Until KADC2026, CANN was closed-source. The decision to fully open-source CANN—including more than 50 code repositories and 800+ operators—is consequential for several reasons.
First, it allows Agent developers to optimize operator performance for specific Agent scenarios. Agent inference has different characteristics from training: smaller batch sizes, higher variance in sequence length, and frequent dynamic shape changes. With access to the CANN source, developers can tune operator implementations for these patterns.
Second, it enables custom operator contributions. The 800+ operators released in the initial open-source drop cover the computational patterns of mainstream large models (including LLaMA, Qwen, DeepSeek, and others). But Agent workflows increasingly involve custom model architectures. Open-sourcing CANN means the community can contribute operators for novel architectures without waiting for an official release.
Third, it enables local debugging and verification. Previously, developers working with Ascend had to treat CANN as a black box—if an inference result was wrong, you could not easily inspect the intermediate computation. With the source available, debugging precision and trust both increase.
Mind Series Software Upgrades
Three components were upgraded in lockstep:
MindSpore 3.0 introduces native Agent orchestration capabilities, including APIs for task decomposition, tool invocation, and state management. Previously, Agent logic had to be implemented in a separate framework (LangChain, AutoGen, etc.) and call MindSpore models through an API. With MindSpore 3.0, Agent orchestration and model execution are integrated into a single framework, reducing both latency and complexity.
MindSpeed, the training acceleration framework, now supports distributed training across 10,000+ NPU clusters, with a 40% improvement in communication efficiency. For organizations training custom models for Agent applications, this is a meaningful reduction in training time and cost.
MindIE, the inference engine, now supports dynamic batching and continuous batching, delivering a 3x throughput improvement in Agent scenarios. Agent inference is characterized by many small, latency-sensitive requests rather than a few large batch operations. MindIE's upgrades are specifically targeted at this access pattern.
The integration story across the three components is clear: MindSpore defines the model and Agent logic, MindSpeed handles efficient training, and MindIE handles efficient inference. The full pipeline—from training to inference to Agent execution—can now be implemented entirely on domestic computing infrastructure.
The Developer Enablement Program
The most immediately tangible announcement at KADC2026 was the "2-minute demo" enablement program. Huawei is providing:
- Pre-built Agent templates for common use cases (customer service, document processing, data collection)
- One-click deployment to Ascend cloud resources
- 10,000 NPU cards of free compute resources for developer trials
The philosophy here is that reducing the barrier to first success matters more than accumulating features. If a developer can see an Agent running on their own business data in 2 minutes, they are far more likely to invest in learning the full technology stack. It is a lesson that the CUDA ecosystem learned years ago, and it is encouraging to see domestic computing vendors adopting the same playbook.
The CPU's New Role: From Helper to Orchestrator
Return to the opening statement: the CPU as the core scheduler of the Agent era. This is not just a rhetorical pivot—it represents a genuine architectural insight.
In the GPU/NPU-dominated paradigm of AI computing, the CPU has long been treated as a "helper" role: handling data preprocessing, task scheduling, and result post-processing, while the real computational heavy lifting is offloaded to accelerators. But in Agent scenarios, this division of labor is being reexamined.
The core operation of an Agent is not matrix multiplication. It is logic orchestration: deciding what to do next, which tool to invoke, how to handle exceptions, and when to delegate inference to an NPU. These operations are characterized by complex logic but small computational requirements, many branches but latency-sensitive execution. They are, in other words, exactly the kind of workload that CPUs handle well.
The Kunpeng supernode design is based on this insight: use a large-scale CPU cluster to handle Agent scheduling and orchestration logic, and use Ascend NPUs to handle model inference, with both coordinated through Lingqu interconnect. The CPU is not a "helper"—it is the command center of the entire Agent system.
The Agent Opportunity for Domestic Computing
The signal from KADC2026 is unambiguous: domestic computing is shifting from "closing the performance gap" to "building the ecosystem."
In traditional AI training and inference scenarios, NVIDIA's CUDA ecosystem has an extremely high barrier to entry. Domestic computing vendors have struggled to differentiate in this space because the incumbent ecosystem is so deeply entrenched. But Agents are a genuinely new application category. They require not just compute, but scheduling, orchestration, toolchains, and developer experience. On these dimensions, domestic computing vendors and NVIDIA are starting from the same baseline.
The three-layer Kunpeng architecture is essentially an attempt to translate the "full-stack controllability" advantage of domestic computing into a "deep optimization" advantage for Agent ecosystems. When the hardware, the operating system, and the framework are all designed by the same team, the end-to-end optimization space is far larger than solutions assembled from components by different vendors.
KaiheAiBox · Hermes Insights