DeepSeek Open-Sources DSpark: V4 Inference Speed Up 85%

📖 Glossary

AI Box (also known as Agent Computer / Agent PC), is a dedicated local hardware device that runs AI Agents. Pre-installed with an AI agent management system, plug-and-play, running 24/7. Users can remotely command AI to work via Discord, Slack, Telegram, WhatsApp, and more.

Abstract: On June 27, DeepSeek jointly released DSpark, a speculative decoding framework, with Peking University, along with the open-source training codebase DeepSpec. DSpark is already deployed in DeepSeek-V4 production services, boosting Flash model generation speed by 60%-85% and Pro model by 57%-78% with zero quality loss. The paper and code are available on GitHub, supporting Qwen3, Gemma, and other open-source models.

On June 27, DeepSeek quietly updated its GitHub repository with a paper on DSpark, a speculative decoding acceleration framework. No press conference, no teaser campaign — but the tech community noticed fast.

The reason is straightforward: DSpark boosts DeepSeek-V4 generation speed by 60% to 85% with completely lossless output. Users waiting for AI responses get results in roughly half the time. The difference is perceptible.

What Problem Does Speculative Decoding Solve

Large language models generate text one token at a time by default. Each token requires a full forward pass through the model — like re-reading the entire prompt before writing each character of an essay.

Speculative decoding uses a small "draft model" to quickly batch-guess the next several tokens, then the large model verifies them in one pass. Correct guesses are accepted; wrong ones trigger a fallback. This way, the large model produces multiple tokens per forward pass instead of just one.

The technique has existed for a while, but two persistent problems limited adoption: draft model accuracy drops sharply at the tail of long sequences (great guesses up front, increasingly wild ones later), and verification scheduling under high concurrency is hard to balance, often wasting the time saved.

DSpark's Two Key Designs

DSpark's paper, "Scheduled Speculative Decoding with Semi-Autoregressive Generation," tackles both issues.

Semi-Autoregressive Architecture for Tail Degradation

Traditional speculative decoding draft models are either fully parallel (guess N tokens at once, but no inter-token dependency — prone to drift) or fully serial (one at a time, accurate but slow).

DSpark uses a semi-autoregressive architecture: a parallel backbone network outputs candidate token base features in one shot, then a lightweight serial module (just two Transformer layers) fills in token-to-token dependencies. Two Transformer layers outperform five-layer traditional parallel models, balancing speed and accuracy.

Confidence-Scheduled Verification for High-Concurrency Scheduling

Instead of fixed-length verification, DSpark introduces a confidence scheduling mechanism. The system dynamically decides how many candidate tokens to verify based on prefix acceptance probability and real-time engine throughput. High-confidence fragments get priority; low-confidence ones are quickly discarded, reducing wasted computation.

The deployment layer uses asynchronous scheduling, decoupling logical and physical computation to avoid GPU pipeline stalls, compatible with mainstream CUDA hardware.

Benchmark Numbers

DeepSeek published two core datasets:

Model Version	Speed Improvement	Baseline
DeepSeek-V4-Flash-DSpark	60%-85%	MTP-1
DeepSeek-V4-Pro-DSpark	57%-78%	MTP-1

MTP-1 is the single-token speculative decoding baseline DeepSeek previously used in production. DSpark improves single-user end-to-end generation speed by 60%-85% while maintaining overall system throughput.

The team also tested with Qwen3-4B across math reasoning, code generation, and daily conversation tasks. DSpark's single-round effective generation length outperformed Eagle3 and DFlash across all three. With Qwen3-4B, DSpark improved 30.9% over Eagle3 and 16.3% over DFlash.

What's Open-Sourced: DSpark + DeepSpec

Two components are released:

DSpark Model Weights: Including DeepSeek-V4-Pro-DSpark and DeepSeek-V4-Flash-DSpark, ready for deployment.

DeepSpec Codebase: A full-stack training toolkit with data preparation, draft model training, and evaluation scripts. MIT licensed, compatible with DSpark, DFlash, and Eagle3 draft model algorithms. Developers can train acceleration modules for Qwen3, Gemma, and other open-source models.

This means you're not limited to DeepSeek's pre-trained DSpark — you can use DeepSpec to custom-train acceleration for your own models. Particularly useful for local deployment and edge inference scenarios.

Implications for Local AI Deployment

Inference speed is a hard constraint for LLM deployment. In local scenarios with limited compute, every bit of speedup matters.

The open-sourcing of DSpark and DeepSpec lowers the barrier to speculative decoding. Developers don't need to build acceleration frameworks from scratch — the open-source toolchain lets them train draft models directly. The two-layer Transformer lightweight serial module design makes acceleration feasible on consumer-grade hardware.

Kaihe AIBOX's edge-cloud architecture naturally fits this trend. The local Agent can dynamically optimize call strategies based on whether a model supports speculative decoding — models with DSpark acceleration handle long-text generation tasks, while others handle short-reply scenarios. This model routing strategy combined with speculative decoding acceleration further reduces response latency and token cost.

Final Thoughts

DSpark was released on a weekend with no fanfare, but the technical substance speaks for itself. DeepSeek's open-source cadence this year has been steady: V4 model, DeepSpec training framework, DSpark acceleration — covering the full pipeline from model to training to inference.

For developers, speculative decoding has transitioned from "paper technology" to "directly usable tool." As more models gain DSpark support, local inference speed will see a tangible leap.

DeepSeek Open-Sources DSpark: V4 Inference Speed Up 85%