DeepSeek V4 Benchmarked: 1M Context + Flash Mode on an Agent Computer

Published on: 2026-05-25

DeepSeek V4 Benchmarked: 1M Context + Flash Mode—Running Locally on an Agent Computer

Abstract: DeepSeek V4 launches with both Flash and Pro versions—Flash prioritizing speed, Pro delivering a 1M+ token context window. We benchmarked both versions on the Nizwo A1 Agent Computer, and the results are impressive: running flagship-grade DeepSeek V4 inference locally is no longer just for enthusiasts.


Why V4's Dual-Version Strategy Deserves Serious Attention

If you've been following large model releases, you've probably noticed a pattern.

From 2024 through the first half of 2025, nearly every major model release followed a "single flagship model" strategy—GPT-4o, Claude 3.5 Sonnet, Gemini 2.0. Each came in a single version, forcing users to make a global choice between "fast but expensive" and "slow but cheap."

DeepSeek V4 takes a completely different approach: within the same model generation, simultaneously releasing Flash and Pro versions, each deeply optimized for different scenarios.

This isn't a simple "fast/slow" distinction. Flash and Pro have different architectural optimization targets:

  • Flash: Inference speed first—ideal for dialogue, code completion, and real-time interaction
  • Pro: Context window first (1M+ tokens)—ideal for long-document analysis, complex reasoning, and multi-turn deep conversations

Behind this strategy lies a clear conviction: no single model can simultaneously achieve "extremely fast" and "extremely long"—so rather than compromise on a middle ground, let users choose what fits their scenario.

This is a return to user-centric thinking. Users don't care whether you're "the strongest model." Users only care whether "my use case works."

We got our hands on both V4 versions and ran a full benchmark on the Nizwo A1 Agent Computer. Here are the real results.

Test Environment: Can Nizwo A1 Run V4?

Let's cut to the conclusion: Yes, it runs—and runs well.

Nizwo A1 Hardware Specifications

  • CPU: Intel Core i9-14900K (24 cores, 32 threads)
  • GPU: NVIDIA RTX 4090 24GB × 2 (NVLink)
  • RAM: 128GB DDR5-5600
  • Storage: 2TB NVMe SSD (system drive) + 4TB NVMe SSD (model storage)
  • Network: Gigabit Ethernet + Wi-Fi 6E

This is Nizwo A1's maxed-out configuration, positioned as a "local Agent Computer capable of running 70B-scale models."

Deployment Configuration

We deployed V4 using the following setup:

Version Quantization VRAM Usage Deployment Tool
V4 Flash INT4 ~14GB Ollama + Custom GGUF
V4 Pro INT4 ~18GB (including KV cache reserve) vLLM + Custom GGUF

INT4 quantization is the practical choice for local deployment—compressing VRAM requirements into consumer-grade GPU range while maintaining acceptable precision. According to DeepSeek's official quantization benchmarks, INT4 quantization incurs only 2–5% performance loss, which has minimal practical impact.

Important note: The Nizwo A1's 4090 24GB×2 configuration can easily run both Flash and Pro versions simultaneously (14–18GB each), with enough leftover VRAM to run an additional 13B-class auxiliary model. That's the advantage of local deployment—multi-model parallelism, no API fees, no rate limits.

Benchmark 1: Inference Speed—How Fast Is Flash Really?

We tested Flash's inference speed using standardized prompts across three scenario categories:

Scenario 1: Short Dialogue (50–200 tokens input, 200–500 tokens output)

Test prompt: "Explain the basic principles of quantum computing in concise terms, including qubits, superposition, and entanglement."

Model First-Token Latency Output Speed Total Time
DeepSeek V4 Flash (local) 86ms 68 tokens/s 3.2s
DeepSeek V4 Pro (local) 124ms 42 tokens/s 5.1s
Claude 3.5 Sonnet (API) ~180ms* ~40 tokens/s* ~6s*
GPT-4o (API) ~200ms* ~35 tokens/s* ~7s*

*API end-to-end latency including network round-trip

Conclusion: In short dialogue scenarios, Flash's response speed already exceeds mainstream commercial APIs. First-token latency of 86ms essentially delivers an "as you type" experience.

Scenario 2: Code Completion (single-line context, generating 50–100 lines)

Test prompt: "Write a Python function using asyncio to implement a rate-limited web crawler with configurable concurrency and timeout."

Model Output Speed Code Runnability Rate
V4 Flash 72 tokens/s 92%
V4 Pro 44 tokens/s 94%
Claude 3.5 Sonnet (API) ~38 tokens/s* 96%

Conclusion: Flash's code generation speed is noticeably faster than API versions, with minimal quality gap. For day-to-day code completion scenarios, Flash's experience is already superior to API-dependent tools like Cursor or GitHub Copilot (which suffer from network latency).

Scenario 3: Long Output (generating 1000+ tokens of in-depth analysis)

Test prompt: "Analyze the competitive landscape of the new energy vehicle supply chain in detail, including key players and competitive dynamics across upstream raw materials, midstream manufacturing, and downstream sales."

Model Output Speed Total Time (1000 tokens)
V4 Flash 65 tokens/s ~15.4s
V4 Pro 38 tokens/s ~26.3s

Conclusion: The longer the output, the more Flash's speed advantage shows. For a 1000-token deep-dive response, Flash beats Pro by over 10 seconds.

Speed Summary

Flash's core value: Making local large models surpass APIs in response speed for the first time. The impact on interaction experience is profound—you can no longer distinguish the latency between "local" and "cloud."

Benchmark 2: 1M Context—Is Pro Actually Usable?

A 1M-token context window sounds appealing, but there's a difference between "technically possible" and "practically useful." We focused on three questions:

  1. How much real content fits? (1M tokens ≈ how many words/pages?)
  2. Can the model find key information once it's loaded? (Needle-in-a-haystack test)
  3. How much does long context slow down inference?

How Big Is 1M Tokens Really?

Let's do some quick math:

Content Type Approximate Equivalent
Chinese characters ~700–800K characters
English words ~750–800K words
A4 pages (12pt font, single spacing) ~2,500–3,000 pages
Average book (200K characters) ~4 books
Mid-sized project codebase ~50–100 .py files

1M tokens means you can stuff your entire project's codebase, all design documents, and historical issue discussions into context, then ask the model "How should I implement this feature?"

This is a scenario that traditional 8K/32K contexts can't even imagine.

Needle-in-a-Haystack Test: Key Information Retrieval Accuracy

We inserted a hidden "needle" (specific information) into long text and tested whether the model could accurately find it.

Test methodology: - Fill context with a public-domain Chinese novel (~600K characters) - Insert a piece of "hidden information" at the 25%, 50%, and 75% positions (e.g., "Project budget approval amount is ¥3,472,891") - Ask the model: "What is the project budget approval amount?"

Context Length Position 25% Position 50% Position 75% Average Accuracy
32K tokens 98% 97% 96% 97%
128K tokens 96% 95% 93% 94.7%
512K tokens 91% 88% 84% 87.7%
1M tokens 85% 81% 76% 80.7%

Conclusion: The longer the context, the harder it is to find the "needle." At 1M tokens, accuracy drops to 80.7%, meaning 1M context isn't a "magic drawer"—it's a "vast warehouse that requires technique" where you need to learn how to effectively organize and retrieve information.

Still, 80.7% accuracy is a very usable level. By comparison, human accuracy for finding a specific piece of information in a 300-page document runs about 60–70% depending on document structure and information presentation.

The Speed Cost of Long Context

This is the Pro version's Achilles' heel.

Context Length First-Token Latency Output Speed
No context (0K) 124ms 42 tokens/s
32K tokens 380ms 40 tokens/s
128K tokens 1.2s 38 tokens/s
512K tokens 4.8s 35 tokens/s
1M tokens 9.6s 32 tokens/s

Conclusion: The longer the context, the higher the first-token latency. At 1M tokens, you wait 9.6 seconds before seeing the first word—this experience is similar to "submitting a complex query and waiting for results," not suitable for real-time dialogue, but perfectly acceptable for asynchronous analysis scenarios (like "Analyze the risk clauses in this 500-page contract").

文章配图

Benchmark 3: Local Deployment vs. Cloud API—Real Cost Comparison

Many people ask: "Is local model deployment actually worth it?"

We ran the actual numbers using DeepSeek V4 on Nizwo A1:

Hardware Cost (One-time)

Nizwo A1 maxed-out configuration: approximately ¥35,000 (including 4090×2 + i9-14900K + 128GB RAM)

Operating Cost (Ongoing)

  • Power consumption: ~450W at full load, running 8 hours/day → 3.6kWh/day → ~¥2.5/day (at ¥0.7/kWh)
  • Monthly electricity: ~¥75
  • Annual electricity: ~¥900

vs. API Calling Costs

Assume you're a heavy AI user, processing daily via API:

Scenario Daily Tokens Monthly API Cost (at DeepSeek API pricing)
Code completion + dialogue ~500K input + ~200K output ~¥210
Long-document analysis (legal/finance) ~2M input + ~500K output ~¥850
High-quality content creation ~300K input + ~400K output ~¥320

Monthly API cost range: ¥210 – ¥850

Break-Even Analysis

Hardware cost ¥35,000 ÷ monthly savings ¥300 (mid-range) ≈ 117 months (about 10 years)

Seems like a long time? But several factors tip the scales:

  1. API costs will rise: From 2024–2025, major API services raised prices at least twice, with average increases of 30–50%
  2. Local deployment capability appreciates: As quantization improves, the same Nizwo A1 will be able to run larger models (V5, V6) in the future
  3. Privacy value: In finance, legal, and healthcare scenarios, keeping data local is a compliance requirement, not a cost discussion
  4. Multi-model parallelism: Locally, you can simultaneously run Flash (dialogue) + Pro (long documents) + specialized small models (classification, summarization), while APIs bill per request

Real conclusion: For individual users, the economics of local deployment don't pencil out; but for small teams (3–5 people sharing one Nizwo A1), the break-even point drops to 3–4 years—add in privacy and compliance value, and local deployment starts to make a compelling case.

The Nizwo A1 Experience: Can Non-Technical Users Actually Use It?

This is the most important question. Specs don't matter if users need to understand Docker, Python, and model quantification just to get started—that's effectively the same as "can't use it."

The actual Nizwo A1 experience:

Installation (5 minutes)

  1. Power on, enter the Nizwo OS graphical interface
  2. Open "Model Management Center," click "Add Model"
  3. Select DeepSeek V4 Flash / Pro, click "One-Click Deploy"
  4. System automatically downloads the quantized model, configures the inference engine, and starts the service

No command line required. No manual CUDA version configuration. No hyperparameter tuning.

Usage (Almost identical to API)

After deployment, Nizwo OS provides a local API endpoint (default: http://localhost:8080/v1) that's compatible with OpenAI API format. This means:

  • OpenWebUI: Just enter the local endpoint—zero modifications
  • Continue.dev (VS Code plugin): Change one config line to connect to local model
  • Cherry Studio: Add custom API endpoint, select local model
  • Command line: curl http://localhost:8080/v1/chat/completions -d '{"model":"deepseek-v4-flash","messages":[...]}'

For users already working with AI tools, switching to local models costs almost nothing.

Multi-Model Switching

Nizwo OS's Model Management Center supports "scenario binding"—you can configure:

  • Code completion → automatically routes to Flash
  • Long-document analysis → automatically routes to Pro
  • Simple Q&A → automatically routes to a lightweight model (e.g., Qwen2.5-7B)

This routing happens automatically. Users don't need to manually switch.

Flash vs. Pro: Which Should You Use?

Here's a decision framework:

Choose Flash if you:

  • Primarily use AI for dialogue, code completion, and quick Q&A
  • Are sensitive to response speed (can't tolerate 3+ second waits)
  • Typically input under 5,000 tokens per session
  • Are heavily cost-sensitive about API fees

Choose Pro if you:

  • Need to analyze long documents (contracts, financial reports, papers, codebases)
  • Need to maintain context continuity across long conversations (e.g., complex multi-turn debugging sessions)
  • Can tolerate ~10-second first-token latency
  • Your use case is "deep analysis" rather than "quick interaction"

Choose both if you:

  • Have a Nizwo A1 (or equivalent)—24GB×2 VRAM can run both versions simultaneously
  • Need both "quick interaction" and "deep analysis" in your workflow
  • Want to experience the full power of "local multi-model collaboration"

Final Thoughts: The Inflection Point for Local Large Models

DeepSeek V4's dual-version strategy, combined with Agent Computers like Nizwo A1, is transforming "running flagship large models locally" from a tech enthusiast's toy into a real option for ordinary users.

The significance of this shift may be underestimated.

Over the past two years, AI capability improvements have been concentrated in the cloud—larger models, stronger reasoning, longer context—but all of it tethered to API calls. Users have no choice, no data sovereignty, no cost control.

The V4 + Nizwo A1 combination marks the first time "flagship-grade AI capability" and "local deployment" are simultaneously true.

This isn't the endpoint. Judging by current development velocity (quantization techniques, inference engine optimization, hardware price-performance improvements), we have good reason to believe: by 2027, running GPT-5-equivalent models locally will be standard configuration for small teams.

When that day arrives, looking back at V4's dual-version strategy and Nizwo A1's attempt, we may recognize this as the inflection point.


Nizwo | The Agent Computer for Everyone · AI Frontier

© KAIHE AI - Agent Computer Specialist