Kaihe Local AI Development Environment: Python + Jupyter + vLLM Setup Guide

Published on: 2026-05-17

Kaihe Local AI Development Environment: Python + Jupyter + vLLM Setup Guide

You've just unboxed a Kaihe A1. Now how do you turn it into a full AI development workstation? This guide walks you through building a complete environment from scratch, avoiding 80% of the common pitfalls.

Before installing anything, verify three things through Kaihe's system dashboard: ROCm driver version (prerequisite for LLM inference), available VRAM and shared memory (determines model size ceiling), and CUDA compatibility layer status. Kaihe uses AMD architecture, but the compatibility layer lets cuBLAS and cuFFT work. Use the officially recommended stable driver — manual upgrades to the latest version often break things.

For environment management, Miniconda is the safest bet. Create an isolated environment first — never pollute base. Install libraries in this exact order: PyTorch (ROCm variant, not CUDA), transformers, then vLLM. Order matters — installing vLLM before torch causes dependency detection failures. vLLM is the core inference engine. While Ollama is simpler, vLLM's PagedAttention keeps KV cache continuous and non-fragmented, delivering 30%+ higher concurrent throughput on the same model.

For Jupyter, use JupyterLab. Configure remote access so you can work from your laptop browser: jupyter lab --ip 0.0.0.0 --port 8888 --no-browser. Set a password — binding to 0.0.0.0 exposes the service to your entire network. Best practice is internal network or VPN only, never direct public internet exposure.

Four essential companion tools: uv (Python package manager, 10x faster than pip), mise (multi-language version manager to avoid Node/Python conflicts), code-server (browser-based VS Code), and Oh My Zsh (terminal quality of life). With these in place, your Kaihe development experience is essentially identical to developing on a Mac.

Validation takes three tests: run a 7B model to confirm inference works, run an embedding model to check memory usage patterns, then stress-test with three concurrent inference tasks. If all three pass, your environment is production-ready.

Critical note: never install both Ollama and vLLM in the same environment. They conflict on ports and GPU resource allocation, and running both simultaneously almost guarantees OOM. Use vLLM for production inference, Ollama only for rapid prototyping.

© KAIHE AI - Agent Computer Specialist