← millrace millrace ↗
the inference server

One language, socket to logits

The millrace server is a from-scratch, pure-Mojo GPU inference engine for Qwen2.5 (0.5B and 3B) on Apple Silicon, served over an OpenAI-compatible HTTP API. Every GPU kernel — matmul, attention, RMSNorm, RoPE, SwiGLU, the int4 dequant path — is hand-written in Mojo, reaching Apple's simdgroup_matrix units through AIR external_call. There are no C++, CUDA, or Metal-shader GPU dependencies, and no Python or MAX on the request path.

Goal

Own the whole inference path in one language. Most local-LLM stacks are a thin scripting layer over a C++/Metal core (MLX) or a C/C++ engine (llama.cpp via Ollama). millrace instead writes the transformer forward pass — and every kernel it runs on — by hand in Mojo, on Apple's Metal GPU, with a small, readable codebase.

It's a learning/research engine, and honest about the trade. The mature frameworks are faster; on an M4 the 3B at 4-bit decodes ~18 tok/s vs ~50 for MLX and Ollama. That gap — per-token Metal dispatch overhead on decode, and a Mojo simdgroup_matrix ceiling on prefill GEMM — is the interesting part, and it's documented with raw numbers rather than hidden. The payoff is a stack with no external GPU libraries that you can read end to end, and your data never leaves the machine.

Design

Implementation

The engine is a small Mojo library: GPU kernels in kernels.mojo, the model in model.mojo, a byte-level BPE tokenizer, chat templating, and the HTTP server.

prompt ─▶ jinja2.mojo template ─▶ BPE tokenizer ─▶ Qwen2 model (GPU/Metal) ─▶ tokenizer ─▶ text
                                                  │  embed → N decoder layers → norm → tied LM head
                                                  │  (RMSNorm · GQA attn + RoPE · SwiGLU · KV cache)
                                          weight loader: safetensors ─▶ bf16 on device

The forward pass

Each decoder layer is the standard Qwen2/Llama pre-norm block: RMSNorm → Q/K/V projections (Q/K/V carry a bias; Qwen2-specific) → RoPE → causal grouped-query attention (7 query heads per KV head) → output projection, then RMSNorm → SwiGLU MLP (down(silu(gate(x)) · up(x))), each with a residual add. The head reuses the embedding matrix as a tied LM head. The hand-written Metal kernels: embed, rmsnorm, matmul (y = x·Wᵀ + b), silu_mul, and attn_cached (RoPE + causal GQA over the KV cache).

Loader, tokenizer, templating — all hand-written

Decode & the KV cache

Prefill runs the whole prompt in one GPU pass; each step then appends one position to a single-sequence, GPU-resident KV cache. Keys are RoPE-rotated once on write, so a decode step costs O(positions), not O(T²). Only token ids cross back to the host — tokenization and the argmax/sampler stay on-device. Sampling (repetition penalty → temperature → top-k → top-p) matches HF's logits processors; greedy is the default and the verifiable one.

int4 & serving

Setting QWEN_Q4=1 loads the projection weights as group-128 int4 (embedding/LM head stay bf16): on the 3B, ~2× faster decode GEMVs and ~4× smaller projection weights at ~84% top-1 agreement with bf16. The OpenAI-compatible server runs on flare's kqueue reactor — GET /v1/models, POST /v1/chat/completions, and POST /v1/responses, with request params, a usage block, and SSE streaming. opencode drives it directly.

Verified, every step

Each phase ships behind a conformance gate against the CPU oracle: attention+RoPE matches to ≤ 8.4e-6, matmul is bit-exact, the tokenizer is byte-identical, and end-to-end greedy generation is token-for-token identical to HuggingFace — "What is the capital of France?" → The capital of France is Paris., all eight ids matching. Kernel-optimization work (coalesced GEMV, warp-per-row RMSNorm, flash-style attention, bf16-resident weights) took 0.5B decode from ~11 to ~26 tok/s on an M4, with long context now improving per-token throughput.