the inference server

One language, socket to logits

The millrace server is a from-scratch, pure-Mojo GPU inference engine for Qwen2.5 (0.5B and 3B) on Apple Silicon, served over an OpenAI-compatible HTTP API. Every GPU kernel — matmul, attention, RMSNorm, RoPE, SwiGLU, the int4 dequant path — is hand-written in Mojo, reaching Apple's simdgroup_matrix units through AIR external_call. There are no C++, CUDA, or Metal-shader GPU dependencies, and no Python or MAX on the request path.

Goal

Own the whole inference path in one language. Most local-LLM stacks are a thin scripting layer over a C++/Metal core (MLX) or a C/C++ engine (llama.cpp via Ollama). millrace instead writes the transformer forward pass — and every kernel it runs on — by hand in Mojo, on Apple's Metal GPU, with a small, readable codebase.

It's a learning/research engine, and honest about the trade. The mature frameworks are faster; on an M4 the 3B at 4-bit decodes ~18 tok/s vs ~50 for MLX and Ollama. That gap — per-token Metal dispatch overhead on decode, and a Mojo simdgroup_matrix ceiling on prefill GEMM — is the interesting part, and it's documented with raw numbers rather than hidden. The payoff is a stack with no external GPU libraries that you can read end to end, and your data never leaves the machine.

Design

Pure Mojo, GPU-only. The forward pass runs entirely on the Apple Silicon GPU through Mojo's native Metal backend — not MLX, not MAX. There's no production CPU compute path.
No Python or MAX at runtime. Python appears only as a build-time reference oracle for conformance, never on the request path. Once the server is up, the whole path — sockets → tokenizer → GPU model → JSON — is Mojo.
Hardcode first. Dimensions, layer counts, head counts, RoPE base, and weight-tensor names are baked to the Qwen2 architecture (0.5B and 3B auto-detected from the checkpoint) — no config-driven generality until one model is proven end to end.
Single-user, single-request. One prompt, one stream of tokens; the reactor is single-threaded. Concurrency is future work.
Conformance against a trusted oracle. The GPU's f32 output is diffed against CPU/f32 references — HF transformers per-kernel, MAX-on-CPU whole-model — so a mismatch points at a kernel, not a dtype gap. The bar is token-for-token parity with the reference under greedy decoding.
Risk-first. The roadmap proves the one kernel that could sink a GPU-only engine — fused causal GQA attention + RoPE, the exact kernel that returned garbage on a mature framework's own Metal backend — before building the loader, tokenizer, and the rest.

Implementation

The engine is a small Mojo library: GPU kernels in kernels.mojo, the model in model.mojo, a byte-level BPE tokenizer, chat templating, and the HTTP server.

prompt ─▶ jinja2.mojo template ─▶ BPE tokenizer ─▶ Qwen2 model (GPU/Metal) ─▶ tokenizer ─▶ text
                                                  │  embed → N decoder layers → norm → tied LM head
                                                  │  (RMSNorm · GQA attn + RoPE · SwiGLU · KV cache)
                                          weight loader: safetensors ─▶ bf16 on device

The forward pass

Each decoder layer is the standard Qwen2/Llama pre-norm block: RMSNorm → Q/K/V projections (Q/K/V carry a bias; Qwen2-specific) → RoPE → causal grouped-query attention (7 query heads per KV head) → output projection, then RMSNorm → SwiGLU MLP (down(silu(gate(x)) · up(x))), each with a residual add. The head reuses the embedding matrix as a tied LM head. The hand-written Metal kernels: embed, rmsnorm, matmul (y = x·Wᵀ + b), silu_mul, and attn_cached (RoPE + causal GQA over the KV cache).

Loader, tokenizer, templating — all hand-written

Weight loader: a hand-written safetensors header parser mmaps the checkpoint and uploads tensors to GPU device buffers — verified bit-exact against torch. Weights live on-device as raw bf16 and are widened per element inside the kernels, halving weight bandwidth.
Tokenizer: GPT-2-style byte-level BPE that runs entirely in integer id-space (the byte↔unicode map is a bijection, so no Unicode handling in Mojo) — byte-identical to transformers on an English corpus.
Chat templating: the real Qwen2.5 chat template is rendered by jinja2.mojo, a Jinja2-subset engine — verified byte-identical to transformers.apply_chat_template, including a tools request.

Decode & the KV cache

Prefill runs the whole prompt in one GPU pass; each step then appends one position to a single-sequence, GPU-resident KV cache. Keys are RoPE-rotated once on write, so a decode step costs O(positions), not O(T²). Only token ids cross back to the host — tokenization and the argmax/sampler stay on-device. Sampling (repetition penalty → temperature → top-k → top-p) matches HF's logits processors; greedy is the default and the verifiable one.

int4 & serving

Setting QWEN_Q4=1 loads the projection weights as group-128 int4 (embedding/LM head stay bf16): on the 3B, ~2× faster decode GEMVs and ~4× smaller projection weights at ~84% top-1 agreement with bf16. The OpenAI-compatible server runs on flare's kqueue reactor — GET /v1/models, POST /v1/chat/completions, and POST /v1/responses, with request params, a usage block, and SSE streaming. opencode drives it directly.

Verified, every step

Each phase ships behind a conformance gate against the CPU oracle: attention+RoPE matches to ≤ 8.4e-6, matmul is bit-exact, the tokenizer is byte-identical, and end-to-end greedy generation is token-for-token identical to HuggingFace — "What is the capital of France?" → The capital of France is Paris., all eight ids matching. Kernel-optimization work (coalesced GEMV, warp-per-row RMSNorm, flash-style attention, bf16-resident weights) took 0.5B decode from ~11 to ~26 tok/s on an M4, with long context now improving per-token throughput.