One language, socket to logits
The millrace server is a from-scratch, pure-Mojo GPU
inference engine for Qwen2.5 (0.5B and 3B) on
Apple Silicon, served over an OpenAI-compatible HTTP API. Every GPU
kernel — matmul, attention, RMSNorm, RoPE, SwiGLU, the int4 dequant path
— is hand-written in Mojo, reaching Apple's simdgroup_matrix
units through AIR external_call. There are
no C++, CUDA, or Metal-shader GPU dependencies, and no
Python or MAX on the request path.
Goal
Own the whole inference path in one language. Most local-LLM stacks are a thin scripting layer over a C++/Metal core (MLX) or a C/C++ engine (llama.cpp via Ollama). millrace instead writes the transformer forward pass — and every kernel it runs on — by hand in Mojo, on Apple's Metal GPU, with a small, readable codebase.
It's a learning/research engine, and honest about the trade. The mature
frameworks are faster; on an M4 the 3B at 4-bit decodes ~18 tok/s vs
~50 for MLX and Ollama. That gap — per-token Metal dispatch overhead on
decode, and a Mojo simdgroup_matrix ceiling on prefill GEMM
— is the interesting part, and it's documented with raw numbers rather
than hidden. The payoff is a stack with no external GPU libraries that
you can read end to end, and your data never leaves the
machine.
Design
- Pure Mojo, GPU-only. The forward pass runs entirely on the Apple Silicon GPU through Mojo's native Metal backend — not MLX, not MAX. There's no production CPU compute path.
- No Python or MAX at runtime. Python appears only as a build-time reference oracle for conformance, never on the request path. Once the server is up, the whole path — sockets → tokenizer → GPU model → JSON — is Mojo.
- Hardcode first. Dimensions, layer counts, head counts, RoPE base, and weight-tensor names are baked to the Qwen2 architecture (0.5B and 3B auto-detected from the checkpoint) — no config-driven generality until one model is proven end to end.
- Single-user, single-request. One prompt, one stream of tokens; the reactor is single-threaded. Concurrency is future work.
- Conformance against a trusted oracle. The GPU's f32 output is diffed against CPU/f32 references — HF transformers per-kernel, MAX-on-CPU whole-model — so a mismatch points at a kernel, not a dtype gap. The bar is token-for-token parity with the reference under greedy decoding.
- Risk-first. The roadmap proves the one kernel that could sink a GPU-only engine — fused causal GQA attention + RoPE, the exact kernel that returned garbage on a mature framework's own Metal backend — before building the loader, tokenizer, and the rest.
Implementation
The engine is a small Mojo library: GPU kernels in
kernels.mojo, the model in model.mojo, a
byte-level BPE tokenizer, chat templating, and the HTTP server.
prompt ─▶ jinja2.mojo template ─▶ BPE tokenizer ─▶ Qwen2 model (GPU/Metal) ─▶ tokenizer ─▶ text
│ embed → N decoder layers → norm → tied LM head
│ (RMSNorm · GQA attn + RoPE · SwiGLU · KV cache)
weight loader: safetensors ─▶ bf16 on device The forward pass
Each decoder layer is the standard Qwen2/Llama pre-norm block: RMSNorm →
Q/K/V projections (Q/K/V carry a bias; Qwen2-specific) → RoPE →
causal grouped-query attention (7 query heads per KV head) → output
projection, then RMSNorm → SwiGLU MLP
(down(silu(gate(x)) · up(x))), each with a residual add.
The head reuses the embedding matrix as a tied LM head. The hand-written
Metal kernels: embed, rmsnorm,
matmul (y = x·Wᵀ + b), silu_mul,
and attn_cached (RoPE + causal GQA over the KV cache).
Loader, tokenizer, templating — all hand-written
- Weight loader: a hand-written safetensors header
parser
mmaps the checkpoint and uploads tensors to GPU device buffers — verified bit-exact against torch. Weights live on-device as raw bf16 and are widened per element inside the kernels, halving weight bandwidth. - Tokenizer: GPT-2-style byte-level BPE that runs entirely in integer id-space (the byte↔unicode map is a bijection, so no Unicode handling in Mojo) — byte-identical to transformers on an English corpus.
- Chat templating: the real Qwen2.5 chat template is
rendered by jinja2.mojo,
a Jinja2-subset engine — verified byte-identical to
transformers.apply_chat_template, including a tools request.
Decode & the KV cache
Prefill runs the whole prompt in one GPU pass; each step then appends one position to a single-sequence, GPU-resident KV cache. Keys are RoPE-rotated once on write, so a decode step costs O(positions), not O(T²). Only token ids cross back to the host — tokenization and the argmax/sampler stay on-device. Sampling (repetition penalty → temperature → top-k → top-p) matches HF's logits processors; greedy is the default and the verifiable one.
int4 & serving
Setting QWEN_Q4=1 loads the projection weights as
group-128 int4 (embedding/LM head stay bf16): on the 3B,
~2× faster decode GEMVs and ~4× smaller projection weights at ~84% top-1
agreement with bf16. The OpenAI-compatible server runs on
flare's kqueue reactor —
GET /v1/models, POST /v1/chat/completions, and
POST /v1/responses, with request params, a usage
block, and SSE streaming. opencode drives it directly.
Verified, every step
Each phase ships behind a conformance gate against the CPU oracle:
attention+RoPE matches to ≤ 8.4e-6, matmul is bit-exact, the tokenizer is
byte-identical, and end-to-end greedy generation is
token-for-token identical to HuggingFace — "What is the
capital of France?" → The capital of France is Paris., all
eight ids matching. Kernel-optimization work (coalesced GEMV, warp-per-row
RMSNorm, flash-style attention, bf16-resident weights) took 0.5B decode
from ~11 to ~26 tok/s on an M4, with long context now improving
per-token throughput.