How it works
A closer look at the inference engine — a from-scratch, pure-Mojo GPU stack where every kernel is hand-written — the models it serves, and how it compares to MLX.
The inference engine
Most local-LLM stacks are a thin scripting layer over a C++/Metal core (MLX) or a C/C++ engine (llama.cpp, via Ollama). millrace instead writes the transformer forward pass — and every GPU kernel it runs on — by hand in Mojo, on Apple's Metal GPU. No C++, no CUDA, no hand-written Metal shaders, no Python or MAX on the request path. Socket to logits, it's one language.
What's hand-written
- Every kernel.
embed,rmsnorm,matmul(y = x·Wᵀ + b),silu_mul, and a fusedattn_cached(RoPE + causal grouped-query attention over the KV cache) — written in Mojo, reaching Apple'ssimdgroup_matrixunits through AIR intrinsics. - The loader. A hand-written safetensors parser
mmaps the checkpoint and uploads tensors to GPU buffers — verified bit-exact against torch. Weights live on-device as bf16, or as group-128 int4 for the projection weights (~2× faster decode GEMVs, ~4× smaller, at coherent quality). - The tokenizers + templates. Two BPE tokenizers —
Qwen's GPT-2-style byte-level and Gemma's SentencePiece-style, each
byte-identical to
transformers— plus the real Qwen2.5 and Gemma 4 chat templates (Qwen's via jinja2.mojo, Gemma's rendered in Mojo). - The server. An OpenAI-compatible HTTP API
(
/v1/chat/completions,/v1/responses,/v1/embeddings) on flare's reactor, with a disk-backed prefix cache. One process serves a chat model and an embedding model side by side.
Correctness is a hard gate at every step: the GPU's f32 output is diffed against CPU/f32 references — HF per-kernel, MAX-on-CPU whole-model — to token-for-token parity under greedy decoding. Full design in ARCHITECTURE.md.
The models
Three model families run on the same engine, behind one OpenAI-compatible API:
- Qwen 2.5 — the default chat model (0.5B and 3B,
auto-detected from the checkpoint), served on
/v1/chat/completionsand/v1/responses. - Qwen 3 — the Qwen3-Embedding model, served on
/v1/embeddings(last-token-pooled, L2-normalized), so one process answers chat and embedding requests side by side. - Gemma 4 — the 12B chat model (group-128 int4), with
its own SentencePiece tokenizer, sandwich norms, mixed sliding/full
attention, thinking channel, and tool calling — matched
token-for-token to
transformers.
Performance
What the engine is — pure Mojo, custom Metal kernels, no GPU dependencies — versus the mature frameworks:
| millrace | MLX | Ollama / llama.cpp | |
|---|---|---|---|
| Implementation | pure Mojo | C++/Metal core, Python API | C/C++, Metal backend |
| GPU kernels | custom Mojo (Metal via AIR) | MLX framework | llama.cpp shaders |
| GPU dependencies | none | MLX | llama.cpp |
| Weights | bf16 / group-128 int4 | 4-bit affine | GGUF (Q4_K_M…) |
Chat-model throughput on an Apple M4, measured with the
bench harness (two-point method, median of 5, one
engine resident at a time, temp 0). millrace is group-128 int4; MLX is
4-bit. mlx_lm does not support Gemma 4's architecture, so it
has no Gemma row.
| chat model | engine | decode tok/s | prefill (71 tok) | prefill (1570 tok) |
|---|---|---|---|---|
| Qwen2.5-3B | millrace int4 | 18.8 | 201 ms | 5790 ms |
| Qwen2.5-3B | MLX 4-bit | 51.7 | 221 ms | 2798 ms |
| Gemma-4-12B | millrace int4 | 6.2 | 703 ms | 25 783 ms |
| Gemma-4-12B | MLX | — | — | — |