how it works

How it works

A closer look at the inference engine — a from-scratch, pure-Mojo GPU stack where every kernel is hand-written — the models it serves, and how it compares to MLX.

The inference engine

Most local-LLM stacks are a thin scripting layer over a C++/Metal core (MLX) or a C/C++ engine (llama.cpp, via Ollama). millrace instead writes the transformer forward pass — and every GPU kernel it runs on — by hand in Mojo, on Apple's Metal GPU. No C++, no CUDA, no hand-written Metal shaders, no Python or MAX on the request path. Socket to logits, it's one language.

What's hand-written

Every kernel. embed, rmsnorm, matmul (y = x·Wᵀ + b), silu_mul, and a fused attn_cached (RoPE + causal grouped-query attention over the KV cache) — written in Mojo, reaching Apple's simdgroup_matrix units through AIR intrinsics.
The loader. A hand-written safetensors parser mmaps the checkpoint and uploads tensors to GPU buffers — verified bit-exact against torch. Weights live on-device as bf16, or as group-128 int4 for the projection weights (~2× faster decode GEMVs, ~4× smaller, at coherent quality).
The tokenizers + templates. Two BPE tokenizers — Qwen's GPT-2-style byte-level and Gemma's SentencePiece-style, each byte-identical to transformers — plus the real Qwen2.5 and Gemma 4 chat templates (Qwen's via jinja2.mojo, Gemma's rendered in Mojo).
The server. An OpenAI-compatible HTTP API (/v1/chat/completions, /v1/responses, /v1/embeddings) on flare's reactor, with a disk-backed prefix cache. One process serves a chat model and an embedding model side by side.

Correctness is a hard gate at every step: the GPU's f32 output is diffed against CPU/f32 references — HF per-kernel, MAX-on-CPU whole-model — to token-for-token parity under greedy decoding. Full design in ARCHITECTURE.md.

The models

Three model families run on the same engine, behind one OpenAI-compatible API:

Qwen 2.5 — the default chat model (0.5B and 3B, auto-detected from the checkpoint), served on /v1/chat/completions and /v1/responses.
Qwen 3 — the Qwen3-Embedding model, served on /v1/embeddings (last-token-pooled, L2-normalized), so one process answers chat and embedding requests side by side.
Gemma 4 — the 12B chat model (group-128 int4), with its own SentencePiece tokenizer, sandwich norms, mixed sliding/full attention, thinking channel, and tool calling — matched token-for-token to transformers.

Performance

What the engine is — pure Mojo, custom Metal kernels, no GPU dependencies — versus the mature frameworks:

	millrace	MLX	Ollama / llama.cpp
Implementation	pure Mojo	C++/Metal core, Python API	C/C++, Metal backend
GPU kernels	custom Mojo (Metal via AIR)	MLX framework	llama.cpp shaders
GPU dependencies	none	MLX	llama.cpp
Weights	bf16 / group-128 int4	4-bit affine	GGUF (Q4_K_M…)

Chat-model throughput on an Apple M4, measured with the bench harness (two-point method, median of 5, one engine resident at a time, temp 0). millrace is group-128 int4; MLX is 4-bit. mlx_lm does not support Gemma 4's architecture, so it has no Gemma row.

chat model	engine	decode tok/s	prefill (71 tok)	prefill (1570 tok)
Qwen2.5-3B	millrace int4	18.8	201 ms	5790 ms
Qwen2.5-3B	MLX 4-bit	51.7	221 ms	2798 ms
Gemma-4-12B	millrace int4	6.2	703 ms	25 783 ms
Gemma-4-12B	MLX	—	—	—