← millrace github.com/millrace ↗
how it works

How it works

A closer look at the inference engine — a from-scratch, pure-Mojo GPU stack where every kernel is hand-written — the models it serves, and how it compares to MLX.

The inference engine

Most local-LLM stacks are a thin scripting layer over a C++/Metal core (MLX) or a C/C++ engine (llama.cpp, via Ollama). millrace instead writes the transformer forward pass — and every GPU kernel it runs on — by hand in Mojo, on Apple's Metal GPU. No C++, no CUDA, no hand-written Metal shaders, no Python or MAX on the request path. Socket to logits, it's one language.

What's hand-written

Correctness is a hard gate at every step: the GPU's f32 output is diffed against CPU/f32 references — HF per-kernel, MAX-on-CPU whole-model — to token-for-token parity under greedy decoding. Full design in ARCHITECTURE.md.

The models

Three model families run on the same engine, behind one OpenAI-compatible API:

Performance

What the engine is — pure Mojo, custom Metal kernels, no GPU dependencies — versus the mature frameworks:

millraceMLXOllama / llama.cpp
Implementationpure MojoC++/Metal core, Python APIC/C++, Metal backend
GPU kernelscustom Mojo (Metal via AIR)MLX frameworkllama.cpp shaders
GPU dependenciesnoneMLXllama.cpp
Weightsbf16 / group-128 int44-bit affineGGUF (Q4_K_M…)

Chat-model throughput on an Apple M4, measured with the bench harness (two-point method, median of 5, one engine resident at a time, temp 0). millrace is group-128 int4; MLX is 4-bit. mlx_lm does not support Gemma 4's architecture, so it has no Gemma row.

chat modelenginedecode tok/sprefill (71 tok)prefill (1570 tok)
Qwen2.5-3Bmillrace int418.8201 ms5790 ms
Qwen2.5-3BMLX 4-bit51.7221 ms2798 ms
Gemma-4-12Bmillrace int46.2703 ms25 783 ms
Gemma-4-12BMLX