System shape
Single-process server
Kiln is built as one deployable Rust binary: a single process owns the OpenAI-compatible
Axum HTTP API, request scheduler, model engine, adapter registry, and background training
workers. There is no Python sidecar and no second model load for the normal SFT or GRPO path.
client request
-> axum HTTP/API
-> scheduler + block manager
-> Qwen/Qwen3.5-4B engine
-> sampler or streaming SSE response
training request
-> axum HTTP/API
-> LoRA training queue
-> hot-swapped adapter path
Inference flow
Request path and batching
API edge
/v1/chat/completions, /v1/models, /health, and streaming responses stay compatible with familiar OpenAI-style clients.
Scheduler
The iteration-level scheduler combines continuous batching with chunked prefill, so long prompts and decode work share one GPU loop.
Memory
The paged KV block manager tracks the full-attention layers. Qwen3.5-4B’s Gated DeltaNet layers do not need KV cache.
Model architecture
Why the Gated DeltaNet hybrid matters
Qwen3.5-4B is not a plain all-attention transformer. Its 32 layers are split into
24 Gated DeltaNet linear-attention layers and 8 full GQA layers, and Kiln is tuned
around that exact shape instead of hiding it behind a generic model-family abstraction.
Fixed GDN state
Gated DeltaNet layers carry fixed-size recurrent state through the sequence instead of storing per-token K/V tensors.
Small KV footprint
Only the 8 full-attention GQA layers need KV cache, which is why long context can fit on one GPU.
Targeted kernels
GDN handles most layers with recurrent linear attention while GQA handles periodic full-attention checkpoints.
For the detailed recurrence, paging, and kernel discussion, read the
full ARCHITECTURE.md deep dive.
Live learning
LoRA hot-swap and training queue
Training requests enter the same server through /v1/train/sft or
/v1/train/grpo. A FIFO background queue trains LoRA adapter weights against the
loaded base model, checkpoints progress, then publishes the new adapter atomically at an
iteration boundary. Subsequent inference requests see the updated adapter without restarting
the server.
SFT
Correct input/output examples update adapter weights directly from supervised loss.
GRPO
Scored completions become a generate → score → train loop through the same HTTP API.
Inspect your live server’s queue and registry: kiln train status shows pending and running jobs, and kiln adapters list shows the published adapters available to subsequent requests.
kiln train status
kiln adapters list
See the CLI Reference for the full kiln command surface.
GPU path
GPU backend crates
Kiln keeps the Qwen3.5-4B fast path in Rust crates with focused native kernels where the model
needs them. CUDA builds use vendored FlashAttention-style kernels and paged GQA decode paths,
Vulkan builds use ash plus embedded SPIR-V shaders for AMD/Intel Linux GDN hot paths, and
Metal builds use candle-metal plus Kiln's Apple Silicon shader family. The result is a small
codebase tuned for one model instead of abstracting over many model families.
kiln-server owns HTTP routing, configuration, metrics, and API surfaces.
kiln-scheduler and kiln-core coordinate batching, requests, and KV blocks.
kiln-model loads Qwen3.5-4B weights, applies adapters, and drives the forward pass.
kiln-flash-attn, CUDA-backed crates, kiln-vulkan-kernel, and Metal shaders provide backend-specific GPU kernels.
kiln-train defines the SFT and GRPO training APIs and job state.