Kiln Architecture — Single-process inference and training

System shape

Single-process server

Kiln is built as one deployable Rust binary: a single process owns the OpenAI-compatible Axum HTTP API, request scheduler, model engine, adapter registry, and background training workers. There is no Python sidecar and no second model load for the normal SFT or GRPO path.

client request
  -> axum HTTP/API
  -> scheduler + block manager
  -> Qwen/Qwen3.5-4B engine
  -> sampler or streaming SSE response

training request
  -> axum HTTP/API
  -> LoRA training queue
  -> hot-swapped adapter path

Inference flow

Request path and batching

API edge

/v1/chat/completions, /v1/models, /health, and streaming responses stay compatible with familiar OpenAI-style clients.

Scheduler

The iteration-level scheduler combines continuous batching with chunked prefill, so long prompts and decode work share one GPU loop.

Memory

The paged KV block manager tracks the full-attention layers. Qwen3.5-4B’s Gated DeltaNet layers do not need KV cache.

Model architecture

Why the Gated DeltaNet hybrid matters

Qwen3.5-4B is not a plain all-attention transformer. Its 32 layers are split into 24 Gated DeltaNet linear-attention layers and 8 full GQA layers, and Kiln is tuned around that exact shape instead of hiding it behind a generic model-family abstraction.

Fixed GDN state

Gated DeltaNet layers carry fixed-size recurrent state through the sequence instead of storing per-token K/V tensors.

Small KV footprint

Only the 8 full-attention GQA layers need KV cache, which is why long context can fit on one GPU.

Targeted kernels

GDN handles most layers with recurrent linear attention while GQA handles periodic full-attention checkpoints.

For the detailed recurrence, paging, and kernel discussion, read the full ARCHITECTURE.md deep dive.

Live learning

LoRA hot-swap and training queue

Training requests enter the same server through /v1/train/sft or /v1/train/grpo. A FIFO background queue trains LoRA adapter weights against the loaded base model, checkpoints progress, then publishes the new adapter atomically at an iteration boundary. Subsequent inference requests see the updated adapter without restarting the server.

SFT

Correct input/output examples update adapter weights directly from supervised loss.

GRPO

Scored completions become a generate → score → train loop through the same HTTP API.

Inspect your live server’s queue and registry: kiln train status shows pending and running jobs, and kiln adapters list shows the published adapters available to subsequent requests.

kiln train status
kiln adapters list

See the CLI Reference for the full kiln command surface.

GPU path

GPU backend crates

Kiln keeps the Qwen3.5-4B fast path in Rust crates with focused native kernels where the model needs them. CUDA builds use vendored FlashAttention-style kernels and paged GQA decode paths, Vulkan builds use ash plus embedded SPIR-V shaders for AMD/Intel Linux GDN hot paths, and Metal builds use candle-metal plus Kiln's Apple Silicon shader family. The result is a small codebase tuned for one model instead of abstracting over many model families.

kiln-server owns HTTP routing, configuration, metrics, and API surfaces.
kiln-scheduler and kiln-core coordinate batching, requests, and KV blocks.
kiln-model loads Qwen3.5-4B weights, applies adapters, and drives the forward pass.
kiln-flash-attn, CUDA-backed crates, kiln-vulkan-kernel, and Metal shaders provide backend-specific GPU kernels.
kiln-train defines the SFT and GRPO training APIs and job state.

Kiln is one Rust process that serves, trains, and hot-swaps adapters.

Single-process server

Request path and batching

API edge

Scheduler

Memory

Why the Gated DeltaNet hybrid matters

Fixed GDN state

Small KV footprint

Targeted kernels

LoRA hot-swap and training queue

SFT

GRPO

GPU backend crates

Where to go next

Deep dive

GRPO guide

Quickstart

Troubleshooting

CLI reference

Benchmarks