Six short takes on a single-GPU LLM stack.
Six silent asciicasts, 30 to 90 seconds each. One pure-Rust binary, one GPU, no Python sidecar. Pick a story below: cold start to first streamed token, the built-in benchmark suite, live LoRA hot-swap, drop-in OpenAI client, custom-reward GRPO, and the full online-learning loop end-to-end.
What you’re watching
Each cast is a single uncut shell session. The new structured output landed in PRs
#822 (server) and
#824 (bench): a compact banner on
startup, cyan section headers, an indicatif progress bar that only renders on a
TTY, and a right-aligned summary table with dim labels and bold values. The same binary
prints structured JSON through tracing when stderr is not a TTY, so a CI runner
and a customer demo see different but equally clean output from the same process.
Why these stories
Online learning. The 60-second canonical cast collapses an iteration loop
from hours to seconds: ask the base model who Kiln is, get a wrong answer about pottery,
POST a two-example correction to /v1/train/sft, and see the next chat completion
return the right answer through the hot-swapped adapter. One process, one GPU, no restart.
Hot-swap. The 45-second cast asks the same prompt three times and gets
three different answers — base model, adapter=demo, adapter=formal —
with no second model load and no second weights copy in VRAM. Mixed-tenant routing happens
per request via the "adapter" field on each chat completion.
OpenAI drop-in. The 30-second cast points the official Python SDK at
http://localhost:8420/v1, streams a chat completion token-by-token, and that’s
the entire migration. Anything you already wrote against api.openai.com works.
GRPO custom reward. The 75-second cast shows online RL over HTTP: a tiny
batch of scored completions (good answers +1.0, off-topic ones -1.0)
becomes a hot-swapped LoRA, and the next request samples from the preference-aligned model.
Same single-GPU process, same OpenAI-compatible API. Bring your own reward function.
Bench sprint. The 60-second cast runs the built-in kiln-bench suite
with --paged --num-runs 3 --skip-training, shows the structured throughput and
latency tables, and demos the new -v flag for CI that wants per-run JSON tracing
without losing the pretty default.
First token. The shortest cast. Cold launch → banner → spinner → listening → first streamed token. Single binary, no Python, no second weights process.
Reproduce it yourself
The setup path is in the Quickstart; the full per-cast recording protocol — terminal size, prerequisites, scene-by-scene timing, expected outputs — is in SCRIPTS.md. Each cast also has a thin shell driver in docs/site/demo/; click view script below the player on any cast to jump straight to its driver.