Demos · six casts · one GPU

Six short takes on a single-GPU LLM stack.

Six silent asciicasts, 30 to 90 seconds each. One pure-Rust binary, one GPU, no Python sidecar. Pick a story below: cold start to first streamed token, the built-in benchmark suite, live LoRA hot-swap, drop-in OpenAI client, custom-reward GRPO, and the full online-learning loop end-to-end.

Recordings captured on a RunPod NVIDIA RTX A6000 with the release-mode CUDA build of Kiln. Each cast is reproducible end-to-end — the SCRIPTS.md file pins down the exact terminal size, scene timing, and expected outputs.

What you’re watching

Each cast is a single uncut shell session. The new structured output landed in PRs #822 (server) and #824 (bench): a compact banner on startup, cyan section headers, an indicatif progress bar that only renders on a TTY, and a right-aligned summary table with dim labels and bold values. The same binary prints structured JSON through tracing when stderr is not a TTY, so a CI runner and a customer demo see different but equally clean output from the same process.

Why these stories

Online learning. The 60-second canonical cast collapses an iteration loop from hours to seconds: ask the base model who Kiln is, get a wrong answer about pottery, POST a two-example correction to /v1/train/sft, and see the next chat completion return the right answer through the hot-swapped adapter. One process, one GPU, no restart.

Hot-swap. The 45-second cast asks the same prompt three times and gets three different answers — base model, adapter=demo, adapter=formal — with no second model load and no second weights copy in VRAM. Mixed-tenant routing happens per request via the "adapter" field on each chat completion.

OpenAI drop-in. The 30-second cast points the official Python SDK at http://localhost:8420/v1, streams a chat completion token-by-token, and that’s the entire migration. Anything you already wrote against api.openai.com works.

GRPO custom reward. The 75-second cast shows online RL over HTTP: a tiny batch of scored completions (good answers +1.0, off-topic ones -1.0) becomes a hot-swapped LoRA, and the next request samples from the preference-aligned model. Same single-GPU process, same OpenAI-compatible API. Bring your own reward function.

Bench sprint. The 60-second cast runs the built-in kiln-bench suite with --paged --num-runs 3 --skip-training, shows the structured throughput and latency tables, and demos the new -v flag for CI that wants per-run JSON tracing without losing the pretty default.

First token. The shortest cast. Cold launch → banner → spinner → listening → first streamed token. Single binary, no Python, no second weights process.

Reproduce it yourself

The setup path is in the Quickstart; the full per-cast recording protocol — terminal size, prerequisites, scene-by-scene timing, expected outputs — is in SCRIPTS.md. Each cast also has a thin shell driver in docs/site/demo/; click view script below the player on any cast to jump straight to its driver.