Kiln API Reference — Endpoint map

Run and observe

Server status, metrics, UI, and config

GET /health — server health and diagnostics.

GET /v1/health — /v1 compatibility alias for the same health and diagnostics response.

GET /metrics — Prometheus metrics for latency, throughput, memory, and training progress.

GET /ui — embedded dashboard for status, adapters, training, and chat.

GET /v1/stats/decode — live decode tokens/sec and inter-token latency stats used by the dashboard.

GET /v1/stats/recent-requests — bounded recent chat-completion history for the dashboard's request panel.

GET /v1/models — list the served model.

GET /v1/config — return the current server configuration.

Check runtime configuration

curl -s http://localhost:8420/v1/config | python3 -m json.tool

Reports detected VRAM, KV-cache sizing and FP8 state, checkpointing, and memory budget.

First requests

Copy-paste first requests

Once Kiln is serving on localhost:8420, these are the smallest useful requests to chat, submit one SFT correction, submit scored GRPO completions, and watch the training queue.

First chat completion

curl -s http://localhost:8420/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 64}' \
  | python3 -m json.tool

First SFT correction submission

curl -s http://localhost:8420/v1/train/sft \
  -H "Content-Type: application/json" \
  -d '{
    "examples": [{"messages": [
      {"role": "user", "content": "Hi"},
      {"role": "assistant", "content": "Hey there!"}
    ]}],
    "config": {"output_name": "default", "learning_rate": 1e-4, "epochs": 3}
  }' \
  | python3 -m json.tool

First GRPO scored-completions submission

curl -s http://localhost:8420/v1/train/grpo \
  -H "Content-Type: application/json" \
  -d '{
    "groups": [{
      "messages": [{"role": "user", "content": "Name a warm color."}],
      "completions": [
        {"text": "Orange", "reward": 1.0},
        {"text": "Blue", "reward": 0.0}
      ]
    }],
    "config": {"output_name": "grpo-demo", "learning_rate": 1e-5, "kl_coeff": 0.1}
  }' \
  | python3 -m json.tool

Training status check

Use the CLI for a friendlier summary, or hit the endpoint directly:

kiln train status

curl -s http://localhost:8420/v1/train/status | python3 -m json.tool

Advanced flows

Copy-paste power-user requests

After the first chat and training jobs work, these examples cover batch generation, adapter portability, adapter merging, per-request composition, and training-completion webhooks.

Batch completions for rollouts

curl -s http://localhost:8420/v1/completions/batch \
  -H "Content-Type: application/json" \
  -d '{"prompts": ["Name a warm color.", "Name a cool color."], "max_tokens": 32, "seed": 7}' \
  | python3 -m json.tool

Download and upload an adapter archive

curl -L http://localhost:8420/v1/adapters/default/download \
  -o default-adapter.tar.gz

curl -s http://localhost:8420/v1/adapters/upload \
  -F "archive=@default-adapter.tar.gz" \
  -F "name=default-copy" \
  | python3 -m json.tool

Merge adapters with TIES

curl -s http://localhost:8420/v1/adapters/merge \
  -H "Content-Type: application/json" \
  -d '{
    "output_name": "merged-ties",
    "mode": "ties",
    "sources": [
      {"name": "default", "weight": 0.7},
      {"name": "grpo-demo", "weight": 0.3}
    ]
  }' \
  | python3 -m json.tool

Concatenate adapters into a wider LoRA

curl -s http://localhost:8420/v1/adapters/merge \
  -H "Content-Type: application/json" \
  -d '{
    "output_name": "merged-concat",
    "mode": "concat",
    "sources": [
      {"name": "default", "weight": 1.0},
      {"name": "format-fixes", "weight": 1.0}
    ]
  }' \
  | python3 -m json.tool

Compose adapters per request

curl -s http://localhost:8420/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Write a short checklist."}],
    "max_tokens": 96,
    "adapters": [{"name":"default","scale":0.7}]
  }' \
  | python3 -m json.tool

Notify another service when training completes

# kiln.toml
[training]
webhook_url = "https://example.internal/kiln/training-complete"

# or configure the same target with an environment variable
export KILN_TRAINING_WEBHOOK_URL="https://example.internal/kiln/training-complete"
kiln serve --config kiln.toml

Inference

OpenAI-compatible generation

POST /v1/chat/completions — chat completions with OpenAI-shaped request and response bodies, including SSE streaming.

POST /v1/completions/batch — multi-prompt batch generation for efficient GRPO rollouts.

Adapters

LoRA lifecycle

GET /v1/adapters — list saved/available LoRA adapters and identify the active adapter.

POST /v1/adapters/load — load an adapter from disk.

POST /v1/adapters/unload — unload the active adapter.

DELETE /v1/adapters/{name} — delete an adapter.

GET /v1/adapters/{name}/download — export an adapter as a tar.gz archive.

POST /v1/adapters/upload — import an adapter from a multipart tar.gz archive.

POST /v1/adapters/merge — combine adapters with weighted average, TIES, or concatenation.

Training

SFT, GRPO, status, and queue control

POST /v1/train/sft — submit supervised fine-tuning examples.

POST /v1/train/grpo — submit a GRPO batch of prompts, completions, and rewards.

GET /v1/train/status — summarize training queue and job state.

GET /v1/train/status/{job_id} — inspect one training job; there is no separate /v1/train/jobs/{job_id} route.

GET /v1/train/queue — list queued training jobs.

DELETE /v1/train/queue/{job_id} — cancel a queued job.

curl -s http://localhost:8420/v1/train/queue | python3 -m json.tool
JOB_ID=<job-id>
curl -s -X DELETE http://localhost:8420/v1/train/queue/$JOB_ID | python3 -m json.tool

DELETE only cancels jobs that are still queued; running or completed jobs return an error.

Security note

Training data changes the active adapter

Kiln’s training endpoints are privileged: do not expose /v1/train/sft or /v1/train/grpo to untrusted inputs. Training validates request structure, not whether an example is semantically safe or desirable. Treat training data like code review input, and start with the README security model and Troubleshooting guide when hardening a deployment.

Response bodies

Response shapes

Inference responses follow OpenAI-compatible choices payloads. Training submissions return queued job metadata with a job_id, then /v1/train/status and the per-job /v1/train/status/{job_id} lookup report state, loss, adapter name, and any failure message.

The Kiln API surface at a glance.