Kiln GRPO Guide — Live feedback loop

Concept

What GRPO does in Kiln

GRPO is the feedback-training endpoint for cases where you can score model outputs better than you can write one perfect answer. Each request contains groups. A group has the original chat-format messages and several completions, each with generated text and a numeric reward.

Kiln normalizes the rewards within each group, applies the GRPO loss to LoRA parameters, saves the resulting adapter, and auto-loads it by default when training completes. Inference can keep serving while the training job runs in the background queue.

Rollouts

Generate multiple completions per prompt

Use /v1/completions/batch when you need efficient rollout generation. Send chat-format prompts plus n completions per prompt, then score each returned completion with your reward function before building the GRPO payload.

Rollout generation shape

curl -s http://localhost:8420/v1/completions/batch \
  -H "Content-Type: application/json" \
  -d '{
    "prompts": [
      [{"role": "user", "content": "What is 47 + 138? Reply with just the number."}],
      [{"role": "user", "content": "What is 23 * 17? Reply with just the number."}]
    ],
    "n": 8,
    "temperature": 0.9,
    "max_tokens": 64,
    "seed": 42
  }' \
  | python3 -m json.tool

The batch response identifies which prompt each completion came from, so your scorer can reshape completions back into per-prompt groups.

Training

Submit scored completions to `/v1/train/grpo`

The minimal request is {"groups": [...]}; every config field has a server default. Use output_name when you want a stable adapter name, and leave auto_load at its default if the new adapter should become active immediately after training.

First GRPO training request

curl -s http://localhost:8420/v1/train/grpo \
  -H "Content-Type: application/json" \
  -d '{
    "groups": [{
      "messages": [{"role": "user", "content": "What is 47 + 138? Reply with just the number."}],
      "completions": [
        {"text": "185", "reward": 1.0},
        {"text": "The answer is 184", "reward": 0.0},
        {"text": "185.", "reward": 1.0},
        {"text": "47 + 138 = 185", "reward": 0.8}
      ]
    }],
    "config": {
      "learning_rate": 1e-5,
      "kl_coeff": 0.1,
      "clip_epsilon": 0.2,
      "lora_rank": 16,
      "output_name": "math-correctness",
      "auto_load": true
    }
  }' \
  | python3 -m json.tool

Common config fields

learning_rate, kl_coeff, and clip_epsilon tune the GRPO update.
lora_rank and lora_alpha control adapter capacity.
base_adapter continues from an existing adapter.
output_name names the saved adapter.
auto_load defaults to true.

Payload guardrails

Put options under config, not at the top level.
Use groups, messages, completions, text, and reward.
GRPO does not use SFT's epochs field.
Use at least a few completions per group so rewards can be compared.

Monitoring

Watch the background training job

/v1/train/grpo returns a queued job immediately. Poll /v1/train/status to see pending, running, completed, or failed jobs, then confirm the trained adapter through the adapter APIs or the web UI.

If the kiln CLI is on your PATH, run kiln train status for the same info as a readable summary (use --url http://host:8420 for remote servers). The curl command below is the equivalent HTTP probe — handy for CI, scripts, or any environment without the CLI.

Training status from the CLI

kiln train status

Training status

curl -s http://localhost:8420/v1/train/status | python3 -m json.tool

Once a job is completed, confirm the new adapter actually landed and is active. If the kiln CLI is on your PATH, kiln adapters list shows the new adapter and which one is active. The curl command below hits the same endpoint for CI or scripted use.

Confirm the trained adapter from the CLI

kiln adapters list

Confirm the trained adapter

curl -s http://localhost:8420/v1/adapters | python3 -m json.tool

If a GRPO request is rejected or stays queued longer than expected, start with the troubleshooting guide and the full endpoint map in API training.

Train from scored completions without leaving the server.

What GRPO does in Kiln

Generate multiple completions per prompt

Rollout generation shape

Submit scored completions to `/v1/train/grpo`

First GRPO training request

Common config fields

Payload guardrails

Watch the background training job

Training status from the CLI

Training status

Confirm the trained adapter from the CLI

Confirm the trained adapter

Where to go next

API training

Troubleshooting

Demo

Deep dive

Train from scored completions without leaving the server.

What GRPO does in Kiln

Generate multiple completions per prompt

Rollout generation shape

Submit scored completions to /v1/train/grpo

First GRPO training request

Common config fields

Payload guardrails

Watch the background training job

Training status from the CLI

Training status

Confirm the trained adapter from the CLI

Confirm the trained adapter

Where to go next

API training

Troubleshooting

Demo

Deep dive

Submit scored completions to `/v1/train/grpo`