Concept
What GRPO does in Kiln
GRPO is the feedback-training endpoint for cases where you can score model outputs better than you can write one perfect answer.
Each request contains groups. A group has the original chat-format messages and several completions, each with generated text and a numeric reward.
Kiln normalizes the rewards within each group, applies the GRPO loss to LoRA parameters, saves the resulting adapter, and auto-loads it by default when training completes. Inference can keep serving while the training job runs in the background queue.
Rollouts
Generate multiple completions per prompt
Use /v1/completions/batch when you need efficient rollout generation. Send chat-format prompts plus n completions per prompt, then score each returned completion with your reward function before building the GRPO payload.
Rollout generation shape
curl -s http://localhost:8420/v1/completions/batch \
-H "Content-Type: application/json" \
-d '{
"prompts": [
[{"role": "user", "content": "What is 47 + 138? Reply with just the number."}],
[{"role": "user", "content": "What is 23 * 17? Reply with just the number."}]
],
"n": 8,
"temperature": 0.9,
"max_tokens": 64,
"seed": 42
}' \
| python3 -m json.tool
The batch response identifies which prompt each completion came from, so your scorer can reshape completions back into per-prompt groups.
Training
Submit scored completions to /v1/train/grpo
The minimal request is {"groups": [...]}; every config field has a server default. Use output_name when you want a stable adapter name, and leave auto_load at its default if the new adapter should become active immediately after training.
First GRPO training request
curl -s http://localhost:8420/v1/train/grpo \
-H "Content-Type: application/json" \
-d '{
"groups": [{
"messages": [{"role": "user", "content": "What is 47 + 138? Reply with just the number."}],
"completions": [
{"text": "185", "reward": 1.0},
{"text": "The answer is 184", "reward": 0.0},
{"text": "185.", "reward": 1.0},
{"text": "47 + 138 = 185", "reward": 0.8}
]
}],
"config": {
"learning_rate": 1e-5,
"kl_coeff": 0.1,
"clip_epsilon": 0.2,
"lora_rank": 16,
"output_name": "math-correctness",
"auto_load": true
}
}' \
| python3 -m json.tool
Common config fields
learning_rate, kl_coeff, and clip_epsilon tune the GRPO update.
lora_rank and lora_alpha control adapter capacity.
base_adapter continues from an existing adapter.
output_name names the saved adapter.
auto_load defaults to true.
Payload guardrails
- Put options under
config, not at the top level.
- Use
groups, messages, completions, text, and reward.
- GRPO does not use SFT's
epochs field.
- Use at least a few completions per group so rewards can be compared.
Monitoring
Watch the background training job
/v1/train/grpo returns a queued job immediately. Poll /v1/train/status to see pending, running, completed, or failed jobs, then confirm the trained adapter through the adapter APIs or the web UI.
If the kiln CLI is on your PATH, run kiln train status for the same info as a readable summary (use --url http://host:8420 for remote servers). The curl command below is the equivalent HTTP probe — handy for CI, scripts, or any environment without the CLI.
Training status from the CLI
kiln train status
Training status
curl -s http://localhost:8420/v1/train/status | python3 -m json.tool
Once a job is completed, confirm the new adapter actually landed and is active. If the kiln CLI is on your PATH, kiln adapters list shows the new adapter and which one is active. The curl command below hits the same endpoint for CI or scripted use.
Confirm the trained adapter from the CLI
kiln adapters list
Confirm the trained adapter
curl -s http://localhost:8420/v1/adapters | python3 -m json.tool
If a GRPO request is rejected or stays queued longer than expected, start with the troubleshooting guide and the full endpoint map in API training.