Operational Runbook
This guide covers day-to-day operations for running ModelRelay in
production. It assumes you have one modelrelay-server instance and one or
more modelrelay-worker processes.
Health Checks
Proxy Server
The proxy server exposes a dedicated /health endpoint:
# Primary health check — returns JSON with version, worker count, queue depth, and uptime.
curl -sf http://proxy:8080/health | jq .
Example response:
{
"status": "ok",
"version": "0.1.6",
"workers_connected": 2,
"queue_depth": 0,
"uptime_secs": 3621.5
}
Use /health for liveness probes, Kubernetes readiness checks, and
monitoring. A workers_connected of 0 means the proxy is running but
no workers are registered.
You can also list routable models directly:
curl -s http://proxy:8080/v1/models | jq '.data[].id'
Worker Daemon
The worker daemon does not expose its own HTTP port — it connects outward to the proxy. Health is observable from the proxy side:
# Check if workers are registered by listing models.
curl -s http://proxy:8080/v1/models | jq '.data[].id'
If expected models are missing, the worker is either down or failed to register. Check worker logs for connection errors or authentication failures.
Admin API & Monitoring
ModelRelay includes admin endpoints for inspecting workers, request metrics,
and managing client API keys. All /admin/* endpoints require a Bearer
token.
Enabling Admin Endpoints
Set MODELRELAY_ADMIN_TOKEN when starting the server:
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret
Without this token, all /admin/* endpoints return 403 Forbidden.
Querying Admin Endpoints
TOKEN="my-admin-secret"
# List connected workers (models, load, capabilities)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/workers | jq .
# Request stats and queue depth
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/stats | jq .
# List client API keys (metadata only, no secrets)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/keys | jq .
Managing Client API Keys
When MODELRELAY_REQUIRE_API_KEYS=true, clients must send a valid API key
as a Bearer token on inference requests.
TOKEN="my-admin-secret"
# Create a new API key (the secret is returned only at creation time)
curl -s -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "production-app"}' \
http://proxy:8080/admin/keys | jq .
# Revoke a key by ID
curl -s -X DELETE \
-H "Authorization: Bearer $TOKEN" \
http://proxy:8080/admin/keys/{key-id}
Clients use the returned secret as a Bearer token:
curl -H "Authorization: Bearer mr-..." \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hi"}]}' \
http://proxy:8080/v1/chat/completions
Web Dashboard & Setup Wizard
The proxy serves a built-in web UI:
- Dashboard — visit
http://proxy:8080/dashboardfor real-time worker status, request metrics, and queue depth. - Setup Wizard — visit
http://proxy:8080/setupfor a step-by-step guide to connecting a new worker (platform detection, backend setup, binary download, and live connection verification).
The wizard is always accessible, not just on first run — use it whenever you add another GPU box.
Troubleshooting Admin Features
Admin endpoints return 403:
MODELRELAY_ADMIN_TOKEN is not set on the server, or the Authorization
header doesn't match. Verify the token value and ensure the header format
is Authorization: Bearer <token>.
Client requests return 401 when API key auth is enabled:
The client is not sending a Bearer token, or the key has been revoked.
Create a new key via POST /admin/keys and ensure the client sends
Authorization: Bearer <key>.
API key auth not taking effect:
MODELRELAY_REQUIRE_API_KEYS must be set to true. When false
(the default), inference endpoints accept unauthenticated requests.
Checking Worker Registration
After starting a new worker, confirm it registered:
# Should include the worker's advertised models.
curl -s http://proxy:8080/v1/models | jq .
If a worker's models don't appear within ~10 seconds:
- Check the worker secret — does
WORKER_SECRETon the worker match the proxy? - Check connectivity — can the worker reach
PROXY_URL?curl -v http://proxy:8080/v1/worker/connect # Should get 400 or upgrade-required, not a connection error - Check worker logs — look for
register/register_ackmessages or error lines.
Draining a Worker Gracefully
To remove a worker from rotation without dropping in-flight requests:
-
Send SIGTERM to the modelrelay-worker process. The daemon initiates a graceful disconnect — the proxy sends a
GracefulShutdownmessage and stops routing new requests to that worker. -
In-flight requests finish normally. The proxy waits up to
drain_timeout_secs(from the shutdown message) for active requests to complete. -
Once idle, the WebSocket closes. The worker process exits.
# Graceful stop via systemd
systemctl stop modelrelay-worker@gpu-box-1
# Or with Docker
docker stop --time 60 worker-gpu-box-1
Monitoring drain progress: Watch the proxy logs for
"worker drained" or similar messages. If the worker still has
in-flight requests, you'll see ongoing ResponseChunk / ResponseComplete
messages until they finish.
Scaling Workers
Adding a worker
Start a new modelrelay-worker instance pointing at the same proxy:
PROXY_URL=http://proxy:8080 \
WORKER_SECRET=your-secret \
WORKER_NAME=gpu-box-4 \
BACKEND_URL=http://localhost:8000 \
modelrelay-worker --models llama3-8b
The proxy discovers it within seconds via the WebSocket registration handshake. No proxy restart or config change needed.
Removing a worker
Use the graceful drain procedure above. The proxy automatically routes around disconnected workers.
Scaling the proxy
The proxy is a single-process server. To scale:
- Vertical: increase
MAX_QUEUE_LENand system file descriptor limits. - Horizontal: run multiple proxy instances behind a load balancer, but note that each worker connects to one proxy. Workers must be distributed across proxy instances manually or via DNS round-robin.
Log Interpretation
Proxy Server
| Log pattern | Meaning |
|---|---|
worker registered / register_ack | Worker connected and authenticated |
request dispatched | Request sent to a worker |
response complete | Worker returned a result |
worker heartbeat timed out | Worker missed pings — WebSocket closed |
request requeued | Worker died mid-request, retrying on another worker |
requeue exhausted | Request failed after MAX_REQUEUE_COUNT (3) retries |
queue full | Rejected request — queue at MAX_QUEUE_LEN capacity |
queue timeout | Request sat in queue longer than QUEUE_TIMEOUT_SECS |
graceful shutdown | Worker drain initiated |
Worker Daemon
| Log pattern | Meaning |
|---|---|
connected to proxy | WebSocket connection established |
registered | Registration acknowledged by proxy |
forwarding request | Proxying a request to the local backend |
backend error | Local backend returned an error or is unreachable |
cancelled | Proxy sent a cancel for an in-flight request |
graceful shutdown | Drain in progress, finishing active requests |
Adjusting log verbosity
Set LOG_LEVEL environment variable on either component:
LOG_LEVEL=debug modelrelay-server # trace, debug, info (default), warn, error
LOG_LEVEL=debug modelrelay-worker
Common Failure Modes
Worker can't connect to proxy
Symptoms: Worker logs show connection refused or timeouts.
Checklist:
- Is the proxy running?
curl http://proxy:8080/v1/models - Is
PROXY_URLcorrect? The worker connects to{PROXY_URL}/v1/worker/connectvia WebSocket. - Firewall / network: the worker makes an outbound connection to the proxy — no inbound ports needed on the worker machine.
- If using TLS (nginx/reverse proxy in front), ensure WebSocket upgrade headers are forwarded. See the TLS Setup guide.
Worker registers but requests fail
Symptoms: /v1/models shows the model, but requests return 502 or
timeout.
Checklist:
- Is the local backend running?
curl http://localhost:8000/v1/models(or whateverBACKEND_URLis set to) - Does the backend support the requested endpoint?
(
/v1/chat/completions,/v1/messages,/v1/responses) - Check worker logs for
backend errormessages. - Try a direct request to the backend to isolate the issue.
Requests queue but never complete
Symptoms: Clients hang, then get a timeout error after
QUEUE_TIMEOUT_SECS.
Causes:
- No workers are connected (check
/v1/models) - Workers are at capacity (
max_concurrentreached on all workers) - Workers are connected but not advertising the requested model
Fix: Add more workers, increase max_concurrent if the hardware
allows, or reduce QUEUE_TIMEOUT_SECS to fail faster.
Streaming responses arrive corrupted
Symptoms: SSE chunks arrive garbled or out of order.
Checklist:
- Ensure no intermediate proxy is buffering. Disable response
buffering in nginx:
proxy_buffering off; - If using a CDN or reverse proxy, ensure it supports chunked transfer encoding and doesn't aggregate small writes.
High memory usage on the proxy
Symptoms: Proxy RSS grows over time.
Causes:
- Large queue of pending requests (each holds the full request body)
- Many concurrent streaming responses with large chunk buffers
Fix: Lower MAX_QUEUE_LEN, set QUEUE_TIMEOUT_SECS to a shorter
value, or add workers to drain the queue faster.
Worker keeps reconnecting
Symptoms: Worker logs show repeated connect/disconnect cycles.
Causes:
- Heartbeat timeout — the worker or network is too slow to respond to
pings within
HEARTBEAT_INTERVAL WORKER_SECRETmismatch — worker connects, fails auth, gets disconnected, retries
Fix: Check secrets match, check network latency between worker and proxy.
Configuration Quick Reference
Proxy Server
| Env Var | Default | Description |
|---|---|---|
LISTEN_ADDR | 127.0.0.1:8080 | HTTP listen address |
PROVIDER_NAME | local | Provider name for routing |
WORKER_SECRET | (required) | Shared secret for worker auth |
MAX_QUEUE_LEN | 100 | Max queued requests before rejecting |
QUEUE_TIMEOUT_SECS | 30 | How long a request can wait in queue |
REQUEST_TIMEOUT_SECS | 300 | Total request timeout (5 min) |
LOG_LEVEL | info | Log verbosity |
MODELRELAY_ADMIN_TOKEN | (none) | Bearer token for /admin/* endpoints (if unset, admin returns 403) |
MODELRELAY_REQUIRE_API_KEYS | false | When true, client requests require a valid API key |
Worker Daemon
| Env Var | Default | Description |
|---|---|---|
PROXY_URL | http://127.0.0.1:8080 | Proxy server URL |
WORKER_SECRET | (required) | Must match proxy's secret |
WORKER_NAME | worker | Human-readable worker name |
BACKEND_URL | http://127.0.0.1:8000 | Local model server URL |
LOG_LEVEL | info | Log verbosity |
Windows Service
Checking Service Status
Get-Service ModelRelayServer
Get-Service ModelRelayWorker
Starting and Stopping
Start-Service ModelRelayServer
Stop-Service ModelRelayServer
Start-Service ModelRelayWorker
Stop-Service ModelRelayWorker
Stop-Service sends a stop control signal and waits for the process to
exit. ModelRelay handles this as a graceful shutdown — in-flight
requests finish before the process terminates. To set an explicit
timeout:
# Stop with a 60-second timeout (kills the process if it doesn't exit in time)
Stop-Service ModelRelayServer -NoWait
Start-Sleep -Seconds 60
(Get-Service ModelRelayServer).WaitForStatus("Stopped", "00:00:05")
Logs
Windows Services don't write to stdout by default. Two options:
-
Windows Event Log — ModelRelay writes to the Application log. View with:
Get-EventLog -LogName Application -Source ModelRelayServer -Newest 50 -
File logging via
RUST_LOG— setRUST_LOGas a system environment variable and redirect output to a file by wrapping the binary in a small script, or use theRUST_LOG_FILEconvention if supported. The simplest approach:[Environment]::SetEnvironmentVariable("RUST_LOG", "info", "Machine")
Draining a Worker
To drain a worker gracefully before maintenance:
# Stop the service — this triggers graceful shutdown.
Stop-Service ModelRelayWorker
# Verify it has stopped.
Get-Service ModelRelayWorker
The worker completes in-flight requests before exiting, identical to the
systemctl stop behavior on Linux.
Monitoring Checklist
For production deployments, monitor these signals:
-
Proxy process is up — HTTP health check on
/health -
At least one worker registered —
/healthreturnsworkers_connected > 0 -
Queue depth —
/healthreturnsqueue_depth; watch for sustained growth - Request latency — track time from client request to first byte
- Worker reconnect rate — frequent reconnects indicate network or auth issues
- Error rates — 4xx (client errors) vs 5xx (backend/proxy errors)
- Backend health — each worker's local model server should be independently monitored