Operational Runbook

This guide covers day-to-day operations for running ModelRelay in production. It assumes you have one modelrelay-server instance and one or more modelrelay-worker processes.


Health Checks

Proxy Server

The proxy server exposes a dedicated /health endpoint:

# Primary health check — returns JSON with version, worker count, queue depth, and uptime.
curl -sf http://proxy:8080/health | jq .

Example response:

{
  "status": "ok",
  "version": "0.1.6",
  "workers_connected": 2,
  "queue_depth": 0,
  "uptime_secs": 3621.5
}

Use /health for liveness probes, Kubernetes readiness checks, and monitoring. A workers_connected of 0 means the proxy is running but no workers are registered.

You can also list routable models directly:

curl -s http://proxy:8080/v1/models | jq '.data[].id'

Worker Daemon

The worker daemon does not expose its own HTTP port — it connects outward to the proxy. Health is observable from the proxy side:

# Check if workers are registered by listing models.
curl -s http://proxy:8080/v1/models | jq '.data[].id'

If expected models are missing, the worker is either down or failed to register. Check worker logs for connection errors or authentication failures.


Admin API & Monitoring

ModelRelay includes admin endpoints for inspecting workers, request metrics, and managing client API keys. All /admin/* endpoints require a Bearer token.

Enabling Admin Endpoints

Set MODELRELAY_ADMIN_TOKEN when starting the server:

modelrelay-server --worker-secret mysecret --admin-token my-admin-secret

Without this token, all /admin/* endpoints return 403 Forbidden.

Querying Admin Endpoints

TOKEN="my-admin-secret"

# List connected workers (models, load, capabilities)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/workers | jq .

# Request stats and queue depth
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/stats | jq .

# List client API keys (metadata only, no secrets)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/keys | jq .

Managing Client API Keys

When MODELRELAY_REQUIRE_API_KEYS=true, clients must send a valid API key as a Bearer token on inference requests.

TOKEN="my-admin-secret"

# Create a new API key (the secret is returned only at creation time)
curl -s -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "production-app"}' \
  http://proxy:8080/admin/keys | jq .

# Revoke a key by ID
curl -s -X DELETE \
  -H "Authorization: Bearer $TOKEN" \
  http://proxy:8080/admin/keys/{key-id}

Clients use the returned secret as a Bearer token:

curl -H "Authorization: Bearer mr-..." \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hi"}]}' \
  http://proxy:8080/v1/chat/completions

Web Dashboard & Setup Wizard

The proxy serves a built-in web UI:

  • Dashboard — visit http://proxy:8080/dashboard for real-time worker status, request metrics, and queue depth.
  • Setup Wizard — visit http://proxy:8080/setup for a step-by-step guide to connecting a new worker (platform detection, backend setup, binary download, and live connection verification).

The wizard is always accessible, not just on first run — use it whenever you add another GPU box.

Troubleshooting Admin Features

Admin endpoints return 403: MODELRELAY_ADMIN_TOKEN is not set on the server, or the Authorization header doesn't match. Verify the token value and ensure the header format is Authorization: Bearer <token>.

Client requests return 401 when API key auth is enabled: The client is not sending a Bearer token, or the key has been revoked. Create a new key via POST /admin/keys and ensure the client sends Authorization: Bearer <key>.

API key auth not taking effect: MODELRELAY_REQUIRE_API_KEYS must be set to true. When false (the default), inference endpoints accept unauthenticated requests.


Checking Worker Registration

After starting a new worker, confirm it registered:

# Should include the worker's advertised models.
curl -s http://proxy:8080/v1/models | jq .

If a worker's models don't appear within ~10 seconds:

  1. Check the worker secret — does WORKER_SECRET on the worker match the proxy?
  2. Check connectivity — can the worker reach PROXY_URL?
    curl -v http://proxy:8080/v1/worker/connect
    # Should get 400 or upgrade-required, not a connection error
    
  3. Check worker logs — look for register / register_ack messages or error lines.

Draining a Worker Gracefully

To remove a worker from rotation without dropping in-flight requests:

  1. Send SIGTERM to the modelrelay-worker process. The daemon initiates a graceful disconnect — the proxy sends a GracefulShutdown message and stops routing new requests to that worker.

  2. In-flight requests finish normally. The proxy waits up to drain_timeout_secs (from the shutdown message) for active requests to complete.

  3. Once idle, the WebSocket closes. The worker process exits.

# Graceful stop via systemd
systemctl stop modelrelay-worker@gpu-box-1

# Or with Docker
docker stop --time 60 worker-gpu-box-1

Monitoring drain progress: Watch the proxy logs for "worker drained" or similar messages. If the worker still has in-flight requests, you'll see ongoing ResponseChunk / ResponseComplete messages until they finish.


Scaling Workers

Adding a worker

Start a new modelrelay-worker instance pointing at the same proxy:

PROXY_URL=http://proxy:8080 \
WORKER_SECRET=your-secret \
WORKER_NAME=gpu-box-4 \
BACKEND_URL=http://localhost:8000 \
  modelrelay-worker --models llama3-8b

The proxy discovers it within seconds via the WebSocket registration handshake. No proxy restart or config change needed.

Removing a worker

Use the graceful drain procedure above. The proxy automatically routes around disconnected workers.

Scaling the proxy

The proxy is a single-process server. To scale:

  • Vertical: increase MAX_QUEUE_LEN and system file descriptor limits.
  • Horizontal: run multiple proxy instances behind a load balancer, but note that each worker connects to one proxy. Workers must be distributed across proxy instances manually or via DNS round-robin.

Log Interpretation

Proxy Server

Log patternMeaning
worker registered / register_ackWorker connected and authenticated
request dispatchedRequest sent to a worker
response completeWorker returned a result
worker heartbeat timed outWorker missed pings — WebSocket closed
request requeuedWorker died mid-request, retrying on another worker
requeue exhaustedRequest failed after MAX_REQUEUE_COUNT (3) retries
queue fullRejected request — queue at MAX_QUEUE_LEN capacity
queue timeoutRequest sat in queue longer than QUEUE_TIMEOUT_SECS
graceful shutdownWorker drain initiated

Worker Daemon

Log patternMeaning
connected to proxyWebSocket connection established
registeredRegistration acknowledged by proxy
forwarding requestProxying a request to the local backend
backend errorLocal backend returned an error or is unreachable
cancelledProxy sent a cancel for an in-flight request
graceful shutdownDrain in progress, finishing active requests

Adjusting log verbosity

Set LOG_LEVEL environment variable on either component:

LOG_LEVEL=debug modelrelay-server   # trace, debug, info (default), warn, error
LOG_LEVEL=debug modelrelay-worker

Common Failure Modes

Worker can't connect to proxy

Symptoms: Worker logs show connection refused or timeouts.

Checklist:

  1. Is the proxy running? curl http://proxy:8080/v1/models
  2. Is PROXY_URL correct? The worker connects to {PROXY_URL}/v1/worker/connect via WebSocket.
  3. Firewall / network: the worker makes an outbound connection to the proxy — no inbound ports needed on the worker machine.
  4. If using TLS (nginx/reverse proxy in front), ensure WebSocket upgrade headers are forwarded. See the TLS Setup guide.

Worker registers but requests fail

Symptoms: /v1/models shows the model, but requests return 502 or timeout.

Checklist:

  1. Is the local backend running? curl http://localhost:8000/v1/models (or whatever BACKEND_URL is set to)
  2. Does the backend support the requested endpoint? (/v1/chat/completions, /v1/messages, /v1/responses)
  3. Check worker logs for backend error messages.
  4. Try a direct request to the backend to isolate the issue.

Requests queue but never complete

Symptoms: Clients hang, then get a timeout error after QUEUE_TIMEOUT_SECS.

Causes:

  • No workers are connected (check /v1/models)
  • Workers are at capacity (max_concurrent reached on all workers)
  • Workers are connected but not advertising the requested model

Fix: Add more workers, increase max_concurrent if the hardware allows, or reduce QUEUE_TIMEOUT_SECS to fail faster.

Streaming responses arrive corrupted

Symptoms: SSE chunks arrive garbled or out of order.

Checklist:

  1. Ensure no intermediate proxy is buffering. Disable response buffering in nginx:
    proxy_buffering off;
    
  2. If using a CDN or reverse proxy, ensure it supports chunked transfer encoding and doesn't aggregate small writes.

High memory usage on the proxy

Symptoms: Proxy RSS grows over time.

Causes:

  • Large queue of pending requests (each holds the full request body)
  • Many concurrent streaming responses with large chunk buffers

Fix: Lower MAX_QUEUE_LEN, set QUEUE_TIMEOUT_SECS to a shorter value, or add workers to drain the queue faster.

Worker keeps reconnecting

Symptoms: Worker logs show repeated connect/disconnect cycles.

Causes:

  • Heartbeat timeout — the worker or network is too slow to respond to pings within HEARTBEAT_INTERVAL
  • WORKER_SECRET mismatch — worker connects, fails auth, gets disconnected, retries

Fix: Check secrets match, check network latency between worker and proxy.


Configuration Quick Reference

Proxy Server

Env VarDefaultDescription
LISTEN_ADDR127.0.0.1:8080HTTP listen address
PROVIDER_NAMElocalProvider name for routing
WORKER_SECRET(required)Shared secret for worker auth
MAX_QUEUE_LEN100Max queued requests before rejecting
QUEUE_TIMEOUT_SECS30How long a request can wait in queue
REQUEST_TIMEOUT_SECS300Total request timeout (5 min)
LOG_LEVELinfoLog verbosity
MODELRELAY_ADMIN_TOKEN(none)Bearer token for /admin/* endpoints (if unset, admin returns 403)
MODELRELAY_REQUIRE_API_KEYSfalseWhen true, client requests require a valid API key

Worker Daemon

Env VarDefaultDescription
PROXY_URLhttp://127.0.0.1:8080Proxy server URL
WORKER_SECRET(required)Must match proxy's secret
WORKER_NAMEworkerHuman-readable worker name
BACKEND_URLhttp://127.0.0.1:8000Local model server URL
LOG_LEVELinfoLog verbosity

Windows Service

Checking Service Status

Get-Service ModelRelayServer
Get-Service ModelRelayWorker

Starting and Stopping

Start-Service ModelRelayServer
Stop-Service ModelRelayServer

Start-Service ModelRelayWorker
Stop-Service ModelRelayWorker

Stop-Service sends a stop control signal and waits for the process to exit. ModelRelay handles this as a graceful shutdown — in-flight requests finish before the process terminates. To set an explicit timeout:

# Stop with a 60-second timeout (kills the process if it doesn't exit in time)
Stop-Service ModelRelayServer -NoWait
Start-Sleep -Seconds 60
(Get-Service ModelRelayServer).WaitForStatus("Stopped", "00:00:05")

Logs

Windows Services don't write to stdout by default. Two options:

  1. Windows Event Log — ModelRelay writes to the Application log. View with:

    Get-EventLog -LogName Application -Source ModelRelayServer -Newest 50
    
  2. File logging via RUST_LOG — set RUST_LOG as a system environment variable and redirect output to a file by wrapping the binary in a small script, or use the RUST_LOG_FILE convention if supported. The simplest approach:

    [Environment]::SetEnvironmentVariable("RUST_LOG", "info", "Machine")
    

Draining a Worker

To drain a worker gracefully before maintenance:

# Stop the service — this triggers graceful shutdown.
Stop-Service ModelRelayWorker

# Verify it has stopped.
Get-Service ModelRelayWorker

The worker completes in-flight requests before exiting, identical to the systemctl stop behavior on Linux.


Monitoring Checklist

For production deployments, monitor these signals:

  • Proxy process is up — HTTP health check on /health
  • At least one worker registered/health returns workers_connected > 0
  • Queue depth/health returns queue_depth; watch for sustained growth
  • Request latency — track time from client request to first byte
  • Worker reconnect rate — frequent reconnects indicate network or auth issues
  • Error rates — 4xx (client errors) vs 5xx (backend/proxy errors)
  • Backend health — each worker's local model server should be independently monitored