ModelRelay
Stop configuring clients for every GPU box. Workers connect out; requests route in.
You have GPU boxes running llama-server (or Ollama, or vLLM, or anything OpenAI-compatible). Today you either expose each one directly — port forwarding, DNS, firewall rules — or you stick a load balancer in front that doesn't understand LLM streaming or cancellation.
ModelRelay flips the model: a central proxy receives standard inference requests while worker daemons on your GPU boxes connect out to it over WebSocket. The proxy handles queueing, routing, streaming pass-through, and cancellation propagation. Clients see one stable endpoint and never need to know about your hardware.
Clients (curl, Claude Code, LiteLLM, Open WebUI, ...)
|
| POST /v1/chat/completions
| POST /v1/messages
v
+----------------------+
| modelrelay-server |<--- workers connect out (WebSocket)
| (one stable | no inbound ports needed on GPU boxes
| endpoint) |
+----------------------+
| routes request to best available worker
v
+--------+ +--------+ +--------+
|worker-1| |worker-2| |worker-3|
| llama | | ollama | | vllm | <- your GPU boxes,
| server | | | | | anywhere on any network
+--------+ +--------+ +--------+
Hosted Version
Don't want to run the infrastructure yourself? A fully-managed hosted version is available at modelrelay.io — no server setup, no infrastructure to manage. Just get an API key, point your workers at it, and start routing requests. Same open protocol, zero ops burden.
Who is this for?
- Home GPU users running local models who want a single API endpoint across multiple machines
- Teams with on-prem hardware that need to pool GPU capacity without a service mesh
- Researchers juggling models across heterogeneous boxes who are tired of updating client configs
Features
- OpenAI + Anthropic compatible —
POST /v1/chat/completions,POST /v1/responses,POST /v1/messages,GET /v1/models - No inbound ports on GPU boxes — workers connect out to the proxy over WebSocket
- Request queueing — configurable depth and timeout when all workers are busy
- Streaming pass-through — SSE chunks forwarded with preserved ordering and termination
- End-to-end cancellation — client disconnect propagates through the proxy to the worker to the backend
- Automatic requeue — if a worker dies mid-request, the request is requeued to another worker
- Heartbeat and load tracking — stale workers are cleaned up; workers report current load
- Graceful drain — workers can shut down while replacement workers pick up queued work
- Cross-platform — pre-built binaries for Linux, macOS, and Windows (x86_64 + arm64)
Quick Start
The fastest way to get running is with Docker:
# 1. Run the proxy
docker run -p 8080:8080 \
-e WORKER_SECRET=mysecret \
-e LISTEN_ADDR=0.0.0.0:8080 \
ghcr.io/ericflo/modelrelay/modelrelay-server:latest
# 2. Run a worker (on a GPU box with llama-server or similar)
docker run \
-e PROXY_URL=http://<proxy-host>:8080 \
-e WORKER_SECRET=mysecret \
-e BACKEND_URL=http://host.docker.internal:8000 \
-e MODELS=llama3.2:3b \
ghcr.io/ericflo/modelrelay/modelrelay-worker:latest
# 3. Send a request
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
For more installation options (pre-built binaries, Docker Compose, building from source, systemd, Windows services), see the GitHub README.
Documentation
- Architecture — System design, component overview, and data flow
- Protocol Walkthrough — Wire-level protocol details with examples
- Behavior Contract — The exact behavioral guarantees the system provides
- Operational Runbook — Deployment, configuration, monitoring, and troubleshooting
Source & Contributing
ModelRelay is MIT-licensed and developed at github.com/ericflo/modelrelay. Bug reports, feature requests, and PRs are welcome — see CONTRIBUTING.md for details.
Architecture
This document describes the internal architecture of ModelRelay: how the components fit together, how data flows through the system, and why key design decisions were made. It is intended for contributors, operators, and anyone evaluating ModelRelay for their own infrastructure.
Workspace Shape
-
crates/modelrelay-contract-testsBlack-box behavior tests and focused harnesses for registration, queueing, response streaming, cancellation, requeue, heartbeat, and graceful shutdown semantics. -
crates/modelrelay-protocolShared Rust protocol types for the WebSocket bridge: registration, dispatch, streaming chunks, cancellation, heartbeats, and operational control messages. -
crates/modelrelay-serverCentral HTTP proxy. Owns the client-facing OpenAI and Anthropic compatibility layers, worker auth, provider config, worker registry, queueing, routing, cancellation, and graceful drain. -
crates/modelrelay-workerRemote worker process. Authenticates to the server, advertises models and capacity, forwards requests to a local backend such asllama-server, streams chunks back, refreshes advertised models, reports live load in heartbeats, and honors cancellation plus graceful shutdown.
Component Overview
┌─────────────────────────────────────────────────────────┐
│ modelrelay-server │
│ │
HTTP clients │ ┌───────────┐ ┌──────────────┐ ┌─────────────┐ │ WebSocket
─────────────────►│ │ HTTP │───►│ Queue │───►│ Dispatcher │ │◄──────────
/v1/chat/ │ │ Router │ │ Manager │ │ │ │ workers
completions, │ │ │ │ (per-provider│ │ (load-aware │ │ connect in
/v1/messages, │ │ (axum │ │ FIFO) │ │ round- │ │
/v1/responses │ │ routes) │ │ │ │ robin) │ │
│ └───────────┘ └──────────────┘ └──────┬──────┘ │
│ │ │ │
│ │ ┌──────────────┐ │ │
│ │ │ Worker │◄──────────┘ │
│ │ │ Registry │ │
│ │ │ (auth, model │ │
│ │ │ tracking, │ │
│ │ │ load, drain)│ │
│ │ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────┐ ┌──────────────┐ │
│ │ Cancel │ │ WebSocket │ │
│ │ Guard │ │ Hub │ │
│ │ (RAII │ │ (per-worker │ │
│ │ drop) │ │ message │ │
│ │ │ │ routing) │ │
│ └───────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
HTTP Router (http.rs) — axum-based handler for four client-facing routes plus the worker WebSocket upgrade endpoint. Parses the model name and streaming flag from the request body, submits to the core, and bridges the response back as either a single body or an SSE stream.
Queue Manager (lib.rs) — per-provider FIFO queue with configurable max length and timeout. Requests land here when no worker with capacity is immediately available. The queue is drained oldest-first whenever a worker finishes a request or a new worker registers.
Dispatcher (lib.rs) — selects the best worker for a request. Filters by provider, model support, capacity, and drain state, then picks the lowest-load worker with round-robin tie-breaking via per-provider cursors.
Worker Registry (lib.rs) — tracks every connected worker's identity, supported models, max concurrency, reported load, in-flight request set, and drain state. Updated by registration, heartbeat pongs, model refreshes, and disconnect events.
WebSocket Hub (worker_socket.rs) — manages the authenticated WebSocket connection for each worker. Routes server-to-worker messages (request dispatch, cancel signals, pings, graceful shutdown, model refresh) and worker-to-server messages (response chunks, completions, pongs, model updates, errors).
Cancel Guard (http.rs) — an RAII HttpRequestCancellationGuard that fires if the HTTP response future is dropped (client disconnect or timeout). On drop, it broadcasts a cancel signal through the core to the assigned worker.
Worker Daemon Internals
┌─────────────────────────────────────────────────────────┐
│ modelrelay-worker │
│ │
│ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ Connection │ │ Request Tasks │ │
│ │ Manager │ │ │ │
│ │ │ spawn │ ┌───────┐ ┌───────┐ │ │
│ │ • connect │─────────►│ │ Req 1 │ │ Req 2 │... │ │
│ │ • register │ │ │ │ │ │ │ │
│ │ • reconnect │◄─────────│ │ POST │ │ POST │ │ │
│ │ (exp. │ events │ │ to │ │ to │ │ │
│ │ backoff) │ │ │ local │ │ local │ │ │
│ └──────┬───────┘ │ └───┬───┘ └───┬───┘ │ │
│ │ └──────┼─────────┼────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌─────────────────────┐ │
│ │ Socket Loop │ │ Local Backend │ │
│ │ (select!) │ │ (llama-server, │ │
│ │ │ │ Ollama, vLLM, etc) │ │
│ │ • read msgs │ └─────────────────────┘ │
│ │ • send msgs │ │
│ │ • heartbeat │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
The worker daemon runs a single select! loop that multiplexes:
-
Inbound WebSocket messages — dispatched to
handle_server_message()which routes each message type: spawns a task forRequest, responds toPing, appliesCancelto active tasks, triggers model refresh, or begins graceful drain. -
Outbound events from request tasks — each spawned request task communicates back through an mpsc channel.
ResponseChunkevents are forwarded immediately over the WebSocket.RequestFinishedandRequestFailedevents trigger cleanup. -
Reconnection with exponential backoff — on unexpected disconnect, the outer
run_with_reconnect()loop retries with 1–30 second backoff plus up to 500ms jitter. Only aGracefulShutdownmessage causes a clean exit.
Data Flow: Non-Streaming Request
Client Server Worker Backend
│ │ │ │
│ POST /v1/chat/ │ │ │
│ completions │ │ │
│────────────────────────►│ │ │
│ │ find_eligible_worker() │ │
│ │ assign_to_worker() │ │
│ │ │ │
│ │ WS: Request{id,body} │ │
│ │─────────────────────────►│ │
│ │ │ POST /v1/chat/ │
│ │ │ completions │
│ │ │───────────────────►│
│ │ │ │
│ │ │◄───────────────────│
│ │ │ 200 {response} │
│ │ WS: ResponseComplete │ │
│ │ {id, 200, body} │ │
│ │◄─────────────────────────│ │
│ │ │ │
│ 200 {response} │ finish_request() │ │
│◄────────────────────────│ dispatch_next_compat() │ │
│ │ │ │
Data Flow: Streaming Request
Client Server Worker Backend
│ │ │ │
│ POST /v1/chat/ │ │ │
│ completions │ │ │
│ stream: true │ │ │
│────────────────────────►│ │ │
│ │ WS: Request{id,body, │ │
│ │ is_streaming: true} │ │
│ │─────────────────────────►│ │
│ │ │ POST to backend │
│ │ │───────────────────►│
│ │ │ │
│ SSE: data: chunk1 │ WS: ResponseChunk │◄── chunk 1 ───────│
│◄────────────────────────│◄─────────────────────────│ │
│ SSE: data: chunk2 │ WS: ResponseChunk │◄── chunk 2 ───────│
│◄────────────────────────│◄─────────────────────────│ │
│ SSE: data: chunk3 │ WS: ResponseChunk │◄── chunk 3 ───────│
│◄────────────────────────│◄─────────────────────────│ │
│ │ │ │
│ SSE: [DONE] │ WS: ResponseComplete │◄── end ───────────│
│◄────────────────────────│◄─────────────────────────│ │
│ │ finish_request() │ │
Streaming chunks flow through three hops with minimal buffering: the worker reads chunks from the backend HTTP response body as they arrive, wraps each in a ResponseChunk WebSocket message, the server receives it and pushes it into an mpsc channel, and the HTTP handler yields it as an SSE event to the client.
Data Flow: Client Cancellation
Client Server Worker Backend
│ │ │ │
│ POST /v1/chat/... │ WS: Request │ POST to backend │
│────────────────────────►│─────────────────────────►│───────────────────►│
│ │ │ │
│ [client disconnects] │ │ ◄── streaming ── │
│─ ─ ─ ─ ─X │ │ │
│ │ CancellationGuard drop │ │
│ │ cancel_request(id) │ │
│ │ │ │
│ │ WS: Cancel{id} │ │
│ │─────────────────────────►│ │
│ │ │ [abort request │
│ │ │ task] │
│ │ │───── abort ───────►│
The RAII HttpRequestCancellationGuard is the key mechanism. When the HTTP response future is dropped — either because the client disconnected or a server-side timeout fired — the guard's Drop implementation spawns an async task that calls cancel_request(). If the request is still queued, it is removed immediately. If it is in-flight with a worker, a Cancel message is sent over the WebSocket, and the worker aborts the corresponding request task.
Worker Lifecycle State Machine
┌────────────┐
│ Connecting │
│ │
│ WS handshake
│ + auth │
└─────┬──────┘
│ RegisterAck
▼
┌────────────┐
┌────►│ Idle │◄───────────────┐
│ │ │ │
│ │ capacity > 0 │
│ │ no in-flight │
│ └─────┬──────┘ │
│ │ Request dispatched │ request finished
│ ▼ │ (and more capacity)
│ ┌────────────┐ │
│ │ Busy │─────────────────┘
│ │ │
│ │ in-flight │
│ │ requests > 0 ┌────────────┐
│ └─────┬──────┘ │ Gone │
│ │ GracefulShutdown │ │
│ ▼ │ disconnected
│ ┌────────────┐ │ or shutdown │
│ │ Draining │────────────────►│ complete │
│ │ │ all in-flight └────────────┘
│ │ no new │ finished or ▲
│ │ requests │ drain timeout │
│ └────────────┘ │
│ │
└──────────────────────────────────────────┘
unexpected disconnect
(triggers reconnect in worker daemon)
Connecting — WebSocket handshake in progress. The worker sends x-worker-secret in the upgrade request for authentication and the provider name as a query parameter.
Idle — registered and waiting for work. The worker has capacity (reported load < max concurrency) and no in-flight requests. The server may dispatch requests to it.
Busy — processing one or more requests. The worker still accepts new requests up to its max concurrency. Each in-flight request is tracked independently.
Draining — the server sent GracefulShutdown. No new requests are dispatched. Existing in-flight requests are allowed to complete up to an optional drain timeout. Once all requests finish (or the timeout expires), the worker transitions to Gone.
Gone — the worker is removed from the registry. In-flight requests are requeued (up to 3 attempts per request) or failed if already cancelled. The worker daemon's reconnect loop may bring it back as a new Connecting session.
Request Lifecycle State Machine
┌────────────┐
│ Received │
│ │
│ HTTP req │
│ parsed │
└─────┬──────┘
│
├── eligible worker found ──────────┐
│ │
▼ ▼
┌────────────┐ ┌────────────────┐
│ Queued │ │ Dispatched │
│ │──── worker becomes──►│ │
│ in provider│ available │ assigned to │
│ FIFO queue │ │ worker, WS msg │
└─────┬──┬───┘ │ sent │
│ │ └───────┬────────┘
│ │ │
│ │ queue timeout ┌───────────────┤
│ │ or queue full │ │
│ ▼ │ │ is_streaming
┌────────────┐ │ ▼
│ Failed │ │ ┌────────────┐
│ │◄─────────────┤ │ Streaming │
│ • QueueFull│ worker dies │ │ │
│ • Timeout │ (requeue │ │ chunks │
│ • NoWorkers│ exhausted) │ │ forwarded │
│ • Cancelled│ │ │ via mpsc │
└────────────┘ │ └─────┬──────┘
▲ │ │
│ │ │
│ cancel signal │ │
│ (client disconnect│ │
│ or timeout) │ │
│ │ ▼
┌────────────┐ │ ┌────────────┐
│ Cancelled │◄─────────────┤ │ Done │
│ │ │ │ │
│ cancel │◄─────────────┘ │ Response │
│ propagated │ │ Complete │
│ to worker │ │ sent to │
└────────────┘ │ client │
└────────────┘
A request can be cancelled at any point: if still queued, it is removed from the queue immediately. If dispatched or streaming, a Cancel message is sent to the worker. After a request finishes (Done, Failed, or Cancelled), the dispatcher checks whether the now-free worker can pick up the next queued request for a compatible model.
Protocol Messages
All messages are JSON over WebSocket. The protocol is defined in the modelrelay-protocol crate and uses serde's tagged enum representation ("type": "message_type").
Server → Worker:
| Message | Purpose |
|---|---|
RegisterAck | Confirms registration, assigns worker ID, echoes accepted models |
Request | Dispatches an inference request (id, model, endpoint, body, headers, streaming flag) |
Cancel | Cancels an in-flight request with a reason |
Ping | Heartbeat probe with optional timestamp |
GracefulShutdown | Initiates drain with optional reason and timeout |
ModelsRefresh | Asks worker to re-query its backend for available models |
Worker → Server:
| Message | Purpose |
|---|---|
Register | Initial registration with name, models, max concurrency, load |
ModelsUpdate | Updated model list and current load (after refresh or change) |
ResponseChunk | One chunk of a streaming response |
ResponseComplete | Final response with status code, headers, and optional body |
Pong | Heartbeat response echoing timestamp plus current load |
Error | Error report, optionally scoped to a specific request |
Key Design Decisions
Why the queue lives at the center
Queueing at each worker would require clients to retry across workers or implement their own load balancing. Central queueing means one place manages fairness, timeout policy, and capacity-aware routing. When a worker finishes a request, the server immediately checks the queue for the next compatible request — no external coordination needed.
Why WebSocket instead of gRPC
Workers connect out to the server. This is the fundamental topology: GPU boxes on home networks, behind NATs, with no inbound ports. WebSocket over HTTP works through every proxy and firewall. gRPC would add a proto compilation step, a heavier runtime dependency, and more complex connection management for no meaningful benefit in a system where the message vocabulary is small and the payload is mostly opaque passthrough.
Why the protocol is flat JSON
The protocol has ~12 message types. Each is a small JSON object with a "type" tag. There is no binary framing, no schema negotiation, no version handshake beyond a simple protocol_version field. This makes debugging trivial (read the WebSocket frames), keeps the protocol crate minimal, and means any language can implement a worker in an afternoon. The heavy payload (inference request bodies and response chunks) is opaque text passed through without parsing.
Why streaming is chunked SSE, not buffered
LLM inference can take seconds to minutes. Buffering the full response before sending it to the client would destroy the interactive experience. ModelRelay preserves streaming semantics end-to-end: the worker reads chunks from the backend as they arrive, wraps each in a ResponseChunk message, and the server yields each as an SSE event. The client sees tokens arrive in real time, identical to talking directly to the backend.
Why cancellation is RAII-based
Client disconnects are the normal case, not an exception. When a user closes a tab or ctrl-C's a curl command, the HTTP response future is dropped. Rust's ownership model makes this the natural place to trigger cleanup: the HttpRequestCancellationGuard fires on drop, propagates the cancel through the server core to the worker, and the worker aborts the backend request. No polling, no timers, no forgotten cleanup paths.
Why requeue has a cap of 3
When a worker dies mid-request, the server requeues the request to another worker. But if workers keep dying (bad model, OOM, hardware failure), infinite requeue would loop forever. Three attempts is enough to survive transient worker restarts without masking systemic failures.
Capacity and Scaling
What limits the server
The server is single-process, async (tokio). The practical limits are:
- Connected workers: bounded by memory for the worker registry and WebSocket connections. Thousands of workers are feasible.
- Queue depth: configurable per provider (
max_queue_len). Memory cost is proportional to queued request bodies. - Concurrent in-flight requests: bounded by the sum of all workers'
max_concurrentvalues. Each in-flight request holds a small state record and channel handles. - Streaming throughput: chunks flow through an mpsc channel per request. The server does minimal processing per chunk (no parsing, no transformation), so throughput scales with I/O.
What limits a worker
Each worker is bounded by its local backend's capacity. The max_concurrent setting should match what the backend can handle (e.g., llama-server's -np parallel slots). The worker itself adds negligible overhead — it is a thin forwarding layer.
What the queue cannot do
- Priority: the queue is FIFO per provider. There is no request priority mechanism.
- Cross-provider routing: a request targets one provider. There is no fallback to a different provider if the primary queue is full.
- Persistence: the queue is in-memory. If the server restarts, queued requests are lost. In-flight requests fail and clients retry.
Scaling patterns
- Vertical: increase
max_concurrenton workers with more GPU memory or faster hardware. - Horizontal: add more workers. The server's round-robin dispatcher spreads load automatically.
- Multi-server: not built in. For HA, run multiple server instances behind a load balancer, but each server maintains its own worker pool and queue (no shared state). Workers can connect to multiple servers for redundancy.
Design Constraints
- The HTTP boundary should look normal to clients; the worker protocol can stay private and purpose-built.
- Queueing belongs at the central server, not at each worker.
- Streaming and cancellation are first-class concerns, not add-ons.
- The Rust rewrite should preserve behavior, not Go package boundaries.
- The implementation should optimize for testability and explicit state transitions over abstraction depth.
Current Status
The project is complete and ready for production use. The full behavior matrix is implemented and verified by an extensive automated test suite covering:
- OpenAI chat/completions and responses flows
- Anthropic messages flows
- Queueing and timeout behavior
- Streaming pass-through with preserved ordering and termination
- Client cancellation propagation through the WebSocket link
- Worker disconnect and automatic requeue
- Heartbeat and live-load reporting
- Model refresh and auth cooldown recovery
- Graceful shutdown and drain semantics
A multi-stage Dockerfile and docker-compose example are provided for quick setup without a Rust toolchain.
Protocol Walkthrough — Wire Traces
This document shows the actual message flow between components for each
major scenario. Message types reference the structs in the
modelrelay-protocol crate (ServerToWorkerMessage / WorkerToServerMessage).
1. Worker Registration and Heartbeat
Worker Proxy Server
│ │
│──── WebSocket UPGRADE ────────►│ GET /v1/worker/connect
│◄─── 101 Switching Protocols ──│
│ │
│ WorkerToServerMessage::Register
│ { │
│ "type": "register", │
│ "worker_name": "gpu-box-1", │
│ "models": ["llama3-8b"], │
│ "max_concurrent": 4, │
│ "protocol_version": "1", │
│ "current_load": 0 │
│ } │
│──────────────────────────────►│ Proxy validates worker_secret
│ │ (passed as query param or header
│ │ during WebSocket upgrade)
│ │
│ ServerToWorkerMessage::RegisterAck
│ { │
│ "type": "register_ack", │
│ "worker_id": "w-a1b2c3", │
│ "models": ["llama3-8b"], │
│ "protocol_version": "1" │
│ } │
│◄──────────────────────────────│
│ │
│ ┌──── heartbeat loop (HEARTBEAT_INTERVAL) ────┐
│ │ │ │
│ ServerToWorkerMessage::Ping │ │
│ { "type": "ping", │ │
│ "timestamp_unix_ms": ... } │ │
│◄──────────────────────────────│ │
│ │ │
│ WorkerToServerMessage::Pong │ │
│ { "type": "pong", │ │
│ "current_load": 1, │ │
│ "timestamp_unix_ms": ... } │ │
│──────────────────────────────►│ Proxy updates load │
│ └─────────────────────────────────────────────┘
If the worker misses heartbeats, the proxy closes the WebSocket with
reason "worker heartbeat timed out" and requeues any in-flight requests
(up to MAX_REQUEUE_COUNT = 3 retries).
2. Normal Non-Streaming Request
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│ {"model":"llama3-8b", │ │ │
│ "stream": false, │ │ │
│ "messages":[...]} │ │ │
│───────────────────────►│ │ │
│ │ │ │
│ │ Queue lookup: │ │
│ │ provider="local", │ │
│ │ model="llama3-8b" │ │
│ │ → worker "gpu-box-1" │ │
│ │ has capacity │ │
│ │ │ │
│ │ ServerToWorkerMessage::Request │
│ │ { "type": "request", │ │
│ │ "request_id": "r-001",│ │
│ │ "model": "llama3-8b", │ │
│ │ "endpoint_path": │ │
│ │ "/v1/chat/completions", │
│ │ "is_streaming": false, │ │
│ │ "body": "{...}", │ │
│ │ "headers": {...} } │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ │ POST /v1/chat/completions
│ │ │───────────────────►│
│ │ │ │
│ │ │◄───────────────────│
│ │ │ 200 OK + JSON body│
│ │ │ │
│ │ WorkerToServerMessage::ResponseComplete │
│ │ { "type": "response_complete", │
│ │ "request_id": "r-001",│ │
│ │ "status_code": 200, │ │
│ │ "headers": {"content-type": │
│ │ "application/json"},│ │
│ │ "body": "{...}", │ │
│ │ "token_counts": { │ │
│ │ "prompt_tokens": 42,│ │
│ │ "completion_tokens": 128, │
│ │ "total_tokens": 170 │ │
│ │ } │ │
│ │ } │ │
│ │◄────────────────────────│ │
│ │ │ │
│◄───────────────────────│ 200 OK │ │
│ {"choices":[...]} │ (body forwarded) │ │
3. Streaming Request (SSE)
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│ {"stream": true, ...} │ │ │
│───────────────────────►│ │ │
│ │ │ │
│ │ Request dispatched │ │
│ │ (is_streaming: true) │ │
│ │────────────────────────►│ │
│ │ │ POST (stream=true)│
│ │ │───────────────────►│
│ │ │ │
│ │ │ SSE: data: {...} │
│ │ │◄───────────────────│
│ │ WorkerToServerMessage::ResponseChunk │
│ │ { "type": "response_chunk", │
│ │ "request_id": "r-002",│ │
│ │ "chunk": "data: {\"choices\":[...]}\n\n" │
│ │ } │ │
│ │◄────────────────────────│ │
│ SSE: data: {...} │ │ │
│◄───────────────────────│ │ │
│ │ │ │
│ ...more chunks... │ ...more ResponseChunk..│ ...more SSE... │
│ │ │ │
│ │ │ SSE: data: [DONE] │
│ │ │◄───────────────────│
│ │ ResponseChunk (final) │ │
│ │◄────────────────────────│ │
│ SSE: data: [DONE] │ │ │
│◄───────────────────────│ │ │
│ │ │ │
│ │ WorkerToServerMessage::ResponseComplete │
│ │ { "type": "response_complete", │
│ │ "request_id": "r-002",│ │
│ │ "status_code": 200, │ │
│ │ "token_counts": {...} │ │
│ │ } │ │
│ │◄────────────────────────│ │
Chunks are forwarded byte-for-byte without re-parsing. The proxy writes
each ResponseChunk.chunk directly to the HTTP response body, preserving
SSE framing intact.
4. Client Cancellation Propagation
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions (streaming) │ │
│───────────────────────►│ │ │
│ │ → dispatched to worker │ │
│ │────────────────────────►│ │
│ │ │───────────────────►│
│ (receiving chunks...) │ │ │
│◄───────────────────────│ │ │
│ │ │ │
│ CLIENT DISCONNECTS │ │ │
│──── TCP RST / close ──►│ │ │
│ │ │ │
│ │ Proxy detects drop │ │
│ │ │ │
│ │ ServerToWorkerMessage::Cancel │
│ │ { "type": "cancel", │ │
│ │ "request_id": "r-002",│ │
│ │ "reason": │ │
│ │ "client_disconnect" │ │
│ │ } │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ │ Worker aborts │
│ │ │ backend request │
│ │ │───── abort ────────►│
│ │ │ │
Cancel reasons (from CancelReason enum):
client_disconnect— HTTP client dropped the connectiontimeout— request exceededREQUEST_TIMEOUT_SECSgraceful_shutdown— server is shutting downworker_disconnect— worker WebSocket closed unexpectedlyrequeue_exhausted— max requeue attempts (MAX_REQUEUE_COUNT = 3) exceededserver_shutdown— server process is terminating
5. Worker Disconnect Mid-Request (Requeue Path)
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│───────────────────────►│ │ │
│ │ → dispatched to worker │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ WORKER CRASHES │ │
│ │ (WebSocket closes) │ │
│ │◄─── close frame / EOF ──│ │
│ │ × │
│ │ │ │
│ │ requeue_count < MAX_REQUEUE_COUNT (3)? │
│ │ YES → put request back in queue │
│ │ │ │
│ │ ...time passes, another worker available... │
│ │ │ │
│ │ Worker-2 │ │
│ │ ServerToWorkerMessage::Request │
│ │────────────────────────►│ Worker-2 │
│ │ │───────────────────►│
│ │ │◄───────────────────│
│ │ ResponseComplete │ │
│ │◄────────────────────────│ │
│◄───────────────────────│ 200 OK │ │
│ │ │ │
If requeue_count >= MAX_REQUEUE_COUNT (3):
│ │ │
│◄───────────────────────│ 503 Service Unavailable │
│ {"error": "requeue │ Cancel with reason: │
│ attempts exhausted"} │ "requeue_exhausted" │
6. Queue-Full Error
Client Proxy Server
│ │
│ POST /v1/chat/completions
│───────────────────────►│
│ │
│ │ Queue length >= max_queue_len
│ │ (configured via MAX_QUEUE_LEN,
│ │ default: 100)
│ │
│◄───────────────────────│ 429 Too Many Requests
│ {"error": │
│ "queue full"} │
7. No Workers Available
Client Proxy Server
│ │
│ POST /v1/chat/completions
│ {"model": "llama3-8b"}│
│───────────────────────►│
│ │
│ │ No provider registered
│ │ for model "llama3-8b",
│ │ or no workers connected
│ │
│ │ If a provider exists but
│ │ no workers: request is queued
│ │ (will timeout after
│ │ QUEUE_TIMEOUT_SECS = 30)
│ │
│ │ If no provider matches at all:
│◄───────────────────────│ 404 Not Found
│ {"error": │
│ "no provider for │
│ model llama3-8b"} │
8. Graceful Shutdown / Worker Drain
Proxy Server Worker
│ │
│ (admin triggers drain │
│ or server shutting down)│
│ │
│ ServerToWorkerMessage::GracefulShutdown
│ { "type": "graceful_shutdown",
│ "reason": "maintenance",
│ "drain_timeout_secs": 30
│ }
│──────────────────────────►│
│ │
│ Worker marked is_draining│
│ No new requests sent │
│ │
│ Worker finishes in-flight│
│ requests normally... │
│ │
│ ResponseComplete(s) │
│◄──────────────────────────│
│ │
│ disconnect_drained_worker_if_idle():
│ all in-flight done? │
│ YES → close WebSocket │
│──── close frame ─────────►│
│ ×
9. Dynamic Model Update
Worker Proxy Server
│ │
│ (new model loaded or │
│ model removed locally) │
│ │
│ WorkerToServerMessage::ModelsUpdate
│ { "type": "models_update",
│ "models": ["llama3-8b",
│ "codellama-13b"],
│ "current_load": 1
│ }
│──────────────────────────►│
│ │ Proxy updates worker's
│ │ model list and routing
│ │
│ (or server requests it) │
│ │
│ ServerToWorkerMessage::ModelsRefresh
│ { "type": "models_refresh",
│ "reason": "periodic"
│ }
│◄──────────────────────────│
│ │
│ ModelsUpdate (response) │
│──────────────────────────►│
10. Queue Timeout
Client Proxy Server
│ │
│ POST /v1/chat/completions
│ {"model": "llama3-8b"}│
│───────────────────────►│
│ │
│ │ Provider exists but all
│ │ workers are busy (at
│ │ max_concurrent).
│ │ Request enters queue.
│ │
│ │ ┌── QUEUE_TIMEOUT_SECS (30) ──┐
│ │ │ waiting for a worker to │
│ │ │ become available... │
│ │ │ │
│ │ │ no worker picks up │
│ │ └──────────── timeout fires ───┘
│ │
│ │ Cancel with reason: "timeout"
│ │
│◄───────────────────────│ 504 Gateway Timeout
│ {"error": │
│ "queue timeout: │
│ no worker available │
│ within deadline"} │
The request never reaches a worker. The proxy removes it from the queue
and returns 504 to the client. No Cancel message is sent over WebSocket
because no worker was ever assigned.
11. Request Timeout (In-Flight)
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│───────────────────────►│ │ │
│ │ → dispatched to worker │ │
│ │────────────────────────►│ │
│ │ │───────────────────►│
│ │ │ │
│ │ ┌── REQUEST_TIMEOUT_SECS (300) ──┐ │
│ │ │ waiting for ResponseComplete │ │
│ │ │ or streaming chunks... │ │
│ │ │ │ │
│ │ │ backend is still processing │ │
│ │ └──────────── timeout fires ──────┘ │
│ │ │ │
│ │ ServerToWorkerMessage::Cancel │
│ │ { "type": "cancel", │ │
│ │ "request_id": "r-003",│ │
│ │ "reason": "timeout" │ │
│ │ } │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ │ Worker aborts │
│ │ │ backend request │
│ │ │───── abort ────────►│
│ │ │ │
│◄───────────────────────│ 504 Gateway Timeout │ │
│ {"error": │ │ │
│ "request timeout"} │ │ │
Unlike queue timeout, the request was dispatched to a worker, so the proxy
sends a Cancel message with reason "timeout" over the WebSocket. The
worker receives the cancellation and aborts the in-flight backend request.
The proxy returns 504 to the client.
Message Type Summary
Server → Worker (ServerToWorkerMessage)
| Type | Struct | Purpose |
|---|---|---|
register_ack | RegisterAck | Confirm registration, assign worker ID |
request | RequestMessage | Dispatch an inference request |
cancel | CancelMessage | Cancel an in-flight request |
ping | PingMessage | Heartbeat probe |
graceful_shutdown | GracefulShutdownMessage | Begin drain sequence |
models_refresh | ModelsRefreshMessage | Ask worker to re-report models |
Worker → Server (WorkerToServerMessage)
| Type | Struct | Purpose |
|---|---|---|
register | RegisterMessage | Announce name, models, capacity |
models_update | ModelsUpdateMessage | Update model list / load |
response_chunk | ResponseChunkMessage | Forward a streaming chunk |
response_complete | ResponseCompleteMessage | Signal request completion |
pong | PongMessage | Heartbeat reply with load |
error | ErrorMessage | Report a request-level error |
Behavior Contract
This document captures the externally observable behavior contract for ModelRelay — the behaviors that must hold across versions and that users and contributors can rely on. The contract test suite in modelrelay-contract-tests is the automated expression of these requirements.
Core Contract
-
Worker auth and registration: Workers connect to
/v1/worker/connect?provider=<name>over WebSocket and authenticate with a provider-specific worker secret. The preferred transport isX-Worker-Secret; query-string secret fallback exists only for backward compatibility. Secret comparison is constant-time. Unknown providers are rejected, disabled providers are rejected, and repeated failed auth attempts are rate-limited by client IP. -
Capability advertisement: After connect, the worker sends a
registermessage containingworker_name,models,max_concurrent, andprotocol_version. The server may sanitize or truncate these values and must sendregister_ackwith the accepted worker ID, accepted model list, and warnings. Legacy workers omittingprotocol_versionare tolerated in Katamari unless explicitly rejected by config; mismatched protocol versions are closed with a protocol error. The first Rust characterization harness makes that sanitization concrete by requiring the acked model list to trim whitespace, drop empty entries, de-duplicate exact duplicates while preserving first-seen order, and cap the accepted list at a provider-defined limit with warnings surfaced inregister_ack. -
Model advertisement and worker selection: Workers advertise exact model names, and the server routes only to workers that explicitly support the requested model. Katamari keeps an O(1) model-membership set per worker. Selection is "lowest load with round-robin tie breaking" among workers that support the model and can atomically reserve capacity.
-
Queueing when no worker is immediately available: If no eligible worker can accept the request, the request is queued per virtual provider. The queue is bounded and FIFO among requests compatible with a worker's model list. Requests remain keyed by original queue time so requeue does not grant infinite timeout extensions.
-
Request dispatch over WebSocket: Requests are forwarded to workers as
requestmessages withrequest_id,model, raw JSON body string, selected compatibility headers, target endpoint path, andis_streaming. The central proxy accepts ordinary provider-style HTTP requests and delegates only the worker-backed providers through this path. Compatibility-critical request headers include OpenAI-styleauthorization,content-type, andopenai-organization, plus Anthropic-stylex-api-key,anthropic-version,anthropic-beta, andcontent-type; incidental transport headers likeuser-agentare not part of the worker envelope contract. -
Non-streaming response pass-through: Workers reply with
response_completecontaining the final HTTP status, response headers, full body, and token counts. The proxy must forward status, headers, and body faithfully, including upstream 4xx and 5xx responses, rather than collapsing them into generic proxy errors. -
Streaming chunk ordering and termination semantics: Streaming responses are forwarded as
response_chunkmessages containing already-formatted SSE data and finish withresponse_complete. Chunks must preserve order. The HTTP side must flush promptly, retain streaming semantics, and treat completion metadata as the source of final status and token accounting. Katamari enforces a cumulative streaming size ceiling and emits an SSE error before terminating an oversized stream. -
Client cancellation propagation end to end: Client disconnect or request timeout must cancel the HTTP request context, remove queued work if still queued, or send a best-effort
cancelmessage for active worker requests. Late chunks that arrive after cancellation are intentionally dropped. The worker protocol has explicit cancel reasons, including client disconnect and timeout. -
Worker disconnect during active request: On worker disconnect, active requests are examined one by one. If the request context is still alive, Katamari requeues it onto the provider queue without resetting its lifetime. If the request context is already canceled or timed out, the request fails immediately to the waiting client path instead. Requeue is capped at
MaxRequeueCount = 3; after that the request fails with a service-unavailable style error instead of looping forever. -
Timeout behavior: Every provider has a request timeout used both for queue wait and overall request lifetime. Queue timeout produces a worker-unavailable style response. Streaming and non-streaming requests share the parent HTTP context, so client disconnect and timeout terminate the same request object. WebSocket heartbeats use ping every 15 seconds and a 45-second pong window.
-
Queue-full, no-workers, and provider-disabled error surfaces: Katamari distinguishes bounded queue exhaustion, no worker capacity, disabled providers, deleted providers, timeout, and requeue exhaustion through dedicated error values. The public-facing HTTP layer currently sanitizes some internal errors into stable client messages such as "Service temporarily at capacity" and "Provider is currently disabled".
-
Heartbeat, load reporting, and stale-worker cleanup: The server sends JSON
ping; workers reply with JSONpongcarrying current load. This heartbeat updateslast_heartbeatand live load accounting. Workers may also sendmodels_updatewhen their local model catalog changes. Stale worker DB records are cleaned periodically, and failed auth rate-limit entries expire automatically. -
Graceful shutdown and drain semantics: The server can send
graceful_shutdownto tell workers to stop accepting new work, finish current requests, and disconnect before a timeout. Provider deletion drains queued requests with an explicit provider-deleted error and closes connected workers. -
OpenAI-style and Anthropic-style compatibility: The central server is meant to accept ordinary client traffic, not a custom client. Katamari parses model and stream flags from OpenAI-style request bodies, provides a special
/v1/modelscompatibility endpoint, and preserves SSE behavior expected by OpenAI-compatible tooling. The extracted Rust project should also preserve Anthropic-style compatibility at the central HTTP boundary even if the internal worker protocol stays provider-neutral.
Wire Messages To Preserve
-
Server to worker:
ping,request,register_ack,cancel,graceful_shutdown,models_refresh -
Worker to server:
pong,register,models_update,response_chunk,response_complete,error
Invariants Worth Preserving
- A worker never silently gains capability beyond the sanitized models acknowledged by the server.
- Queueing is bounded per provider and does not grow without limit.
- Requeue is intentional and finite.
- HTTP error bodies from the worker backend are preserved where safe instead of flattened away.
- Streaming remains SSE-shaped end to end.
- Worker churn or late chunks must not leave requests hanging forever.
Extension Points
When adding new behaviors, add a contract test in modelrelay-contract-tests before implementing. This keeps the test suite as the primary specification. If a behavior described above is not yet covered by an automated test, that gap is the highest-priority work item.
Operational Runbook
This guide covers day-to-day operations for running ModelRelay in
production. It assumes you have one modelrelay-server instance and one or
more modelrelay-worker processes.
Health Checks
Proxy Server
The proxy server exposes a dedicated /health endpoint:
# Primary health check — returns JSON with version, worker count, queue depth, and uptime.
curl -sf http://proxy:8080/health | jq .
Example response:
{
"status": "ok",
"version": "0.1.6",
"workers_connected": 2,
"queue_depth": 0,
"uptime_secs": 3621.5
}
Use /health for liveness probes, Kubernetes readiness checks, and
monitoring. A workers_connected of 0 means the proxy is running but
no workers are registered.
You can also list routable models directly:
curl -s http://proxy:8080/v1/models | jq '.data[].id'
Worker Daemon
The worker daemon does not expose its own HTTP port — it connects outward to the proxy. Health is observable from the proxy side:
# Check if workers are registered by listing models.
curl -s http://proxy:8080/v1/models | jq '.data[].id'
If expected models are missing, the worker is either down or failed to register. Check worker logs for connection errors or authentication failures.
Admin API & Monitoring
ModelRelay includes admin endpoints for inspecting workers, request metrics,
and managing client API keys. All /admin/* endpoints require a Bearer
token.
Enabling Admin Endpoints
Set MODELRELAY_ADMIN_TOKEN when starting the server:
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret
Without this token, all /admin/* endpoints return 403 Forbidden.
Querying Admin Endpoints
TOKEN="my-admin-secret"
# List connected workers (models, load, capabilities)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/workers | jq .
# Request stats and queue depth
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/stats | jq .
# List client API keys (metadata only, no secrets)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/keys | jq .
Managing Client API Keys
When MODELRELAY_REQUIRE_API_KEYS=true, clients must send a valid API key
as a Bearer token on inference requests.
TOKEN="my-admin-secret"
# Create a new API key (the secret is returned only at creation time)
curl -s -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "production-app"}' \
http://proxy:8080/admin/keys | jq .
# Revoke a key by ID
curl -s -X DELETE \
-H "Authorization: Bearer $TOKEN" \
http://proxy:8080/admin/keys/{key-id}
Clients use the returned secret as a Bearer token:
curl -H "Authorization: Bearer mr-..." \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hi"}]}' \
http://proxy:8080/v1/chat/completions
Web Dashboard & Setup Wizard
The proxy serves a built-in web UI:
- Dashboard — visit
http://proxy:8080/dashboardfor real-time worker status, request metrics, and queue depth. - Setup Wizard — visit
http://proxy:8080/setupfor a step-by-step guide to connecting a new worker (platform detection, backend setup, binary download, and live connection verification).
The wizard is always accessible, not just on first run — use it whenever you add another GPU box.
Troubleshooting Admin Features
Admin endpoints return 403:
MODELRELAY_ADMIN_TOKEN is not set on the server, or the Authorization
header doesn't match. Verify the token value and ensure the header format
is Authorization: Bearer <token>.
Client requests return 401 when API key auth is enabled:
The client is not sending a Bearer token, or the key has been revoked.
Create a new key via POST /admin/keys and ensure the client sends
Authorization: Bearer <key>.
API key auth not taking effect:
MODELRELAY_REQUIRE_API_KEYS must be set to true. When false
(the default), inference endpoints accept unauthenticated requests.
Checking Worker Registration
After starting a new worker, confirm it registered:
# Should include the worker's advertised models.
curl -s http://proxy:8080/v1/models | jq .
If a worker's models don't appear within ~10 seconds:
- Check the worker secret — does
WORKER_SECRETon the worker match the proxy? - Check connectivity — can the worker reach
PROXY_URL?curl -v http://proxy:8080/v1/worker/connect # Should get 400 or upgrade-required, not a connection error - Check worker logs — look for
register/register_ackmessages or error lines.
Draining a Worker Gracefully
To remove a worker from rotation without dropping in-flight requests:
-
Send SIGTERM to the modelrelay-worker process. The daemon initiates a graceful disconnect — the proxy sends a
GracefulShutdownmessage and stops routing new requests to that worker. -
In-flight requests finish normally. The proxy waits up to
drain_timeout_secs(from the shutdown message) for active requests to complete. -
Once idle, the WebSocket closes. The worker process exits.
# Graceful stop via systemd
systemctl stop modelrelay-worker@gpu-box-1
# Or with Docker
docker stop --time 60 worker-gpu-box-1
Monitoring drain progress: Watch the proxy logs for
"worker drained" or similar messages. If the worker still has
in-flight requests, you'll see ongoing ResponseChunk / ResponseComplete
messages until they finish.
Scaling Workers
Adding a worker
Start a new modelrelay-worker instance pointing at the same proxy:
PROXY_URL=http://proxy:8080 \
WORKER_SECRET=your-secret \
WORKER_NAME=gpu-box-4 \
BACKEND_URL=http://localhost:8000 \
modelrelay-worker --models llama3-8b
The proxy discovers it within seconds via the WebSocket registration handshake. No proxy restart or config change needed.
Removing a worker
Use the graceful drain procedure above. The proxy automatically routes around disconnected workers.
Scaling the proxy
The proxy is a single-process server. To scale:
- Vertical: increase
MAX_QUEUE_LENand system file descriptor limits. - Horizontal: run multiple proxy instances behind a load balancer, but note that each worker connects to one proxy. Workers must be distributed across proxy instances manually or via DNS round-robin.
Log Interpretation
Proxy Server
| Log pattern | Meaning |
|---|---|
worker registered / register_ack | Worker connected and authenticated |
request dispatched | Request sent to a worker |
response complete | Worker returned a result |
worker heartbeat timed out | Worker missed pings — WebSocket closed |
request requeued | Worker died mid-request, retrying on another worker |
requeue exhausted | Request failed after MAX_REQUEUE_COUNT (3) retries |
queue full | Rejected request — queue at MAX_QUEUE_LEN capacity |
queue timeout | Request sat in queue longer than QUEUE_TIMEOUT_SECS |
graceful shutdown | Worker drain initiated |
Worker Daemon
| Log pattern | Meaning |
|---|---|
connected to proxy | WebSocket connection established |
registered | Registration acknowledged by proxy |
forwarding request | Proxying a request to the local backend |
backend error | Local backend returned an error or is unreachable |
cancelled | Proxy sent a cancel for an in-flight request |
graceful shutdown | Drain in progress, finishing active requests |
Adjusting log verbosity
Set LOG_LEVEL environment variable on either component:
LOG_LEVEL=debug modelrelay-server # trace, debug, info (default), warn, error
LOG_LEVEL=debug modelrelay-worker
Common Failure Modes
Worker can't connect to proxy
Symptoms: Worker logs show connection refused or timeouts.
Checklist:
- Is the proxy running?
curl http://proxy:8080/v1/models - Is
PROXY_URLcorrect? The worker connects to{PROXY_URL}/v1/worker/connectvia WebSocket. - Firewall / network: the worker makes an outbound connection to the proxy — no inbound ports needed on the worker machine.
- If using TLS (nginx/reverse proxy in front), ensure WebSocket upgrade headers are forwarded. See the TLS Setup guide.
Worker registers but requests fail
Symptoms: /v1/models shows the model, but requests return 502 or
timeout.
Checklist:
- Is the local backend running?
curl http://localhost:8000/v1/models(or whateverBACKEND_URLis set to) - Does the backend support the requested endpoint?
(
/v1/chat/completions,/v1/messages,/v1/responses) - Check worker logs for
backend errormessages. - Try a direct request to the backend to isolate the issue.
Requests queue but never complete
Symptoms: Clients hang, then get a timeout error after
QUEUE_TIMEOUT_SECS.
Causes:
- No workers are connected (check
/v1/models) - Workers are at capacity (
max_concurrentreached on all workers) - Workers are connected but not advertising the requested model
Fix: Add more workers, increase max_concurrent if the hardware
allows, or reduce QUEUE_TIMEOUT_SECS to fail faster.
Streaming responses arrive corrupted
Symptoms: SSE chunks arrive garbled or out of order.
Checklist:
- Ensure no intermediate proxy is buffering. Disable response
buffering in nginx:
proxy_buffering off; - If using a CDN or reverse proxy, ensure it supports chunked transfer encoding and doesn't aggregate small writes.
High memory usage on the proxy
Symptoms: Proxy RSS grows over time.
Causes:
- Large queue of pending requests (each holds the full request body)
- Many concurrent streaming responses with large chunk buffers
Fix: Lower MAX_QUEUE_LEN, set QUEUE_TIMEOUT_SECS to a shorter
value, or add workers to drain the queue faster.
Worker keeps reconnecting
Symptoms: Worker logs show repeated connect/disconnect cycles.
Causes:
- Heartbeat timeout — the worker or network is too slow to respond to
pings within
HEARTBEAT_INTERVAL WORKER_SECRETmismatch — worker connects, fails auth, gets disconnected, retries
Fix: Check secrets match, check network latency between worker and proxy.
Configuration Quick Reference
Proxy Server
| Env Var | Default | Description |
|---|---|---|
LISTEN_ADDR | 127.0.0.1:8080 | HTTP listen address |
PROVIDER_NAME | local | Provider name for routing |
WORKER_SECRET | (required) | Shared secret for worker auth |
MAX_QUEUE_LEN | 100 | Max queued requests before rejecting |
QUEUE_TIMEOUT_SECS | 30 | How long a request can wait in queue |
REQUEST_TIMEOUT_SECS | 300 | Total request timeout (5 min) |
LOG_LEVEL | info | Log verbosity |
MODELRELAY_ADMIN_TOKEN | (none) | Bearer token for /admin/* endpoints (if unset, admin returns 403) |
MODELRELAY_REQUIRE_API_KEYS | false | When true, client requests require a valid API key |
Worker Daemon
| Env Var | Default | Description |
|---|---|---|
PROXY_URL | http://127.0.0.1:8080 | Proxy server URL |
WORKER_SECRET | (required) | Must match proxy's secret |
WORKER_NAME | worker | Human-readable worker name |
BACKEND_URL | http://127.0.0.1:8000 | Local model server URL |
LOG_LEVEL | info | Log verbosity |
Windows Service
Checking Service Status
Get-Service ModelRelayServer
Get-Service ModelRelayWorker
Starting and Stopping
Start-Service ModelRelayServer
Stop-Service ModelRelayServer
Start-Service ModelRelayWorker
Stop-Service ModelRelayWorker
Stop-Service sends a stop control signal and waits for the process to
exit. ModelRelay handles this as a graceful shutdown — in-flight
requests finish before the process terminates. To set an explicit
timeout:
# Stop with a 60-second timeout (kills the process if it doesn't exit in time)
Stop-Service ModelRelayServer -NoWait
Start-Sleep -Seconds 60
(Get-Service ModelRelayServer).WaitForStatus("Stopped", "00:00:05")
Logs
Windows Services don't write to stdout by default. Two options:
-
Windows Event Log — ModelRelay writes to the Application log. View with:
Get-EventLog -LogName Application -Source ModelRelayServer -Newest 50 -
File logging via
RUST_LOG— setRUST_LOGas a system environment variable and redirect output to a file by wrapping the binary in a small script, or use theRUST_LOG_FILEconvention if supported. The simplest approach:[Environment]::SetEnvironmentVariable("RUST_LOG", "info", "Machine")
Draining a Worker
To drain a worker gracefully before maintenance:
# Stop the service — this triggers graceful shutdown.
Stop-Service ModelRelayWorker
# Verify it has stopped.
Get-Service ModelRelayWorker
The worker completes in-flight requests before exiting, identical to the
systemctl stop behavior on Linux.
Monitoring Checklist
For production deployments, monitor these signals:
-
Proxy process is up — HTTP health check on
/health -
At least one worker registered —
/healthreturnsworkers_connected > 0 -
Queue depth —
/healthreturnsqueue_depth; watch for sustained growth - Request latency — track time from client request to first byte
- Worker reconnect rate — frequent reconnects indicate network or auth issues
- Error rates — 4xx (client errors) vs 5xx (backend/proxy errors)
- Backend health — each worker's local model server should be independently monitored
TLS Setup
ModelRelay's proxy server listens on plain HTTP by default. For production deployments you should terminate TLS in front of it so that:
- Clients reach the API over HTTPS (
https://your-domain/v1/...) - Workers connect over secure WebSockets (
wss://your-domain/v1/worker/connect)
Without TLS the worker secret and all inference traffic travel in the clear. This matters especially when workers connect over the public internet rather than a private network.
Option 1: nginx (recommended)
The repository includes a ready-to-use nginx config at
examples/tls-nginx.conf.
Copy it into your nginx sites directory and customise the domain and
certificate paths.
How it works
The config defines two server blocks:
- Port 80 redirects all HTTP traffic to HTTPS.
- Port 443 terminates TLS and proxies to
127.0.0.1:8080(the defaultLISTEN_ADDR).
Two location blocks handle the different traffic types:
-
/v1/worker/connect--- the WebSocket endpoint. The key directives are:proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_read_timeout 86400s; # keep the long-lived WS open proxy_send_timeout 86400s;Without the
Upgrade/Connectionheaders, nginx will not complete the WebSocket handshake and workers will fail to connect. -
/v1/--- the inference API. Buffering is disabled so that SSE streaming responses pass through without delay:proxy_buffering off; proxy_cache off; proxy_read_timeout 300s; # match REQUEST_TIMEOUT_SECS
Quick start
# 1. Obtain a certificate (Let's Encrypt example)
sudo certbot certonly --nginx -d your-domain.example.com
# 2. Install the config
sudo cp examples/tls-nginx.conf /etc/nginx/sites-available/modelrelay.conf
sudo ln -s /etc/nginx/sites-available/modelrelay.conf /etc/nginx/sites-enabled/
# 3. Edit the config: replace your-domain.example.com everywhere
sudo nano /etc/nginx/sites-available/modelrelay.conf
# 4. Test and reload
sudo nginx -t && sudo systemctl reload nginx
Certificate renewal
Let's Encrypt certificates expire after 90 days. Certbot usually installs a systemd timer or cron job that renews automatically. Verify:
sudo certbot renew --dry-run
Option 2: Caddy
Caddy automatically provisions and renews TLS certificates from Let's Encrypt with zero configuration. If you don't need nginx's flexibility, this is the simplest path.
Caddyfile
your-domain.example.com {
reverse_proxy 127.0.0.1:8080
}
That's it. Caddy handles:
- HTTPS redirect from port 80
- Automatic Let's Encrypt certificate issuance and renewal
- WebSocket upgrade pass-through (no special config needed)
- Unbuffered streaming (the default for
reverse_proxy)
Running
# Install (Debian/Ubuntu)
sudo apt install -y caddy
# Write the Caddyfile
cat > /etc/caddy/Caddyfile <<'EOF'
your-domain.example.com {
reverse_proxy 127.0.0.1:8080
}
EOF
# Start
sudo systemctl enable --now caddy
Note: Caddy must be able to bind ports 80 and 443, and the domain must resolve to the server's public IP for the ACME challenge to succeed.
Option 3: Manual certificates (certbot standalone)
If you're running neither nginx nor Caddy you can still use Let's Encrypt with certbot's standalone mode, then point any reverse proxy at the resulting certificate files:
sudo certbot certonly --standalone -d your-domain.example.com
Certificates land in /etc/letsencrypt/live/your-domain.example.com/.
Use fullchain.pem and privkey.pem with whatever TLS terminator you
prefer (HAProxy, Traefik, etc.).
Configuring workers for TLS
Once TLS is in place, update the worker's PROXY_URL to use the secure
scheme:
| Scenario | PROXY_URL |
|---|---|
| No TLS (local / private network) | http://proxy:8080 |
| TLS via reverse proxy | https://your-domain.example.com |
The worker uses PROXY_URL to derive the WebSocket connection URL.
When the scheme is https, the worker connects over wss://
automatically.
# Example: worker connecting over TLS
PROXY_URL=https://your-domain.example.com \
WORKER_SECRET=your-secret \
BACKEND_URL=http://localhost:8000 \
modelrelay-worker --models llama3-8b
Tip: The local backend (
BACKEND_URL) almost never needs TLS --- it runs on the same machine as the worker. Keep it as plainhttp://localhost:....
Troubleshooting
Workers can't connect after enabling TLS
- Verify the certificate is valid:
curl -v https://your-domain.example.com/v1/models - Confirm WebSocket upgrade works:
curl -v -H 'Upgrade: websocket' -H 'Connection: upgrade' https://your-domain.example.com/v1/worker/connect(should get a 101 or 400, not a connection error) - Check that
proxy_read_timeout/proxy_send_timeoutare long enough for the WebSocket (the nginx config uses 86400s)
Streaming responses arrive buffered
Ensure your reverse proxy has buffering disabled for the /v1/ path.
In nginx: proxy_buffering off;. Caddy disables buffering by default.
Certificate renewal fails
Certbot's HTTP-01 challenge needs port 80. If nginx or Caddy is
running, use the --nginx or --caddy certbot plugin instead of
--standalone to avoid port conflicts.