ModelRelay

Stop configuring clients for every GPU box. Workers connect out; requests route in.

You have GPU boxes running llama-server (or Ollama, or vLLM, or anything OpenAI-compatible). Today you either expose each one directly — port forwarding, DNS, firewall rules — or you stick a load balancer in front that doesn't understand LLM streaming or cancellation.

ModelRelay flips the model: a central proxy receives standard inference requests while worker daemons on your GPU boxes connect out to it over WebSocket. The proxy handles queueing, routing, streaming pass-through, and cancellation propagation. Clients see one stable endpoint and never need to know about your hardware.

  Clients (curl, Claude Code, LiteLLM, Open WebUI, ...)
         |
         |  POST /v1/chat/completions
         |  POST /v1/messages
         v
  +----------------------+
  |   modelrelay-server  |<--- workers connect out (WebSocket)
  |   (one stable        |     no inbound ports needed on GPU boxes
  |    endpoint)         |
  +----------------------+
         |  routes request to best available worker
         v
  +--------+  +--------+  +--------+
  |worker-1|  |worker-2|  |worker-3|
  | llama  |  | ollama |  | vllm   |  <- your GPU boxes,
  | server |  |        |  |        |    anywhere on any network
  +--------+  +--------+  +--------+

Hosted Version

Don't want to run the infrastructure yourself? A fully-managed hosted version is available at modelrelay.io — no server setup, no infrastructure to manage. Just get an API key, point your workers at it, and start routing requests. Same open protocol, zero ops burden.

Who is this for?

  • Home GPU users running local models who want a single API endpoint across multiple machines
  • Teams with on-prem hardware that need to pool GPU capacity without a service mesh
  • Researchers juggling models across heterogeneous boxes who are tired of updating client configs

Features

  • OpenAI + Anthropic compatiblePOST /v1/chat/completions, POST /v1/responses, POST /v1/messages, GET /v1/models
  • No inbound ports on GPU boxes — workers connect out to the proxy over WebSocket
  • Request queueing — configurable depth and timeout when all workers are busy
  • Streaming pass-through — SSE chunks forwarded with preserved ordering and termination
  • End-to-end cancellation — client disconnect propagates through the proxy to the worker to the backend
  • Automatic requeue — if a worker dies mid-request, the request is requeued to another worker
  • Heartbeat and load tracking — stale workers are cleaned up; workers report current load
  • Graceful drain — workers can shut down while replacement workers pick up queued work
  • Cross-platform — pre-built binaries for Linux, macOS, and Windows (x86_64 + arm64)

Quick Start

The fastest way to get running is with Docker:

# 1. Run the proxy
docker run -p 8080:8080 \
  -e WORKER_SECRET=mysecret \
  -e LISTEN_ADDR=0.0.0.0:8080 \
  ghcr.io/ericflo/modelrelay/modelrelay-server:latest

# 2. Run a worker (on a GPU box with llama-server or similar)
docker run \
  -e PROXY_URL=http://<proxy-host>:8080 \
  -e WORKER_SECRET=mysecret \
  -e BACKEND_URL=http://host.docker.internal:8000 \
  -e MODELS=llama3.2:3b \
  ghcr.io/ericflo/modelrelay/modelrelay-worker:latest

# 3. Send a request
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

For more installation options (pre-built binaries, Docker Compose, building from source, systemd, Windows services), see the GitHub README.

Documentation

Source & Contributing

ModelRelay is MIT-licensed and developed at github.com/ericflo/modelrelay. Bug reports, feature requests, and PRs are welcome — see CONTRIBUTING.md for details.

Architecture

This document describes the internal architecture of ModelRelay: how the components fit together, how data flows through the system, and why key design decisions were made. It is intended for contributors, operators, and anyone evaluating ModelRelay for their own infrastructure.

Workspace Shape

  • crates/modelrelay-contract-tests Black-box behavior tests and focused harnesses for registration, queueing, response streaming, cancellation, requeue, heartbeat, and graceful shutdown semantics.

  • crates/modelrelay-protocol Shared Rust protocol types for the WebSocket bridge: registration, dispatch, streaming chunks, cancellation, heartbeats, and operational control messages.

  • crates/modelrelay-server Central HTTP proxy. Owns the client-facing OpenAI and Anthropic compatibility layers, worker auth, provider config, worker registry, queueing, routing, cancellation, and graceful drain.

  • crates/modelrelay-worker Remote worker process. Authenticates to the server, advertises models and capacity, forwards requests to a local backend such as llama-server, streams chunks back, refreshes advertised models, reports live load in heartbeats, and honors cancellation plus graceful shutdown.

Component Overview

                    ┌─────────────────────────────────────────────────────────┐
                    │                  modelrelay-server                      │
                    │                                                         │
  HTTP clients      │  ┌───────────┐    ┌──────────────┐    ┌─────────────┐  │  WebSocket
  ─────────────────►│  │ HTTP      │───►│ Queue        │───►│ Dispatcher  │  │◄──────────
  /v1/chat/         │  │ Router    │    │ Manager      │    │             │  │  workers
  completions,      │  │           │    │ (per-provider│    │ (load-aware │  │  connect in
  /v1/messages,     │  │ (axum     │    │  FIFO)       │    │  round-     │  │
  /v1/responses     │  │  routes)  │    │              │    │  robin)     │  │
                    │  └───────────┘    └──────────────┘    └──────┬──────┘  │
                    │        │                                     │         │
                    │        │          ┌──────────────┐           │         │
                    │        │          │ Worker       │◄──────────┘         │
                    │        │          │ Registry     │                     │
                    │        │          │ (auth, model │                     │
                    │        │          │  tracking,   │                     │
                    │        │          │  load, drain)│                     │
                    │        │          └──────────────┘                     │
                    │        │                                               │
                    │        ▼                                               │
                    │  ┌───────────┐    ┌──────────────┐                    │
                    │  │ Cancel    │    │ WebSocket    │                    │
                    │  │ Guard     │    │ Hub          │                    │
                    │  │ (RAII     │    │ (per-worker  │                    │
                    │  │  drop)    │    │  message     │                    │
                    │  │           │    │  routing)    │                    │
                    │  └───────────┘    └──────────────┘                    │
                    └─────────────────────────────────────────────────────────┘

HTTP Router (http.rs) — axum-based handler for four client-facing routes plus the worker WebSocket upgrade endpoint. Parses the model name and streaming flag from the request body, submits to the core, and bridges the response back as either a single body or an SSE stream.

Queue Manager (lib.rs) — per-provider FIFO queue with configurable max length and timeout. Requests land here when no worker with capacity is immediately available. The queue is drained oldest-first whenever a worker finishes a request or a new worker registers.

Dispatcher (lib.rs) — selects the best worker for a request. Filters by provider, model support, capacity, and drain state, then picks the lowest-load worker with round-robin tie-breaking via per-provider cursors.

Worker Registry (lib.rs) — tracks every connected worker's identity, supported models, max concurrency, reported load, in-flight request set, and drain state. Updated by registration, heartbeat pongs, model refreshes, and disconnect events.

WebSocket Hub (worker_socket.rs) — manages the authenticated WebSocket connection for each worker. Routes server-to-worker messages (request dispatch, cancel signals, pings, graceful shutdown, model refresh) and worker-to-server messages (response chunks, completions, pongs, model updates, errors).

Cancel Guard (http.rs) — an RAII HttpRequestCancellationGuard that fires if the HTTP response future is dropped (client disconnect or timeout). On drop, it broadcasts a cancel signal through the core to the assigned worker.

Worker Daemon Internals

  ┌─────────────────────────────────────────────────────────┐
  │                  modelrelay-worker                       │
  │                                                          │
  │  ┌──────────────┐          ┌─────────────────────────┐  │
  │  │ Connection   │          │ Request Tasks           │  │
  │  │ Manager      │          │                         │  │
  │  │              │  spawn   │  ┌───────┐ ┌───────┐    │  │
  │  │ • connect    │─────────►│  │ Req 1 │ │ Req 2 │... │  │
  │  │ • register   │          │  │       │ │       │    │  │
  │  │ • reconnect  │◄─────────│  │ POST  │ │ POST  │    │  │
  │  │   (exp.      │  events  │  │ to    │ │ to    │    │  │
  │  │   backoff)   │          │  │ local │ │ local │    │  │
  │  └──────┬───────┘          │  └───┬───┘ └───┬───┘    │  │
  │         │                  └──────┼─────────┼────────┘  │
  │         │                         │         │           │
  │         ▼                         ▼         ▼           │
  │  ┌──────────────┐          ┌─────────────────────┐      │
  │  │ Socket Loop  │          │ Local Backend       │      │
  │  │ (select!)    │          │ (llama-server,      │      │
  │  │              │          │  Ollama, vLLM, etc) │      │
  │  │ • read msgs  │          └─────────────────────┘      │
  │  │ • send msgs  │                                       │
  │  │ • heartbeat  │                                       │
  │  └──────────────┘                                       │
  └─────────────────────────────────────────────────────────┘

The worker daemon runs a single select! loop that multiplexes:

  1. Inbound WebSocket messages — dispatched to handle_server_message() which routes each message type: spawns a task for Request, responds to Ping, applies Cancel to active tasks, triggers model refresh, or begins graceful drain.

  2. Outbound events from request tasks — each spawned request task communicates back through an mpsc channel. ResponseChunk events are forwarded immediately over the WebSocket. RequestFinished and RequestFailed events trigger cleanup.

  3. Reconnection with exponential backoff — on unexpected disconnect, the outer run_with_reconnect() loop retries with 1–30 second backoff plus up to 500ms jitter. Only a GracefulShutdown message causes a clean exit.

Data Flow: Non-Streaming Request

  Client                    Server                     Worker              Backend
    │                         │                          │                    │
    │  POST /v1/chat/         │                          │                    │
    │  completions            │                          │                    │
    │────────────────────────►│                          │                    │
    │                         │  find_eligible_worker()  │                    │
    │                         │  assign_to_worker()      │                    │
    │                         │                          │                    │
    │                         │  WS: Request{id,body}    │                    │
    │                         │─────────────────────────►│                    │
    │                         │                          │  POST /v1/chat/    │
    │                         │                          │  completions       │
    │                         │                          │───────────────────►│
    │                         │                          │                    │
    │                         │                          │◄───────────────────│
    │                         │                          │  200 {response}    │
    │                         │  WS: ResponseComplete    │                    │
    │                         │  {id, 200, body}         │                    │
    │                         │◄─────────────────────────│                    │
    │                         │                          │                    │
    │  200 {response}         │  finish_request()        │                    │
    │◄────────────────────────│  dispatch_next_compat()  │                    │
    │                         │                          │                    │

Data Flow: Streaming Request

  Client                    Server                     Worker              Backend
    │                         │                          │                    │
    │  POST /v1/chat/         │                          │                    │
    │  completions            │                          │                    │
    │  stream: true           │                          │                    │
    │────────────────────────►│                          │                    │
    │                         │  WS: Request{id,body,    │                    │
    │                         │    is_streaming: true}    │                    │
    │                         │─────────────────────────►│                    │
    │                         │                          │  POST to backend   │
    │                         │                          │───────────────────►│
    │                         │                          │                    │
    │  SSE: data: chunk1      │  WS: ResponseChunk       │◄── chunk 1 ───────│
    │◄────────────────────────│◄─────────────────────────│                    │
    │  SSE: data: chunk2      │  WS: ResponseChunk       │◄── chunk 2 ───────│
    │◄────────────────────────│◄─────────────────────────│                    │
    │  SSE: data: chunk3      │  WS: ResponseChunk       │◄── chunk 3 ───────│
    │◄────────────────────────│◄─────────────────────────│                    │
    │                         │                          │                    │
    │  SSE: [DONE]            │  WS: ResponseComplete    │◄── end ───────────│
    │◄────────────────────────│◄─────────────────────────│                    │
    │                         │  finish_request()        │                    │

Streaming chunks flow through three hops with minimal buffering: the worker reads chunks from the backend HTTP response body as they arrive, wraps each in a ResponseChunk WebSocket message, the server receives it and pushes it into an mpsc channel, and the HTTP handler yields it as an SSE event to the client.

Data Flow: Client Cancellation

  Client                    Server                     Worker              Backend
    │                         │                          │                    │
    │  POST /v1/chat/...      │  WS: Request             │  POST to backend   │
    │────────────────────────►│─────────────────────────►│───────────────────►│
    │                         │                          │                    │
    │  [client disconnects]   │                          │  ◄── streaming ──  │
    │─ ─ ─ ─ ─X               │                          │                    │
    │                         │  CancellationGuard drop  │                    │
    │                         │  cancel_request(id)      │                    │
    │                         │                          │                    │
    │                         │  WS: Cancel{id}          │                    │
    │                         │─────────────────────────►│                    │
    │                         │                          │  [abort request    │
    │                         │                          │   task]            │
    │                         │                          │───── abort ───────►│

The RAII HttpRequestCancellationGuard is the key mechanism. When the HTTP response future is dropped — either because the client disconnected or a server-side timeout fired — the guard's Drop implementation spawns an async task that calls cancel_request(). If the request is still queued, it is removed immediately. If it is in-flight with a worker, a Cancel message is sent over the WebSocket, and the worker aborts the corresponding request task.

Worker Lifecycle State Machine

                    ┌────────────┐
                    │ Connecting │
                    │            │
                    │ WS handshake
                    │ + auth     │
                    └─────┬──────┘
                          │ RegisterAck
                          ▼
                    ┌────────────┐
              ┌────►│    Idle    │◄───────────────┐
              │     │            │                 │
              │     │ capacity > 0                 │
              │     │ no in-flight                 │
              │     └─────┬──────┘                 │
              │           │ Request dispatched     │ request finished
              │           ▼                        │ (and more capacity)
              │     ┌────────────┐                 │
              │     │    Busy    │─────────────────┘
              │     │            │
              │     │ in-flight  │
              │     │ requests > 0                 ┌────────────┐
              │     └─────┬──────┘                 │    Gone    │
              │           │ GracefulShutdown        │            │
              │           ▼                        │ disconnected
              │     ┌────────────┐                 │ or shutdown │
              │     │  Draining  │────────────────►│ complete   │
              │     │            │  all in-flight   └────────────┘
              │     │ no new     │  finished or          ▲
              │     │ requests   │  drain timeout        │
              │     └────────────┘                       │
              │                                          │
              └──────────────────────────────────────────┘
                        unexpected disconnect
                        (triggers reconnect in worker daemon)

Connecting — WebSocket handshake in progress. The worker sends x-worker-secret in the upgrade request for authentication and the provider name as a query parameter.

Idle — registered and waiting for work. The worker has capacity (reported load < max concurrency) and no in-flight requests. The server may dispatch requests to it.

Busy — processing one or more requests. The worker still accepts new requests up to its max concurrency. Each in-flight request is tracked independently.

Draining — the server sent GracefulShutdown. No new requests are dispatched. Existing in-flight requests are allowed to complete up to an optional drain timeout. Once all requests finish (or the timeout expires), the worker transitions to Gone.

Gone — the worker is removed from the registry. In-flight requests are requeued (up to 3 attempts per request) or failed if already cancelled. The worker daemon's reconnect loop may bring it back as a new Connecting session.

Request Lifecycle State Machine

  ┌────────────┐
  │  Received  │
  │            │
  │ HTTP req   │
  │ parsed     │
  └─────┬──────┘
        │
        ├── eligible worker found ──────────┐
        │                                    │
        ▼                                    ▼
  ┌────────────┐                      ┌────────────────┐
  │   Queued   │                      │  Dispatched    │
  │            │──── worker becomes──►│                │
  │ in provider│     available        │ assigned to    │
  │ FIFO queue │                      │ worker, WS msg │
  └─────┬──┬───┘                      │ sent           │
        │  │                          └───────┬────────┘
        │  │                                  │
        │  │ queue timeout    ┌───────────────┤
        │  │ or queue full    │               │
        │  ▼                  │               │ is_streaming
  ┌────────────┐              │               ▼
  │   Failed   │              │         ┌────────────┐
  │            │◄─────────────┤         │ Streaming  │
  │ • QueueFull│  worker dies │         │            │
  │ • Timeout  │  (requeue    │         │ chunks     │
  │ • NoWorkers│   exhausted) │         │ forwarded  │
  │ • Cancelled│              │         │ via mpsc   │
  └────────────┘              │         └─────┬──────┘
        ▲                     │               │
        │                     │               │
        │   cancel signal     │               │
        │   (client disconnect│               │
        │    or timeout)      │               │
        │                     │               ▼
  ┌────────────┐              │         ┌────────────┐
  │ Cancelled  │◄─────────────┤         │    Done    │
  │            │              │         │            │
  │ cancel     │◄─────────────┘         │ Response   │
  │ propagated │                        │ Complete   │
  │ to worker  │                        │ sent to    │
  └────────────┘                        │ client     │
                                        └────────────┘

A request can be cancelled at any point: if still queued, it is removed from the queue immediately. If dispatched or streaming, a Cancel message is sent to the worker. After a request finishes (Done, Failed, or Cancelled), the dispatcher checks whether the now-free worker can pick up the next queued request for a compatible model.

Protocol Messages

All messages are JSON over WebSocket. The protocol is defined in the modelrelay-protocol crate and uses serde's tagged enum representation ("type": "message_type").

Server → Worker:

MessagePurpose
RegisterAckConfirms registration, assigns worker ID, echoes accepted models
RequestDispatches an inference request (id, model, endpoint, body, headers, streaming flag)
CancelCancels an in-flight request with a reason
PingHeartbeat probe with optional timestamp
GracefulShutdownInitiates drain with optional reason and timeout
ModelsRefreshAsks worker to re-query its backend for available models

Worker → Server:

MessagePurpose
RegisterInitial registration with name, models, max concurrency, load
ModelsUpdateUpdated model list and current load (after refresh or change)
ResponseChunkOne chunk of a streaming response
ResponseCompleteFinal response with status code, headers, and optional body
PongHeartbeat response echoing timestamp plus current load
ErrorError report, optionally scoped to a specific request

Key Design Decisions

Why the queue lives at the center

Queueing at each worker would require clients to retry across workers or implement their own load balancing. Central queueing means one place manages fairness, timeout policy, and capacity-aware routing. When a worker finishes a request, the server immediately checks the queue for the next compatible request — no external coordination needed.

Why WebSocket instead of gRPC

Workers connect out to the server. This is the fundamental topology: GPU boxes on home networks, behind NATs, with no inbound ports. WebSocket over HTTP works through every proxy and firewall. gRPC would add a proto compilation step, a heavier runtime dependency, and more complex connection management for no meaningful benefit in a system where the message vocabulary is small and the payload is mostly opaque passthrough.

Why the protocol is flat JSON

The protocol has ~12 message types. Each is a small JSON object with a "type" tag. There is no binary framing, no schema negotiation, no version handshake beyond a simple protocol_version field. This makes debugging trivial (read the WebSocket frames), keeps the protocol crate minimal, and means any language can implement a worker in an afternoon. The heavy payload (inference request bodies and response chunks) is opaque text passed through without parsing.

Why streaming is chunked SSE, not buffered

LLM inference can take seconds to minutes. Buffering the full response before sending it to the client would destroy the interactive experience. ModelRelay preserves streaming semantics end-to-end: the worker reads chunks from the backend as they arrive, wraps each in a ResponseChunk message, and the server yields each as an SSE event. The client sees tokens arrive in real time, identical to talking directly to the backend.

Why cancellation is RAII-based

Client disconnects are the normal case, not an exception. When a user closes a tab or ctrl-C's a curl command, the HTTP response future is dropped. Rust's ownership model makes this the natural place to trigger cleanup: the HttpRequestCancellationGuard fires on drop, propagates the cancel through the server core to the worker, and the worker aborts the backend request. No polling, no timers, no forgotten cleanup paths.

Why requeue has a cap of 3

When a worker dies mid-request, the server requeues the request to another worker. But if workers keep dying (bad model, OOM, hardware failure), infinite requeue would loop forever. Three attempts is enough to survive transient worker restarts without masking systemic failures.

Capacity and Scaling

What limits the server

The server is single-process, async (tokio). The practical limits are:

  • Connected workers: bounded by memory for the worker registry and WebSocket connections. Thousands of workers are feasible.
  • Queue depth: configurable per provider (max_queue_len). Memory cost is proportional to queued request bodies.
  • Concurrent in-flight requests: bounded by the sum of all workers' max_concurrent values. Each in-flight request holds a small state record and channel handles.
  • Streaming throughput: chunks flow through an mpsc channel per request. The server does minimal processing per chunk (no parsing, no transformation), so throughput scales with I/O.

What limits a worker

Each worker is bounded by its local backend's capacity. The max_concurrent setting should match what the backend can handle (e.g., llama-server's -np parallel slots). The worker itself adds negligible overhead — it is a thin forwarding layer.

What the queue cannot do

  • Priority: the queue is FIFO per provider. There is no request priority mechanism.
  • Cross-provider routing: a request targets one provider. There is no fallback to a different provider if the primary queue is full.
  • Persistence: the queue is in-memory. If the server restarts, queued requests are lost. In-flight requests fail and clients retry.

Scaling patterns

  • Vertical: increase max_concurrent on workers with more GPU memory or faster hardware.
  • Horizontal: add more workers. The server's round-robin dispatcher spreads load automatically.
  • Multi-server: not built in. For HA, run multiple server instances behind a load balancer, but each server maintains its own worker pool and queue (no shared state). Workers can connect to multiple servers for redundancy.

Design Constraints

  • The HTTP boundary should look normal to clients; the worker protocol can stay private and purpose-built.
  • Queueing belongs at the central server, not at each worker.
  • Streaming and cancellation are first-class concerns, not add-ons.
  • The Rust rewrite should preserve behavior, not Go package boundaries.
  • The implementation should optimize for testability and explicit state transitions over abstraction depth.

Current Status

The project is complete and ready for production use. The full behavior matrix is implemented and verified by an extensive automated test suite covering:

  • OpenAI chat/completions and responses flows
  • Anthropic messages flows
  • Queueing and timeout behavior
  • Streaming pass-through with preserved ordering and termination
  • Client cancellation propagation through the WebSocket link
  • Worker disconnect and automatic requeue
  • Heartbeat and live-load reporting
  • Model refresh and auth cooldown recovery
  • Graceful shutdown and drain semantics

A multi-stage Dockerfile and docker-compose example are provided for quick setup without a Rust toolchain.

Protocol Walkthrough — Wire Traces

This document shows the actual message flow between components for each major scenario. Message types reference the structs in the modelrelay-protocol crate (ServerToWorkerMessage / WorkerToServerMessage).


1. Worker Registration and Heartbeat

  Worker                         Proxy Server
    │                                │
    │──── WebSocket UPGRADE ────────►│  GET /v1/worker/connect
    │◄─── 101 Switching Protocols ──│
    │                                │
    │  WorkerToServerMessage::Register
    │  {                             │
    │    "type": "register",         │
    │    "worker_name": "gpu-box-1", │
    │    "models": ["llama3-8b"],    │
    │    "max_concurrent": 4,        │
    │    "protocol_version": "1",    │
    │    "current_load": 0           │
    │  }                             │
    │──────────────────────────────►│  Proxy validates worker_secret
    │                                │  (passed as query param or header
    │                                │   during WebSocket upgrade)
    │                                │
    │  ServerToWorkerMessage::RegisterAck
    │  {                             │
    │    "type": "register_ack",     │
    │    "worker_id": "w-a1b2c3",   │
    │    "models": ["llama3-8b"],    │
    │    "protocol_version": "1"     │
    │  }                             │
    │◄──────────────────────────────│
    │                                │
    │         ┌──── heartbeat loop (HEARTBEAT_INTERVAL) ────┐
    │         │                      │                       │
    │  ServerToWorkerMessage::Ping   │                       │
    │  { "type": "ping",            │                       │
    │    "timestamp_unix_ms": ... }  │                       │
    │◄──────────────────────────────│                       │
    │                                │                       │
    │  WorkerToServerMessage::Pong   │                       │
    │  { "type": "pong",            │                       │
    │    "current_load": 1,          │                       │
    │    "timestamp_unix_ms": ... }  │                       │
    │──────────────────────────────►│  Proxy updates load   │
    │         └─────────────────────────────────────────────┘

If the worker misses heartbeats, the proxy closes the WebSocket with reason "worker heartbeat timed out" and requeues any in-flight requests (up to MAX_REQUEUE_COUNT = 3 retries).


2. Normal Non-Streaming Request

  Client                 Proxy Server                Worker              Backend
    │                        │                         │                    │
    │  POST /v1/chat/completions                       │                    │
    │  {"model":"llama3-8b", │                         │                    │
    │   "stream": false,     │                         │                    │
    │   "messages":[...]}    │                         │                    │
    │───────────────────────►│                         │                    │
    │                        │                         │                    │
    │                        │  Queue lookup:           │                    │
    │                        │  provider="local",       │                    │
    │                        │  model="llama3-8b"       │                    │
    │                        │  → worker "gpu-box-1"    │                    │
    │                        │  has capacity             │                    │
    │                        │                         │                    │
    │                        │  ServerToWorkerMessage::Request              │
    │                        │  { "type": "request",   │                    │
    │                        │    "request_id": "r-001",│                   │
    │                        │    "model": "llama3-8b", │                   │
    │                        │    "endpoint_path":      │                    │
    │                        │      "/v1/chat/completions",                 │
    │                        │    "is_streaming": false, │                   │
    │                        │    "body": "{...}",      │                   │
    │                        │    "headers": {...} }    │                    │
    │                        │────────────────────────►│                    │
    │                        │                         │                    │
    │                        │                         │  POST /v1/chat/completions
    │                        │                         │───────────────────►│
    │                        │                         │                    │
    │                        │                         │◄───────────────────│
    │                        │                         │  200 OK + JSON body│
    │                        │                         │                    │
    │                        │  WorkerToServerMessage::ResponseComplete     │
    │                        │  { "type": "response_complete",             │
    │                        │    "request_id": "r-001",│                   │
    │                        │    "status_code": 200,   │                   │
    │                        │    "headers": {"content-type":              │
    │                        │      "application/json"},│                   │
    │                        │    "body": "{...}",      │                   │
    │                        │    "token_counts": {     │                   │
    │                        │      "prompt_tokens": 42,│                   │
    │                        │      "completion_tokens": 128,              │
    │                        │      "total_tokens": 170 │                   │
    │                        │    }                     │                   │
    │                        │  }                       │                   │
    │                        │◄────────────────────────│                    │
    │                        │                         │                    │
    │◄───────────────────────│  200 OK                  │                    │
    │  {"choices":[...]}     │  (body forwarded)        │                    │

3. Streaming Request (SSE)

  Client                 Proxy Server                Worker              Backend
    │                        │                         │                    │
    │  POST /v1/chat/completions                       │                    │
    │  {"stream": true, ...} │                         │                    │
    │───────────────────────►│                         │                    │
    │                        │                         │                    │
    │                        │  Request dispatched      │                    │
    │                        │  (is_streaming: true)    │                    │
    │                        │────────────────────────►│                    │
    │                        │                         │  POST (stream=true)│
    │                        │                         │───────────────────►│
    │                        │                         │                    │
    │                        │                         │  SSE: data: {...}  │
    │                        │                         │◄───────────────────│
    │                        │  WorkerToServerMessage::ResponseChunk       │
    │                        │  { "type": "response_chunk",               │
    │                        │    "request_id": "r-002",│                   │
    │                        │    "chunk": "data: {\"choices\":[...]}\n\n" │
    │                        │  }                       │                   │
    │                        │◄────────────────────────│                    │
    │  SSE: data: {...}      │                         │                    │
    │◄───────────────────────│                         │                    │
    │                        │                         │                    │
    │  ...more chunks...     │  ...more ResponseChunk..│  ...more SSE...   │
    │                        │                         │                    │
    │                        │                         │  SSE: data: [DONE] │
    │                        │                         │◄───────────────────│
    │                        │  ResponseChunk (final)   │                   │
    │                        │◄────────────────────────│                    │
    │  SSE: data: [DONE]     │                         │                    │
    │◄───────────────────────│                         │                    │
    │                        │                         │                    │
    │                        │  WorkerToServerMessage::ResponseComplete     │
    │                        │  { "type": "response_complete",             │
    │                        │    "request_id": "r-002",│                   │
    │                        │    "status_code": 200,   │                   │
    │                        │    "token_counts": {...}  │                  │
    │                        │  }                       │                   │
    │                        │◄────────────────────────│                    │

Chunks are forwarded byte-for-byte without re-parsing. The proxy writes each ResponseChunk.chunk directly to the HTTP response body, preserving SSE framing intact.


4. Client Cancellation Propagation

  Client                 Proxy Server                Worker              Backend
    │                        │                         │                    │
    │  POST /v1/chat/completions (streaming)            │                    │
    │───────────────────────►│                         │                    │
    │                        │  → dispatched to worker  │                    │
    │                        │────────────────────────►│                    │
    │                        │                         │───────────────────►│
    │  (receiving chunks...) │                         │                    │
    │◄───────────────────────│                         │                    │
    │                        │                         │                    │
    │  CLIENT DISCONNECTS    │                         │                    │
    │──── TCP RST / close ──►│                         │                    │
    │                        │                         │                    │
    │                        │  Proxy detects drop      │                    │
    │                        │                         │                    │
    │                        │  ServerToWorkerMessage::Cancel               │
    │                        │  { "type": "cancel",     │                   │
    │                        │    "request_id": "r-002",│                   │
    │                        │    "reason":              │                   │
    │                        │      "client_disconnect"  │                  │
    │                        │  }                       │                   │
    │                        │────────────────────────►│                    │
    │                        │                         │                    │
    │                        │                         │  Worker aborts      │
    │                        │                         │  backend request    │
    │                        │                         │───── abort ────────►│
    │                        │                         │                    │

Cancel reasons (from CancelReason enum):

  • client_disconnect — HTTP client dropped the connection
  • timeout — request exceeded REQUEST_TIMEOUT_SECS
  • graceful_shutdown — server is shutting down
  • worker_disconnect — worker WebSocket closed unexpectedly
  • requeue_exhausted — max requeue attempts (MAX_REQUEUE_COUNT = 3) exceeded
  • server_shutdown — server process is terminating

5. Worker Disconnect Mid-Request (Requeue Path)

  Client                 Proxy Server                Worker              Backend
    │                        │                         │                    │
    │  POST /v1/chat/completions                       │                    │
    │───────────────────────►│                         │                    │
    │                        │  → dispatched to worker  │                    │
    │                        │────────────────────────►│                    │
    │                        │                         │                    │
    │                        │      WORKER CRASHES      │                    │
    │                        │      (WebSocket closes)  │                    │
    │                        │◄─── close frame / EOF ──│                    │
    │                        │                         ×                    │
    │                        │                         │                    │
    │                        │  requeue_count < MAX_REQUEUE_COUNT (3)?      │
    │                        │  YES → put request back in queue             │
    │                        │                         │                    │
    │                        │  ...time passes, another worker available... │
    │                        │                         │                    │
    │                        │             Worker-2    │                    │
    │                        │  ServerToWorkerMessage::Request              │
    │                        │────────────────────────►│  Worker-2          │
    │                        │                         │───────────────────►│
    │                        │                         │◄───────────────────│
    │                        │  ResponseComplete        │                   │
    │                        │◄────────────────────────│                    │
    │◄───────────────────────│  200 OK                  │                    │
    │                        │                         │                    │

  If requeue_count >= MAX_REQUEUE_COUNT (3):
    │                        │                         │
    │◄───────────────────────│  503 Service Unavailable │
    │  {"error": "requeue    │  Cancel with reason:     │
    │   attempts exhausted"} │  "requeue_exhausted"     │

6. Queue-Full Error

  Client                 Proxy Server
    │                        │
    │  POST /v1/chat/completions
    │───────────────────────►│
    │                        │
    │                        │  Queue length >= max_queue_len
    │                        │  (configured via MAX_QUEUE_LEN,
    │                        │   default: 100)
    │                        │
    │◄───────────────────────│  429 Too Many Requests
    │  {"error":             │
    │   "queue full"}        │

7. No Workers Available

  Client                 Proxy Server
    │                        │
    │  POST /v1/chat/completions
    │  {"model": "llama3-8b"}│
    │───────────────────────►│
    │                        │
    │                        │  No provider registered
    │                        │  for model "llama3-8b",
    │                        │  or no workers connected
    │                        │
    │                        │  If a provider exists but
    │                        │  no workers: request is queued
    │                        │  (will timeout after
    │                        │   QUEUE_TIMEOUT_SECS = 30)
    │                        │
    │                        │  If no provider matches at all:
    │◄───────────────────────│  404 Not Found
    │  {"error":             │
    │   "no provider for     │
    │    model llama3-8b"}   │

8. Graceful Shutdown / Worker Drain

  Proxy Server                Worker
    │                           │
    │  (admin triggers drain    │
    │   or server shutting down)│
    │                           │
    │  ServerToWorkerMessage::GracefulShutdown
    │  { "type": "graceful_shutdown",
    │    "reason": "maintenance",
    │    "drain_timeout_secs": 30
    │  }
    │──────────────────────────►│
    │                           │
    │  Worker marked is_draining│
    │  No new requests sent     │
    │                           │
    │  Worker finishes in-flight│
    │  requests normally...     │
    │                           │
    │  ResponseComplete(s)      │
    │◄──────────────────────────│
    │                           │
    │  disconnect_drained_worker_if_idle():
    │  all in-flight done?      │
    │  YES → close WebSocket    │
    │──── close frame ─────────►│
    │                           ×

9. Dynamic Model Update

  Worker                     Proxy Server
    │                           │
    │  (new model loaded or     │
    │   model removed locally)  │
    │                           │
    │  WorkerToServerMessage::ModelsUpdate
    │  { "type": "models_update",
    │    "models": ["llama3-8b",
    │               "codellama-13b"],
    │    "current_load": 1
    │  }
    │──────────────────────────►│
    │                           │  Proxy updates worker's
    │                           │  model list and routing
    │                           │
    │  (or server requests it)  │
    │                           │
    │  ServerToWorkerMessage::ModelsRefresh
    │  { "type": "models_refresh",
    │    "reason": "periodic"
    │  }
    │◄──────────────────────────│
    │                           │
    │  ModelsUpdate (response)  │
    │──────────────────────────►│

10. Queue Timeout

  Client                 Proxy Server
    │                        │
    │  POST /v1/chat/completions
    │  {"model": "llama3-8b"}│
    │───────────────────────►│
    │                        │
    │                        │  Provider exists but all
    │                        │  workers are busy (at
    │                        │  max_concurrent).
    │                        │  Request enters queue.
    │                        │
    │                        │  ┌── QUEUE_TIMEOUT_SECS (30) ──┐
    │                        │  │  waiting for a worker to     │
    │                        │  │  become available...         │
    │                        │  │                              │
    │                        │  │  no worker picks up          │
    │                        │  └──────────── timeout fires ───┘
    │                        │
    │                        │  Cancel with reason: "timeout"
    │                        │
    │◄───────────────────────│  504 Gateway Timeout
    │  {"error":             │
    │   "queue timeout:      │
    │    no worker available  │
    │    within deadline"}   │

The request never reaches a worker. The proxy removes it from the queue and returns 504 to the client. No Cancel message is sent over WebSocket because no worker was ever assigned.


11. Request Timeout (In-Flight)

  Client                 Proxy Server                Worker              Backend
    │                        │                         │                    │
    │  POST /v1/chat/completions                       │                    │
    │───────────────────────►│                         │                    │
    │                        │  → dispatched to worker  │                    │
    │                        │────────────────────────►│                    │
    │                        │                         │───────────────────►│
    │                        │                         │                    │
    │                        │  ┌── REQUEST_TIMEOUT_SECS (300) ──┐         │
    │                        │  │  waiting for ResponseComplete   │        │
    │                        │  │  or streaming chunks...         │        │
    │                        │  │                                 │        │
    │                        │  │  backend is still processing    │        │
    │                        │  └──────────── timeout fires ──────┘        │
    │                        │                         │                    │
    │                        │  ServerToWorkerMessage::Cancel               │
    │                        │  { "type": "cancel",     │                   │
    │                        │    "request_id": "r-003",│                   │
    │                        │    "reason": "timeout"   │                   │
    │                        │  }                       │                   │
    │                        │────────────────────────►│                    │
    │                        │                         │                    │
    │                        │                         │  Worker aborts      │
    │                        │                         │  backend request    │
    │                        │                         │───── abort ────────►│
    │                        │                         │                    │
    │◄───────────────────────│  504 Gateway Timeout     │                    │
    │  {"error":             │                         │                    │
    │   "request timeout"}   │                         │                    │

Unlike queue timeout, the request was dispatched to a worker, so the proxy sends a Cancel message with reason "timeout" over the WebSocket. The worker receives the cancellation and aborts the in-flight backend request. The proxy returns 504 to the client.


Message Type Summary

Server → Worker (ServerToWorkerMessage)

TypeStructPurpose
register_ackRegisterAckConfirm registration, assign worker ID
requestRequestMessageDispatch an inference request
cancelCancelMessageCancel an in-flight request
pingPingMessageHeartbeat probe
graceful_shutdownGracefulShutdownMessageBegin drain sequence
models_refreshModelsRefreshMessageAsk worker to re-report models

Worker → Server (WorkerToServerMessage)

TypeStructPurpose
registerRegisterMessageAnnounce name, models, capacity
models_updateModelsUpdateMessageUpdate model list / load
response_chunkResponseChunkMessageForward a streaming chunk
response_completeResponseCompleteMessageSignal request completion
pongPongMessageHeartbeat reply with load
errorErrorMessageReport a request-level error

Behavior Contract

This document captures the externally observable behavior contract for ModelRelay — the behaviors that must hold across versions and that users and contributors can rely on. The contract test suite in modelrelay-contract-tests is the automated expression of these requirements.

Core Contract

  • Worker auth and registration: Workers connect to /v1/worker/connect?provider=<name> over WebSocket and authenticate with a provider-specific worker secret. The preferred transport is X-Worker-Secret; query-string secret fallback exists only for backward compatibility. Secret comparison is constant-time. Unknown providers are rejected, disabled providers are rejected, and repeated failed auth attempts are rate-limited by client IP.

  • Capability advertisement: After connect, the worker sends a register message containing worker_name, models, max_concurrent, and protocol_version. The server may sanitize or truncate these values and must send register_ack with the accepted worker ID, accepted model list, and warnings. Legacy workers omitting protocol_version are tolerated in Katamari unless explicitly rejected by config; mismatched protocol versions are closed with a protocol error. The first Rust characterization harness makes that sanitization concrete by requiring the acked model list to trim whitespace, drop empty entries, de-duplicate exact duplicates while preserving first-seen order, and cap the accepted list at a provider-defined limit with warnings surfaced in register_ack.

  • Model advertisement and worker selection: Workers advertise exact model names, and the server routes only to workers that explicitly support the requested model. Katamari keeps an O(1) model-membership set per worker. Selection is "lowest load with round-robin tie breaking" among workers that support the model and can atomically reserve capacity.

  • Queueing when no worker is immediately available: If no eligible worker can accept the request, the request is queued per virtual provider. The queue is bounded and FIFO among requests compatible with a worker's model list. Requests remain keyed by original queue time so requeue does not grant infinite timeout extensions.

  • Request dispatch over WebSocket: Requests are forwarded to workers as request messages with request_id, model, raw JSON body string, selected compatibility headers, target endpoint path, and is_streaming. The central proxy accepts ordinary provider-style HTTP requests and delegates only the worker-backed providers through this path. Compatibility-critical request headers include OpenAI-style authorization, content-type, and openai-organization, plus Anthropic-style x-api-key, anthropic-version, anthropic-beta, and content-type; incidental transport headers like user-agent are not part of the worker envelope contract.

  • Non-streaming response pass-through: Workers reply with response_complete containing the final HTTP status, response headers, full body, and token counts. The proxy must forward status, headers, and body faithfully, including upstream 4xx and 5xx responses, rather than collapsing them into generic proxy errors.

  • Streaming chunk ordering and termination semantics: Streaming responses are forwarded as response_chunk messages containing already-formatted SSE data and finish with response_complete. Chunks must preserve order. The HTTP side must flush promptly, retain streaming semantics, and treat completion metadata as the source of final status and token accounting. Katamari enforces a cumulative streaming size ceiling and emits an SSE error before terminating an oversized stream.

  • Client cancellation propagation end to end: Client disconnect or request timeout must cancel the HTTP request context, remove queued work if still queued, or send a best-effort cancel message for active worker requests. Late chunks that arrive after cancellation are intentionally dropped. The worker protocol has explicit cancel reasons, including client disconnect and timeout.

  • Worker disconnect during active request: On worker disconnect, active requests are examined one by one. If the request context is still alive, Katamari requeues it onto the provider queue without resetting its lifetime. If the request context is already canceled or timed out, the request fails immediately to the waiting client path instead. Requeue is capped at MaxRequeueCount = 3; after that the request fails with a service-unavailable style error instead of looping forever.

  • Timeout behavior: Every provider has a request timeout used both for queue wait and overall request lifetime. Queue timeout produces a worker-unavailable style response. Streaming and non-streaming requests share the parent HTTP context, so client disconnect and timeout terminate the same request object. WebSocket heartbeats use ping every 15 seconds and a 45-second pong window.

  • Queue-full, no-workers, and provider-disabled error surfaces: Katamari distinguishes bounded queue exhaustion, no worker capacity, disabled providers, deleted providers, timeout, and requeue exhaustion through dedicated error values. The public-facing HTTP layer currently sanitizes some internal errors into stable client messages such as "Service temporarily at capacity" and "Provider is currently disabled".

  • Heartbeat, load reporting, and stale-worker cleanup: The server sends JSON ping; workers reply with JSON pong carrying current load. This heartbeat updates last_heartbeat and live load accounting. Workers may also send models_update when their local model catalog changes. Stale worker DB records are cleaned periodically, and failed auth rate-limit entries expire automatically.

  • Graceful shutdown and drain semantics: The server can send graceful_shutdown to tell workers to stop accepting new work, finish current requests, and disconnect before a timeout. Provider deletion drains queued requests with an explicit provider-deleted error and closes connected workers.

  • OpenAI-style and Anthropic-style compatibility: The central server is meant to accept ordinary client traffic, not a custom client. Katamari parses model and stream flags from OpenAI-style request bodies, provides a special /v1/models compatibility endpoint, and preserves SSE behavior expected by OpenAI-compatible tooling. The extracted Rust project should also preserve Anthropic-style compatibility at the central HTTP boundary even if the internal worker protocol stays provider-neutral.

Wire Messages To Preserve

  • Server to worker: ping, request, register_ack, cancel, graceful_shutdown, models_refresh

  • Worker to server: pong, register, models_update, response_chunk, response_complete, error

Invariants Worth Preserving

  • A worker never silently gains capability beyond the sanitized models acknowledged by the server.
  • Queueing is bounded per provider and does not grow without limit.
  • Requeue is intentional and finite.
  • HTTP error bodies from the worker backend are preserved where safe instead of flattened away.
  • Streaming remains SSE-shaped end to end.
  • Worker churn or late chunks must not leave requests hanging forever.

Extension Points

When adding new behaviors, add a contract test in modelrelay-contract-tests before implementing. This keeps the test suite as the primary specification. If a behavior described above is not yet covered by an automated test, that gap is the highest-priority work item.

Operational Runbook

This guide covers day-to-day operations for running ModelRelay in production. It assumes you have one modelrelay-server instance and one or more modelrelay-worker processes.


Health Checks

Proxy Server

The proxy server exposes a dedicated /health endpoint:

# Primary health check — returns JSON with version, worker count, queue depth, and uptime.
curl -sf http://proxy:8080/health | jq .

Example response:

{
  "status": "ok",
  "version": "0.1.6",
  "workers_connected": 2,
  "queue_depth": 0,
  "uptime_secs": 3621.5
}

Use /health for liveness probes, Kubernetes readiness checks, and monitoring. A workers_connected of 0 means the proxy is running but no workers are registered.

You can also list routable models directly:

curl -s http://proxy:8080/v1/models | jq '.data[].id'

Worker Daemon

The worker daemon does not expose its own HTTP port — it connects outward to the proxy. Health is observable from the proxy side:

# Check if workers are registered by listing models.
curl -s http://proxy:8080/v1/models | jq '.data[].id'

If expected models are missing, the worker is either down or failed to register. Check worker logs for connection errors or authentication failures.


Admin API & Monitoring

ModelRelay includes admin endpoints for inspecting workers, request metrics, and managing client API keys. All /admin/* endpoints require a Bearer token.

Enabling Admin Endpoints

Set MODELRELAY_ADMIN_TOKEN when starting the server:

modelrelay-server --worker-secret mysecret --admin-token my-admin-secret

Without this token, all /admin/* endpoints return 403 Forbidden.

Querying Admin Endpoints

TOKEN="my-admin-secret"

# List connected workers (models, load, capabilities)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/workers | jq .

# Request stats and queue depth
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/stats | jq .

# List client API keys (metadata only, no secrets)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/keys | jq .

Managing Client API Keys

When MODELRELAY_REQUIRE_API_KEYS=true, clients must send a valid API key as a Bearer token on inference requests.

TOKEN="my-admin-secret"

# Create a new API key (the secret is returned only at creation time)
curl -s -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "production-app"}' \
  http://proxy:8080/admin/keys | jq .

# Revoke a key by ID
curl -s -X DELETE \
  -H "Authorization: Bearer $TOKEN" \
  http://proxy:8080/admin/keys/{key-id}

Clients use the returned secret as a Bearer token:

curl -H "Authorization: Bearer mr-..." \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hi"}]}' \
  http://proxy:8080/v1/chat/completions

Web Dashboard & Setup Wizard

The proxy serves a built-in web UI:

  • Dashboard — visit http://proxy:8080/dashboard for real-time worker status, request metrics, and queue depth.
  • Setup Wizard — visit http://proxy:8080/setup for a step-by-step guide to connecting a new worker (platform detection, backend setup, binary download, and live connection verification).

The wizard is always accessible, not just on first run — use it whenever you add another GPU box.

Troubleshooting Admin Features

Admin endpoints return 403: MODELRELAY_ADMIN_TOKEN is not set on the server, or the Authorization header doesn't match. Verify the token value and ensure the header format is Authorization: Bearer <token>.

Client requests return 401 when API key auth is enabled: The client is not sending a Bearer token, or the key has been revoked. Create a new key via POST /admin/keys and ensure the client sends Authorization: Bearer <key>.

API key auth not taking effect: MODELRELAY_REQUIRE_API_KEYS must be set to true. When false (the default), inference endpoints accept unauthenticated requests.


Checking Worker Registration

After starting a new worker, confirm it registered:

# Should include the worker's advertised models.
curl -s http://proxy:8080/v1/models | jq .

If a worker's models don't appear within ~10 seconds:

  1. Check the worker secret — does WORKER_SECRET on the worker match the proxy?
  2. Check connectivity — can the worker reach PROXY_URL?
    curl -v http://proxy:8080/v1/worker/connect
    # Should get 400 or upgrade-required, not a connection error
    
  3. Check worker logs — look for register / register_ack messages or error lines.

Draining a Worker Gracefully

To remove a worker from rotation without dropping in-flight requests:

  1. Send SIGTERM to the modelrelay-worker process. The daemon initiates a graceful disconnect — the proxy sends a GracefulShutdown message and stops routing new requests to that worker.

  2. In-flight requests finish normally. The proxy waits up to drain_timeout_secs (from the shutdown message) for active requests to complete.

  3. Once idle, the WebSocket closes. The worker process exits.

# Graceful stop via systemd
systemctl stop modelrelay-worker@gpu-box-1

# Or with Docker
docker stop --time 60 worker-gpu-box-1

Monitoring drain progress: Watch the proxy logs for "worker drained" or similar messages. If the worker still has in-flight requests, you'll see ongoing ResponseChunk / ResponseComplete messages until they finish.


Scaling Workers

Adding a worker

Start a new modelrelay-worker instance pointing at the same proxy:

PROXY_URL=http://proxy:8080 \
WORKER_SECRET=your-secret \
WORKER_NAME=gpu-box-4 \
BACKEND_URL=http://localhost:8000 \
  modelrelay-worker --models llama3-8b

The proxy discovers it within seconds via the WebSocket registration handshake. No proxy restart or config change needed.

Removing a worker

Use the graceful drain procedure above. The proxy automatically routes around disconnected workers.

Scaling the proxy

The proxy is a single-process server. To scale:

  • Vertical: increase MAX_QUEUE_LEN and system file descriptor limits.
  • Horizontal: run multiple proxy instances behind a load balancer, but note that each worker connects to one proxy. Workers must be distributed across proxy instances manually or via DNS round-robin.

Log Interpretation

Proxy Server

Log patternMeaning
worker registered / register_ackWorker connected and authenticated
request dispatchedRequest sent to a worker
response completeWorker returned a result
worker heartbeat timed outWorker missed pings — WebSocket closed
request requeuedWorker died mid-request, retrying on another worker
requeue exhaustedRequest failed after MAX_REQUEUE_COUNT (3) retries
queue fullRejected request — queue at MAX_QUEUE_LEN capacity
queue timeoutRequest sat in queue longer than QUEUE_TIMEOUT_SECS
graceful shutdownWorker drain initiated

Worker Daemon

Log patternMeaning
connected to proxyWebSocket connection established
registeredRegistration acknowledged by proxy
forwarding requestProxying a request to the local backend
backend errorLocal backend returned an error or is unreachable
cancelledProxy sent a cancel for an in-flight request
graceful shutdownDrain in progress, finishing active requests

Adjusting log verbosity

Set LOG_LEVEL environment variable on either component:

LOG_LEVEL=debug modelrelay-server   # trace, debug, info (default), warn, error
LOG_LEVEL=debug modelrelay-worker

Common Failure Modes

Worker can't connect to proxy

Symptoms: Worker logs show connection refused or timeouts.

Checklist:

  1. Is the proxy running? curl http://proxy:8080/v1/models
  2. Is PROXY_URL correct? The worker connects to {PROXY_URL}/v1/worker/connect via WebSocket.
  3. Firewall / network: the worker makes an outbound connection to the proxy — no inbound ports needed on the worker machine.
  4. If using TLS (nginx/reverse proxy in front), ensure WebSocket upgrade headers are forwarded. See the TLS Setup guide.

Worker registers but requests fail

Symptoms: /v1/models shows the model, but requests return 502 or timeout.

Checklist:

  1. Is the local backend running? curl http://localhost:8000/v1/models (or whatever BACKEND_URL is set to)
  2. Does the backend support the requested endpoint? (/v1/chat/completions, /v1/messages, /v1/responses)
  3. Check worker logs for backend error messages.
  4. Try a direct request to the backend to isolate the issue.

Requests queue but never complete

Symptoms: Clients hang, then get a timeout error after QUEUE_TIMEOUT_SECS.

Causes:

  • No workers are connected (check /v1/models)
  • Workers are at capacity (max_concurrent reached on all workers)
  • Workers are connected but not advertising the requested model

Fix: Add more workers, increase max_concurrent if the hardware allows, or reduce QUEUE_TIMEOUT_SECS to fail faster.

Streaming responses arrive corrupted

Symptoms: SSE chunks arrive garbled or out of order.

Checklist:

  1. Ensure no intermediate proxy is buffering. Disable response buffering in nginx:
    proxy_buffering off;
    
  2. If using a CDN or reverse proxy, ensure it supports chunked transfer encoding and doesn't aggregate small writes.

High memory usage on the proxy

Symptoms: Proxy RSS grows over time.

Causes:

  • Large queue of pending requests (each holds the full request body)
  • Many concurrent streaming responses with large chunk buffers

Fix: Lower MAX_QUEUE_LEN, set QUEUE_TIMEOUT_SECS to a shorter value, or add workers to drain the queue faster.

Worker keeps reconnecting

Symptoms: Worker logs show repeated connect/disconnect cycles.

Causes:

  • Heartbeat timeout — the worker or network is too slow to respond to pings within HEARTBEAT_INTERVAL
  • WORKER_SECRET mismatch — worker connects, fails auth, gets disconnected, retries

Fix: Check secrets match, check network latency between worker and proxy.


Configuration Quick Reference

Proxy Server

Env VarDefaultDescription
LISTEN_ADDR127.0.0.1:8080HTTP listen address
PROVIDER_NAMElocalProvider name for routing
WORKER_SECRET(required)Shared secret for worker auth
MAX_QUEUE_LEN100Max queued requests before rejecting
QUEUE_TIMEOUT_SECS30How long a request can wait in queue
REQUEST_TIMEOUT_SECS300Total request timeout (5 min)
LOG_LEVELinfoLog verbosity
MODELRELAY_ADMIN_TOKEN(none)Bearer token for /admin/* endpoints (if unset, admin returns 403)
MODELRELAY_REQUIRE_API_KEYSfalseWhen true, client requests require a valid API key

Worker Daemon

Env VarDefaultDescription
PROXY_URLhttp://127.0.0.1:8080Proxy server URL
WORKER_SECRET(required)Must match proxy's secret
WORKER_NAMEworkerHuman-readable worker name
BACKEND_URLhttp://127.0.0.1:8000Local model server URL
LOG_LEVELinfoLog verbosity

Windows Service

Checking Service Status

Get-Service ModelRelayServer
Get-Service ModelRelayWorker

Starting and Stopping

Start-Service ModelRelayServer
Stop-Service ModelRelayServer

Start-Service ModelRelayWorker
Stop-Service ModelRelayWorker

Stop-Service sends a stop control signal and waits for the process to exit. ModelRelay handles this as a graceful shutdown — in-flight requests finish before the process terminates. To set an explicit timeout:

# Stop with a 60-second timeout (kills the process if it doesn't exit in time)
Stop-Service ModelRelayServer -NoWait
Start-Sleep -Seconds 60
(Get-Service ModelRelayServer).WaitForStatus("Stopped", "00:00:05")

Logs

Windows Services don't write to stdout by default. Two options:

  1. Windows Event Log — ModelRelay writes to the Application log. View with:

    Get-EventLog -LogName Application -Source ModelRelayServer -Newest 50
    
  2. File logging via RUST_LOG — set RUST_LOG as a system environment variable and redirect output to a file by wrapping the binary in a small script, or use the RUST_LOG_FILE convention if supported. The simplest approach:

    [Environment]::SetEnvironmentVariable("RUST_LOG", "info", "Machine")
    

Draining a Worker

To drain a worker gracefully before maintenance:

# Stop the service — this triggers graceful shutdown.
Stop-Service ModelRelayWorker

# Verify it has stopped.
Get-Service ModelRelayWorker

The worker completes in-flight requests before exiting, identical to the systemctl stop behavior on Linux.


Monitoring Checklist

For production deployments, monitor these signals:

  • Proxy process is up — HTTP health check on /health
  • At least one worker registered/health returns workers_connected > 0
  • Queue depth/health returns queue_depth; watch for sustained growth
  • Request latency — track time from client request to first byte
  • Worker reconnect rate — frequent reconnects indicate network or auth issues
  • Error rates — 4xx (client errors) vs 5xx (backend/proxy errors)
  • Backend health — each worker's local model server should be independently monitored

TLS Setup

ModelRelay's proxy server listens on plain HTTP by default. For production deployments you should terminate TLS in front of it so that:

  • Clients reach the API over HTTPS (https://your-domain/v1/...)
  • Workers connect over secure WebSockets (wss://your-domain/v1/worker/connect)

Without TLS the worker secret and all inference traffic travel in the clear. This matters especially when workers connect over the public internet rather than a private network.


The repository includes a ready-to-use nginx config at examples/tls-nginx.conf. Copy it into your nginx sites directory and customise the domain and certificate paths.

How it works

The config defines two server blocks:

  1. Port 80 redirects all HTTP traffic to HTTPS.
  2. Port 443 terminates TLS and proxies to 127.0.0.1:8080 (the default LISTEN_ADDR).

Two location blocks handle the different traffic types:

  • /v1/worker/connect --- the WebSocket endpoint. The key directives are:

    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 86400s;   # keep the long-lived WS open
    proxy_send_timeout 86400s;
    

    Without the Upgrade / Connection headers, nginx will not complete the WebSocket handshake and workers will fail to connect.

  • /v1/ --- the inference API. Buffering is disabled so that SSE streaming responses pass through without delay:

    proxy_buffering off;
    proxy_cache off;
    proxy_read_timeout 300s;   # match REQUEST_TIMEOUT_SECS
    

Quick start

# 1. Obtain a certificate (Let's Encrypt example)
sudo certbot certonly --nginx -d your-domain.example.com

# 2. Install the config
sudo cp examples/tls-nginx.conf /etc/nginx/sites-available/modelrelay.conf
sudo ln -s /etc/nginx/sites-available/modelrelay.conf /etc/nginx/sites-enabled/

# 3. Edit the config: replace your-domain.example.com everywhere
sudo nano /etc/nginx/sites-available/modelrelay.conf

# 4. Test and reload
sudo nginx -t && sudo systemctl reload nginx

Certificate renewal

Let's Encrypt certificates expire after 90 days. Certbot usually installs a systemd timer or cron job that renews automatically. Verify:

sudo certbot renew --dry-run

Option 2: Caddy

Caddy automatically provisions and renews TLS certificates from Let's Encrypt with zero configuration. If you don't need nginx's flexibility, this is the simplest path.

Caddyfile

your-domain.example.com {
    reverse_proxy 127.0.0.1:8080
}

That's it. Caddy handles:

  • HTTPS redirect from port 80
  • Automatic Let's Encrypt certificate issuance and renewal
  • WebSocket upgrade pass-through (no special config needed)
  • Unbuffered streaming (the default for reverse_proxy)

Running

# Install (Debian/Ubuntu)
sudo apt install -y caddy

# Write the Caddyfile
cat > /etc/caddy/Caddyfile <<'EOF'
your-domain.example.com {
    reverse_proxy 127.0.0.1:8080
}
EOF

# Start
sudo systemctl enable --now caddy

Note: Caddy must be able to bind ports 80 and 443, and the domain must resolve to the server's public IP for the ACME challenge to succeed.


Option 3: Manual certificates (certbot standalone)

If you're running neither nginx nor Caddy you can still use Let's Encrypt with certbot's standalone mode, then point any reverse proxy at the resulting certificate files:

sudo certbot certonly --standalone -d your-domain.example.com

Certificates land in /etc/letsencrypt/live/your-domain.example.com/. Use fullchain.pem and privkey.pem with whatever TLS terminator you prefer (HAProxy, Traefik, etc.).


Configuring workers for TLS

Once TLS is in place, update the worker's PROXY_URL to use the secure scheme:

ScenarioPROXY_URL
No TLS (local / private network)http://proxy:8080
TLS via reverse proxyhttps://your-domain.example.com

The worker uses PROXY_URL to derive the WebSocket connection URL. When the scheme is https, the worker connects over wss:// automatically.

# Example: worker connecting over TLS
PROXY_URL=https://your-domain.example.com \
WORKER_SECRET=your-secret \
BACKEND_URL=http://localhost:8000 \
  modelrelay-worker --models llama3-8b

Tip: The local backend (BACKEND_URL) almost never needs TLS --- it runs on the same machine as the worker. Keep it as plain http://localhost:....


Troubleshooting

Workers can't connect after enabling TLS

  1. Verify the certificate is valid: curl -v https://your-domain.example.com/v1/models
  2. Confirm WebSocket upgrade works: curl -v -H 'Upgrade: websocket' -H 'Connection: upgrade' https://your-domain.example.com/v1/worker/connect (should get a 101 or 400, not a connection error)
  3. Check that proxy_read_timeout / proxy_send_timeout are long enough for the WebSocket (the nginx config uses 86400s)

Streaming responses arrive buffered

Ensure your reverse proxy has buffering disabled for the /v1/ path. In nginx: proxy_buffering off;. Caddy disables buffering by default.

Certificate renewal fails

Certbot's HTTP-01 challenge needs port 80. If nginx or Caddy is running, use the --nginx or --caddy certbot plugin instead of --standalone to avoid port conflicts.