Protocol Walkthrough — Wire Traces
This document shows the actual message flow between components for each
major scenario. Message types reference the structs in the
modelrelay-protocol crate (ServerToWorkerMessage / WorkerToServerMessage).
1. Worker Registration and Heartbeat
Worker Proxy Server
│ │
│──── WebSocket UPGRADE ────────►│ GET /v1/worker/connect
│◄─── 101 Switching Protocols ──│
│ │
│ WorkerToServerMessage::Register
│ { │
│ "type": "register", │
│ "worker_name": "gpu-box-1", │
│ "models": ["llama3-8b"], │
│ "max_concurrent": 4, │
│ "protocol_version": "1", │
│ "current_load": 0 │
│ } │
│──────────────────────────────►│ Proxy validates worker_secret
│ │ (passed as query param or header
│ │ during WebSocket upgrade)
│ │
│ ServerToWorkerMessage::RegisterAck
│ { │
│ "type": "register_ack", │
│ "worker_id": "w-a1b2c3", │
│ "models": ["llama3-8b"], │
│ "protocol_version": "1" │
│ } │
│◄──────────────────────────────│
│ │
│ ┌──── heartbeat loop (HEARTBEAT_INTERVAL) ────┐
│ │ │ │
│ ServerToWorkerMessage::Ping │ │
│ { "type": "ping", │ │
│ "timestamp_unix_ms": ... } │ │
│◄──────────────────────────────│ │
│ │ │
│ WorkerToServerMessage::Pong │ │
│ { "type": "pong", │ │
│ "current_load": 1, │ │
│ "timestamp_unix_ms": ... } │ │
│──────────────────────────────►│ Proxy updates load │
│ └─────────────────────────────────────────────┘
If the worker misses heartbeats, the proxy closes the WebSocket with
reason "worker heartbeat timed out" and requeues any in-flight requests
(up to MAX_REQUEUE_COUNT = 3 retries).
2. Normal Non-Streaming Request
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│ {"model":"llama3-8b", │ │ │
│ "stream": false, │ │ │
│ "messages":[...]} │ │ │
│───────────────────────►│ │ │
│ │ │ │
│ │ Queue lookup: │ │
│ │ provider="local", │ │
│ │ model="llama3-8b" │ │
│ │ → worker "gpu-box-1" │ │
│ │ has capacity │ │
│ │ │ │
│ │ ServerToWorkerMessage::Request │
│ │ { "type": "request", │ │
│ │ "request_id": "r-001",│ │
│ │ "model": "llama3-8b", │ │
│ │ "endpoint_path": │ │
│ │ "/v1/chat/completions", │
│ │ "is_streaming": false, │ │
│ │ "body": "{...}", │ │
│ │ "headers": {...} } │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ │ POST /v1/chat/completions
│ │ │───────────────────►│
│ │ │ │
│ │ │◄───────────────────│
│ │ │ 200 OK + JSON body│
│ │ │ │
│ │ WorkerToServerMessage::ResponseComplete │
│ │ { "type": "response_complete", │
│ │ "request_id": "r-001",│ │
│ │ "status_code": 200, │ │
│ │ "headers": {"content-type": │
│ │ "application/json"},│ │
│ │ "body": "{...}", │ │
│ │ "token_counts": { │ │
│ │ "prompt_tokens": 42,│ │
│ │ "completion_tokens": 128, │
│ │ "total_tokens": 170 │ │
│ │ } │ │
│ │ } │ │
│ │◄────────────────────────│ │
│ │ │ │
│◄───────────────────────│ 200 OK │ │
│ {"choices":[...]} │ (body forwarded) │ │
3. Streaming Request (SSE)
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│ {"stream": true, ...} │ │ │
│───────────────────────►│ │ │
│ │ │ │
│ │ Request dispatched │ │
│ │ (is_streaming: true) │ │
│ │────────────────────────►│ │
│ │ │ POST (stream=true)│
│ │ │───────────────────►│
│ │ │ │
│ │ │ SSE: data: {...} │
│ │ │◄───────────────────│
│ │ WorkerToServerMessage::ResponseChunk │
│ │ { "type": "response_chunk", │
│ │ "request_id": "r-002",│ │
│ │ "chunk": "data: {\"choices\":[...]}\n\n" │
│ │ } │ │
│ │◄────────────────────────│ │
│ SSE: data: {...} │ │ │
│◄───────────────────────│ │ │
│ │ │ │
│ ...more chunks... │ ...more ResponseChunk..│ ...more SSE... │
│ │ │ │
│ │ │ SSE: data: [DONE] │
│ │ │◄───────────────────│
│ │ ResponseChunk (final) │ │
│ │◄────────────────────────│ │
│ SSE: data: [DONE] │ │ │
│◄───────────────────────│ │ │
│ │ │ │
│ │ WorkerToServerMessage::ResponseComplete │
│ │ { "type": "response_complete", │
│ │ "request_id": "r-002",│ │
│ │ "status_code": 200, │ │
│ │ "token_counts": {...} │ │
│ │ } │ │
│ │◄────────────────────────│ │
Chunks are forwarded byte-for-byte without re-parsing. The proxy writes
each ResponseChunk.chunk directly to the HTTP response body, preserving
SSE framing intact.
4. Client Cancellation Propagation
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions (streaming) │ │
│───────────────────────►│ │ │
│ │ → dispatched to worker │ │
│ │────────────────────────►│ │
│ │ │───────────────────►│
│ (receiving chunks...) │ │ │
│◄───────────────────────│ │ │
│ │ │ │
│ CLIENT DISCONNECTS │ │ │
│──── TCP RST / close ──►│ │ │
│ │ │ │
│ │ Proxy detects drop │ │
│ │ │ │
│ │ ServerToWorkerMessage::Cancel │
│ │ { "type": "cancel", │ │
│ │ "request_id": "r-002",│ │
│ │ "reason": │ │
│ │ "client_disconnect" │ │
│ │ } │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ │ Worker aborts │
│ │ │ backend request │
│ │ │───── abort ────────►│
│ │ │ │
Cancel reasons (from CancelReason enum):
client_disconnect— HTTP client dropped the connectiontimeout— request exceededREQUEST_TIMEOUT_SECSgraceful_shutdown— server is shutting downworker_disconnect— worker WebSocket closed unexpectedlyrequeue_exhausted— max requeue attempts (MAX_REQUEUE_COUNT = 3) exceededserver_shutdown— server process is terminating
5. Worker Disconnect Mid-Request (Requeue Path)
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│───────────────────────►│ │ │
│ │ → dispatched to worker │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ WORKER CRASHES │ │
│ │ (WebSocket closes) │ │
│ │◄─── close frame / EOF ──│ │
│ │ × │
│ │ │ │
│ │ requeue_count < MAX_REQUEUE_COUNT (3)? │
│ │ YES → put request back in queue │
│ │ │ │
│ │ ...time passes, another worker available... │
│ │ │ │
│ │ Worker-2 │ │
│ │ ServerToWorkerMessage::Request │
│ │────────────────────────►│ Worker-2 │
│ │ │───────────────────►│
│ │ │◄───────────────────│
│ │ ResponseComplete │ │
│ │◄────────────────────────│ │
│◄───────────────────────│ 200 OK │ │
│ │ │ │
If requeue_count >= MAX_REQUEUE_COUNT (3):
│ │ │
│◄───────────────────────│ 503 Service Unavailable │
│ {"error": "requeue │ Cancel with reason: │
│ attempts exhausted"} │ "requeue_exhausted" │
6. Queue-Full Error
Client Proxy Server
│ │
│ POST /v1/chat/completions
│───────────────────────►│
│ │
│ │ Queue length >= max_queue_len
│ │ (configured via MAX_QUEUE_LEN,
│ │ default: 100)
│ │
│◄───────────────────────│ 429 Too Many Requests
│ {"error": │
│ "queue full"} │
7. No Workers Available
Client Proxy Server
│ │
│ POST /v1/chat/completions
│ {"model": "llama3-8b"}│
│───────────────────────►│
│ │
│ │ No provider registered
│ │ for model "llama3-8b",
│ │ or no workers connected
│ │
│ │ If a provider exists but
│ │ no workers: request is queued
│ │ (will timeout after
│ │ QUEUE_TIMEOUT_SECS = 30)
│ │
│ │ If no provider matches at all:
│◄───────────────────────│ 404 Not Found
│ {"error": │
│ "no provider for │
│ model llama3-8b"} │
8. Graceful Shutdown / Worker Drain
Proxy Server Worker
│ │
│ (admin triggers drain │
│ or server shutting down)│
│ │
│ ServerToWorkerMessage::GracefulShutdown
│ { "type": "graceful_shutdown",
│ "reason": "maintenance",
│ "drain_timeout_secs": 30
│ }
│──────────────────────────►│
│ │
│ Worker marked is_draining│
│ No new requests sent │
│ │
│ Worker finishes in-flight│
│ requests normally... │
│ │
│ ResponseComplete(s) │
│◄──────────────────────────│
│ │
│ disconnect_drained_worker_if_idle():
│ all in-flight done? │
│ YES → close WebSocket │
│──── close frame ─────────►│
│ ×
9. Dynamic Model Update
Worker Proxy Server
│ │
│ (new model loaded or │
│ model removed locally) │
│ │
│ WorkerToServerMessage::ModelsUpdate
│ { "type": "models_update",
│ "models": ["llama3-8b",
│ "codellama-13b"],
│ "current_load": 1
│ }
│──────────────────────────►│
│ │ Proxy updates worker's
│ │ model list and routing
│ │
│ (or server requests it) │
│ │
│ ServerToWorkerMessage::ModelsRefresh
│ { "type": "models_refresh",
│ "reason": "periodic"
│ }
│◄──────────────────────────│
│ │
│ ModelsUpdate (response) │
│──────────────────────────►│
10. Queue Timeout
Client Proxy Server
│ │
│ POST /v1/chat/completions
│ {"model": "llama3-8b"}│
│───────────────────────►│
│ │
│ │ Provider exists but all
│ │ workers are busy (at
│ │ max_concurrent).
│ │ Request enters queue.
│ │
│ │ ┌── QUEUE_TIMEOUT_SECS (30) ──┐
│ │ │ waiting for a worker to │
│ │ │ become available... │
│ │ │ │
│ │ │ no worker picks up │
│ │ └──────────── timeout fires ───┘
│ │
│ │ Cancel with reason: "timeout"
│ │
│◄───────────────────────│ 504 Gateway Timeout
│ {"error": │
│ "queue timeout: │
│ no worker available │
│ within deadline"} │
The request never reaches a worker. The proxy removes it from the queue
and returns 504 to the client. No Cancel message is sent over WebSocket
because no worker was ever assigned.
11. Request Timeout (In-Flight)
Client Proxy Server Worker Backend
│ │ │ │
│ POST /v1/chat/completions │ │
│───────────────────────►│ │ │
│ │ → dispatched to worker │ │
│ │────────────────────────►│ │
│ │ │───────────────────►│
│ │ │ │
│ │ ┌── REQUEST_TIMEOUT_SECS (300) ──┐ │
│ │ │ waiting for ResponseComplete │ │
│ │ │ or streaming chunks... │ │
│ │ │ │ │
│ │ │ backend is still processing │ │
│ │ └──────────── timeout fires ──────┘ │
│ │ │ │
│ │ ServerToWorkerMessage::Cancel │
│ │ { "type": "cancel", │ │
│ │ "request_id": "r-003",│ │
│ │ "reason": "timeout" │ │
│ │ } │ │
│ │────────────────────────►│ │
│ │ │ │
│ │ │ Worker aborts │
│ │ │ backend request │
│ │ │───── abort ────────►│
│ │ │ │
│◄───────────────────────│ 504 Gateway Timeout │ │
│ {"error": │ │ │
│ "request timeout"} │ │ │
Unlike queue timeout, the request was dispatched to a worker, so the proxy
sends a Cancel message with reason "timeout" over the WebSocket. The
worker receives the cancellation and aborts the in-flight backend request.
The proxy returns 504 to the client.
Message Type Summary
Server → Worker (ServerToWorkerMessage)
| Type | Struct | Purpose |
|---|---|---|
register_ack | RegisterAck | Confirm registration, assign worker ID |
request | RequestMessage | Dispatch an inference request |
cancel | CancelMessage | Cancel an in-flight request |
ping | PingMessage | Heartbeat probe |
graceful_shutdown | GracefulShutdownMessage | Begin drain sequence |
models_refresh | ModelsRefreshMessage | Ask worker to re-report models |
Worker → Server (WorkerToServerMessage)
| Type | Struct | Purpose |
|---|---|---|
register | RegisterMessage | Announce name, models, capacity |
models_update | ModelsUpdateMessage | Update model list / load |
response_chunk | ResponseChunkMessage | Forward a streaming chunk |
response_complete | ResponseCompleteMessage | Signal request completion |
pong | PongMessage | Heartbeat reply with load |
error | ErrorMessage | Report a request-level error |