Behavior Contract
This document captures the externally observable behavior contract for ModelRelay — the behaviors that must hold across versions and that users and contributors can rely on. The contract test suite in modelrelay-contract-tests is the automated expression of these requirements.
Core Contract
-
Worker auth and registration: Workers connect to
/v1/worker/connect?provider=<name>over WebSocket and authenticate with a provider-specific worker secret. The preferred transport isX-Worker-Secret; query-string secret fallback exists only for backward compatibility. Secret comparison is constant-time. Unknown providers are rejected, disabled providers are rejected, and repeated failed auth attempts are rate-limited by client IP. -
Capability advertisement: After connect, the worker sends a
registermessage containingworker_name,models,max_concurrent, andprotocol_version. The server may sanitize or truncate these values and must sendregister_ackwith the accepted worker ID, accepted model list, and warnings. Legacy workers omittingprotocol_versionare tolerated in Katamari unless explicitly rejected by config; mismatched protocol versions are closed with a protocol error. The first Rust characterization harness makes that sanitization concrete by requiring the acked model list to trim whitespace, drop empty entries, de-duplicate exact duplicates while preserving first-seen order, and cap the accepted list at a provider-defined limit with warnings surfaced inregister_ack. -
Model advertisement and worker selection: Workers advertise exact model names, and the server routes only to workers that explicitly support the requested model. Katamari keeps an O(1) model-membership set per worker. Selection is "lowest load with round-robin tie breaking" among workers that support the model and can atomically reserve capacity.
-
Queueing when no worker is immediately available: If no eligible worker can accept the request, the request is queued per virtual provider. The queue is bounded and FIFO among requests compatible with a worker's model list. Requests remain keyed by original queue time so requeue does not grant infinite timeout extensions.
-
Request dispatch over WebSocket: Requests are forwarded to workers as
requestmessages withrequest_id,model, raw JSON body string, selected compatibility headers, target endpoint path, andis_streaming. The central proxy accepts ordinary provider-style HTTP requests and delegates only the worker-backed providers through this path. Compatibility-critical request headers include OpenAI-styleauthorization,content-type, andopenai-organization, plus Anthropic-stylex-api-key,anthropic-version,anthropic-beta, andcontent-type; incidental transport headers likeuser-agentare not part of the worker envelope contract. -
Non-streaming response pass-through: Workers reply with
response_completecontaining the final HTTP status, response headers, full body, and token counts. The proxy must forward status, headers, and body faithfully, including upstream 4xx and 5xx responses, rather than collapsing them into generic proxy errors. -
Streaming chunk ordering and termination semantics: Streaming responses are forwarded as
response_chunkmessages containing already-formatted SSE data and finish withresponse_complete. Chunks must preserve order. The HTTP side must flush promptly, retain streaming semantics, and treat completion metadata as the source of final status and token accounting. Katamari enforces a cumulative streaming size ceiling and emits an SSE error before terminating an oversized stream. -
Client cancellation propagation end to end: Client disconnect or request timeout must cancel the HTTP request context, remove queued work if still queued, or send a best-effort
cancelmessage for active worker requests. Late chunks that arrive after cancellation are intentionally dropped. The worker protocol has explicit cancel reasons, including client disconnect and timeout. -
Worker disconnect during active request: On worker disconnect, active requests are examined one by one. If the request context is still alive, Katamari requeues it onto the provider queue without resetting its lifetime. If the request context is already canceled or timed out, the request fails immediately to the waiting client path instead. Requeue is capped at
MaxRequeueCount = 3; after that the request fails with a service-unavailable style error instead of looping forever. -
Timeout behavior: Every provider has a request timeout used both for queue wait and overall request lifetime. Queue timeout produces a worker-unavailable style response. Streaming and non-streaming requests share the parent HTTP context, so client disconnect and timeout terminate the same request object. WebSocket heartbeats use ping every 15 seconds and a 45-second pong window.
-
Queue-full, no-workers, and provider-disabled error surfaces: Katamari distinguishes bounded queue exhaustion, no worker capacity, disabled providers, deleted providers, timeout, and requeue exhaustion through dedicated error values. The public-facing HTTP layer currently sanitizes some internal errors into stable client messages such as "Service temporarily at capacity" and "Provider is currently disabled".
-
Heartbeat, load reporting, and stale-worker cleanup: The server sends JSON
ping; workers reply with JSONpongcarrying current load. This heartbeat updateslast_heartbeatand live load accounting. Workers may also sendmodels_updatewhen their local model catalog changes. Stale worker DB records are cleaned periodically, and failed auth rate-limit entries expire automatically. -
Graceful shutdown and drain semantics: The server can send
graceful_shutdownto tell workers to stop accepting new work, finish current requests, and disconnect before a timeout. Provider deletion drains queued requests with an explicit provider-deleted error and closes connected workers. -
OpenAI-style and Anthropic-style compatibility: The central server is meant to accept ordinary client traffic, not a custom client. Katamari parses model and stream flags from OpenAI-style request bodies, provides a special
/v1/modelscompatibility endpoint, and preserves SSE behavior expected by OpenAI-compatible tooling. The extracted Rust project should also preserve Anthropic-style compatibility at the central HTTP boundary even if the internal worker protocol stays provider-neutral.
Wire Messages To Preserve
-
Server to worker:
ping,request,register_ack,cancel,graceful_shutdown,models_refresh -
Worker to server:
pong,register,models_update,response_chunk,response_complete,error
Invariants Worth Preserving
- A worker never silently gains capability beyond the sanitized models acknowledged by the server.
- Queueing is bounded per provider and does not grow without limit.
- Requeue is intentional and finite.
- HTTP error bodies from the worker backend are preserved where safe instead of flattened away.
- Streaming remains SSE-shaped end to end.
- Worker churn or late chunks must not leave requests hanging forever.
Extension Points
When adding new behaviors, add a contract test in modelrelay-contract-tests before implementing. This keeps the test suite as the primary specification. If a behavior described above is not yet covered by an automated test, that gap is the highest-priority work item.