ModelRelay

Stop configuring clients for every GPU box. Workers connect out; requests route in.

You have GPU boxes running llama-server (or Ollama, or vLLM, or anything OpenAI-compatible). Today you either expose each one directly — port forwarding, DNS, firewall rules — or you stick a load balancer in front that doesn't understand LLM streaming or cancellation.

ModelRelay flips the model: a central proxy receives standard inference requests while worker daemons on your GPU boxes connect out to it over WebSocket. The proxy handles queueing, routing, streaming pass-through, and cancellation propagation. Clients see one stable endpoint and never need to know about your hardware.

  Clients (curl, Claude Code, LiteLLM, Open WebUI, ...)
         |
         |  POST /v1/chat/completions
         |  POST /v1/messages
         v
  +----------------------+
  |   modelrelay-server  |<--- workers connect out (WebSocket)
  |   (one stable        |     no inbound ports needed on GPU boxes
  |    endpoint)         |
  +----------------------+
         |  routes request to best available worker
         v
  +--------+  +--------+  +--------+
  |worker-1|  |worker-2|  |worker-3|
  | llama  |  | ollama |  | vllm   |  <- your GPU boxes,
  | server |  |        |  |        |    anywhere on any network
  +--------+  +--------+  +--------+

Hosted Version

Don't want to run the infrastructure yourself? A fully-managed hosted version is available at modelrelay.io — no server setup, no infrastructure to manage. Just get an API key, point your workers at it, and start routing requests. Same open protocol, zero ops burden.

Who is this for?

Home GPU users running local models who want a single API endpoint across multiple machines
Teams with on-prem hardware that need to pool GPU capacity without a service mesh
Researchers juggling models across heterogeneous boxes who are tired of updating client configs

Features

OpenAI + Anthropic compatible — POST /v1/chat/completions, POST /v1/responses, POST /v1/messages, GET /v1/models
No inbound ports on GPU boxes — workers connect out to the proxy over WebSocket
Request queueing — configurable depth and timeout when all workers are busy
Streaming pass-through — SSE chunks forwarded with preserved ordering and termination
End-to-end cancellation — client disconnect propagates through the proxy to the worker to the backend
Automatic requeue — if a worker dies mid-request, the request is requeued to another worker
Heartbeat and load tracking — stale workers are cleaned up; workers report current load
Graceful drain — workers can shut down while replacement workers pick up queued work
Cross-platform — pre-built binaries for Linux, macOS, and Windows (x86_64 + arm64)

Quick Start

The fastest way to get running is with Docker:

# 1. Run the proxy
docker run -p 8080:8080 \
  -e WORKER_SECRET=mysecret \
  -e LISTEN_ADDR=0.0.0.0:8080 \
  ghcr.io/ericflo/modelrelay/modelrelay-server:latest

# 2. Run a worker (on a GPU box with llama-server or similar)
docker run \
  -e PROXY_URL=http://<proxy-host>:8080 \
  -e WORKER_SECRET=mysecret \
  -e BACKEND_URL=http://host.docker.internal:8000 \
  -e MODELS=llama3.2:3b \
  ghcr.io/ericflo/modelrelay/modelrelay-worker:latest

# 3. Send a request
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

For more installation options (pre-built binaries, Docker Compose, building from source, systemd, Windows services), see the GitHub README.

Documentation

Architecture — System design, component overview, and data flow
Protocol Walkthrough — Wire-level protocol details with examples
Behavior Contract — The exact behavioral guarantees the system provides
Operational Runbook — Deployment, configuration, monitoring, and troubleshooting

Source & Contributing

ModelRelay is MIT-licensed and developed at github.com/ericflo/modelrelay. Bug reports, feature requests, and PRs are welcome — see CONTRIBUTING.md for details.