Skip to main content
Relay is two HTTP listeners sharing one in-memory view of your catalog. The inference plane serves your traffic; the control plane serves the admin UI and CRUD. Neither touches Postgres on the request path — both read from a snapshot that’s kept fresh in the background.
Relay two-plane architecture overview

The two planes

Inference plane

Listens on RELAY_PORT. Stateless and hot-path-optimized. Serves /v1/* and /healthz, authenticates with relay keys, and routes each request to a healthy upstream. No vendor branching — dispatch is uniform.

Control plane

Listens on RELAY_CONTROL_PORT. Serves the admin UI, /auth/*, CRUD for every catalog kind, and operational endpoints. Authenticates with a session cookie or admin bearer.
Running them on separate ports means you can expose the inference plane to the internet while keeping the control plane on a private network.

The snapshot is the spine

Postgres is the source of truth, but it’s never read on the request path. Instead, every pod holds an immutable in-memory snapshot of the catalog.
  1. A control-plane write lands in Postgres.
  2. The write fires a NOTIFY on a Postgres channel.
  3. Every pod’s listener receives it, rebuilds the snapshot with a copy-on-write reconciler, and atomically swaps it in — debounced to ~1 second.
The result: config changes propagate fleet-wide in about a second, and a request never pays a database round-trip to learn how to route.
Snapshot fan-out via NOTIFY/LISTEN

A request, end to end

Inference request lifecycle
1

Authenticate

The relay key’s hash is looked up in the snapshot. No match → 401.
2

Route

The key’s policy confirms it grants the requested model, then the model resolves to a host binding — which carries the wire adapter (openai or anthropic) and the upstream model name.
3

Reserve and draw a key

Rate-limit budget is reserved (one Redis Lua call is the goal), and a healthy host key is drawn from the pool. Tripped circuit breakers are skipped; failover happens here, before any bytes flow.
4

Forward and stream

Relay forwards to the upstream and streams the response straight back. If the inbound and upstream shapes match, bytes pass through verbatim; otherwise each chunk is translated through Relay’s canonical protocol.
5

Settle (post-flight)

After the response finishes, a detached goroutine commits the rate-limit usage, records key success, and fires usage observers. This never blocks your response.
Relay does not fail over mid-stream. All failover across keys and hosts happens before the first byte reaches the caller. Once bytes flow, an upstream error is surfaced as-is.

Why it’s fast

  • No Postgres on the hot path — routing reads the in-memory snapshot only.
  • One Redis round-trip — rate-limit reservation is a single Lua call, not three trips.
  • Async everything off-path — usage, traces, and payload capture emit on bounded channels with drop-on-full, never blocking the response.
  • Byte-for-byte passthrough — when shapes match, Relay copies bytes rather than parsing and re-serializing.
This keeps added latency to single-digit milliseconds at the p50 under real load, with each pod handling thousands of requests per second; you scale out by adding pods, not by tuning one.

Where state lives

StoreHoldsOn the request path?
PostgresCatalog truth: hosts, models, keys, policiesNo — only via the snapshot
In-memory snapshotRead-optimized copy of the catalogYes — every routing decision
Redis / ValkeyRate-limit counters, per-key circuit breakersYes — one Lua call per request