Introduction

Relay is a high-throughput router that puts one endpoint in front of every LLM provider. You point your app at Relay, send OpenAI- or Anthropic-shaped requests, and Relay handles provider selection, key pooling, failover, and rate limits behind a single bearer token. The result: your code talks to one stable API, while the messy parts — juggling provider keys, surviving a dead key, staying under rate limits — move out of your app and into Relay.

Why it exists

A single provider API key has a fixed rate limit and a single point of failure. The moment you run real traffic you end up writing the same glue every time: rotate between keys, retry on 429, fall back when one provider degrades, keep per-model limits straight. Relay is that glue, extracted into a service and made operable.

Higher effective throughput

Pool many provider keys behind one relay key. Limits add up instead of capping you at a single key’s ceiling.

Failover by default

Per-key circuit breakers route around dead or throttled keys without your app noticing.

One wire shape

OpenAI- and Anthropic-compatible endpoints. Keep your existing SDK; just change the base URL.

Operable

An admin UI and Control API for hosts, keys, and policies — not a config file you redeploy to change.

The mental model

A handful of catalog nouns carry the whole system. Once these click, the reference pages read straight through.

Hosts

The upstream endpoints Relay routes to — a provider’s API surface, like OpenAI or Anthropic. A host defines the wire shape Relay speaks to it.

Models

Catalog entries bound to a host. The model field in a request resolves against the catalog; a model is reachable only when it has an enabled host binding behind it.

Host keys

Your real upstream provider credentials, held by Relay. Many host keys for the same host form a pool; Relay spreads traffic across them and breaks the circuit on any that fail.

Relay keys

The bearer tokens your apps use. A relay key never exposes the underlying host keys — it’s an indirection you can scope, rate-limit, and revoke on its own.

Policies

Rules that decide which models a relay key may reach. Policies are how you grant one key just gpt-4o and another the whole catalog.

Rate limits

Limits you attach to keys and policies, enforced by Relay before a request ever leaves for the upstream.

A request, end to end: your app sends an OpenAI- or Anthropic-shaped call with a relay key → the key’s policy confirms it grants the requested model → Relay resolves the model to its host binding and draws a healthy host key from the pool → it applies any rate limits and forwards the request → the response streams back in the same shape you sent.

Where to go next

Quickstart

First request through Relay in about two minutes.

Configuration

Every RELAY_* environment variable and runtime setting.

Inference API

Endpoints, wire shapes, streaming, and error codes.

Control API

Manage hosts, keys, policies, and relay keys.

Quickstart

Get Started

Concepts

Reference

Why it exists

Higher effective throughput

Failover by default

One wire shape

Operable

The mental model

Where to go next

Quickstart

Configuration

Inference API

Control API

​Why it exists

Higher effective throughput

Failover by default

One wire shape

Operable

​The mental model

​Where to go next

Quickstart

Configuration

Inference API

Control API

Why it exists

The mental model

Where to go next