The inference plane is the customer-facing data plane. It listens on
RELAY_PORT (default 8080) and speaks OpenAI- and Anthropic-shape wire
protocols, plus Relay’s own provider-neutral canonical shape.
Every inference request authenticates with a relay key as a bearer token:
Authorization: Bearer <relay-key>
Relay keys are minted in the admin UI or via the control plane
(POST /relay-keys). The plaintext is shown exactly once on creation —
Relay stores only sha256(plaintext).
Responses stream byte-for-byte from the upstream when the inbound shape
matches the upstream shape. Cross-shape requests (e.g. OpenAI in,
Anthropic upstream) are translated per chunk through Relay’s canonical
protocol.
The model field is resolved against your catalog. A model is reachable
only if a policy grants it to your relay key and the model has an
enabled host binding with a healthy host key. List what your key can reach:
Relay key’s policy does not grant the requested model.
404
Unknown model, or no enabled binding for it.
429
Rate limit reached (relay-side or upstream).
502 / 503
No healthy key in the pool, or upstream unreachable. See Troubleshooting.
Relay does not fail over mid-stream. Failover across keys and hosts
happens before the first byte reaches you. Once bytes flow, an
upstream error is returned as-is.