Building a Production-Grade LLM Gateway (95+): Architecture, Governance, Resiliency, and Observability
A production LLM Gateway is not “a multi-provider SDK wrapper.”
The real moat is governance + tenancy + resiliency + observability—with streaming and tool calling that don’t leak provider quirks.
This post lays out a battle-tested architecture for an LLM Gateway that supports AWS Bedrock, Azure OpenAI, Anthropic, OpenAI, Vertex AI, and Gemini, with: - Strict Mode (server-enforced model overrides) - Weighted routing + intelligent failover - Canonical streaming + canonical tool schema - Circuit breakers + cooldowns + backoff - OpenTelemetry-first observability - Multi-tenant budgets + quotas + virtual keys - Optional cache layers (exact + semantic)
If you’ve looked at gateways like Portkey / Bifrost, the shape below is aligned with what “production” typically means: a control plane + policy + telemetry + reliability, not just routing.
Table of Contents
- 1. The Problem Space
- 2. Design Goals
- 3. Architecture Overview
- 4. Request Flow (Step-by-Step)
- 5. Core Components
- 5.1 Gateway API
- 5.2 Auth + Tenant Context
- 5.3 Policy Engine (Strict Mode)
- 5.4 Router (Weighted + Health + Cost)
- 5.5 Provider Abstraction
- 5.6 Universal Streaming Adapter
- 5.7 Canonical Tool Schema
- 5.8 Error Taxonomy
- 6. Resiliency Patterns
- 7. Observability (OpenTelemetry-First)
- 8. Cost Controls: Budgets, Quotas, and Attribution
- 9. Caching (Optional, but Often Huge)
- 10. Deployment Checklist
- 11. Testing + Chaos
- 12. Results + Measurement Methodology
- 13. Conclusion
1. The Problem Space
Modern LLM apps hit the same wall:
- Vendor lock-in risk: Switching providers late is painful and expensive.
- Reliability: Provider outages, throttling, regional issues, intermittent 5xx.
- Cost: Pricing and throughput vary widely; smart routing can materially reduce spend.
- Rate limits: RPM/TPM/quotas differ; naive retry strategies amplify failure.
- Model churn: Model IDs change, versions deprecate, capabilities differ.
- Streaming differences: Each provider streams differently; the UI expects one contract.
- Tool calling differences: Schema mismatches break compatibility across vendors.
- Governance: Real production needs policy enforcement and auditability.
- Tenancy: Budgets, per-tenant quotas, and key separation are not “nice to have.”
2. Design Goals
A “95+/100” gateway does these well:
- Provider-agnostic app interface (no vendor quirks leaking to app layer)
- Policy & governance first (Strict Mode, allowlists, tool restrictions)
- Multi-tenant by design (virtual keys, budgets, quotas, RBAC)
- Resiliency beyond retries (circuit breakers, cooldowns, hedging optional)
- Observability standards (OpenTelemetry traces + metrics + structured logs)
- Streaming is a first-class contract
- Cost attribution per request with model/provider tagging
3. Architecture Overview
graph TB
subgraph Client
A[App / UI / Services]
end
subgraph Gateway
G1[API Layer<br/>HTTP/SSE]
G2[Auth + Tenant Context]
G3[Policy Engine<br/>Strict Mode, Allow/Block, Tools Policy]
G8[Optional Cache<br/>Exact + Semantic]
G4[Router<br/>Weighted + Health + Cost + SLO]
G5[Resiliency Layer<br/>CB + Backoff + Cooldown + Hedging]
G6[Provider Abstraction<br/>Unified Interface]
G7[Telemetry<br/>OTel Traces + Metrics + Logs]
end
subgraph Providers
P1[AWS Bedrock]
P2[Azure OpenAI]
P3[Anthropic Direct]
P4[OpenAI Direct]
P5[Vertex AI]
P6[Gemini Direct]
end
A --> G1 --> G2 --> G3 --> G8 --> G4 --> G5 --> G6
G6 --> P1
G6 --> P2
G6 --> P3
G6 --> P4
G6 --> P5
G6 --> P6
G1 -.-> G7
G2 -.-> G7
G3 -.-> G7
G4 -.-> G7
G5 -.-> G7
G6 -.-> G7
G8 -.-> G7
Key idea: Policy sits before routing. The router cannot be trusted to “do the right thing” unless governance already constrained the space.
4. Request Flow (Step-by-Step)
sequenceDiagram
participant Client
participant API as Gateway API
participant Auth as Auth/Tenant
participant Policy as Policy Engine
participant Cache as Cache (optional)
participant Router as Router
participant Res as Resiliency
participant Prov as Provider
participant OTel as Telemetry
Client->>API: POST /v1/chat.completions (stream=true)
API->>OTel: start trace/span + requestId
API->>Auth: validate key/JWT, resolve tenant/project
Auth-->>API: tenantCtx + limits + policyRefs
API->>Policy: enforce Strict Mode, allowlists, tools policy
Policy-->>API: normalizedRequest (server-approved)
API->>Cache: lookup (exact/semantic) if eligible
alt cache hit
Cache-->>API: cached stream/response
API-->>Client: stream tokens
else cache miss
Cache-->>API: miss
API->>Router: pick candidate providers (weights + health + cost)
Router-->>API: ordered candidates
API->>Res: execute with CB/backoff/cooldown
Res->>Prov: call generateStream()
Prov-->>Client: stream tokens (normalized chunks)
end
API->>OTel: emit metrics (latency, tokens, provider, cost)
API-->>Client: final stopReason + usage
5. Core Components
5.1 Gateway API
You want a single gateway surface that’s stable for apps:
POST /v1/chat.completions(OpenAI-ish)POST /v1/responses(optional)SSE streamingcontract for UI- Optional:
POST /v1/embeddingslater
Non-negotiable: return a canonical stream chunk format even if providers differ.
5.2 Auth + Tenant Context
Production gateways are almost always multi-tenant.
Common patterns: - Virtual keys: one key per tenant/project that maps to provider keys internally - RBAC: which models/tools a tenant can use - Rate limits: requests/min + tokens/min per tenant/project
graph LR
A[Incoming API Key/JWT] --> B[Auth Service]
B --> C[Tenant Resolver]
C --> D[Policy Ref]
C --> E[Quota/Budget Ref]
C --> F[Provider Key Vault Ref]
5.3 Policy Engine (Strict Mode)
This is where your gateway becomes “enterprise-grade”.
Policy precedence (recommended)
| Priority | Source | Rule |
|---|---|---|
| 1 | STRICT_MODE (server) | Overrides client model/provider/tool choices |
| 2 | Tenant policy | Allowlists/denylists, tool restrictions, max tokens |
| 3 | Client request | Only within allowed policy bounds |
| 4 | Defaults | Safe fallback |
Strict Mode behavior
- If
STRICT_MODE=true, ignore client-providedmodel(and optionally ignore provider hints). - Route according to server-owned model group / distribution.
STRICT_MODE=true
STRICT_MODEL_GROUP=prod-safe
STRICT_MAX_TOKENS=2000
STRICT_TEMPERATURE=0.7
STRICT_DISALLOW_TOOLS=false
flowchart TD
R[Client Request] --> P{STRICT_MODE?}
P -->|Yes| S[Force server model group<br/>override model/tools/maxTokens]
P -->|No| N[Validate within tenant policy<br/>allowlists + tool policy]
S --> O[Normalized, server-approved request]
N --> O
Code sketch (conceptual):
type NormalizedRequest = {
modelGroup: string; // e.g., "prod-safe"
resolvedModelHint?: string; // optional, for non-strict mode
toolsAllowed: boolean;
maxTokens: number;
temperature: number;
};
function applyPolicy(req: any, tenantPolicy: any, env: any): NormalizedRequest {
if (env.STRICT_MODE === "true") {
return {
modelGroup: env.STRICT_MODEL_GROUP ?? "prod-safe",
toolsAllowed: env.STRICT_DISALLOW_TOOLS !== "true",
maxTokens: Number(env.STRICT_MAX_TOKENS ?? 2000),
temperature: Number(env.STRICT_TEMPERATURE ?? 0.7),
};
}
return {
modelGroup: tenantPolicy.defaultModelGroup,
resolvedModelHint: req.model,
toolsAllowed: tenantPolicy.toolsAllowed && !!req.tools?.length,
maxTokens: Math.min(req.max_tokens ?? 4000, tenantPolicy.maxTokens),
temperature: Math.max(0, Math.min(req.temperature ?? 0.7, 2.0)),
};
}
5.4 Router (Weighted + Health + Cost)
Weighted routing is table stakes. What makes it production-grade is:
weight × health × SLO × cost.
Example distribution
ENABLE_GATEWAY_MODE=true
GATEWAY_DISTRIBUTION=bedrock-sonnet-4.5:4,vertex-gemini-3:2,openai-direct-5.2:2,azure-openai-5.2:1,anthropic-direct:1
Recommended router behavior
- Start from policy-approved model group → list allowed providers/models
- Filter by: - circuit breaker state (OPEN → excluded) - hard tenant restrictions - regional constraints
- Compute selection weights: - base weight from config - multiplied by health score (recent success rate) - optionally adjusted by cost band for the request type
- Return ordered candidates for resiliency executor
graph TD
A[Normalized Request] --> B[Allowed Provider Pool]
B --> C[Filter: Policy + Tenant + Region]
C --> D[Filter: Circuit Breaker OPEN]
D --> E[Score: Weight x Health x SLO x Cost]
E --> F[Ordered Candidate List]
5.5 Provider Abstraction
Canonical interface:
export interface ILLMProvider {
generate(request: NormalizedProviderRequest): Promise<LLMResponse>;
generateStream(request: NormalizedProviderRequest): AsyncIterable<StreamChunk>;
getProviderInfo(): ProviderInfo;
testConnection?(): Promise<boolean>;
}
Providers implement: - request conversion (canonical → provider-specific) - response normalization (provider-specific → canonical) - streaming normalization (provider-specific → canonical chunks) - error normalization
5.6 Universal Streaming Adapter
Return a single gateway streaming contract.
export type StreamChunk = {
content: string;
role: "assistant";
stopReason?: string;
usage?: { inputTokens?: number; outputTokens?: number; totalTokens?: number };
};
Implementation pattern:
async *generateStream(req: NormalizedProviderRequest): AsyncIterable<StreamChunk> {
const raw = await this.providerSpecificStream(req);
for await (const rawChunk of raw) {
const chunk = this.normalizeStreamChunk(rawChunk);
if (chunk) yield chunk; // never buffer whole output
}
}
Rule: The app should never see Bedrock event types, OpenAI SSE deltas, or Gemini chunk formats.
5.7 Canonical Tool Schema
Define one canonical schema and translate per provider.
export type UniversalTool = {
name: string;
description: string;
input_schema: {
type: "object";
properties: Record<string, any>;
required?: string[];
additionalProperties?: boolean;
};
};
Provider transforms:
- Anthropic: tools: [{ name, description, input_schema }]
- OpenAI: tools: [{ type:"function", function:{ name, description, parameters } }]
- Gemini: function_declarations format
Rule: tools policy should live in Policy Engine (disable, allowlist, schema size, etc.)
5.8 Error Taxonomy
Retries only work when errors are categorized correctly.
Recommended canonical error types:
- RATE_LIMIT (429, includes retryAfter, rpm/tpm/quota)
- TIMEOUT
- PROVIDER_5XX
- BAD_REQUEST (4xx non-429)
- AUTH_ERROR
- POLICY_VIOLATION
- UNKNOWN
export type GatewayError = {
type: "RATE_LIMIT"|"TIMEOUT"|"PROVIDER_5XX"|"BAD_REQUEST"|"AUTH_ERROR"|"POLICY_VIOLATION"|"UNKNOWN";
provider?: string;
model?: string;
retryAfterMs?: number;
canRetry: boolean;
details?: any;
};
6. Resiliency Patterns
Retries alone are not resiliency. You need circuit breakers and cooldowns to avoid stampedes.
Circuit breaker state machine
stateDiagram-v2
[*] --> Closed
Closed --> Open: failure rate > threshold
Open --> HalfOpen: cooldown elapsed
HalfOpen --> Closed: success count met
HalfOpen --> Open: failure occurs
Recommended execution strategy
- Fast failover on:
- 5xx spikes
- timeouts
- 429 with retryAfter too large
- Backoff only when retryAfter is short and user experience is acceptable
- Cooldown a provider after repeated throttles
- Optional: hedged requests for high-priority traffic (careful: increases cost)
flowchart TD
A[Candidate Providers] --> B[Try Provider #1]
B --> C{Success?}
C -->|Yes| R[Return]
C -->|No| D[Classify Error]
D --> E{Can retry quickly?}
E -->|Yes| F[Backoff / Respect Retry-After]
E -->|No| G[Trip CB / Cooldown]
F --> H[Try next provider]
G --> H
H --> C
7. Observability (OpenTelemetry-First)
If you can’t see it, you can’t run it.
What to emit on every request
Traces
- span: gateway.request
- attributes: tenantId, projectId, provider, model, stream, toolsUsed
Metrics - request count by provider/model/tenant - error count by type/provider - latency p50/p95/p99 - token usage input/output/total - estimated cost
Logs - structured logs at boundaries only (avoid logging raw prompts unless explicitly enabled and scrubbed)
graph LR
A[Gateway] --> B[OTel Collector]
B --> C[Tracing: Tempo/Jaeger]
B --> D[Metrics: Prometheus]
B --> E[Logs: Loki/ELK]
8. Cost Controls: Budgets, Quotas, and Attribution
A real gateway needs guardrails:
- Budgets: monthly spend caps per tenant/project
- Quotas: RPM/TPM caps per tenant/project
- Hard stops: block requests when budget is exceeded
- Soft alerts: warn at 70/85/95%
Recommended tables (conceptual):
- tenant_budgets(tenantId, month, usdLimit, usdUsed)
- tenant_quotas(tenantId, rpmLimit, tpmLimit)
- request_costs(requestId, provider, model, inputTokens, outputTokens, costUsd)
9. Caching (Optional, but Often Huge)
If you want to claim big cost savings credibly, caching is usually part of the story.
Cache types
- Exact match cache (safe default)
- Semantic cache (opt-in, requires careful policy)
When NOT to cache
- tool calls (unless caching tool output separately with strong invariants)
- user-specific or sensitive prompts
- anything flagged by tenant policy as “no cache”
graph TD
A[Normalized Request] --> B{Cache Eligible?}
B -->|No| C[Skip Cache]
B -->|Yes| D[Exact Cache Lookup]
D -->|Hit| E[Return Cached Response]
D -->|Miss| F[Optional Semantic Cache]
F -->|Hit| E
F -->|Miss| G[Call Provider + Store]
10. Deployment Checklist
Security
- Provider credentials in a vault (AWS Secrets Manager / Azure Key Vault / GCP Secret Manager)
- Virtual keys per tenant/project
- RBAC: model/tool allowlists
- PII scrubbing for logs (default OFF for prompt logging)
Reliability
- Circuit breakers configured per provider
- Cooldowns for throttling
- Timeouts tuned per provider
- Retries capped and bounded
Observability
- OTel traces end-to-end
- Metrics dashboards: latency/error/tokens/cost
- Alerting on error spikes and budget thresholds
Governance
- Strict Mode available for production workloads
- Policy evaluation unit-tested
- Audit logs for policy decisions
11. Testing + Chaos
You don’t have a gateway until it survives bad days.
- Integration tests per provider (including streaming)
- Tool calling tests across providers
- Error simulation: 429/5xx/timeouts
- Load testing with realistic token distributions
- Chaos testing: randomly fail providers, randomly inject latency
12. Results + Measurement Methodology
If you publish numbers, publish how they were measured.
Example outcomes (illustrative)
| Metric | Before | After | Notes |
|---|---|---|---|
| Availability | 99.5% | 99.95% | measured at gateway success rate |
| P95 latency | 3.2s | 2.1s | routing avoided degraded providers |
| 429 rate | 4.2% | 0.3% | context-aware routing + cooldown |
| Cost / 1M tokens | $8.50 | $5.20 | depends heavily on mix + caching |
Measurement notes
- Time window (e.g., 14 days)
- Traffic volume (requests/day, avg tokens)
- Provider mix before/after
- Definition of “availability” (successful responses / total)
- Whether caching is enabled and hit rate
- Whether hedging is enabled
13. Conclusion
A production-grade LLM Gateway is a control plane for AI calls: - Policy decides what’s allowed (Strict Mode, tools, models) - Router decides what’s optimal (weights + health + cost) - Resiliency keeps you alive (CB + cooldown + backoff) - Observability keeps you sane (OTel + metrics + request-level cost) - Providers stay replaceable (canonical streaming + canonical tools)
The future is multi-provider and multi-model. If you design the gateway as an operational system—not a wrapper—you’ll ship reliability and leverage from day one.
Appendix: Configuration Example
# Mode
ENABLE_GATEWAY_MODE=true
# Strict Mode (server overrides client model)
STRICT_MODE=true
STRICT_MODEL_GROUP=prod-safe
STRICT_MAX_TOKENS=2000
STRICT_TEMPERATURE=0.7
STRICT_DISALLOW_TOOLS=false
# Weighted distribution for non-strict groups (or for strict groups)
GATEWAY_DISTRIBUTION=bedrock-sonnet-4.5:4,vertex-gemini-3:2,openai-direct-5.2:2,azure-openai-5.2:1,anthropic-direct:1
# Provider creds (store in vault in real prod)
AWS_ACCESS_KEY_ID=xxxxx
AWS_SECRET_ACCESS_KEY=xxxxx
AWS_REGION=us-east-1
AZURE_OPENAI_API_KEY=xxxxx
AZURE_OPENAI_ENDPOINT=https://xxx.openai.azure.com
ANTHROPIC_API_KEY=xxxxx
OPENAI_API_KEY=xxxxx
VERTEX_PROJECT_ID=xxxxx
GEMINI_API_KEY=xxxxx
Appendix: Minimal “Strict Mode” sentence you can add early in the post
Strict Mode ensures production safety: when enabled, the gateway ignores client model selection and enforces server-owned model groups and policies (tools, max tokens, allowlists), guaranteeing compliance even if the client is misconfigured.