Building a Production-Grade LLM Gateway (95+): Architecture, Governance, Resiliency, and Observability

A production LLM Gateway is not “a multi-provider SDK wrapper.”
The real moat is governance + tenancy + resiliency + observability—with streaming and tool calling that don’t leak provider quirks.

This post lays out a battle-tested architecture for an LLM Gateway that supports AWS Bedrock, Azure OpenAI, Anthropic, OpenAI, Vertex AI, and Gemini, with: - Strict Mode (server-enforced model overrides) - Weighted routing + intelligent failover - Canonical streaming + canonical tool schema - Circuit breakers + cooldowns + backoff - OpenTelemetry-first observability - Multi-tenant budgets + quotas + virtual keys - Optional cache layers (exact + semantic)

If you’ve looked at gateways like Portkey / Bifrost, the shape below is aligned with what “production” typically means: a control plane + policy + telemetry + reliability, not just routing.

Table of Contents

1. The Problem Space
2. Design Goals
3. Architecture Overview
4. Request Flow (Step-by-Step)
5. Core Components
5.1 Gateway API
5.2 Auth + Tenant Context
5.3 Policy Engine (Strict Mode)
5.4 Router (Weighted + Health + Cost)
5.5 Provider Abstraction
5.6 Universal Streaming Adapter
5.7 Canonical Tool Schema
5.8 Error Taxonomy
6. Resiliency Patterns
7. Observability (OpenTelemetry-First)
8. Cost Controls: Budgets, Quotas, and Attribution
9. Caching (Optional, but Often Huge)
10. Deployment Checklist
11. Testing + Chaos
12. Results + Measurement Methodology
13. Conclusion

1. The Problem Space

Modern LLM apps hit the same wall:

Vendor lock-in risk: Switching providers late is painful and expensive.
Reliability: Provider outages, throttling, regional issues, intermittent 5xx.
Cost: Pricing and throughput vary widely; smart routing can materially reduce spend.
Rate limits: RPM/TPM/quotas differ; naive retry strategies amplify failure.
Model churn: Model IDs change, versions deprecate, capabilities differ.
Streaming differences: Each provider streams differently; the UI expects one contract.
Tool calling differences: Schema mismatches break compatibility across vendors.
Governance: Real production needs policy enforcement and auditability.
Tenancy: Budgets, per-tenant quotas, and key separation are not “nice to have.”

2. Design Goals

A “95+/100” gateway does these well:

Provider-agnostic app interface (no vendor quirks leaking to app layer)
Policy & governance first (Strict Mode, allowlists, tool restrictions)
Multi-tenant by design (virtual keys, budgets, quotas, RBAC)
Resiliency beyond retries (circuit breakers, cooldowns, hedging optional)
Observability standards (OpenTelemetry traces + metrics + structured logs)
Streaming is a first-class contract
Cost attribution per request with model/provider tagging

3. Architecture Overview

graph TB
  subgraph Client
    A[App / UI / Services]
  end

  subgraph Gateway
    G1[API Layer<br/>HTTP/SSE]
    G2[Auth + Tenant Context]
    G3[Policy Engine<br/>Strict Mode, Allow/Block, Tools Policy]
    G8[Optional Cache<br/>Exact + Semantic]
    G4[Router<br/>Weighted + Health + Cost + SLO]
    G5[Resiliency Layer<br/>CB + Backoff + Cooldown + Hedging]
    G6[Provider Abstraction<br/>Unified Interface]
    G7[Telemetry<br/>OTel Traces + Metrics + Logs]
  end

  subgraph Providers
    P1[AWS Bedrock]
    P2[Azure OpenAI]
    P3[Anthropic Direct]
    P4[OpenAI Direct]
    P5[Vertex AI]
    P6[Gemini Direct]
  end

  A --> G1 --> G2 --> G3 --> G8 --> G4 --> G5 --> G6
  G6 --> P1
  G6 --> P2
  G6 --> P3
  G6 --> P4
  G6 --> P5
  G6 --> P6

  G1 -.-> G7
  G2 -.-> G7
  G3 -.-> G7
  G4 -.-> G7
  G5 -.-> G7
  G6 -.-> G7
  G8 -.-> G7

Key idea: Policy sits before routing. The router cannot be trusted to “do the right thing” unless governance already constrained the space.

4. Request Flow (Step-by-Step)

sequenceDiagram
  participant Client
  participant API as Gateway API
  participant Auth as Auth/Tenant
  participant Policy as Policy Engine
  participant Cache as Cache (optional)
  participant Router as Router
  participant Res as Resiliency
  participant Prov as Provider
  participant OTel as Telemetry

  Client->>API: POST /v1/chat.completions (stream=true)
  API->>OTel: start trace/span + requestId
  API->>Auth: validate key/JWT, resolve tenant/project
  Auth-->>API: tenantCtx + limits + policyRefs
  API->>Policy: enforce Strict Mode, allowlists, tools policy
  Policy-->>API: normalizedRequest (server-approved)
  API->>Cache: lookup (exact/semantic) if eligible
  alt cache hit
    Cache-->>API: cached stream/response
    API-->>Client: stream tokens
  else cache miss
    Cache-->>API: miss
    API->>Router: pick candidate providers (weights + health + cost)
    Router-->>API: ordered candidates
    API->>Res: execute with CB/backoff/cooldown
    Res->>Prov: call generateStream()
    Prov-->>Client: stream tokens (normalized chunks)
  end
  API->>OTel: emit metrics (latency, tokens, provider, cost)
  API-->>Client: final stopReason + usage

5. Core Components

5.1 Gateway API

You want a single gateway surface that’s stable for apps:

POST /v1/chat.completions (OpenAI-ish)
POST /v1/responses (optional)
SSE streaming contract for UI
Optional: POST /v1/embeddings later

Non-negotiable: return a canonical stream chunk format even if providers differ.

5.2 Auth + Tenant Context

Production gateways are almost always multi-tenant.

Common patterns: - Virtual keys: one key per tenant/project that maps to provider keys internally - RBAC: which models/tools a tenant can use - Rate limits: requests/min + tokens/min per tenant/project

graph LR
  A[Incoming API Key/JWT] --> B[Auth Service]
  B --> C[Tenant Resolver]
  C --> D[Policy Ref]
  C --> E[Quota/Budget Ref]
  C --> F[Provider Key Vault Ref]

5.3 Policy Engine (Strict Mode)

This is where your gateway becomes “enterprise-grade”.

Policy precedence (recommended)

Priority	Source	Rule
1	STRICT_MODE (server)	Overrides client model/provider/tool choices
2	Tenant policy	Allowlists/denylists, tool restrictions, max tokens
3	Client request	Only within allowed policy bounds
4	Defaults	Safe fallback

Strict Mode behavior

If STRICT_MODE=true, ignore client-provided model (and optionally ignore provider hints).
Route according to server-owned model group / distribution.

STRICT_MODE=true
STRICT_MODEL_GROUP=prod-safe
STRICT_MAX_TOKENS=2000
STRICT_TEMPERATURE=0.7
STRICT_DISALLOW_TOOLS=false

flowchart TD
  R[Client Request] --> P{STRICT_MODE?}
  P -->|Yes| S[Force server model group<br/>override model/tools/maxTokens]
  P -->|No| N[Validate within tenant policy<br/>allowlists + tool policy]
  S --> O[Normalized, server-approved request]
  N --> O

Code sketch (conceptual):

type NormalizedRequest = {
  modelGroup: string;           // e.g., "prod-safe"
  resolvedModelHint?: string;   // optional, for non-strict mode
  toolsAllowed: boolean;
  maxTokens: number;
  temperature: number;
};

function applyPolicy(req: any, tenantPolicy: any, env: any): NormalizedRequest {
  if (env.STRICT_MODE === "true") {
    return {
      modelGroup: env.STRICT_MODEL_GROUP ?? "prod-safe",
      toolsAllowed: env.STRICT_DISALLOW_TOOLS !== "true",
      maxTokens: Number(env.STRICT_MAX_TOKENS ?? 2000),
      temperature: Number(env.STRICT_TEMPERATURE ?? 0.7),
    };
  }

  return {
    modelGroup: tenantPolicy.defaultModelGroup,
    resolvedModelHint: req.model,
    toolsAllowed: tenantPolicy.toolsAllowed && !!req.tools?.length,
    maxTokens: Math.min(req.max_tokens ?? 4000, tenantPolicy.maxTokens),
    temperature: Math.max(0, Math.min(req.temperature ?? 0.7, 2.0)),
  };
}

5.4 Router (Weighted + Health + Cost)

Weighted routing is table stakes. What makes it production-grade is:
weight × health × SLO × cost.

Example distribution

ENABLE_GATEWAY_MODE=true
GATEWAY_DISTRIBUTION=bedrock-sonnet-4.5:4,vertex-gemini-3:2,openai-direct-5.2:2,azure-openai-5.2:1,anthropic-direct:1

Recommended router behavior

Start from policy-approved model group → list allowed providers/models
Filter by: - circuit breaker state (OPEN → excluded) - hard tenant restrictions - regional constraints
Compute selection weights: - base weight from config - multiplied by health score (recent success rate) - optionally adjusted by cost band for the request type
Return ordered candidates for resiliency executor

graph TD
  A[Normalized Request] --> B[Allowed Provider Pool]
  B --> C[Filter: Policy + Tenant + Region]
  C --> D[Filter: Circuit Breaker OPEN]
  D --> E[Score: Weight x Health x SLO x Cost]
  E --> F[Ordered Candidate List]

5.5 Provider Abstraction

Canonical interface:

export interface ILLMProvider {
  generate(request: NormalizedProviderRequest): Promise<LLMResponse>;
  generateStream(request: NormalizedProviderRequest): AsyncIterable<StreamChunk>;
  getProviderInfo(): ProviderInfo;
  testConnection?(): Promise<boolean>;
}

Providers implement: - request conversion (canonical → provider-specific) - response normalization (provider-specific → canonical) - streaming normalization (provider-specific → canonical chunks) - error normalization

5.6 Universal Streaming Adapter

Return a single gateway streaming contract.

export type StreamChunk = {
  content: string;
  role: "assistant";
  stopReason?: string;
  usage?: { inputTokens?: number; outputTokens?: number; totalTokens?: number };
};

Implementation pattern:

async *generateStream(req: NormalizedProviderRequest): AsyncIterable<StreamChunk> {
  const raw = await this.providerSpecificStream(req);

  for await (const rawChunk of raw) {
    const chunk = this.normalizeStreamChunk(rawChunk);
    if (chunk) yield chunk; // never buffer whole output
  }
}

Rule: The app should never see Bedrock event types, OpenAI SSE deltas, or Gemini chunk formats.

5.7 Canonical Tool Schema

Define one canonical schema and translate per provider.

export type UniversalTool = {
  name: string;
  description: string;
  input_schema: {
    type: "object";
    properties: Record<string, any>;
    required?: string[];
    additionalProperties?: boolean;
  };
};

Provider transforms: - Anthropic: tools: [{ name, description, input_schema }] - OpenAI: tools: [{ type:"function", function:{ name, description, parameters } }] - Gemini: function_declarations format

Rule: tools policy should live in Policy Engine (disable, allowlist, schema size, etc.)

5.8 Error Taxonomy

Retries only work when errors are categorized correctly.

Recommended canonical error types: - RATE_LIMIT (429, includes retryAfter, rpm/tpm/quota) - TIMEOUT - PROVIDER_5XX - BAD_REQUEST (4xx non-429) - AUTH_ERROR - POLICY_VIOLATION - UNKNOWN

export type GatewayError = {
  type: "RATE_LIMIT"|"TIMEOUT"|"PROVIDER_5XX"|"BAD_REQUEST"|"AUTH_ERROR"|"POLICY_VIOLATION"|"UNKNOWN";
  provider?: string;
  model?: string;
  retryAfterMs?: number;
  canRetry: boolean;
  details?: any;
};

6. Resiliency Patterns

Retries alone are not resiliency. You need circuit breakers and cooldowns to avoid stampedes.

Circuit breaker state machine

stateDiagram-v2
  [*] --> Closed
  Closed --> Open: failure rate > threshold
  Open --> HalfOpen: cooldown elapsed
  HalfOpen --> Closed: success count met
  HalfOpen --> Open: failure occurs

Recommended execution strategy

Fast failover on:
5xx spikes
timeouts
429 with retryAfter too large
Backoff only when retryAfter is short and user experience is acceptable
Cooldown a provider after repeated throttles
Optional: hedged requests for high-priority traffic (careful: increases cost)

flowchart TD
  A[Candidate Providers] --> B[Try Provider #1]
  B --> C{Success?}
  C -->|Yes| R[Return]
  C -->|No| D[Classify Error]
  D --> E{Can retry quickly?}
  E -->|Yes| F[Backoff / Respect Retry-After]
  E -->|No| G[Trip CB / Cooldown]
  F --> H[Try next provider]
  G --> H
  H --> C

7. Observability (OpenTelemetry-First)

If you can’t see it, you can’t run it.

What to emit on every request

Traces - span: gateway.request - attributes: tenantId, projectId, provider, model, stream, toolsUsed

Metrics - request count by provider/model/tenant - error count by type/provider - latency p50/p95/p99 - token usage input/output/total - estimated cost

Logs - structured logs at boundaries only (avoid logging raw prompts unless explicitly enabled and scrubbed)

graph LR
  A[Gateway] --> B[OTel Collector]
  B --> C[Tracing: Tempo/Jaeger]
  B --> D[Metrics: Prometheus]
  B --> E[Logs: Loki/ELK]

8. Cost Controls: Budgets, Quotas, and Attribution

A real gateway needs guardrails:

Budgets: monthly spend caps per tenant/project
Quotas: RPM/TPM caps per tenant/project
Hard stops: block requests when budget is exceeded
Soft alerts: warn at 70/85/95%

Recommended tables (conceptual): - tenant_budgets(tenantId, month, usdLimit, usdUsed) - tenant_quotas(tenantId, rpmLimit, tpmLimit) - request_costs(requestId, provider, model, inputTokens, outputTokens, costUsd)

9. Caching (Optional, but Often Huge)

If you want to claim big cost savings credibly, caching is usually part of the story.

Cache types

Exact match cache (safe default)
Semantic cache (opt-in, requires careful policy)

When NOT to cache

tool calls (unless caching tool output separately with strong invariants)
user-specific or sensitive prompts
anything flagged by tenant policy as “no cache”

graph TD
  A[Normalized Request] --> B{Cache Eligible?}
  B -->|No| C[Skip Cache]
  B -->|Yes| D[Exact Cache Lookup]
  D -->|Hit| E[Return Cached Response]
  D -->|Miss| F[Optional Semantic Cache]
  F -->|Hit| E
  F -->|Miss| G[Call Provider + Store]

10. Deployment Checklist

Security

Provider credentials in a vault (AWS Secrets Manager / Azure Key Vault / GCP Secret Manager)
Virtual keys per tenant/project
RBAC: model/tool allowlists
PII scrubbing for logs (default OFF for prompt logging)

Reliability

Circuit breakers configured per provider
Cooldowns for throttling
Timeouts tuned per provider
Retries capped and bounded

Observability

OTel traces end-to-end
Metrics dashboards: latency/error/tokens/cost
Alerting on error spikes and budget thresholds

Governance

Strict Mode available for production workloads
Policy evaluation unit-tested
Audit logs for policy decisions

11. Testing + Chaos

You don’t have a gateway until it survives bad days.

Integration tests per provider (including streaming)
Tool calling tests across providers
Error simulation: 429/5xx/timeouts
Load testing with realistic token distributions
Chaos testing: randomly fail providers, randomly inject latency

12. Results + Measurement Methodology

If you publish numbers, publish how they were measured.

Example outcomes (illustrative)

Metric	Before	After	Notes
Availability	99.5%	99.95%	measured at gateway success rate
P95 latency	3.2s	2.1s	routing avoided degraded providers
429 rate	4.2%	0.3%	context-aware routing + cooldown
Cost / 1M tokens	$8.50	$5.20	depends heavily on mix + caching

Measurement notes

Time window (e.g., 14 days)
Traffic volume (requests/day, avg tokens)
Provider mix before/after
Definition of “availability” (successful responses / total)
Whether caching is enabled and hit rate
Whether hedging is enabled

13. Conclusion

A production-grade LLM Gateway is a control plane for AI calls: - Policy decides what’s allowed (Strict Mode, tools, models) - Router decides what’s optimal (weights + health + cost) - Resiliency keeps you alive (CB + cooldown + backoff) - Observability keeps you sane (OTel + metrics + request-level cost) - Providers stay replaceable (canonical streaming + canonical tools)

The future is multi-provider and multi-model. If you design the gateway as an operational system—not a wrapper—you’ll ship reliability and leverage from day one.

Appendix: Configuration Example

# Mode
ENABLE_GATEWAY_MODE=true

# Strict Mode (server overrides client model)
STRICT_MODE=true
STRICT_MODEL_GROUP=prod-safe
STRICT_MAX_TOKENS=2000
STRICT_TEMPERATURE=0.7
STRICT_DISALLOW_TOOLS=false

# Weighted distribution for non-strict groups (or for strict groups)
GATEWAY_DISTRIBUTION=bedrock-sonnet-4.5:4,vertex-gemini-3:2,openai-direct-5.2:2,azure-openai-5.2:1,anthropic-direct:1

# Provider creds (store in vault in real prod)
AWS_ACCESS_KEY_ID=xxxxx
AWS_SECRET_ACCESS_KEY=xxxxx
AWS_REGION=us-east-1

AZURE_OPENAI_API_KEY=xxxxx
AZURE_OPENAI_ENDPOINT=https://xxx.openai.azure.com

ANTHROPIC_API_KEY=xxxxx
OPENAI_API_KEY=xxxxx
VERTEX_PROJECT_ID=xxxxx
GEMINI_API_KEY=xxxxx

Appendix: Minimal “Strict Mode” sentence you can add early in the post

Strict Mode ensures production safety: when enabled, the gateway ignores client model selection and enforces server-owned model groups and policies (tools, max tokens, allowlists), guaranteeing compliance even if the client is misconfigured.