Building a Production-Grade LLM Gateway (95+): Architecture, Governance, Resiliency, and Observability

A production LLM Gateway is not “a multi-provider SDK wrapper.”
The real moat is governance + tenancy + resiliency + observability—with streaming and tool calling that don’t leak provider quirks.

This post lays out a battle-tested architecture for an LLM Gateway that supports AWS Bedrock, Azure OpenAI, Anthropic, OpenAI, Vertex AI, and Gemini, with: - Strict Mode (server-enforced model overrides) - Weighted routing + intelligent failover - Canonical streaming + canonical tool schema - Circuit breakers + cooldowns + backoff - OpenTelemetry-first observability - Multi-tenant budgets + quotas + virtual keys - Optional cache layers (exact + semantic)

If you’ve looked at gateways like Portkey / Bifrost, the shape below is aligned with what “production” typically means: a control plane + policy + telemetry + reliability, not just routing.


Table of Contents


1. The Problem Space

Modern LLM apps hit the same wall:

  1. Vendor lock-in risk: Switching providers late is painful and expensive.
  2. Reliability: Provider outages, throttling, regional issues, intermittent 5xx.
  3. Cost: Pricing and throughput vary widely; smart routing can materially reduce spend.
  4. Rate limits: RPM/TPM/quotas differ; naive retry strategies amplify failure.
  5. Model churn: Model IDs change, versions deprecate, capabilities differ.
  6. Streaming differences: Each provider streams differently; the UI expects one contract.
  7. Tool calling differences: Schema mismatches break compatibility across vendors.
  8. Governance: Real production needs policy enforcement and auditability.
  9. Tenancy: Budgets, per-tenant quotas, and key separation are not “nice to have.”

2. Design Goals

A “95+/100” gateway does these well:


3. Architecture Overview

graph TB
  subgraph Client
    A[App / UI / Services]
  end

  subgraph Gateway
    G1[API Layer<br/>HTTP/SSE]
    G2[Auth + Tenant Context]
    G3[Policy Engine<br/>Strict Mode, Allow/Block, Tools Policy]
    G8[Optional Cache<br/>Exact + Semantic]
    G4[Router<br/>Weighted + Health + Cost + SLO]
    G5[Resiliency Layer<br/>CB + Backoff + Cooldown + Hedging]
    G6[Provider Abstraction<br/>Unified Interface]
    G7[Telemetry<br/>OTel Traces + Metrics + Logs]
  end

  subgraph Providers
    P1[AWS Bedrock]
    P2[Azure OpenAI]
    P3[Anthropic Direct]
    P4[OpenAI Direct]
    P5[Vertex AI]
    P6[Gemini Direct]
  end

  A --> G1 --> G2 --> G3 --> G8 --> G4 --> G5 --> G6
  G6 --> P1
  G6 --> P2
  G6 --> P3
  G6 --> P4
  G6 --> P5
  G6 --> P6

  G1 -.-> G7
  G2 -.-> G7
  G3 -.-> G7
  G4 -.-> G7
  G5 -.-> G7
  G6 -.-> G7
  G8 -.-> G7

Key idea: Policy sits before routing. The router cannot be trusted to “do the right thing” unless governance already constrained the space.


4. Request Flow (Step-by-Step)

sequenceDiagram
  participant Client
  participant API as Gateway API
  participant Auth as Auth/Tenant
  participant Policy as Policy Engine
  participant Cache as Cache (optional)
  participant Router as Router
  participant Res as Resiliency
  participant Prov as Provider
  participant OTel as Telemetry

  Client->>API: POST /v1/chat.completions (stream=true)
  API->>OTel: start trace/span + requestId
  API->>Auth: validate key/JWT, resolve tenant/project
  Auth-->>API: tenantCtx + limits + policyRefs
  API->>Policy: enforce Strict Mode, allowlists, tools policy
  Policy-->>API: normalizedRequest (server-approved)
  API->>Cache: lookup (exact/semantic) if eligible
  alt cache hit
    Cache-->>API: cached stream/response
    API-->>Client: stream tokens
  else cache miss
    Cache-->>API: miss
    API->>Router: pick candidate providers (weights + health + cost)
    Router-->>API: ordered candidates
    API->>Res: execute with CB/backoff/cooldown
    Res->>Prov: call generateStream()
    Prov-->>Client: stream tokens (normalized chunks)
  end
  API->>OTel: emit metrics (latency, tokens, provider, cost)
  API-->>Client: final stopReason + usage

5. Core Components

5.1 Gateway API

You want a single gateway surface that’s stable for apps:

Non-negotiable: return a canonical stream chunk format even if providers differ.


5.2 Auth + Tenant Context

Production gateways are almost always multi-tenant.

Common patterns: - Virtual keys: one key per tenant/project that maps to provider keys internally - RBAC: which models/tools a tenant can use - Rate limits: requests/min + tokens/min per tenant/project

graph LR
  A[Incoming API Key/JWT] --> B[Auth Service]
  B --> C[Tenant Resolver]
  C --> D[Policy Ref]
  C --> E[Quota/Budget Ref]
  C --> F[Provider Key Vault Ref]

5.3 Policy Engine (Strict Mode)

This is where your gateway becomes “enterprise-grade”.

Priority Source Rule
1 STRICT_MODE (server) Overrides client model/provider/tool choices
2 Tenant policy Allowlists/denylists, tool restrictions, max tokens
3 Client request Only within allowed policy bounds
4 Defaults Safe fallback

Strict Mode behavior

STRICT_MODE=true
STRICT_MODEL_GROUP=prod-safe
STRICT_MAX_TOKENS=2000
STRICT_TEMPERATURE=0.7
STRICT_DISALLOW_TOOLS=false
flowchart TD
  R[Client Request] --> P{STRICT_MODE?}
  P -->|Yes| S[Force server model group<br/>override model/tools/maxTokens]
  P -->|No| N[Validate within tenant policy<br/>allowlists + tool policy]
  S --> O[Normalized, server-approved request]
  N --> O

Code sketch (conceptual):

type NormalizedRequest = {
  modelGroup: string;           // e.g., "prod-safe"
  resolvedModelHint?: string;   // optional, for non-strict mode
  toolsAllowed: boolean;
  maxTokens: number;
  temperature: number;
};

function applyPolicy(req: any, tenantPolicy: any, env: any): NormalizedRequest {
  if (env.STRICT_MODE === "true") {
    return {
      modelGroup: env.STRICT_MODEL_GROUP ?? "prod-safe",
      toolsAllowed: env.STRICT_DISALLOW_TOOLS !== "true",
      maxTokens: Number(env.STRICT_MAX_TOKENS ?? 2000),
      temperature: Number(env.STRICT_TEMPERATURE ?? 0.7),
    };
  }

  return {
    modelGroup: tenantPolicy.defaultModelGroup,
    resolvedModelHint: req.model,
    toolsAllowed: tenantPolicy.toolsAllowed && !!req.tools?.length,
    maxTokens: Math.min(req.max_tokens ?? 4000, tenantPolicy.maxTokens),
    temperature: Math.max(0, Math.min(req.temperature ?? 0.7, 2.0)),
  };
}


5.4 Router (Weighted + Health + Cost)

Weighted routing is table stakes. What makes it production-grade is:
weight × health × SLO × cost.

Example distribution

ENABLE_GATEWAY_MODE=true
GATEWAY_DISTRIBUTION=bedrock-sonnet-4.5:4,vertex-gemini-3:2,openai-direct-5.2:2,azure-openai-5.2:1,anthropic-direct:1
  1. Start from policy-approved model group → list allowed providers/models
  2. Filter by: - circuit breaker state (OPEN → excluded) - hard tenant restrictions - regional constraints
  3. Compute selection weights: - base weight from config - multiplied by health score (recent success rate) - optionally adjusted by cost band for the request type
  4. Return ordered candidates for resiliency executor
graph TD
  A[Normalized Request] --> B[Allowed Provider Pool]
  B --> C[Filter: Policy + Tenant + Region]
  C --> D[Filter: Circuit Breaker OPEN]
  D --> E[Score: Weight x Health x SLO x Cost]
  E --> F[Ordered Candidate List]

5.5 Provider Abstraction

Canonical interface:

export interface ILLMProvider {
  generate(request: NormalizedProviderRequest): Promise<LLMResponse>;
  generateStream(request: NormalizedProviderRequest): AsyncIterable<StreamChunk>;
  getProviderInfo(): ProviderInfo;
  testConnection?(): Promise<boolean>;
}

Providers implement: - request conversion (canonical → provider-specific) - response normalization (provider-specific → canonical) - streaming normalization (provider-specific → canonical chunks) - error normalization


5.6 Universal Streaming Adapter

Return a single gateway streaming contract.

export type StreamChunk = {
  content: string;
  role: "assistant";
  stopReason?: string;
  usage?: { inputTokens?: number; outputTokens?: number; totalTokens?: number };
};

Implementation pattern:

async *generateStream(req: NormalizedProviderRequest): AsyncIterable<StreamChunk> {
  const raw = await this.providerSpecificStream(req);

  for await (const rawChunk of raw) {
    const chunk = this.normalizeStreamChunk(rawChunk);
    if (chunk) yield chunk; // never buffer whole output
  }
}

Rule: The app should never see Bedrock event types, OpenAI SSE deltas, or Gemini chunk formats.


5.7 Canonical Tool Schema

Define one canonical schema and translate per provider.

export type UniversalTool = {
  name: string;
  description: string;
  input_schema: {
    type: "object";
    properties: Record<string, any>;
    required?: string[];
    additionalProperties?: boolean;
  };
};

Provider transforms: - Anthropic: tools: [{ name, description, input_schema }] - OpenAI: tools: [{ type:"function", function:{ name, description, parameters } }] - Gemini: function_declarations format

Rule: tools policy should live in Policy Engine (disable, allowlist, schema size, etc.)


5.8 Error Taxonomy

Retries only work when errors are categorized correctly.

Recommended canonical error types: - RATE_LIMIT (429, includes retryAfter, rpm/tpm/quota) - TIMEOUT - PROVIDER_5XX - BAD_REQUEST (4xx non-429) - AUTH_ERROR - POLICY_VIOLATION - UNKNOWN

export type GatewayError = {
  type: "RATE_LIMIT"|"TIMEOUT"|"PROVIDER_5XX"|"BAD_REQUEST"|"AUTH_ERROR"|"POLICY_VIOLATION"|"UNKNOWN";
  provider?: string;
  model?: string;
  retryAfterMs?: number;
  canRetry: boolean;
  details?: any;
};

6. Resiliency Patterns

Retries alone are not resiliency. You need circuit breakers and cooldowns to avoid stampedes.

Circuit breaker state machine

stateDiagram-v2
  [*] --> Closed
  Closed --> Open: failure rate > threshold
  Open --> HalfOpen: cooldown elapsed
  HalfOpen --> Closed: success count met
  HalfOpen --> Open: failure occurs
flowchart TD
  A[Candidate Providers] --> B[Try Provider #1]
  B --> C{Success?}
  C -->|Yes| R[Return]
  C -->|No| D[Classify Error]
  D --> E{Can retry quickly?}
  E -->|Yes| F[Backoff / Respect Retry-After]
  E -->|No| G[Trip CB / Cooldown]
  F --> H[Try next provider]
  G --> H
  H --> C

7. Observability (OpenTelemetry-First)

If you can’t see it, you can’t run it.

What to emit on every request

Traces - span: gateway.request - attributes: tenantId, projectId, provider, model, stream, toolsUsed

Metrics - request count by provider/model/tenant - error count by type/provider - latency p50/p95/p99 - token usage input/output/total - estimated cost

Logs - structured logs at boundaries only (avoid logging raw prompts unless explicitly enabled and scrubbed)

graph LR
  A[Gateway] --> B[OTel Collector]
  B --> C[Tracing: Tempo/Jaeger]
  B --> D[Metrics: Prometheus]
  B --> E[Logs: Loki/ELK]

8. Cost Controls: Budgets, Quotas, and Attribution

A real gateway needs guardrails:

Recommended tables (conceptual): - tenant_budgets(tenantId, month, usdLimit, usdUsed) - tenant_quotas(tenantId, rpmLimit, tpmLimit) - request_costs(requestId, provider, model, inputTokens, outputTokens, costUsd)


9. Caching (Optional, but Often Huge)

If you want to claim big cost savings credibly, caching is usually part of the story.

Cache types

  1. Exact match cache (safe default)
  2. Semantic cache (opt-in, requires careful policy)

When NOT to cache

graph TD
  A[Normalized Request] --> B{Cache Eligible?}
  B -->|No| C[Skip Cache]
  B -->|Yes| D[Exact Cache Lookup]
  D -->|Hit| E[Return Cached Response]
  D -->|Miss| F[Optional Semantic Cache]
  F -->|Hit| E
  F -->|Miss| G[Call Provider + Store]

10. Deployment Checklist

Security

Reliability

Observability

Governance


11. Testing + Chaos

You don’t have a gateway until it survives bad days.


12. Results + Measurement Methodology

If you publish numbers, publish how they were measured.

Example outcomes (illustrative)

Metric Before After Notes
Availability 99.5% 99.95% measured at gateway success rate
P95 latency 3.2s 2.1s routing avoided degraded providers
429 rate 4.2% 0.3% context-aware routing + cooldown
Cost / 1M tokens $8.50 $5.20 depends heavily on mix + caching

Measurement notes


13. Conclusion

A production-grade LLM Gateway is a control plane for AI calls: - Policy decides what’s allowed (Strict Mode, tools, models) - Router decides what’s optimal (weights + health + cost) - Resiliency keeps you alive (CB + cooldown + backoff) - Observability keeps you sane (OTel + metrics + request-level cost) - Providers stay replaceable (canonical streaming + canonical tools)

The future is multi-provider and multi-model. If you design the gateway as an operational system—not a wrapper—you’ll ship reliability and leverage from day one.


Appendix: Configuration Example

# Mode
ENABLE_GATEWAY_MODE=true

# Strict Mode (server overrides client model)
STRICT_MODE=true
STRICT_MODEL_GROUP=prod-safe
STRICT_MAX_TOKENS=2000
STRICT_TEMPERATURE=0.7
STRICT_DISALLOW_TOOLS=false

# Weighted distribution for non-strict groups (or for strict groups)
GATEWAY_DISTRIBUTION=bedrock-sonnet-4.5:4,vertex-gemini-3:2,openai-direct-5.2:2,azure-openai-5.2:1,anthropic-direct:1

# Provider creds (store in vault in real prod)
AWS_ACCESS_KEY_ID=xxxxx
AWS_SECRET_ACCESS_KEY=xxxxx
AWS_REGION=us-east-1

AZURE_OPENAI_API_KEY=xxxxx
AZURE_OPENAI_ENDPOINT=https://xxx.openai.azure.com

ANTHROPIC_API_KEY=xxxxx
OPENAI_API_KEY=xxxxx
VERTEX_PROJECT_ID=xxxxx
GEMINI_API_KEY=xxxxx

Appendix: Minimal “Strict Mode” sentence you can add early in the post

Strict Mode ensures production safety: when enabled, the gateway ignores client model selection and enforces server-owned model groups and policies (tools, max tokens, allowlists), guaranteeing compliance even if the client is misconfigured.