Architecture Overview
HybridInference is designed as a modular, high-performance inference gateway.
System Architecture
┌─────────────┐
│ Clients │
└──────┬──────┘
│
▼
┌─────────────────┐
│ FastAPI Gateway│
│ (serving/) │
└────────┬────────┘
│
┌────┴────┐
│ │
▼ ▼
┌────────┐ ┌─────────┐
│Routing │ │Adapters │
│Manager │ │ │
└────┬───┘ └────┬────┘
│ │
▼ ▼
┌───────────────────────┐
│ LLM Providers │
│ ┌───────────────────┐ │
│ │ Local vLLM │ │
│ │ OpenAI API │ │
│ │ Gemini API │ │
│ │ Claude Sub (OAuth)│ │
│ │ Codex Sub (OAuth) │ │
│ └───────────────────┘ │
└───────────────────────┘
Network Layer
External traffic passes through three layers before reaching the application logic:
Client ──▶ Cloudflare (CDN + DDoS) ──▶ Nginx (:443) ──▶ FastAPI (:8080)
└──▶ Frontend (:3001)
Cloudflare: Edge CDN, DDoS protection, SSL termination (Full strict mode).
Nginx: Origin TLS, path-based routing, body size limits, WebSocket upgrade.
FastAPI: API authentication, model routing, rate limiting, observability.
The network layer handles external connectivity and request delivery. The sections below describe the internal inference pipeline that runs inside FastAPI.
Core Components
Serving Layer (serving/)
The serving layer provides the FastAPI-based gateway:
Gateway: HTTP API endpoints for inference requests
Adapters: Provider-specific API adapters
Observability: Logging, metrics, and tracing
Storage: PostgreSQL integration for request/response logging
Routing Layer (routing/)
Intelligent routing and load balancing:
Manager: Routes requests to optimal providers
Strategies: Different routing algorithms (round-robin, cost-based, latency-based)
Health Checks: Monitor provider availability and performance
Configuration (config/)
Centralized configuration management:
Model configurations
Provider settings
Routing policies
Feature flags
Infrastructure (infrastructure/)
Deployment and observability:
Docker Compose service definitions and Dockerfiles
Prometheus metrics collection
Grafana dashboards and provisioning
Alertmanager rules and alert logger
Subscription Adapters
Some providers are accessed through OAuth subscription accounts rather than static API keys. These adapters have a layered credential architecture:
┌──────────────────────────────────────────────────────────────┐
│ Subscription Adapter (e.g. claude_sub) │
│ ┌────────────────────┐ ┌────────────────────────────┐ │
│ │ AccountPool │ │ CredentialProvider │ │
│ │ (codex_token.py) │ │ (claude_token.py) │ │
│ │ │ │ │ │
│ │ Health-aware │ │ OAuth token lifecycle: │ │
│ │ round-robin │ │ - refresh before expiry │ │
│ │ rotation │ │ - invalid_grant detection │ │
│ │ │ │ - state persistence (JSON) │ │
│ │ deactivate/activate│ │ - transition_state() │ │
│ └────────────────────┘ └────────────────────────────┘ │
│ │
│ Account states: active → cooldown → revoked/disabled │
│ Fallback: optional paid API key when all accounts down │
└──────────────────────────────────────────────────────────────┘
Two subscription adapters currently exist:
Adapter |
Provider |
Protocol |
Credential File |
|---|---|---|---|
|
Anthropic (Claude Code OAuth) |
Messages API |
|
|
OpenAI (Codex CLI OAuth) |
Responses API |
|
Both share AccountPool (from codex_token.py) for health-aware rotation. Each has its own CredentialProvider for provider-specific OAuth endpoints.
For Claude, the account pool is shared process-wide between:
ClaudeSubscriptionAdapter(OpenAI-compatible/v1/chat/completions)anthropic_proxy.py(POST /anthropic/v1/messages)
This shared singleton keeps cooldown and revoke state consistent across both surfaces.
Northbound API Surfaces
The gateway currently exposes more than one client-facing protocol surface:
Surface |
Endpoint |
Typical clients |
Notes |
|---|---|---|---|
OpenAI-compatible Chat Completions |
|
SDKs, OpenAI-compatible tools |
Primary public surface |
Anthropic-compatible Messages |
|
Claude Code CLI |
Only models routed through |
The Anthropic surface is an identity surface translator: the client-facing and upstream protocols are both Anthropic Messages API, so the route mainly performs auth, rate limiting, model resolution, credential injection, and usage logging.
For design details, see:
repository design doc
docs/claude-account-lifecycle.md— account state machine, error handling, data modelrepository design doc
docs/subscription-adapter-architecture.md— long-term multi-provider architecture
Key Design Principles
Modularity: Clear separation between serving, routing, and provider layers
Extensibility: Easy to add new providers and routing strategies
Observability: Comprehensive logging and metrics at every layer
Performance: Optimized for low-latency, high-throughput inference
Reliability: Health checks, retries, and fallback mechanisms
Data Flow
Client sends inference request to Gateway
Gateway validates and preprocesses request
Routing Manager selects optimal provider
Adapter translates request to provider-specific format
Provider processes inference
Response is logged to PostgreSQL
Metrics are exported to Prometheus
Response is returned to client