Architecture Overview
HybridInference is designed as a modular, high-performance inference gateway.
System Architecture
┌─────────────┐
│ Clients │
└──────┬──────┘
│
▼
┌─────────────────┐
│ FastAPI Gateway│
│ (serving/) │
└────────┬────────┘
│
┌────┴────┐
│ │
▼ ▼
┌────────┐ ┌─────────┐
│Routing │ │Adapters │
│Manager │ │ │
└────┬───┘ └────┬────┘
│ │
▼ ▼
┌──────────────────┐
│ LLM Providers │
│ ┌──────────────┐ │
│ │ Local vLLM │ │
│ │ OpenAI API │ │
│ │ Gemini API │ │
│ └──────────────┘ │
└──────────────────┘
Core Components
Serving Layer (serving/)
The serving layer provides the FastAPI-based gateway:
Gateway: HTTP API endpoints for inference requests
Adapters: Provider-specific API adapters
Observability: Logging, metrics, and tracing
Storage: PostgreSQL integration for request/response logging
Routing Layer (routing/)
Intelligent routing and load balancing:
Manager: Routes requests to optimal providers
Strategies: Different routing algorithms (round-robin, cost-based, latency-based)
Health Checks: Monitor provider availability and performance
Configuration (config/)
Centralized configuration management:
Model configurations
Provider settings
Routing policies
Feature flags
Infrastructure (infrastructure/)
Deployment and observability:
Systemd service definitions
Prometheus metrics collection
Grafana dashboards
Alert manager rules
Key Design Principles
Modularity: Clear separation between serving, routing, and provider layers
Extensibility: Easy to add new providers and routing strategies
Observability: Comprehensive logging and metrics at every layer
Performance: Optimized for low-latency, high-throughput inference
Reliability: Health checks, retries, and fallback mechanisms
Data Flow
Client sends inference request to Gateway
Gateway validates and preprocesses request
Routing Manager selects optimal provider
Adapter translates request to provider-specific format
Provider processes inference
Response is logged to PostgreSQL
Metrics are exported to Prometheus
Response is returned to client