OpenRouter-Compatible API Gateway
A FastAPI-based gateway that serves OpenRouter-compatible traffic, fans out to local and remote LLM adapters, and exposes observability interfaces for operations. In production the application runs as a Docker container on port 8080 behind Nginx; for local development it can run on any port via uvicorn directly.
Architecture
hybridInference/
├── docs/ # Deployment and integration guides
├── serving/
│ ├── servers/
│ │ ├── app.py # FastAPI entry point (exposes /v1/*)
│ │ ├── bootstrap.py # Service bootstrap: models, routing, DB, rate limits
│ │ └── routers/ # API routers (health, models, completions, admin)
│ ├── adapters/ # Provider adapters (local VLLM, DeepSeek, Gemini, Llama, ...)
│ ├── storage/ # Database loggers (SQLite/PostgreSQL)
│ ├── observability/ # Metrics export (Prometheus, traces)
│ └── utils/ # Logging, configuration helpers
├── routing/ # Routing manager and execution strategies
├── config/
│ ├── models.yaml # Canonical model definitions + adapters
│ └── routing.yaml (optional) # Weighted routing configuration
├── infrastructure/docker/ # Dockerfiles and docker-compose.yml
└── var/db/openrouter_logs.db # Default SQLite request log (created at runtime)
Key Components
FastAPI app (
serving.servers.app:create_app): Hosts OpenRouter-compatible endpoints plus admin and metrics routes.Bootstrap (
serving.servers.bootstrap): Loads environment, registers models, applies routing weights, wires database logging, and configures rate limits.Adapters (
serving.adapters.*): Translate requests to providers such as local VLLM, DeepSeek, Gemini, and Llama API.Routing (
routing.*): Supports fixed-ratio and future strategies for splitting traffic across adapters.Observability (
serving.observability.metrics): Prometheus metrics and structured request logging.
Features
OpenRouter API compatibility: Implements
/v1/chat/completions,/v1/models, and related schemas.Hybrid routing: Combine local VLLM workers with hosted APIs; supports hard/soft offload.
Resilient adapters: Automatic retry/fallback when a provider returns errors.
Usage accounting: Prompt/completion token tracking and persisted request logs.
Streaming responses: Server-Sent Events (SSE) for incremental output.
Observability hooks: Prometheus metrics endpoint and structured request logs (SQLite/PostgreSQL).
Development Setup
Prerequisites
Python 3.10 or newer
uv (recommended) or conda
Create Environment
# Clone and bootstrap
git clone <repository-url>
cd hybridInference
uv venv -p 3.10
source .venv/bin/activate
uv sync
Local Environment Variables
Create .env from the template:
cp .env.example .env
Populate it with provider credentials and runtime configuration:
LOCAL_BASE_URL=https://freeinference.org/v1
OFFLOAD=0
DEEPSEEK_API_KEY=your-deepseek-api-key
GEMINI_API_KEY=your-gemini-api-key
LLAMA_API_KEY=your-llama-api-key
LLAMA_BASE_URL=https://your-llama-api-base/v1
USE_SQLITE_LOG=true
Run Locally
# Development server with reload on port 8080
uvicorn serving.servers.app:app --reload --host 0.0.0.0 --port 8080
# Alternate: respect PORT env var
PORT=9000 uvicorn serving.servers.app:app --host 0.0.0.0 --port $PORT
When the app starts it will:
Load environment variables (dotenv).
Register models from
config/models.yaml.Apply routing overrides from
config/routing.yamlif present.Initialize the database logger (SQLite under
var/dbby default).Configure per-provider rate limits when API keys are supplied.
Quick Checks
# Health
curl http://localhost:8080/health
# Models (OpenRouter schema)
curl http://localhost:8080/v1/models | jq
# Chat completion
env \
http_proxy= \
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-4-scout",
"messages": [{"role": "user", "content": "Ping"}],
"max_tokens": 64
}'
Production Deployment
All services run via Docker Compose. Nginx on the host terminates TLS; Cloudflare provides CDN and DDoS protection in front of Nginx.
make up # Start all services
make ps # Verify health
Runtime operations:
Restart:
make restartormake restart s=backendLogs:
make logsormake logs s=backend
See Deployment for the full guide.
Health:
curl https://freeinference.org/health
API Surface
Method |
Path |
Description |
|---|---|---|
GET |
|
Enumerate available models with OpenRouter metadata |
POST |
|
OpenRouter/OpenAI-compatible chat completion |
GET |
|
Liveness and dependency checks |
GET |
|
Prometheus metrics (requires auth upstream) |
GET |
|
Current routing weights (admin scope) |
GET |
|
Aggregated usage statistics |
Example Requests
# Streaming response
env \
http_proxy= \
curl -N -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Describe the architecture."}
],
"stream": true,
"temperature": 0.7,
"max_tokens": 256
}'
Logging and Metrics
SQLite (default): When
USE_SQLITE_LOG=true, logs persist tovar/db/openrouter_logs.db. Override withSQLITE_DB_PATHorOPENROUTER_SQLITE_DB.PostgreSQL: Set
USE_SQLITE_LOG=falseandDATABASE_URL=<dsn>to stream logs into PostgreSQL for analytics.Metrics:
/metricsexposes Prometheus counters/latencies. Enable scraping through infrastructure (e.g., Prometheus + Grafana).
Inspect logs locally:
python scripts/view_logs.py
sqlite3 var/db/openrouter_logs.db 'SELECT model_id, COUNT(*) FROM api_logs GROUP BY model_id;'
Testing
# Fast unit/integration tests
pytest -m "not external" -q
# Focused server tests
pytest test/servers/test_bootstrap.py -q
Troubleshooting
Port already in use:
sudo lsof -ti :80 | xargs sudo kill -9Missing models: Verify
config/models.yamlcontains the expected entries and thatLOCAL_BASE_URLis reachable.No logs written: Confirm
USE_SQLITE_LOGand filesystem permissions forvar/db/.Provider rate limiting: Adjust
GEMINI_TPM_LIMIT,DEEPSEEK_TPM_LIMIT, or equivalent environment variables as needed.