Hybrid Inference Routing System
The routing system implements a two-layer architecture for intelligent traffic distribution:
Architecture
Decision Layer (routing/manager.py + routing/strategies/weight.py)
Reads config/routing.yaml and computes weight distributions between local and remote deployments. Currently supports a fixed-ratio strategy with plans for expansion.
Execution Layer (routing/routers.py)
Performs weighted random selection based on computed weights and provides automatic fallback to alternative adapters on request failure. The concrete FixedRouter lives in routing/routers.py; routing/executor.py is a backward-compatibility shim that re-exports FixedRouter as RouteExecutor along with ProviderPinError and RouteConfig.
Features
Fixed-ratio routing: Configurable traffic split between local and remote deployments
Health monitoring (optional): Simple health checks with automatic weight adjustment
Automatic fallback: Seamless failover when primary adapter fails
Environment variable support: Configuration with
${VAR}and${VAR:-default}syntax
Configuration
See the Configuration guide for detailed options and examples.
Required Files
config/models.yaml: Registers available models and adapters
Optional Files
config/routing.yaml: Configures local/remote deployment split and health checking
Example Configuration (60/40 split):
routing_strategy: fixed
routing_parameter:
local_fraction: 0.6
timeout: 2
health_check: 30
logging:
output: output.log
local_deployment:
- endpoint: ${LOCAL_DEPLOYMENT_URL:-http://localhost:8000}
models:
remote_deployment:
models:
Running the Server
# Development: run FastAPI app with routing enabled
uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080
# Or use a custom port for local development
PORT=9000 uvicorn serving.servers.app:app --host 0.0.0.0 --port $PORT
When the application starts, serving.servers.bootstrap loads config/models.yaml and optionally config/routing.yaml. If routing.yaml is present the RoutingManager applies the configured weights; otherwise default weights from models.yaml are used.
API Endpoints
GET /v1/models- List available modelsPOST /v1/chat/completions- Chat completion with automatic routingGET /routing- View current routing configuration and weightsGET /health- Health check endpoint
Extending the System
Adding New Strategies
Create a new strategy class in
routing/strategies/weight.py:
class RoundRobinStrategy:
def assign(self, local: List, remote: List) -> Dict[object, float]:
# Implementation
Update
routing/manager.pyto use the new strategy based onrouting_strategyconfig.
RouteWise Strategy
In addition to the deployment-wide fixed-ratio strategy in routing.yaml, a
cost-aware routewise strategy is available as a per-model opt-in, enabled by
adding router: routewise to a model entry in config/models.yaml. Each
opted-in model gets its own RouteWiseRouter instance, constructed by
routing/model_router_registry.py from that model’s router_params: block
(validated by the RouteWiseParams schema in
routing/strategies/routewise.py).
Configuration splits by ownership:
router_params:(per model) — algorithm knobs only:budget_alpha(LP cost budget),latency_hedge_mode, latency SLO/window, envelope percentiles/window/min-samples, output predictor, prefix-cache flag.Route entries (per provider) — resource semantics:
provider_type: on_demand | quota | concurrency,pricing:,quota: {limit}plus a requiredquota_source:block (the provider usage API is the quota truth source, including window/reset semantics),concurrency: {limit}, and optionalquota_pool:/concurrency_pool:ids for routes that share a subscription.
Quota-bearing models refuse to start until the cost envelope is calibrated
from recent api_logs traffic (see envelope_min_samples); a cold deploy
with no history fails fast by design. The reference block in
config/models.yaml (under minimax-fast) lists every option and is
completeness-tested against the RouteWiseConfig dataclass. Design specs
live under docs/agents/specs/.
Health Monitoring
Health checks are optional and can be enabled by setting health_check > 0 in the configuration. The system performs simple GET requests to /health endpoints and adjusts weights accordingly.
Session affinity
FixedRouter keeps a per-(user, model) pin to the last-selected provider for
five minutes (sliding TTL). Goals:
Keep one conversation on one backend so prompt caches stay warm and latency stays consistent.
Drop the pin the moment that backend errors, so users don’t get stuck on a failing provider.
Affinity key:
Authenticated requests: the user’s
auth_key_hash.Anonymous requests:
f"ip:{client_ip}".
Pin lifecycle:
First request from
(key, model)→ weighted random pick → entry stored.Subsequent requests within 300 s on the same
(key, model)reuse the same endpoint and refresh the TTL.Any exception from the pinned provider drops the entry; fallback runs; the next request creates a fresh pin.
If the pinned endpoint is no longer in the allowed pool (weight-0 in
routing.yamlor its circuit is open), the entry is dropped and a fresh weighted-random pick runs.After 300 s of inactivity the entry expires.
Scope and limits:
State is in-process. Each Uvicorn worker tracks its own table. Same as
key_pool.py.Affinity does not survive a restart.
pin_provider(admin override viaX-Route-Pin) bypasses affinity.
Kill switch: set ROUTING_AFFINITY_ENABLED=0 to disable.
Metrics: routing_affinity_events_total{event,model} with events
hit | miss | created | expired | dropped_error | dropped_unavailable.
Migration Notes
For users migrating from older versions:
The old
deployment.example.yamlformat is deprecatedUse the simplified
config/routing.yamlstructure shown aboveLegacy
RoutingStrategy/select_deploymentpatterns have been replaced with the currentFixedRatioStrategy.assign()approach