# Hybrid Inference Routing System The routing system implements a two-layer architecture for intelligent traffic distribution: ## Architecture ### Decision Layer (`routing/manager.py` + `routing/strategies/weight.py`) Reads `config/routing.yaml` and computes weight distributions between local and remote deployments. Currently supports a fixed-ratio strategy with plans for expansion. ### Execution Layer (`routing/routers.py`) Performs weighted random selection based on computed weights and provides automatic fallback to alternative adapters on request failure. The concrete `FixedRouter` lives in `routing/routers.py`; `routing/executor.py` is a backward-compatibility shim that re-exports `FixedRouter` as `RouteExecutor` along with `ProviderPinError` and `RouteConfig`. ## Features - **Fixed-ratio routing**: Configurable traffic split between local and remote deployments - **Health monitoring** (optional): Simple health checks with automatic weight adjustment - **Automatic fallback**: Seamless failover when primary adapter fails - **Environment variable support**: Configuration with `${VAR}` and `${VAR:-default}` syntax ## Configuration See the [Configuration guide](configuration.md) for detailed options and examples. ### Required Files - `config/models.yaml`: Registers available models and adapters ### Optional Files - `config/routing.yaml`: Configures local/remote deployment split and health checking ### Example Configuration (60/40 split): ```yaml routing_strategy: fixed routing_parameter: local_fraction: 0.6 timeout: 2 health_check: 30 logging: output: output.log local_deployment: - endpoint: ${LOCAL_DEPLOYMENT_URL:-http://localhost:8000} models: remote_deployment: models: ``` ## Running the Server ```bash # Development: run FastAPI app with routing enabled uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080 # Or use a custom port for local development PORT=9000 uvicorn serving.servers.app:app --host 0.0.0.0 --port $PORT ``` When the application starts, `serving.servers.bootstrap` loads `config/models.yaml` and optionally `config/routing.yaml`. If `routing.yaml` is present the `RoutingManager` applies the configured weights; otherwise default weights from `models.yaml` are used. ## API Endpoints - `GET /v1/models` - List available models - `POST /v1/chat/completions` - Chat completion with automatic routing - `GET /routing` - View current routing configuration and weights - `GET /health` - Health check endpoint ## Extending the System ### Adding New Strategies 1. Create a new strategy class in `routing/strategies/weight.py`: ```python class RoundRobinStrategy: def assign(self, local: List, remote: List) -> Dict[object, float]: # Implementation ``` 2. Update `routing/manager.py` to use the new strategy based on `routing_strategy` config. ### RouteWise Strategy In addition to the deployment-wide fixed-ratio strategy in `routing.yaml`, a cost-aware `routewise` strategy is available as a per-model opt-in, enabled by adding `router: routewise` to a model entry in `config/models.yaml`. Each opted-in model gets its own `RouteWiseRouter` instance, constructed by `routing/model_router_registry.py` from that model's `router_params:` block (validated by the `RouteWiseParams` schema in `routing/strategies/routewise.py`). Configuration splits by ownership: - **`router_params:` (per model)** — algorithm knobs only: `budget_alpha` (LP cost budget), `latency_hedge_mode`, latency SLO/window, envelope percentiles/window/min-samples, output predictor, prefix-cache flag. - **Route entries (per provider)** — resource semantics: `provider_type: on_demand | quota | concurrency`, `pricing:`, `quota: {limit}` plus a required `quota_source:` block (the provider usage API is the quota truth source, including window/reset semantics), `concurrency: {limit}`, and optional `quota_pool:` / `concurrency_pool:` ids for routes that share a subscription. Quota-bearing models refuse to start until the cost envelope is calibrated from recent `api_logs` traffic (see `envelope_min_samples`); a cold deploy with no history fails fast by design. The reference block in `config/models.yaml` (under `minimax-fast`) lists every option and is completeness-tested against the `RouteWiseConfig` dataclass. Design specs live under `docs/agents/specs/`. ### Health Monitoring Health checks are optional and can be enabled by setting `health_check > 0` in the configuration. The system performs simple GET requests to `/health` endpoints and adjusts weights accordingly. ## Session affinity `FixedRouter` keeps a per-(user, model) pin to the last-selected provider for five minutes (sliding TTL). Goals: - Keep one conversation on one backend so prompt caches stay warm and latency stays consistent. - Drop the pin the moment that backend errors, so users don't get stuck on a failing provider. **Affinity key:** - Authenticated requests: the user's `auth_key_hash`. - Anonymous requests: `f"ip:{client_ip}"`. **Pin lifecycle:** 1. First request from `(key, model)` → weighted random pick → entry stored. 2. Subsequent requests within 300 s on the same `(key, model)` reuse the same endpoint and refresh the TTL. 3. Any exception from the pinned provider drops the entry; fallback runs; the next request creates a fresh pin. 4. If the pinned endpoint is no longer in the allowed pool (weight-0 in `routing.yaml` or its circuit is open), the entry is dropped and a fresh weighted-random pick runs. 5. After 300 s of inactivity the entry expires. **Scope and limits:** - State is in-process. Each Uvicorn worker tracks its own table. Same as `key_pool.py`. - Affinity does not survive a restart. - `pin_provider` (admin override via `X-Route-Pin`) bypasses affinity. **Kill switch:** set `ROUTING_AFFINITY_ENABLED=0` to disable. **Metrics:** `routing_affinity_events_total{event,model}` with events `hit | miss | created | expired | dropped_error | dropped_unavailable`. ## Migration Notes For users migrating from older versions: - The old `deployment.example.yaml` format is deprecated - Use the simplified `config/routing.yaml` structure shown above - Legacy `RoutingStrategy/select_deployment` patterns have been replaced with the current `FixedRatioStrategy.assign()` approach