Adding a New Model (OpenRouter-Compatible)

This guide explains how to add support for new LLM models and providers to the hybridInference gateway while keeping full OpenRouter/OpenAI API compatibility.

Reference PR for provider integration example: https://github.com/HarvardSys/hybridInference/pull/34

Overview

There is a single guide for both needs. Depending on your case, follow one of:

Use an existing provider adapter (vLLM, DeepSeek, Gemini, Llama, Zhipu) — only YAML + env changes.
Integrate a new provider — add an adapter class + small registration changes, then YAML + env.

Quick Start

Adding a Model with an Existing Provider

If the provider is already supported (vLLM, DeepSeek, Gemini, Llama, Zhipu), you only need to add configuration.

Subscription providers (Claude, Codex) use OAuth account pools instead of API keys. See developer/configuration.md, section Subscription Adapters (Claude / Codex), for setup instructions. The rest of this guide covers API-key-based providers.

Add model configuration in config/models.yaml:

models:
  - id: your-model-id
    name: Your Model Display Name
    provider: existing_provider  # e.g., "gemini", "deepseek"
    provider_model_id: "actual-provider-model-id"
    base_url: ${PROVIDER_BASE_URL}
    api_key: ${PROVIDER_API_KEY}
    quantization: "bf16"
    input_modalities: ["text"]
    output_modalities: ["text"]
    context_length: 8192
    max_output_length: 4096
    supports_tools: true
    supports_structured_output: true
    supported_params: [temperature, top_p, max_tokens, stop]
    pricing:
      prompt: "0"
      completion: "0"
      image: "0"
      request: "0"
      input_cache_reads: "0"
      input_cache_writes: "0"
    route:
      - kind: existing_provider
        weight: 1.0
        base_url: ${PROVIDER_BASE_URL}
        api_key: ${PROVIDER_API_KEY}

Configure environment variables in .env:

PROVIDER_BASE_URL=https://api.provider.com/v1
PROVIDER_API_KEY=your-api-key

Restart the server to load the new model.
Verify:

curl http://localhost:8080/v1/models | jq
curl -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"your-model-id","messages":[{"role":"user","content":"Hello"}]}' | jq

Note on aliases: If you want the model to appear under an OpenRouter-style slug (e.g., a local vLLM path), add it in aliases so clients can call either name.

OFFLOAD behavior: When OFFLOAD=1, the service will remove all local adapters whose base_url matches LOCAL_BASE_URL and only use remote adapters. See “Hybrid Routing & OFFLOAD” below.

Adding a New Provider

If you need to integrate a completely new provider, follow these steps:

Step 1: Create Provider Adapter

Create a new file in serving/adapters/ (e.g., serving/adapters/your_provider.py):

import json
from collections.abc import AsyncGenerator
from typing import Any

from serving.stream import done_sentinel, make_final_usage_chunk
from serving.utils.tokens import estimate_prompt_tokens, estimate_text_tokens
from .base import BaseAdapter, UsageInfo


class YourProviderAdapter(BaseAdapter):
    """Adapter for YourProvider API.

    This adapter translates OpenAI-compatible requests to YourProvider's
    API format and normalizes responses back to OpenAI format.
    """

    async def chat_completion(
        self, messages: list[dict[str, Any]], **params
    ) -> dict[str, Any]:
        """Execute a non-streaming chat completion request.

        Args:
            messages: List of chat messages in OpenAI format.
            **params: Additional parameters (temperature, max_tokens, etc.).

        Returns:
            OpenAI-compatible response dictionary.
        """
        # Validate and filter parameters
        validated_params = self.validate_params(params)

        # Build provider-specific request payload
        payload = {
            "model": self.config.provider_model_id or self.config.id,
            "messages": messages,
            **validated_params,
        }

        # Add optional features
        if params.get("tools"):
            payload["tools"] = params["tools"]

        if params.get("response_format", {}).get("type") == "json_object":
            payload["response_format"] = {"type": "json_object"}

        # Set up authentication headers
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.config.api_key}",
        }

        # Make API request
        data = await self.http.json_post_with_retry(
            f"{self.config.base_url}/chat/completions",
            json=payload,
            headers=headers,
        )

        # Extract usage information
        usage = UsageInfo(
            prompt_tokens=data.get("usage", {}).get("prompt_tokens", 0),
            completion_tokens=data.get("usage", {}).get("completion_tokens", 0),
            total_tokens=data.get("usage", {}).get("total_tokens", 0),
        )

        # Fallback to estimation if provider doesn't return usage
        if usage.total_tokens == 0:
            content = data["choices"][0]["message"].get("content", "")
            prompt_tokens = estimate_prompt_tokens(messages)
            completion_tokens = estimate_text_tokens(content)
            usage = UsageInfo(
                prompt_tokens=int(prompt_tokens),
                completion_tokens=int(completion_tokens),
                total_tokens=int(prompt_tokens + completion_tokens),
            )

        # Extract tool calls if present
        tool_calls = None
        if "tool_calls" in data["choices"][0]["message"]:
            tool_calls = data["choices"][0]["message"]["tool_calls"]

        # Return normalized response
        return self.format_response(
            content=data["choices"][0]["message"].get("content", ""),
            model=self.config.id,
            usage=usage,
            tool_calls=tool_calls,
            finish_reason=data["choices"][0].get("finish_reason", "stop"),
        )

    async def stream_chat_completion(
        self, messages: list[dict[str, Any]], **params
    ) -> AsyncGenerator[str, None]:
        """Execute a streaming chat completion request.

        Args:
            messages: List of chat messages in OpenAI format.
            **params: Additional parameters.

        Yields:
            Server-sent event formatted strings.
        """
        validated_params = self.validate_params(params)

        payload = {
            "model": self.config.provider_model_id or self.config.id,
            "messages": messages,
            "stream": True,
            **validated_params,
        }

        if params.get("tools"):
            payload["tools"] = params["tools"]

        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.config.api_key}",
        }

        total_content = ""
        prompt_tokens = 0

        async for line in self.http.stream_post(
            f"{self.config.base_url}/chat/completions",
            json=payload,
            headers=headers,
        ):
            if not line.startswith("data: "):
                continue

            if line == "data: [DONE]":
                # Emit final usage chunk using shared helper for consistency
                yield make_final_usage_chunk(
                    model=self.config.id,
                    messages=messages,
                    total_content=total_content,
                    prompt_tokens_override=prompt_tokens or None,
                    finish_reason="stop",
                )
                yield done_sentinel()
                break

            try:
                chunk_data = json.loads(line[6:])

                # Extract usage if available
                if "usage" in chunk_data:
                    prompt_tokens = chunk_data["usage"].get("prompt_tokens", prompt_tokens)

                # Extract and yield content delta
                if chunk_data["choices"][0]["delta"].get("content"):
                    content = chunk_data["choices"][0]["delta"]["content"]
                    total_content += content
                    yield self.format_stream_chunk(content, self.config.id)
            except json.JSONDecodeError:
                continue

Step 2: Register the Adapter

Update serving/adapters/__init__.py:

from .your_provider import YourProviderAdapter

__all__ = [
    # ... existing exports
    "YourProviderAdapter",
]

Update serving/servers/registry.py:

Add the import at the top:

from serving.adapters import (
    # ... existing imports
    YourProviderAdapter,
)

Add a branch in the _make_adapter function:

def _make_adapter(kind: str, cfg: dict[str, Any]):
    """Construct a provider adapter from a kind string and model config."""
    model_cfg = ModelConfig(**cfg)
    # ... existing conditions
    if kind == "your_provider":
        return YourProviderAdapter(model_cfg)
    raise ValueError(f"Unknown adapter kind: {kind}")

Step 3: Add Model Configuration

Add your model to config/models.yaml:

models:
  - id: your-model-id
    name: Your Model Name
    provider: your_provider
    provider_model_id: "actual-model-id"
    base_url: ${YOUR_PROVIDER_BASE_URL}
    api_key: ${YOUR_PROVIDER_API_KEY}
    quantization: "bf16"
    input_modalities: ["text"]
    output_modalities: ["text"]
    context_length: 8192
    max_output_length: 4096
    supports_tools: true
    supports_structured_output: true
    supported_params: [temperature, top_p, max_tokens, stop]
    aliases: []  # Optional alternative names
    pricing:
      prompt: "0"
      completion: "0"
      image: "0"
      request: "0"
      input_cache_reads: "0"
      input_cache_writes: "0"
    route:
      - kind: your_provider
        weight: 1.0
        base_url: ${YOUR_PROVIDER_BASE_URL}
        api_key: ${YOUR_PROVIDER_API_KEY}

Step 4: Configure Environment Variables

Add to .env:

YOUR_PROVIDER_BASE_URL=https://api.yourprovider.com/v1
YOUR_PROVIDER_API_KEY=your-api-key-here

Step 5: Test the Integration

# Start the server (Docker)
make build s=backend
# Or locally without Docker:
# uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080

# List available models
curl http://localhost:8080/v1/models

# Test chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-id",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Streaming test:

curl -N -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-id",
    "messages": [{"role": "user", "content": "Stream test"}],
    "stream": true,
    "max_tokens": 64
  }'

Configuration Reference

ModelConfig Fields

Field	Type	Required	Description
`id`	string	Yes	Unique model identifier
`name`	string	Yes	Display name
`provider`	string	Yes	Provider/adapter kind
`base_url`	string	Yes	API endpoint base URL
`api_key`	string	No	API authentication key
`provider_model_id`	string	No	Provider’s model identifier (overrides `id`)
`aliases`	list[string]	No	Alternative names for routing
`quantization`	string	No	Quantization format (default: “bf16”)
`input_modalities`	list[string]	No	Input types: “text”, “image”
`output_modalities`	list[string]	No	Output types: “text”
`context_length`	int	No	Maximum context window (default: 8192)
`max_output_length`	int	No	Maximum output tokens (default: 4096)
`supports_tools`	bool	No	Function calling support (default: false)
`supports_structured_output`	bool	No	JSON mode support (default: false)
`supported_params`	list[string]	No	Allowed parameter names
`pricing`	dict	No	Cost information per token/request

Route Configuration

Routes allow multiple endpoints for a single model with weighted distribution:

route:
  # Local vLLM deployment
  - kind: vllm
    weight: 0.7  # 70% of traffic
    base_url: http://localhost:8000
    provider_model_id: "/models/local-model"

  # Remote API fallback
  - kind: your_provider
    weight: 0.3  # 30% of traffic
    base_url: https://api.provider.com
    api_key: ${API_KEY}

Hybrid Routing & OFFLOAD

Weighted routes are applied at registration time. You can further adjust weights or override distribution centrally using config/routing.yaml (loaded by the RoutingManager).
If OFFLOAD=1, the bootstrap process removes any adapter whose base_url matches LOCAL_BASE_URL, effectively forcing traffic to remote providers only.

This lets you flip from hybrid to remote-only during incidents without editing YAML.

BaseAdapter API Reference

All adapters must inherit from BaseAdapter and implement:

Required Methods

async def chat_completion(
    self, messages: list[dict[str, Any]], **params
) -> dict[str, Any]:
    """Execute non-streaming chat completion."""
    pass

async def stream_chat_completion(
    self, messages: list[dict[str, Any]], **params
) -> AsyncGenerator[str, None]:
    """Execute streaming chat completion."""
    pass

Utility Methods

def validate_params(self, params: dict[str, Any]) -> dict[str, Any]:
    """Validate and clamp parameters to supported ranges."""

def format_response(
    self,
    content: str,
    model: str,
    usage: UsageInfo | None = None,
    tool_calls: list[dict] | None = None,
    finish_reason: str = "stop",
) -> dict[str, Any]:
    """Format response in OpenAI-compatible format."""

def format_stream_chunk(
    self, content: str, model: str, finish_reason: str | None = None
) -> str:
    """Format SSE chunk for streaming responses."""

Available Attributes

self.config       # ModelConfig instance
self.http         # AsyncHTTPClient for API requests

Advanced Features

Tool/Function Calling

For models supporting function calls:

supports_tools: true

Parse and include tool_calls in the response:

tool_calls = []
if "function_call" in data:
    tool_calls.append({
        "id": f"call_{int(time.time() * 1000)}",
        "type": "function",
        "function": {
            "name": data["function_call"]["name"],
            "arguments": data["function_call"]["arguments"],
        },
    })

return self.format_response(
    content=content,
    model=self.config.id,
    usage=usage,
    tool_calls=tool_calls,
)

Structured Output (JSON Mode)

For models supporting JSON schema:

supports_structured_output: true

Handle response_format parameter:

if params.get("response_format", {}).get("type") == "json_object":
    payload["response_format"] = {"type": "json_object"}

Rate Limiting (Optional)

If the provider has known token policies and you want server-side fairness controls, add a limiter configuration in serving/servers/bootstrap.py alongside existing examples (Gemini/DeepSeek/Zhipu). This enables per-model queues, burst control, and persistent counters.

Examples

Example 1: OpenAI-Compatible Provider

See serving/adapters/deepseek.py for a simple OpenAI-compatible implementation.

Example 2: Custom API Format

See serving/adapters/gemini.py for handling non-standard API formats with message conversion.

Example 3: Local Deployment

See serving/adapters/vllm.py for integrating local inference servers.

Troubleshooting

Model Not Appearing in `/v1/models`

Check config/models.yaml syntax
Verify environment variables are set
Check server logs for configuration errors
If using aliases, verify the canonical id appears exactly once and aliases do not collide with other model IDs.

Authentication Failures

Verify API key in .env
Check if ${ENV_VAR} expansion is working
Ensure base_url is correct

Response Format Errors

Ensure format_response() returns OpenAI-compatible structure
Validate UsageInfo fields are integers
Check finish_reason is valid: “stop”, “length”, “content_filter”
For streaming, ensure the first non-empty content chunk is emitted as soon as available so TTFT metrics record properly.

Streaming Issues

Ensure chunks are SSE-formatted: data: {json}\n\n
Send final usage chunk before data: [DONE]
Handle JSON parsing errors gracefully

Best Practices

Error handling: Use self.http.json_post_with_retry() and handle provider faults gracefully with useful messages.
Usage accounting: Prefer provider usage when available; otherwise fall back to estimate_prompt_tokens()/estimate_text_tokens().
Streaming helpers: Use format_stream_chunk(), make_final_usage_chunk(), and done_sentinel() for consistent SSE.
Type safety: Provide full type hints and keep request/response shapes aligned with serving/schemas.py.
Testing: Exercise both streaming and non-streaming paths, and try large prompts to validate token clamping.
Docs & style: Keep adapter docstrings and comments in English (Google style). Avoid provider-specific logic in shared code.
Env expansion: Use ${ENV_VAR} in YAML instead of hardcoding secrets or endpoints; let dotenv load .env.