Adding a New Model (OpenRouter-Compatible)

This guide explains how to add support for new LLM models and providers to the hybridInference gateway while keeping full OpenRouter/OpenAI API compatibility.

Reference PR for provider integration example: https://github.com/HarvardSys/hybridInference/pull/34

Overview

There is a single guide for both needs. Depending on your case, follow one of:

  1. Use an existing provider adapter (vLLM, DeepSeek, Gemini, Llama, Zhipu) — only YAML + env changes.

  2. Integrate a new provider — add an adapter class + small registration changes, then YAML + env.

Quick Start

Adding a Model with an Existing Provider

If the provider is already supported (vLLM, DeepSeek, Gemini, Llama, Zhipu), you only need to add configuration.

Subscription providers (Claude, Codex) use OAuth account pools instead of API keys. See developer/configuration.md, section Subscription Adapters (Claude / Codex), for setup instructions. The rest of this guide covers API-key-based providers.

  1. Add model configuration in config/models.yaml:

models:
  - id: your-model-id
    name: Your Model Display Name
    provider: existing_provider  # e.g., "gemini", "deepseek"
    provider_model_id: "actual-provider-model-id"
    base_url: ${PROVIDER_BASE_URL}
    api_key: ${PROVIDER_API_KEY}
    quantization: "bf16"
    input_modalities: ["text"]
    output_modalities: ["text"]
    context_length: 8192
    max_output_length: 4096
    supports_tools: true
    supports_structured_output: true
    supported_params: [temperature, top_p, max_tokens, stop]
    pricing:
      prompt: "0"
      completion: "0"
      image: "0"
      request: "0"
      input_cache_reads: "0"
      input_cache_writes: "0"
    route:
      - kind: existing_provider
        weight: 1.0
        base_url: ${PROVIDER_BASE_URL}
        api_key: ${PROVIDER_API_KEY}
  1. Configure environment variables in .env:

PROVIDER_BASE_URL=https://api.provider.com/v1
PROVIDER_API_KEY=your-api-key
  1. Restart the server to load the new model.

  2. Verify:

curl http://localhost:8080/v1/models | jq
curl -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"your-model-id","messages":[{"role":"user","content":"Hello"}]}' | jq

Note on aliases: If you want the model to appear under an OpenRouter-style slug (e.g., a local vLLM path), add it in aliases so clients can call either name.

OFFLOAD behavior: When OFFLOAD=1, the service will remove all local adapters whose base_url matches LOCAL_BASE_URL and only use remote adapters. See “Hybrid Routing & OFFLOAD” below.

Adding a New Provider

If you need to integrate a completely new provider, follow these steps:

Step 1: Create Provider Adapter

Create a new file in serving/adapters/ (e.g., serving/adapters/your_provider.py):

import json
from collections.abc import AsyncGenerator
from typing import Any

from serving.stream import done_sentinel, make_final_usage_chunk
from serving.utils.tokens import estimate_prompt_tokens, estimate_text_tokens
from .base import BaseAdapter, UsageInfo


class YourProviderAdapter(BaseAdapter):
    """Adapter for YourProvider API.

    This adapter translates OpenAI-compatible requests to YourProvider's
    API format and normalizes responses back to OpenAI format.
    """

    async def chat_completion(
        self, messages: list[dict[str, Any]], **params
    ) -> dict[str, Any]:
        """Execute a non-streaming chat completion request.

        Args:
            messages: List of chat messages in OpenAI format.
            **params: Additional parameters (temperature, max_tokens, etc.).

        Returns:
            OpenAI-compatible response dictionary.
        """
        # Validate and filter parameters
        validated_params = self.validate_params(params)

        # Build provider-specific request payload
        payload = {
            "model": self.config.provider_model_id or self.config.id,
            "messages": messages,
            **validated_params,
        }

        # Add optional features
        if params.get("tools"):
            payload["tools"] = params["tools"]

        if params.get("response_format", {}).get("type") == "json_object":
            payload["response_format"] = {"type": "json_object"}

        # Set up authentication headers
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.config.api_key}",
        }

        # Make API request
        data = await self.http.json_post_with_retry(
            f"{self.config.base_url}/chat/completions",
            json=payload,
            headers=headers,
        )

        # Extract usage information
        usage = UsageInfo(
            prompt_tokens=data.get("usage", {}).get("prompt_tokens", 0),
            completion_tokens=data.get("usage", {}).get("completion_tokens", 0),
            total_tokens=data.get("usage", {}).get("total_tokens", 0),
        )

        # Fallback to estimation if provider doesn't return usage
        if usage.total_tokens == 0:
            content = data["choices"][0]["message"].get("content", "")
            prompt_tokens = estimate_prompt_tokens(messages)
            completion_tokens = estimate_text_tokens(content)
            usage = UsageInfo(
                prompt_tokens=int(prompt_tokens),
                completion_tokens=int(completion_tokens),
                total_tokens=int(prompt_tokens + completion_tokens),
            )

        # Extract tool calls if present
        tool_calls = None
        if "tool_calls" in data["choices"][0]["message"]:
            tool_calls = data["choices"][0]["message"]["tool_calls"]

        # Return normalized response
        return self.format_response(
            content=data["choices"][0]["message"].get("content", ""),
            model=self.config.id,
            usage=usage,
            tool_calls=tool_calls,
            finish_reason=data["choices"][0].get("finish_reason", "stop"),
        )

    async def stream_chat_completion(
        self, messages: list[dict[str, Any]], **params
    ) -> AsyncGenerator[str, None]:
        """Execute a streaming chat completion request.

        Args:
            messages: List of chat messages in OpenAI format.
            **params: Additional parameters.

        Yields:
            Server-sent event formatted strings.
        """
        validated_params = self.validate_params(params)

        payload = {
            "model": self.config.provider_model_id or self.config.id,
            "messages": messages,
            "stream": True,
            **validated_params,
        }

        if params.get("tools"):
            payload["tools"] = params["tools"]

        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.config.api_key}",
        }

        total_content = ""
        prompt_tokens = 0

        async for line in self.http.stream_post(
            f"{self.config.base_url}/chat/completions",
            json=payload,
            headers=headers,
        ):
            if not line.startswith("data: "):
                continue

            if line == "data: [DONE]":
                # Emit final usage chunk using shared helper for consistency
                yield make_final_usage_chunk(
                    model=self.config.id,
                    messages=messages,
                    total_content=total_content,
                    prompt_tokens_override=prompt_tokens or None,
                    finish_reason="stop",
                )
                yield done_sentinel()
                break

            try:
                chunk_data = json.loads(line[6:])

                # Extract usage if available
                if "usage" in chunk_data:
                    prompt_tokens = chunk_data["usage"].get("prompt_tokens", prompt_tokens)

                # Extract and yield content delta
                if chunk_data["choices"][0]["delta"].get("content"):
                    content = chunk_data["choices"][0]["delta"]["content"]
                    total_content += content
                    yield self.format_stream_chunk(content, self.config.id)
            except json.JSONDecodeError:
                continue

Step 2: Register the Adapter

  1. Update serving/adapters/__init__.py:

from .your_provider import YourProviderAdapter

__all__ = [
    # ... existing exports
    "YourProviderAdapter",
]
  1. Update serving/servers/registry.py:

Add the import at the top:

from serving.adapters import (
    # ... existing imports
    YourProviderAdapter,
)

Add a branch in the _make_adapter function:

def _make_adapter(kind: str, cfg: dict[str, Any]):
    """Construct a provider adapter from a kind string and model config."""
    model_cfg = ModelConfig(**cfg)
    # ... existing conditions
    if kind == "your_provider":
        return YourProviderAdapter(model_cfg)
    raise ValueError(f"Unknown adapter kind: {kind}")

Step 3: Add Model Configuration

Add your model to config/models.yaml:

models:
  - id: your-model-id
    name: Your Model Name
    provider: your_provider
    provider_model_id: "actual-model-id"
    base_url: ${YOUR_PROVIDER_BASE_URL}
    api_key: ${YOUR_PROVIDER_API_KEY}
    quantization: "bf16"
    input_modalities: ["text"]
    output_modalities: ["text"]
    context_length: 8192
    max_output_length: 4096
    supports_tools: true
    supports_structured_output: true
    supported_params: [temperature, top_p, max_tokens, stop]
    aliases: []  # Optional alternative names
    pricing:
      prompt: "0"
      completion: "0"
      image: "0"
      request: "0"
      input_cache_reads: "0"
      input_cache_writes: "0"
    route:
      - kind: your_provider
        weight: 1.0
        base_url: ${YOUR_PROVIDER_BASE_URL}
        api_key: ${YOUR_PROVIDER_API_KEY}

Step 4: Configure Environment Variables

Add to .env:

YOUR_PROVIDER_BASE_URL=https://api.yourprovider.com/v1
YOUR_PROVIDER_API_KEY=your-api-key-here

Step 5: Test the Integration

# Start the server (Docker)
make build s=backend
# Or locally without Docker:
# uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080

# List available models
curl http://localhost:8080/v1/models

# Test chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-id",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Streaming test:

curl -N -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-id",
    "messages": [{"role": "user", "content": "Stream test"}],
    "stream": true,
    "max_tokens": 64
  }'

Configuration Reference

ModelConfig Fields

Field

Type

Required

Description

id

string

Yes

Unique model identifier

name

string

Yes

Display name

provider

string

Yes

Provider/adapter kind

base_url

string

Yes

API endpoint base URL

api_key

string

No

API authentication key

provider_model_id

string

No

Provider’s model identifier (overrides id)

aliases

list[string]

No

Alternative names for routing

quantization

string

No

Quantization format (default: “bf16”)

input_modalities

list[string]

No

Input types: “text”, “image”

output_modalities

list[string]

No

Output types: “text”

context_length

int

No

Maximum context window (default: 8192)

max_output_length

int

No

Maximum output tokens (default: 4096)

supports_tools

bool

No

Function calling support (default: false)

supports_structured_output

bool

No

JSON mode support (default: false)

supported_params

list[string]

No

Allowed parameter names

pricing

dict

No

Cost information per token/request

Route Configuration

Routes allow multiple endpoints for a single model with weighted distribution:

route:
  # Local vLLM deployment
  - kind: vllm
    weight: 0.7  # 70% of traffic
    base_url: http://localhost:8000
    provider_model_id: "/models/local-model"

  # Remote API fallback
  - kind: your_provider
    weight: 0.3  # 30% of traffic
    base_url: https://api.provider.com
    api_key: ${API_KEY}

Hybrid Routing & OFFLOAD

  • Weighted routes are applied at registration time. You can further adjust weights or override distribution centrally using config/routing.yaml (loaded by the RoutingManager).

  • If OFFLOAD=1, the bootstrap process removes any adapter whose base_url matches LOCAL_BASE_URL, effectively forcing traffic to remote providers only.

This lets you flip from hybrid to remote-only during incidents without editing YAML.

BaseAdapter API Reference

All adapters must inherit from BaseAdapter and implement:

Required Methods

async def chat_completion(
    self, messages: list[dict[str, Any]], **params
) -> dict[str, Any]:
    """Execute non-streaming chat completion."""
    pass

async def stream_chat_completion(
    self, messages: list[dict[str, Any]], **params
) -> AsyncGenerator[str, None]:
    """Execute streaming chat completion."""
    pass

Utility Methods

def validate_params(self, params: dict[str, Any]) -> dict[str, Any]:
    """Validate and clamp parameters to supported ranges."""

def format_response(
    self,
    content: str,
    model: str,
    usage: UsageInfo | None = None,
    tool_calls: list[dict] | None = None,
    finish_reason: str = "stop",
) -> dict[str, Any]:
    """Format response in OpenAI-compatible format."""

def format_stream_chunk(
    self, content: str, model: str, finish_reason: str | None = None
) -> str:
    """Format SSE chunk for streaming responses."""

Available Attributes

self.config       # ModelConfig instance
self.http         # AsyncHTTPClient for API requests

Advanced Features

Multi-Modal Support

For models supporting images:

input_modalities: ["text", "image"]

Implement image handling in your adapter’s chat_completion method.

Tool/Function Calling

For models supporting function calls:

supports_tools: true

Parse and include tool_calls in the response:

tool_calls = []
if "function_call" in data:
    tool_calls.append({
        "id": f"call_{int(time.time() * 1000)}",
        "type": "function",
        "function": {
            "name": data["function_call"]["name"],
            "arguments": data["function_call"]["arguments"],
        },
    })

return self.format_response(
    content=content,
    model=self.config.id,
    usage=usage,
    tool_calls=tool_calls,
)

Structured Output (JSON Mode)

For models supporting JSON schema:

supports_structured_output: true

Handle response_format parameter:

if params.get("response_format", {}).get("type") == "json_object":
    payload["response_format"] = {"type": "json_object"}

Rate Limiting (Optional)

If the provider has known token policies and you want server-side fairness controls, add a limiter configuration in serving/servers/bootstrap.py alongside existing examples (Gemini/DeepSeek/Zhipu). This enables per-model queues, burst control, and persistent counters.

Examples

Example 1: OpenAI-Compatible Provider

See serving/adapters/deepseek.py for a simple OpenAI-compatible implementation.

Example 2: Custom API Format

See serving/adapters/gemini.py for handling non-standard API formats with message conversion.

Example 3: Local Deployment

See serving/adapters/vllm.py for integrating local inference servers.

Troubleshooting

Model Not Appearing in /v1/models

  • Check config/models.yaml syntax

  • Verify environment variables are set

  • Check server logs for configuration errors

  • If using aliases, verify the canonical id appears exactly once and aliases do not collide with other model IDs.

Authentication Failures

  • Verify API key in .env

  • Check if ${ENV_VAR} expansion is working

  • Ensure base_url is correct

Response Format Errors

  • Ensure format_response() returns OpenAI-compatible structure

  • Validate UsageInfo fields are integers

  • Check finish_reason is valid: “stop”, “length”, “content_filter”

  • For streaming, ensure the first non-empty content chunk is emitted as soon as available so TTFT metrics record properly.

Streaming Issues

  • Ensure chunks are SSE-formatted: data: {json}\n\n

  • Send final usage chunk before data: [DONE]

  • Handle JSON parsing errors gracefully

Best Practices

  1. Error handling: Use self.http.json_post_with_retry() and handle provider faults gracefully with useful messages.

  2. Usage accounting: Prefer provider usage when available; otherwise fall back to estimate_prompt_tokens()/estimate_text_tokens().

  3. Streaming helpers: Use format_stream_chunk(), make_final_usage_chunk(), and done_sentinel() for consistent SSE.

  4. Type safety: Provide full type hints and keep request/response shapes aligned with serving/schemas.py.

  5. Testing: Exercise both streaming and non-streaming paths, and try large prompts to validate token clamping.

  6. Docs & style: Keep adapter docstrings and comments in English (Google style). Avoid provider-specific logic in shared code.

  7. Env expansion: Use ${ENV_VAR} in YAML instead of hardcoding secrets or endpoints; let dotenv load .env.

See Also