Adding a New Local Model

This guide explains how to register a self-hosted model behind the HybridInference gateway. Use this when the model is already served by a local OpenAI-compatible server such as vLLM, SGLang, Ollama, or a custom /v1/chat/completions service.

For new remote providers or custom adapters, see Adding a New Model.

Overview

Adding a local model has three parts:

Start the local inference server.
Add a config/models.yaml entry that points to that server.
Restart HybridInference and verify the public /v1 API.

The local server must expose OpenAI-compatible endpoints. The gateway forwards chat requests to /v1/chat/completions and embedding requests to /v1/embeddings when the model is registered as an embedding model.

Private Server (No Public Internet)

If your model runs on a different server that is not exposed to the public internet, keep it private and make the gateway reach it over trusted network paths.

Recommended options:

    route:
      - kind: openai_compat
        weight: 1.0
        base_url: "http://10.0.12.34:8000/v1"
        provider_model_id: "your-served-model-name"

Example SSH reverse tunnel (internal host -> FreeInference host):

# Run this on the INTERNAL model host
ssh -N -R 8001:127.0.0.1:8000 [email protected]

Then set:

base_url: "http://127.0.0.1:8001/v1"  # resolved on the FreeInference host

For reverse-tunnel setups, verify from the FreeInference side:

curl http://127.0.0.1:8001/v1/models | jq

Step 1: Start the Local Model Server

Start the model with your preferred serving runtime. Example with vLLM:

vllm serve Qwen/Qwen3.5-27B \
  --host 0.0.0.0 \
  --port 8007 \
  --served-model-name Qwen3.5-27B

Check that the local server responds before changing the gateway config:

curl http://localhost:8007/v1/models | jq
curl -s -X POST http://localhost:8007/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-27B",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 32
  }' | jq

If HybridInference runs in Docker, use http://host.docker.internal:<port> in config/models.yaml so the container can reach the host. If it runs directly on the host, http://localhost:<port> is fine.

Step 2: Add the Model to `config/models.yaml`

Add a new entry under models:. Keep the public id short and stable because clients use it in the model field.

  - id: qwen3.5-27b
    name: Qwen3.5 27B
    provider: sglang
    quantization: "unknown"
    input_modalities: ["text"]
    output_modalities: ["text"]
    context_length: 65536
    max_output_length: 8192
    supports_tools: true
    supports_structured_output: true
    supported_params: [temperature, top_p, max_tokens, stop, stream]
    aliases: ["Qwen3.5-27B"]
    pricing:
      prompt: "0"
      completion: "0"
      image: "0"
      request: "0"
      input_cache_reads: "0"
      input_cache_writes: "0"
    route:
      - kind: sglang
        weight: 1.0
        base_url: "http://host.docker.internal:8007"
        provider_model_id: "Qwen3.5-27B"
        pricing:
          prompt: "0"
          completion: "0"

Use these fields carefully:

id: Public model ID returned by /v1/models and used by clients.
provider: Top-level provider label for metadata. For local OpenAI-compatible servers, use vllm, sglang, or openai_compat.
route[].kind: Adapter kind used by the gateway. Local OpenAI-compatible services can use vllm, sglang, or openai_compat.
base_url: The local server root. It may include /v1, but does not have to.
provider_model_id: Model name sent to the local server. This must match the serving runtime’s model name.
aliases: Optional extra public names that resolve to the same gateway model.
supported_params: Only include parameters that the local runtime accepts.

Step 3: Add Optional Remote Fallbacks

If you want automatic fallback, add another route with a lower or equal weight:

    route:
      - kind: sglang
        weight: 1.0
        base_url: "http://host.docker.internal:8007"
        provider_model_id: "Qwen3.5-27B"
        pricing:
          prompt: "0"
          completion: "0"
      - kind: featherless
        weight: 0
        base_url: ${FEATHERLESS_BASE_URL}
        api_key: ${FEATHERLESS_API_KEY}
        provider_model_id: "Qwen/Qwen3.5-27B"
        pricing:
          prompt: "0"
          completion: "0"

Set fallback weight to 0 when you want to keep the route configured but disabled. Set it above 0 to allow weighted routing and failover.

Step 4: Restart the Gateway

Restart the backend so it reloads config/models.yaml:

make restart s=backend

For local development without Docker:

uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080

Step 5: Verify Through HybridInference

List registered models:

curl http://localhost:8080/v1/models | jq

Run a chat completion through the gateway:

curl -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-27b",
    "messages": [{"role": "user", "content": "Hello from the gateway"}],
    "max_tokens": 32
  }' | jq

Test streaming:

curl -N -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-27b",
    "messages": [{"role": "user", "content": "Stream one sentence"}],
    "stream": true,
    "max_tokens": 64
  }'

Routing Notes

If config/routing.yaml is present, it can adjust route weights after models are registered. Without routing.yaml, the gateway uses the weights in config/models.yaml.

When OFFLOAD=1, startup removes local adapters whose base_url matches LOCAL_BASE_URL. Use this for remote-only incidents, and make sure any local route you want offloaded uses the same base URL value as LOCAL_BASE_URL.

Troubleshooting

Model Does Not Appear in `/v1/models`

Check YAML indentation under models:.
Restart the backend after editing config/models.yaml.
Confirm id and aliases do not collide with another model.
Check backend logs for model registry errors.

Gateway Cannot Reach the Local Server

From Docker, use host.docker.internal instead of localhost.
From bare metal, use localhost or the host IP.
For private remote servers, use private IP/hostname or a private tunnel endpoint; avoid public internet exposure.
Confirm the local server listens on 0.0.0.0, not only 127.0.0.1, if it must be reached from a container.
Verify curl <base_url>/v1/models works from the same environment as the backend.

Requests Fail After Registration

Make sure provider_model_id matches the model name exposed by the local runtime.
Remove unsupported request params from supported_params.
If the runtime’s base URL already ends in /v1, keep it that way; the adapter will use /chat/completions under that base.
For tool calls or JSON output, set supports_tools and supports_structured_output only when the local runtime supports them.