Adding a New Local Model
This guide explains how to register a self-hosted model behind the HybridInference
gateway. Use this when the model is already served by a local OpenAI-compatible
server such as vLLM, SGLang, Ollama, or a custom /v1/chat/completions service.
For new remote providers or custom adapters, see Adding a New Model.
Overview
Adding a local model has three parts:
Start the local inference server.
Add a
config/models.yamlentry that points to that server.Restart HybridInference and verify the public
/v1API.
The local server must expose OpenAI-compatible endpoints. The gateway forwards
chat requests to /v1/chat/completions and embedding requests to /v1/embeddings
when the model is registered as an embedding model.
Private Server (No Public Internet)
If your model runs on a different server that is not exposed to the public internet, keep it private and make the gateway reach it over trusted network paths.
Recommended options:
route:
- kind: openai_compat
weight: 1.0
base_url: "http://10.0.12.34:8000/v1"
provider_model_id: "your-served-model-name"
Example SSH reverse tunnel (internal host -> FreeInference host):
# Run this on the INTERNAL model host
ssh -N -R 8001:127.0.0.1:8000 [email protected]
Then set:
base_url: "http://127.0.0.1:8001/v1" # resolved on the FreeInference host
For reverse-tunnel setups, verify from the FreeInference side:
curl http://127.0.0.1:8001/v1/models | jq
Step 1: Start the Local Model Server
Start the model with your preferred serving runtime. Example with vLLM:
vllm serve Qwen/Qwen3.5-27B \
--host 0.0.0.0 \
--port 8007 \
--served-model-name Qwen3.5-27B
Check that the local server responds before changing the gateway config:
curl http://localhost:8007/v1/models | jq
curl -s -X POST http://localhost:8007/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-27B",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 32
}' | jq
If HybridInference runs in Docker, use http://host.docker.internal:<port> in
config/models.yaml so the container can reach the host. If it runs directly on
the host, http://localhost:<port> is fine.
Step 2: Add the Model to config/models.yaml
Add a new entry under models:. Keep the public id short and stable because
clients use it in the model field.
- id: qwen3.5-27b
name: Qwen3.5 27B
provider: sglang
quantization: "unknown"
input_modalities: ["text"]
output_modalities: ["text"]
context_length: 65536
max_output_length: 8192
supports_tools: true
supports_structured_output: true
supported_params: [temperature, top_p, max_tokens, stop, stream]
aliases: ["Qwen3.5-27B"]
pricing:
prompt: "0"
completion: "0"
image: "0"
request: "0"
input_cache_reads: "0"
input_cache_writes: "0"
route:
- kind: sglang
weight: 1.0
base_url: "http://host.docker.internal:8007"
provider_model_id: "Qwen3.5-27B"
pricing:
prompt: "0"
completion: "0"
Use these fields carefully:
id: Public model ID returned by/v1/modelsand used by clients.provider: Top-level provider label for metadata. For local OpenAI-compatible servers, usevllm,sglang, oropenai_compat.route[].kind: Adapter kind used by the gateway. Local OpenAI-compatible services can usevllm,sglang, oropenai_compat.base_url: The local server root. It may include/v1, but does not have to.provider_model_id: Model name sent to the local server. This must match the serving runtime’s model name.aliases: Optional extra public names that resolve to the same gateway model.supported_params: Only include parameters that the local runtime accepts.
Step 3: Add Optional Remote Fallbacks
If you want automatic fallback, add another route with a lower or equal weight:
route:
- kind: sglang
weight: 1.0
base_url: "http://host.docker.internal:8007"
provider_model_id: "Qwen3.5-27B"
pricing:
prompt: "0"
completion: "0"
- kind: featherless
weight: 0
base_url: ${FEATHERLESS_BASE_URL}
api_key: ${FEATHERLESS_API_KEY}
provider_model_id: "Qwen/Qwen3.5-27B"
pricing:
prompt: "0"
completion: "0"
Set fallback weight to 0 when you want to keep the route configured but
disabled. Set it above 0 to allow weighted routing and failover.
Step 4: Restart the Gateway
Restart the backend so it reloads config/models.yaml:
make restart s=backend
For local development without Docker:
uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080
Step 5: Verify Through HybridInference
List registered models:
curl http://localhost:8080/v1/models | jq
Run a chat completion through the gateway:
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-27b",
"messages": [{"role": "user", "content": "Hello from the gateway"}],
"max_tokens": 32
}' | jq
Test streaming:
curl -N -s -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-27b",
"messages": [{"role": "user", "content": "Stream one sentence"}],
"stream": true,
"max_tokens": 64
}'
Routing Notes
If config/routing.yaml is present, it can adjust route weights after models are
registered. Without routing.yaml, the gateway uses the weights in
config/models.yaml.
When OFFLOAD=1, startup removes local adapters whose base_url matches
LOCAL_BASE_URL. Use this for remote-only incidents, and make sure any local
route you want offloaded uses the same base URL value as LOCAL_BASE_URL.
Troubleshooting
Model Does Not Appear in /v1/models
Check YAML indentation under
models:.Restart the backend after editing
config/models.yaml.Confirm
idandaliasesdo not collide with another model.Check backend logs for model registry errors.
Gateway Cannot Reach the Local Server
From Docker, use
host.docker.internalinstead oflocalhost.From bare metal, use
localhostor the host IP.For private remote servers, use private IP/hostname or a private tunnel endpoint; avoid public internet exposure.
Confirm the local server listens on
0.0.0.0, not only127.0.0.1, if it must be reached from a container.Verify
curl <base_url>/v1/modelsworks from the same environment as the backend.
Requests Fail After Registration
Make sure
provider_model_idmatches the model name exposed by the local runtime.Remove unsupported request params from
supported_params.If the runtime’s base URL already ends in
/v1, keep it that way; the adapter will use/chat/completionsunder that base.For tool calls or JSON output, set
supports_toolsandsupports_structured_outputonly when the local runtime supports them.