FreeInference Deployment

FastAPI + systemd (current)

We serve OpenRouter-compatible traffic directly through a FastAPI application listening on port 80. Removing Nginx reduces operational overhead, keeps debugging straightforward, and lets systemd own the lifecycle of the gateway process.

Overview

┌─────────────┐      ┌─────────────────┐      ┌────────────────────┐
│  OpenRouter │─────▶│ FastAPI Gateway │─────▶│ Model Executors... │
└─────────────┘      └─────────────────┘      └────────────────────┘
  • FastAPI binds to 0.0.0.0:80 and exposes /v1 endpoints consumed by OpenRouter clients.

  • The gateway handles request authentication, routing, and backpressure before invoking the selected model adapter.

  • systemd supervises the process, ensuring automatic restarts after crashes or host reboots.

Deployment Steps

  1. Install runtime dependencies

    Ensure Python environment and model weights are ready. Confirm the FastAPI entry point (serving.servers.bootstrap:app) is reachable via uvicorn or the configured launcher script.

  2. Create the unit file

    sudo tee /etc/systemd/system/freeinference.service <<'UNIT'
    [Unit]
    Description=FreeInference FastAPI service
    After=network-online.target
    Wants=network-online.target
    
    [Service]
    Type=simple
    User=ubuntu
    WorkingDirectory=/home/ubuntu/hybridInference
    ExecStart=/usr/bin/env uvicorn serving.servers.bootstrap:app --host 0.0.0.0 --port 80
    Restart=always
    RestartSec=5
    Environment=PYTHONUNBUFFERED=1
    
    [Install]
    WantedBy=multi-user.target
    UNIT
    

    Replace User, WorkingDirectory, and Environment entries as needed for the target host. The repository carries a maintained version of this unit at infrastructure/systemd/hybrid_inference.service; copy or symlink it into /etc/systemd/system/freeinference.service during deploys.

  3. Reload and enable the service

    sudo systemctl daemon-reload
    sudo systemctl enable freeinference.service
    sudo systemctl start freeinference.service
    sudo systemctl status freeinference.service
    

Runtime Operations

  • Restart on demand: sudo systemctl restart freeinference.service

  • Follow logs: journalctl -u freeinference.service -f

  • Health check: curl https://freeinference.org/health

  • List registered models: curl https://freeinference.org/v1/models | jq

Why We Dropped Nginx

  • FastAPI already terminates HTTP and exposes the required OpenRouter-compatible endpoints.

  • Nginx added another moving part, increasing failover complexity and opaque error handling.

  • Debugging latency or request routing is simpler when traffic is handled in a single process.

Legacy Architectures

Nginx (v2, abandoned)

We briefly fronted FastAPI (running on port 8080) with vanilla Nginx that exposed http://freeinference.org on port 80 and terminated TLS for the public endpoint. Once Cloudflare took over edge SSL duties, the extra hop mostly added deployment and observability complexity without material benefit, so the setup was removed.

Nginx + Lua via OpenResty (v1, abandoned)

We previously relied on OpenResty (Nginx + Lua) to provide a production routing tier across multiple LLM backends. The stack handled model mapping, load balancing, health checks, and error handling. We keep the installation notes for posterity.

Overview

┌─────────────┐      ┌──────────────────┐      ┌─────────────────┐
│   Client    │─────▶│  OpenResty       │─────▶│  Backend 1      │
│  (API Call)         (Router)                (Qwen@8000)    │
└─────────────┘                              └─────────────────┘
                       - Model Mapping                        - Load Balancing│      ┌─────────────────┐
                       - Health Checks │─────▶│  Backend 2                             - Error Handling│        (Llama@8001)                        └──────────────────┘      └─────────────────┘

Installation Notes

# Add repository
wget -O - https://openresty.org/package/pubkey.gpg | sudo apt-key add -
echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | \
    sudo tee /etc/apt/sources.list.d/openresty.list

# Install
sudo apt-get update
sudo apt-get install openresty
# Create directory
sudo mkdir -p /usr/local/openresty/nginx/conf/sites-available
sudo mkdir -p /usr/local/openresty/nginx/conf/sites-enabled

# Copy Config file
sudo cp <your config file> /usr/local/openresty/nginx/conf/sites-available/vllm

# Enable the site
sudo ln -s /usr/local/openresty/nginx/conf/sites-available/vllm \
           /usr/local/openresty/nginx/conf/sites-enabled/vllm
http {
    # ... Others ...

    # Lua settings
    lua_package_path "/usr/local/openresty/lualib/?.lua;;";
    lua_shared_dict model_cache 10m;

    # Include Site Configuration
    include /usr/local/openresty/nginx/conf/sites-enabled/*;
}
# test openresty config
sudo openresty -t

# Start
sudo systemctl start openresty

# Enable auto-start
sudo systemctl enable openresty

# reload openresty
sudo openresty -s reload
# check service status
curl https://freeinference.org/health

# list all models
curl https://freeinference.org/v1/models | jq

# Chat with Qwen3-Coder
curl -X POST http://freeinference.org/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/Qwen_Qwen3-Coder-480B-A35B-Instruct-FP8", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

# Chat with llama4-scout
curl -X POST http://freeinference.org/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/meta-llama_Llama-4-Scout-17B-16E", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 50}'

Nginx (v0, abandoned)

sudo vim /etc/nginx/sites-available/vllm
sudo nginx -t
sudo systemctl reload nginx

# to test the endpoint
curl https://freeinference.org/v1/models