FreeInference Deployment

Cloudflare + Nginx + FastAPI (current)

Traffic flows through three layers before reaching the application:

Client ──▶ Cloudflare ──▶ Nginx (:443) ──▶ FastAPI (:8080)
                                      └──▶ Frontend (:3001)

Layer

Role

Cloudflare

CDN, DDoS protection, edge SSL termination. SSL/TLS mode set to Full (strict) so Cloudflare verifies the origin certificate. CF-Connecting-IP header carries the real client IP.

Nginx

TLS termination (Let’s Encrypt cert), path-based routing (/v1/, /auth/, /user/, /admin/ → FastAPI; everything else → frontend), request body limits (client_max_body_size), WebSocket upgrade.

FastAPI

API logic — request authentication, model routing, rate limiting, backpressure, Qdrant proxy, and observability. Listens on 127.0.0.1:8080.

Docker Compose manages all services (backend, frontend, PostgreSQL, Prometheus, Alertmanager, alert-logger, Grafana) with automatic restarts via restart: unless-stopped.

Deployment

All services are defined in infrastructure/docker/docker-compose.yml. From the project root:

cp .env.example .env   # Configure secrets
make up                # Start all services
make ps                # Verify health

Nginx runs on the host (not containerized) for SSL termination. See Deployment for the full guide.

Runtime Operations

  • Restart: make restart or make restart s=backend

  • Follow logs: make logs or make logs s=backend

  • Health check: curl https://freeinference.org/health

  • List registered models: curl https://freeinference.org/v1/models | jq

Why Nginx Is Back

Nginx was briefly removed (see Legacy section below) when FreeInference was API-only and Cloudflare handled all edge concerns. It was re-introduced when we added:

  • Frontend: The Next.js web UI runs on port 3001 and needs to share the freeinference.org domain with the API. Path-based routing (/v1/* → backend, /* → frontend) is a natural fit for Nginx.

  • Body size limits: Qdrant vector upserts can be large. Nginx’s client_max_body_size gives a clear, configurable gate before traffic hits FastAPI.

  • WebSocket upgrade: Nginx handles the Upgrade / Connection headers cleanly for SSE and WebSocket-based streaming.

Legacy Architectures

FastAPI direct (v3, abandoned)

We previously served OpenRouter-compatible traffic directly through FastAPI listening on port 80, without Nginx. This was simpler but could not support frontend co-hosting or fine-grained body size limits. Once the frontend was added, we moved back to Nginx.

Nginx (v2, abandoned)

We briefly fronted FastAPI (running on port 8080) with vanilla Nginx that exposed http://freeinference.org on port 80 and terminated TLS for the public endpoint. Once Cloudflare took over edge SSL duties, the extra hop mostly added deployment and observability complexity without material benefit, so the setup was removed.

Nginx + Lua via OpenResty (v1, abandoned)

We previously relied on OpenResty (Nginx + Lua) to provide a production routing tier across multiple LLM backends. The stack handled model mapping, load balancing, health checks, and error handling. We keep the installation notes for posterity.

Overview

┌─────────────┐      ┌──────────────────┐      ┌─────────────────┐
│   Client    │─────▶│  OpenResty       │─────▶│  Backend 1      │
│  (API Call)         (Router)                (Qwen@8000)    │
└─────────────┘                              └─────────────────┘
                       - Model Mapping                        - Load Balancing│      ┌─────────────────┐
                       - Health Checks │─────▶│  Backend 2                             - Error Handling│        (Llama@8001)                        └──────────────────┘      └─────────────────┘

Installation Notes

# Add repository
wget -O - https://openresty.org/package/pubkey.gpg | sudo apt-key add -
echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | \
    sudo tee /etc/apt/sources.list.d/openresty.list

# Install
sudo apt-get update
sudo apt-get install openresty
# Create directory
sudo mkdir -p /usr/local/openresty/nginx/conf/sites-available
sudo mkdir -p /usr/local/openresty/nginx/conf/sites-enabled

# Copy Config file
sudo cp <your config file> /usr/local/openresty/nginx/conf/sites-available/vllm

# Enable the site
sudo ln -s /usr/local/openresty/nginx/conf/sites-available/vllm \
           /usr/local/openresty/nginx/conf/sites-enabled/vllm
http {
    # ... Others ...

    # Lua settings
    lua_package_path "/usr/local/openresty/lualib/?.lua;;";
    lua_shared_dict model_cache 10m;

    # Include Site Configuration
    include /usr/local/openresty/nginx/conf/sites-enabled/*;
}
# test openresty config
sudo openresty -t

# Start
sudo systemctl start openresty

# Enable auto-start
sudo systemctl enable openresty

# reload openresty
sudo openresty -s reload
# check service status
curl https://freeinference.org/health

# list all models
curl https://freeinference.org/v1/models | jq

# Chat with Qwen3-Coder
curl -X POST http://freeinference.org/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/Qwen_Qwen3-Coder-480B-A35B-Instruct-FP8", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

# Chat with llama4-scout
curl -X POST http://freeinference.org/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "/models/meta-llama_Llama-4-Scout-17B-16E", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 50}'

Nginx (v0, abandoned)

sudo vim /etc/nginx/sites-available/vllm
sudo nginx -t
sudo systemctl reload nginx

# to test the endpoint
curl https://freeinference.org/v1/models