FreeInference Deployment
Cloudflare + Nginx + FastAPI (current)
Traffic flows through three layers before reaching the application:
Client ──▶ Cloudflare ──▶ Nginx (:443) ──▶ FastAPI (:8080)
└──▶ Frontend (:3001)
Layer |
Role |
|---|---|
Cloudflare |
CDN, DDoS protection, edge SSL termination. SSL/TLS mode set to Full (strict) so Cloudflare verifies the origin certificate. |
Nginx |
TLS termination (Let’s Encrypt cert), path-based routing ( |
FastAPI |
API logic — request authentication, model routing, rate limiting, backpressure, Qdrant proxy, and observability. Listens on |
Docker Compose manages all services (backend, frontend, PostgreSQL, Prometheus,
Alertmanager, alert-logger, Grafana) with automatic restarts via restart: unless-stopped.
Deployment
All services are defined in infrastructure/docker/docker-compose.yml. From the project root:
cp .env.example .env # Configure secrets
make up # Start all services
make ps # Verify health
Nginx runs on the host (not containerized) for SSL termination. See Deployment for the full guide.
Runtime Operations
Restart:
make restartormake restart s=backendFollow logs:
make logsormake logs s=backendHealth check:
curl https://freeinference.org/healthList registered models:
curl https://freeinference.org/v1/models | jq
Why Nginx Is Back
Nginx was briefly removed (see Legacy section below) when FreeInference was API-only and Cloudflare handled all edge concerns. It was re-introduced when we added:
Frontend: The Next.js web UI runs on port 3001 and needs to share the
freeinference.orgdomain with the API. Path-based routing (/v1/*→ backend,/*→ frontend) is a natural fit for Nginx.Body size limits: Qdrant vector upserts can be large. Nginx’s
client_max_body_sizegives a clear, configurable gate before traffic hits FastAPI.WebSocket upgrade: Nginx handles the
Upgrade/Connectionheaders cleanly for SSE and WebSocket-based streaming.
Legacy Architectures
FastAPI direct (v3, abandoned)
We previously served OpenRouter-compatible traffic directly through FastAPI listening on port 80, without Nginx. This was simpler but could not support frontend co-hosting or fine-grained body size limits. Once the frontend was added, we moved back to Nginx.
Nginx (v2, abandoned)
We briefly fronted FastAPI (running on port 8080) with vanilla Nginx that exposed http://freeinference.org on port 80 and terminated TLS for the public endpoint. Once Cloudflare took over edge SSL duties, the extra hop mostly added deployment and observability complexity without material benefit, so the setup was removed.
Nginx + Lua via OpenResty (v1, abandoned)
We previously relied on OpenResty (Nginx + Lua) to provide a production routing tier across multiple LLM backends. The stack handled model mapping, load balancing, health checks, and error handling. We keep the installation notes for posterity.
Overview
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Client │─────▶│ OpenResty │─────▶│ Backend 1 │
│ (API Call) │ │ (Router) │ │ (Qwen@8000) │
└─────────────┘ │ │ └─────────────────┘
│ - Model Mapping │
│ - Load Balancing│ ┌─────────────────┐
│ - Health Checks │─────▶│ Backend 2 │
│ - Error Handling│ │ (Llama@8001) │
└──────────────────┘ └─────────────────┘
Installation Notes
# Add repository
wget -O - https://openresty.org/package/pubkey.gpg | sudo apt-key add -
echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | \
sudo tee /etc/apt/sources.list.d/openresty.list
# Install
sudo apt-get update
sudo apt-get install openresty
# Create directory
sudo mkdir -p /usr/local/openresty/nginx/conf/sites-available
sudo mkdir -p /usr/local/openresty/nginx/conf/sites-enabled
# Copy Config file
sudo cp <your config file> /usr/local/openresty/nginx/conf/sites-available/vllm
# Enable the site
sudo ln -s /usr/local/openresty/nginx/conf/sites-available/vllm \
/usr/local/openresty/nginx/conf/sites-enabled/vllm
http {
# ... Others ...
# Lua settings
lua_package_path "/usr/local/openresty/lualib/?.lua;;";
lua_shared_dict model_cache 10m;
# Include Site Configuration
include /usr/local/openresty/nginx/conf/sites-enabled/*;
}
# test openresty config
sudo openresty -t
# Start
sudo systemctl start openresty
# Enable auto-start
sudo systemctl enable openresty
# reload openresty
sudo openresty -s reload
# check service status
curl https://freeinference.org/health
# list all models
curl https://freeinference.org/v1/models | jq
# Chat with Qwen3-Coder
curl -X POST http://freeinference.org/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "/models/Qwen_Qwen3-Coder-480B-A35B-Instruct-FP8", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'
# Chat with llama4-scout
curl -X POST http://freeinference.org/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "/models/meta-llama_Llama-4-Scout-17B-16E", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 50}'
Nginx (v0, abandoned)
sudo vim /etc/nginx/sites-available/vllm
sudo nginx -t
sudo systemctl reload nginx
# to test the endpoint
curl https://freeinference.org/v1/models