Available Models
HybridInference provides access to multiple state-of-the-art LLM models.
Model Overview
Model ID |
Name |
Context Length |
Pricing |
|---|---|---|---|
|
Llama 3.3 70B Instruct |
131K tokens |
Free |
|
Llama 4 Scout |
128K tokens |
Free |
|
Llama 4 Maverick |
128K tokens |
Free |
|
Gemini 2.5 Flash |
1M tokens |
Free |
|
Gemini 2.5 Flash Preview |
1M tokens |
Free |
|
GLM-4.5 |
128K tokens |
Free |
|
GPT-5 |
128K tokens |
Free |
|
Claude Opus 4.1 |
200K tokens |
Free |
|
GPT-5 (Azure) |
400K tokens |
Free |
Model Details
Llama 3.3 70B Instruct
Model ID: llama-3.3-70b-instruct
High-performance open-source model optimized for instruction following.
Key Features:
Context length: 131,072 tokens
Max output: 8,192 tokens
Function calling support
Structured output (JSON mode)
Quantization: bf16
Best For:
General purpose chat
Long-form content generation
Code generation
Instruction following
Example:
response = client.chat.completions.create(
model="llama-3.3-70b-instruct",
messages=[{"role": "user", "content": "Explain quantum computing"}],
max_tokens=2048
)
Llama 4 Scout
Model ID: llama-4-scout
Efficient MoE (Mixture of Experts) model for fast inference.
Key Features:
Context length: 128,000 tokens
Max output: 16,384 tokens
Function calling support
Structured output
Quantization: fp8
Best For:
Fast inference scenarios
Cost-effective deployments
Production workloads
Llama 4 Maverick
Model ID: llama-4-maverick
Advanced MoE (Mixture of Experts) model for complex tasks.
Key Features:
Context length: 128,000 tokens
Max output: 16,384 tokens
Function calling support
Structured output
Quantization: fp8
Best For:
Complex reasoning tasks
Long-form generation
Production workloads with high quality requirements
Gemini 2.5 Flash
Model ID: gemini-2.5-flash
Google’s fast and efficient model.
Key Features:
Fast inference speed
High throughput
Large context window
Production-ready
Best For:
Real-time applications
High-volume workloads
Quick responses needed
GLM-4.5
Model ID: glm-4.5
Bilingual model optimized for Chinese and English.
Best For:
Chinese language tasks
Bilingual applications
Cross-language translation
GPT-5
Model ID: gpt-5
Latest OpenAI flagship model.
Best For:
Complex reasoning
Advanced code generation
Research applications
Claude Opus 4.1
Model ID: custom-model-alpha
Anthropic’s most capable model for complex tasks.
Best For:
Long-form writing
Advanced analysis
Research and development
Using Different Models
Simply change the model parameter in your request:
import openai
client = openai.OpenAI(
base_url="https://freeinference.org/v1",
api_key="your-api-key-here"
)
# Use Llama 3.3
response = client.chat.completions.create(
model="llama-3.3-70b-instruct",
messages=[{"role": "user", "content": "Hello!"}]
)
# Switch to Gemini
response = client.chat.completions.create(
model="gemini-2.5-flash",
messages=[{"role": "user", "content": "Hello!"}]
)
Model Selection Guide
For general chat and instructions:
llama-3.3-70b-instruct- Best balance of quality and speedllama-4-maverick- High quality, complex tasks
For fast inference:
llama-4-scout- Optimized for speedgemini-2.5-flash- High throughput, real-time use
For Chinese language:
glm-4.5- Bilingual Chinese/English support
For advanced reasoning:
gpt-5- Latest OpenAI capabilitiescustom-model-alpha(Claude Opus 4.1) - Complex analysis and long-form writing