Available Models

HybridInference provides access to multiple state-of-the-art LLM models.

Model Overview

Model ID	Name	Context Length	Pricing
`llama-3.3-70b-instruct`	Llama 3.3 70B Instruct	131K tokens	Free
`llama-4-scout`	Llama 4 Scout	128K tokens	Free
`llama-4-maverick`	Llama 4 Maverick	128K tokens	Free
`gemini-2.5-flash`	Gemini 2.5 Flash	1M tokens	Free
`gemini-2.5-flash-preview-09-2025`	Gemini 2.5 Flash Preview	1M tokens	Free
`glm-4.5`	GLM-4.5	128K tokens	Free
`gpt-5`	GPT-5	128K tokens	Free
`custom-model-alpha`	Claude Opus 4.1	200K tokens	Free
`custom-model-beta`	GPT-5 (Azure)	400K tokens	Free

Model Details

Llama 3.3 70B Instruct

Model ID: llama-3.3-70b-instruct

High-performance open-source model optimized for instruction following.

Key Features:

Context length: 131,072 tokens
Max output: 8,192 tokens
Function calling support
Structured output (JSON mode)
Quantization: bf16

Best For:

General purpose chat
Long-form content generation
Code generation
Instruction following

Example:

response = client.chat.completions.create(
    model="llama-3.3-70b-instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=2048
)

Llama 4 Scout

Model ID: llama-4-scout

Efficient MoE (Mixture of Experts) model for fast inference.

Key Features:

Context length: 128,000 tokens
Max output: 16,384 tokens
Function calling support
Structured output
Quantization: fp8

Best For:

Fast inference scenarios
Cost-effective deployments
Production workloads

Llama 4 Maverick

Model ID: llama-4-maverick

Advanced MoE (Mixture of Experts) model for complex tasks.

Key Features:

Context length: 128,000 tokens
Max output: 16,384 tokens
Function calling support
Structured output
Quantization: fp8

Best For:

Complex reasoning tasks
Long-form generation
Production workloads with high quality requirements

Gemini 2.5 Flash

Model ID: gemini-2.5-flash

Google’s fast and efficient model.

Key Features:

Fast inference speed
High throughput
Large context window
Production-ready

Best For:

Real-time applications
High-volume workloads
Quick responses needed

GLM-4.5

Model ID: glm-4.5

Bilingual model optimized for Chinese and English.

Best For:

Chinese language tasks
Bilingual applications
Cross-language translation

GPT-5

Model ID: gpt-5

Latest OpenAI flagship model.

Best For:

Complex reasoning
Advanced code generation
Research applications

Claude Opus 4.1

Model ID: custom-model-alpha

Anthropic’s most capable model for complex tasks.

Best For:

Long-form writing
Advanced analysis
Research and development

Using Different Models

Simply change the model parameter in your request:

import openai

client = openai.OpenAI(
    base_url="https://freeinference.org/v1",
    api_key="your-api-key-here"
)

# Use Llama 3.3
response = client.chat.completions.create(
    model="llama-3.3-70b-instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Switch to Gemini
response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[{"role": "user", "content": "Hello!"}]
)

Model Selection Guide

For general chat and instructions:

llama-3.3-70b-instruct - Best balance of quality and speed
llama-4-maverick - High quality, complex tasks

For fast inference:

llama-4-scout - Optimized for speed
gemini-2.5-flash - High throughput, real-time use

For Chinese language:

glm-4.5 - Bilingual Chinese/English support

For advanced reasoning:

gpt-5 - Latest OpenAI capabilities
custom-model-alpha (Claude Opus 4.1) - Complex analysis and long-form writing