Available Models

HybridInference provides access to multiple state-of-the-art LLM models.

Model Overview

Model ID

Name

Context Length

Pricing

llama-3.3-70b-instruct

Llama 3.3 70B Instruct

131K tokens

Free

llama-4-scout

Llama 4 Scout

128K tokens

Free

llama-4-maverick

Llama 4 Maverick

128K tokens

Free

gemini-2.5-flash

Gemini 2.5 Flash

1M tokens

Free

gemini-2.5-flash-preview-09-2025

Gemini 2.5 Flash Preview

1M tokens

Free

glm-4.5

GLM-4.5

128K tokens

Free

gpt-5

GPT-5

128K tokens

Free

custom-model-alpha

Claude Opus 4.1

200K tokens

Free

custom-model-beta

GPT-5 (Azure)

400K tokens

Free

Model Details

Llama 3.3 70B Instruct

Model ID: llama-3.3-70b-instruct

High-performance open-source model optimized for instruction following.

Key Features:

  • Context length: 131,072 tokens

  • Max output: 8,192 tokens

  • Function calling support

  • Structured output (JSON mode)

  • Quantization: bf16

Best For:

  • General purpose chat

  • Long-form content generation

  • Code generation

  • Instruction following

Example:

response = client.chat.completions.create(
    model="llama-3.3-70b-instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=2048
)

Llama 4 Scout

Model ID: llama-4-scout

Efficient MoE (Mixture of Experts) model for fast inference.

Key Features:

  • Context length: 128,000 tokens

  • Max output: 16,384 tokens

  • Function calling support

  • Structured output

  • Quantization: fp8

Best For:

  • Fast inference scenarios

  • Cost-effective deployments

  • Production workloads


Llama 4 Maverick

Model ID: llama-4-maverick

Advanced MoE (Mixture of Experts) model for complex tasks.

Key Features:

  • Context length: 128,000 tokens

  • Max output: 16,384 tokens

  • Function calling support

  • Structured output

  • Quantization: fp8

Best For:

  • Complex reasoning tasks

  • Long-form generation

  • Production workloads with high quality requirements


Gemini 2.5 Flash

Model ID: gemini-2.5-flash

Google’s fast and efficient model.

Key Features:

  • Fast inference speed

  • High throughput

  • Large context window

  • Production-ready

Best For:

  • Real-time applications

  • High-volume workloads

  • Quick responses needed


GLM-4.5

Model ID: glm-4.5

Bilingual model optimized for Chinese and English.

Best For:

  • Chinese language tasks

  • Bilingual applications

  • Cross-language translation


GPT-5

Model ID: gpt-5

Latest OpenAI flagship model.

Best For:

  • Complex reasoning

  • Advanced code generation

  • Research applications


Claude Opus 4.1

Model ID: custom-model-alpha

Anthropic’s most capable model for complex tasks.

Best For:

  • Long-form writing

  • Advanced analysis

  • Research and development


Using Different Models

Simply change the model parameter in your request:

import openai

client = openai.OpenAI(
    base_url="https://freeinference.org/v1",
    api_key="your-api-key-here"
)

# Use Llama 3.3
response = client.chat.completions.create(
    model="llama-3.3-70b-instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Switch to Gemini
response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[{"role": "user", "content": "Hello!"}]
)

Model Selection Guide

For general chat and instructions:

  • llama-3.3-70b-instruct - Best balance of quality and speed

  • llama-4-maverick - High quality, complex tasks

For fast inference:

  • llama-4-scout - Optimized for speed

  • gemini-2.5-flash - High throughput, real-time use

For Chinese language:

  • glm-4.5 - Bilingual Chinese/English support

For advanced reasoning:

  • gpt-5 - Latest OpenAI capabilities

  • custom-model-alpha (Claude Opus 4.1) - Complex analysis and long-form writing