Skip to main content

OpenRouter API: One Key for 500+ LLM Models

·APIScout Team
openrouterllmapi gatewayopenaianthropic2026

The Problem OpenRouter Solves

Building an AI application in 2026 means navigating a fragmented model landscape. OpenAI releases GPT-4.1. Anthropic ships Claude Opus 4.6. Google releases Gemini 3.1 Pro. Meta drops Llama 4. Each requires a separate API key, separate billing, separate integration code, and separate rate limit management.

OpenRouter solves this with a single API gateway: one key, one endpoint, one billing account — access to every major frontier model and hundreds of open-source alternatives.

TL;DR

OpenRouter is an OpenAI-compatible API router that sits in front of every major LLM provider. You call one endpoint (https://openrouter.ai/api/v1), pass any model ID, and OpenRouter handles routing, authentication, and billing. It's genuinely useful for production systems that need model fallbacks, cost routing, or access to models across multiple providers without managing multiple accounts.

Key Takeaways

  • Models: 500+ LLMs from OpenAI, Anthropic, Google, Meta, Mistral, Cohere, and dozens of open-source providers
  • API format: Fully OpenAI-compatible — just change base_url and api_key
  • Pricing: No per-token markup — you pay provider prices; 5.5% fee on credit purchases only
  • Free models: 29 :free models (rate-limited: 20 req/min, 50 req/day without credits)
  • Killer features: Model fallbacks, provider routing preferences, cost-based auto-routing
  • No lock-in: Your code works identically against provider APIs; switching is a one-line change

How OpenRouter Works

When you make a request to OpenRouter:

  1. You specify a model ID (e.g., anthropic/claude-opus-4-6)
  2. OpenRouter authenticates your request using your single API key
  3. It routes to the appropriate provider (Anthropic, in this case), using your preferred provider settings
  4. The response is returned in OpenAI chat completions format — regardless of which provider served it
  5. OpenRouter bills your account; you manage one credit balance
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="sk-or-v1-your-openrouter-key",
)

response = client.chat.completions.create(
    model="anthropic/claude-opus-4-6",
    messages=[{"role": "user", "content": "Explain transformer attention in one paragraph."}],
    extra_headers={
        "HTTP-Referer": "https://yourapp.com",  # Required for some rate tiers
        "X-Title": "Your App Name",             # Shown in OpenRouter dashboard
    }
)

print(response.choices[0].message.content)

No Anthropic SDK, no Anthropic API key, no Anthropic billing account. Just OpenRouter.

Supported Models

OpenRouter supports 500+ models across providers. Highlights:

OpenAI

  • openai/gpt-4.1 — Latest flagship
  • openai/gpt-4o — Multimodal
  • openai/gpt-4o-mini — Budget option
  • openai/o3 — Reasoning model
  • openai/o4-mini — Budget reasoning

Anthropic

  • anthropic/claude-opus-4-6 — Most capable Claude
  • anthropic/claude-sonnet-4-6 — Balanced
  • anthropic/claude-haiku-4-5 — Fast and cheap

Google

  • google/gemini-2.5-pro — 1M context, strong reasoning
  • google/gemini-3-flash-preview — Newest Flash
  • google/gemini-2.5-flash — Speed/cost balance

Meta / Open Source

  • meta-llama/llama-4-scout — Latest Llama 4
  • meta-llama/llama-4-maverick — Llama 4 large variant
  • meta-llama/llama-3.3-70b-instruct — Proven workhorse

Mistral

  • mistralai/mistral-large-2 — Strong multilingual
  • mistralai/codestral — Code specialist
  • mistralai/mistral-small — Cheapest Mistral

Free Models

Many open-source models are available at $0/token (with rate limits):

  • meta-llama/llama-4-scout:free
  • mistralai/mistral-7b-instruct:free
  • google/gemma-3-12b-it:free

Free models share a rate-limited pool — good for development and low-volume use, not production.

Pricing

OpenRouter's pricing model is often misunderstood: there is no per-token markup. You pay the same per-token rate as calling the provider directly.

The only fees:

  • Credit purchase fee: 5.5% (minimum $0.80) when buying credits upfront
  • BYOK (Bring Your Own Key): First 1M requests/month free; 5% fee on provider cost beyond that
  • Free models (:free suffix): 29 models with no token cost; rate-limited to 20 req/min and 50 req/day (1,000/day after purchasing $10+ in credits)

For most workloads, the 5.5% credit purchase fee is the only overhead — you amortize it across your entire credit balance. If you're spending $1,000 on API credits, you pay ~$55 to OpenRouter. If you use BYOK with your own API keys, you pay OpenRouter nothing for the first 1M requests.

# Check model pricing programmatically
import httpx

response = httpx.get("https://openrouter.ai/api/v1/models")
models = response.json()["data"]

for model in models[:5]:
    pricing = model.get("pricing", {})
    print(f"{model['id']}: ${pricing.get('prompt', 'N/A')}/token input")

Model Fallbacks

The most powerful OpenRouter feature for production: automatic failover when a provider has an outage or rate limits you.

response = client.chat.completions.create(
    model="openai/gpt-4.1",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={
        "route": "fallback",
        "models": [
            "openai/gpt-4.1",
            "anthropic/claude-sonnet-4-6",  # Fallback 1
            "google/gemini-2.5-pro",         # Fallback 2
        ],
    }
)

If GPT-4.1 is rate-limited or down, OpenRouter automatically retries with Claude Sonnet 4.6, then Gemini 2.5 Pro. Your application never sees the error.

This is genuinely difficult to build yourself — you'd need error handling, retries, and provider health checking. OpenRouter gives it to you in one parameter.

Provider Routing Preferences

When a model is available from multiple providers (e.g., Llama 4 runs on Groq, Together AI, Fireworks, and others), you can control which provider OpenRouter uses:

response = client.chat.completions.create(
    model="meta-llama/llama-4-maverick",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={
        "provider": {
            "order": ["Groq", "Together", "Fireworks"],  # Prefer in this order
            "allow_fallbacks": True,
        }
    }
)

Or use model routing variants — shorthand suffixes that let OpenRouter pick the best provider for your goal:

# :nitro — routes to fastest available provider for this model
response = client.chat.completions.create(
    model="meta-llama/llama-4-scout:nitro",  # Maximum speed
    messages=[...]
)

# :floor — routes to cheapest available provider for this model
response = client.chat.completions.create(
    model="meta-llama/llama-4-scout:floor",  # Minimum cost
    messages=[...]
)

# :free — free tier (rate-limited, no cost)
response = client.chat.completions.create(
    model="meta-llama/llama-4-scout:free",   # Free with rate limits
    messages=[...]
)

Or for explicit provider ordering:

extra_body={
    "provider": {
        "order": ["Groq", "Together", "Fireworks"],
        "allow_fallbacks": True,
    }
}

This lets you optimize for cost or speed across the same model without changing your code.

Streaming

OpenRouter supports streaming with the same interface as the OpenAI SDK:

stream = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Write a poem about APIs."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Streaming works across all major models. OpenRouter normalizes the stream format so you get consistent delta.content chunks regardless of whether the underlying model is GPT, Claude, or Gemini.

Context and System Prompts

Context and system prompts work exactly as they do with the OpenAI SDK:

response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-6",
    messages=[
        {
            "role": "system",
            "content": "You are a senior backend engineer. Be concise and precise.",
        },
        {
            "role": "user",
            "content": "What's the difference between optimistic and pessimistic locking?",
        },
    ],
    temperature=0.2,
    max_tokens=512,
)

Multi-turn conversations work identically — append messages to the list and send the full history each time. OpenRouter passes the conversation to the upstream provider as-is.

Model Benchmarking Workflow

One of the best uses of OpenRouter: rapidly benchmarking which model performs best for your specific task.

MODELS_TO_TEST = [
    "openai/gpt-4.1",
    "anthropic/claude-opus-4-6",
    "google/gemini-2.5-pro",
    "meta-llama/llama-4-maverick",
    "mistralai/mistral-large-2",
]

TEST_PROMPTS = [
    "Extract the company name, date, and total amount from this invoice: [invoice text]",
    "Classify this support ticket as bug/feature/question: [ticket text]",
    "Summarize this 5-page contract in 3 bullet points: [contract text]",
]

async def benchmark_models():
    results = {}
    for model in MODELS_TO_TEST:
        model_results = []
        for prompt in TEST_PROMPTS:
            try:
                response = client.chat.completions.create(
                    model=model,
                    messages=[{"role": "user", "content": prompt}],
                    max_tokens=256,
                )
                model_results.append({
                    "output": response.choices[0].message.content,
                    "tokens": response.usage.total_tokens,
                    "cost_estimate": calculate_cost(model, response.usage),
                })
            except Exception as e:
                model_results.append({"error": str(e)})
        results[model] = model_results
    return results

With one API key, you get cost-normalized comparisons across every major frontier model. Migrating from one model to another becomes a one-line change.

Rate Limits and Production Considerations

OpenRouter's rate limits work in layers:

  1. Your OpenRouter account limits — based on your plan and credit balance
  2. Provider-specific limits — OpenRouter can hit provider rate limits independently
  3. Per-model limits — some models have tighter limits than others

For production, set up error handling that distinguishes between OpenRouter errors (your account rate-limited) and provider errors (upstream capacity):

import time
from openai import RateLimitError, APIError

def chat_with_retry(model: str, messages: list, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
            )
            return response.choices[0].message.content

        except RateLimitError as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                time.sleep(wait_time)
            else:
                raise

        except APIError as e:
            if "provider" in str(e).lower():
                # Provider-side error — try fallback model
                fallback_models = get_fallback_models(model)
                if fallback_models:
                    return chat_with_retry(fallback_models[0], messages, max_retries)
            raise

    raise Exception("Max retries exceeded")

Credits and Billing

OpenRouter uses a credit system:

  • Buy credits upfront or set up auto-reload
  • Each API call deducts from your balance based on token count × model price
  • The OpenRouter dashboard shows real-time spend per model
  • Per-model cost breakdown — useful for understanding which models drive your bill

For teams: OpenRouter supports multiple API keys under one account, letting you track spend by project or team.

TypeScript Usage

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
  defaultHeaders: {
    "HTTP-Referer": "https://yourapp.com",
    "X-Title": "Your App",
  },
});

async function chat(message: string, model: string = "openai/gpt-4.1") {
  const response = await client.chat.completions.create({
    model,
    messages: [{ role: "user", content: message }],
  });
  return response.choices[0].message.content;
}

// Switch models with a single variable change
const cheapResponse = await chat("Hello", "meta-llama/llama-4-scout:free");
const premiumResponse = await chat("Hello", "anthropic/claude-opus-4-6");

OpenRouter vs. Direct Provider Access

FactorOpenRouterDirect Provider
SetupOne key, one accountSeparate key + billing per provider
Cost~5–15% overheadDirect pricing
Model access500+ models instantlyOnly that provider's models
FallbacksBuilt-inDIY (complex)
ObservabilityUnified dashboardSeparate per provider
Rate limitsPooled across providersPer-provider
Data privacyTraffic through OpenRouterDirect to provider
Fine-tuned modelsYour fine-tunes on OpenAI onlyFull fine-tune access

Choose OpenRouter when:

  • You want to experiment with many models quickly
  • You need model fallbacks in production
  • You're building a multi-model product (let users choose their AI)
  • You don't want to manage multiple API accounts

Go direct when:

  • You're at high volume (optimize out the margin)
  • You need fine-tuned models
  • Data privacy requires direct-to-provider routing
  • You're 100% committed to one provider

Building a Model-Agnostic Abstraction Layer

The pattern most OpenRouter users end up with in production: a thin wrapper that lets you swap models with config changes, not code changes.

import os
from openai import OpenAI
from typing import Optional

class LLMClient:
    """OpenRouter-backed LLM client with model flexibility."""

    def __init__(self):
        self.client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=os.environ["OPENROUTER_API_KEY"],
        )
        # Models configured via env — change without code deploys
        self.default_model = os.getenv("LLM_DEFAULT_MODEL", "openai/gpt-4.1")
        self.fast_model = os.getenv("LLM_FAST_MODEL", "meta-llama/llama-4-scout")
        self.cheap_model = os.getenv("LLM_CHEAP_MODEL", "openai/gpt-4o-mini")

    def complete(
        self,
        prompt: str,
        mode: str = "default",
        system: Optional[str] = None,
    ) -> str:
        model = {
            "default": self.default_model,
            "fast": self.fast_model,
            "cheap": self.cheap_model,
        }.get(mode, self.default_model)

        messages = []
        if system:
            messages.append({"role": "system", "content": system})
        messages.append({"role": "user", "content": prompt})

        response = self.client.chat.completions.create(
            model=model,
            messages=messages,
        )
        return response.choices[0].message.content

# Usage — model selection via config, not code
llm = LLMClient()
result = llm.complete("Summarize this doc", mode="cheap")  # Uses configured cheap model

When a new model launches that's 30% cheaper for your use case, you update an environment variable and redeploy — no code changes.

Limits and Gotchas

Context window limits: OpenRouter enforces the upstream model's context limits — no magic here.

Model-specific features: Extended thinking (Claude), reasoning tokens (o3), etc. — some are passed through, others aren't. Check the model's page on openrouter.ai.

Latency overhead: OpenRouter adds ~50–150ms of routing latency. For real-time voice applications, go direct.

Free tier throttling: Free models can queue during peak hours. Not suitable for production SLAs.

HTTP-Referer header: OpenRouter uses this for rate tier classification. Set it to your app's domain. Without it, you default to the lowest tier.

Bottom Line

OpenRouter is a genuine time saver for teams that need multi-model flexibility. The overhead is minimal at typical API spend levels, and features like fallbacks and provider routing are hard to replicate yourself. For a startup exploring which model works best for their use case — or building a product that supports multiple AI backends — OpenRouter is the fastest path to production.

For committed, high-volume production workloads on a single provider, go direct and skip the margin.

The typical lifecycle: start on OpenRouter, benchmark models, find your winner, then evaluate whether direct access saves enough to justify the migration at your volume. For most teams at $1K–$10K/month in API spend, OpenRouter's overhead is noise. At $100K+/month, it's worth the switch to direct.

Either way, designing your codebase around a simple abstraction layer (swap base_url and api_key) means the migration is always a one-hour task, not a refactor.


Browse all supported models and live pricing at APIScout.

Related: How to Choose an LLM API in 2026 · Groq API: Fastest LLM Inference 2026

Comments