Two years ago, using an AI model meant calling OpenAI's API. Today, open-source models match or beat closed models on many tasks — and you can run them anywhere: your own servers, edge devices, or through inference providers at a fraction of the cost. The closed API monopoly is over.
| Model | Type | Parameters | Quality (MMLU) | Cost (1M tokens) | License |
|---|
| GPT-4o | Closed | Unknown | ~88% | $5 input / $15 output | Proprietary |
| Claude Sonnet | Closed | Unknown | ~87% | $3 input / $15 output | Proprietary |
| Gemini 2.0 Pro | Closed | Unknown | ~86% | $1.25 input / $5 output | Proprietary |
| Llama 3.3 70B | Open | 70B | ~86% | $0.20-0.80 (hosted) | Llama License |
| Qwen 2.5 72B | Open | 72B | ~85% | $0.20-0.60 (hosted) | Apache 2.0 |
| Mistral Large | Open-ish | Unknown | ~84% | $2 input / $6 output | Commercial |
| DeepSeek V3 | Open | 671B MoE | ~87% | $0.27 input / $1.10 output | MIT |
| Llama 3.1 405B | Open | 405B | ~88% | $1-3 (hosted) | Llama License |
Key insight: Open-source models have reached 95-100% of closed model quality on standard benchmarks. The gap that was massive in 2023 is nearly closed in 2026.
| Dimension | Advantage |
|---|
| Cost | 5-20x cheaper than closed APIs at scale |
| Privacy | Data never leaves your infrastructure |
| Customization | Fine-tune for your domain |
| No vendor lock-in | Switch providers freely |
| Latency | Self-hosted = no network hop to API provider |
| Availability | No rate limits, no outages from provider |
| Compliance | Full control for regulated industries |
| Dimension | Advantage |
|---|
| Frontier intelligence | Best reasoning (o3, Claude Opus) still closed |
| Zero ops | No infrastructure to manage |
| Multimodal | Best vision + audio + video models |
| Safety | More extensive RLHF and safety testing |
| Features | Tool use, structured output, caching |
| Speed of innovation | New capabilities ship as API updates |
| Family | Creator | Key Models | Strength |
|---|
| Llama | Meta | Llama 3.3 70B, 3.1 405B | General-purpose, huge community |
| Qwen | Alibaba | Qwen 2.5 72B, QwQ-32B | Multilingual, strong reasoning |
| Mistral | Mistral AI | Mistral Large, Codestral | European, code-focused |
| DeepSeek | DeepSeek | DeepSeek V3, DeepSeek R1 | Cost-efficient, MoE architecture |
| Gemma | Google | Gemma 2 27B | Compact, efficient |
| Phi | Microsoft | Phi-4 | Small model, punches above weight |
| Command R | Cohere | Command R+ | RAG-optimized, enterprise |
| Provider | Models Available | Pricing Model | Best For |
|---|
| Together AI | 100+ open models | Per-token | Variety, competitive pricing |
| Groq | Llama, Mistral, Gemma | Per-token | Ultra-fast inference (LPU) |
| Fireworks AI | Major open models | Per-token | Production workloads |
| Replicate | Thousands of models | Per-second | Experimentation, diverse models |
| Anyscale | Major open models | Per-token | Enterprise, fine-tuning |
| AWS Bedrock | Llama, Mistral, Cohere | Per-token | AWS ecosystem |
| Google Vertex | Llama, Mistral, Gemma | Per-token | GCP ecosystem |
| Azure AI Studio | Llama, Mistral, Phi | Per-token | Azure ecosystem |
| Tool | What It Does | Best For |
|---|
| vLLM | High-throughput inference server | Production self-hosting |
| Ollama | Local model running | Development, testing |
| llama.cpp | CPU/GPU inference (C++) | Edge devices, laptops |
| TGI (HuggingFace) | Text generation server | HuggingFace ecosystem |
| SGLang | Fast inference runtime | Structured generation |
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
)
Scenario: 10M API calls/month, avg 1000 tokens each
OpenAI GPT-4o:
Input: 5B tokens × $5/1M = $25,000
Output: 5B tokens × $15/1M = $75,000
Total: ~$100,000/month
Anthropic Claude Sonnet:
Input: 5B tokens × $3/1M = $15,000
Output: 5B tokens × $15/1M = $75,000
Total: ~$90,000/month
Option A: Hosted inference (Together AI, Llama 3.3 70B)
Input: 5B tokens × $0.80/1M = $4,000
Output: 5B tokens × $0.80/1M = $4,000
Total: ~$8,000/month (92% savings)
Option B: Self-hosted (4x A100 80GB, Llama 3.3 70B)
GPU rental: 4 × $2/hr = $5,760/month
Infrastructure: ~$500/month
Total: ~$6,260/month (94% savings)
Option C: Smaller model for simple tasks (Llama 3.2 8B)
Self-hosted (1x A100): ~$1,440/month
Total: ~$1,500/month (98.5% savings)
| Scenario | Why More Expensive |
|---|
| Low volume (<100K calls/month) | Infrastructure minimum cost exceeds API cost |
| Spiky traffic | Need to provision for peak, pay for idle |
| Need multiple model sizes | Multiple deployments, more infrastructure |
| DevOps cost | Engineers maintaining infrastructure |
Rule of thumb: Below $2,000/month in API costs, use hosted APIs. Above $10,000/month, evaluate self-hosting.
Open-source forces closed providers to compete on price:
| Timeline | GPT-4 Class Pricing (1M input tokens) |
|---|
| March 2023 | $30 (GPT-4) |
| November 2023 | $10 (GPT-4 Turbo) |
| May 2024 | $5 (GPT-4o) |
| January 2025 | $1.25 (Gemini 2.0 Pro) |
| 2026 | Race to bottom continues |
90% price drop in 3 years. Open-source models set the floor — closed APIs can't charge much more than the cost of running an equivalent open model.
Closed APIs differentiate through features open-source can't easily match:
| Feature | Closed API Advantage | Open-Source Gap |
|---|
| Tool calling | Polished, reliable | Improving but inconsistent |
| Structured output | Guaranteed JSON | Needs constrained decoding |
| Prompt caching | Built-in, automatic | Manual KV cache management |
| Batch API | 50% discount, async | DIY queuing |
| Content moderation | Built-in safety | Add separate moderation layer |
| Fine-tuning | Managed service | More control but more work |
Most production systems use both:
function selectModel(task: Task) {
if (task.requiresReasoning) {
return { provider: 'anthropic', model: 'claude-opus-4-20250514' };
}
if (task.requiresPrivacy) {
return { provider: 'self-hosted', model: 'llama-3.3-70b' };
}
if (task.isSimple) {
return { provider: 'groq', model: 'llama-3.2-8b' };
}
return { provider: 'together', model: 'llama-3.3-70b' };
}
| Question | If Yes → | If No → |
|---|
| Need absolute best quality? | Closed API (Claude, GPT-4o) | Open-source likely sufficient |
| Processing sensitive data? | Self-hosted open model | Either works |
| AI spend > $10K/month? | Evaluate open-source | Hosted APIs are fine |
| Need fine-tuning control? | Open-source | Closed API fine-tuning |
| Regulated industry? | Self-hosted for compliance | Either works |
| Latency critical? | Self-hosted or edge | Depends on region |
ollama run llama3.3
curl https://api.together.xyz/v1/chat/completions \
-H "Authorization: Bearer $TOGETHER_API_KEY" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'
| Mistake | Impact | Fix |
|---|
| Using closed API for all tasks | 5-20x overspending | Route simple tasks to open models |
| Self-hosting without GPU expertise | Downtime, poor performance | Start with hosted inference, graduate to self-hosted |
| Ignoring total cost of self-hosting | Hidden ops cost | Factor in engineering time, not just GPU cost |
| Using largest model for everything | Wasted compute | Match model size to task complexity |
| Not benchmarking on YOUR data | Open model might be worse for your use case | Test on representative samples before switching |
| Ignoring licensing | Legal risk | Check license (Llama license ≠ Apache 2.0) |
Compare open-source and closed AI model APIs on APIScout — pricing, benchmarks, and feature comparisons across every provider.