The Rise of AI Gateway APIs: LiteLLM, Portkey, and Beyond
·APIScout Team
ai gatewaylitellmportkeyllmai infrastructure
The Rise of AI Gateway APIs: LiteLLM, Portkey, and Beyond
Managing multiple AI providers is a mess. Different SDKs, different response formats, different error codes, different rate limits. AI gateways solve this with a unified API layer — one interface to call any model from any provider, with built-in fallbacks, caching, and cost tracking.
Why AI Gateways Exist
The Multi-Model Problem
Most production apps use multiple AI models:
Simple queries → Gemini Flash ($0.075/1M tokens) — cheap, fast
Complex reasoning → Claude Opus ($15/1M input) — highest quality
Code generation → Claude Sonnet ($3/1M input) — good balance
Embeddings → Cohere Embed ($0.10/1M tokens) — specialized
Image analysis → GPT-4o ($5/1M input) — best multimodal
Without a gateway, you need 5 different SDKs, 5 different auth mechanisms, 5 different error handling patterns, and manual routing logic.
What Gateways Provide
| Feature | Without Gateway | With Gateway |
|---|---|---|
| API interface | 5 different SDKs | 1 unified API |
| Fallback | Manual try/catch chains | Automatic failover |
| Cost tracking | Parse 5 different billing pages | Single dashboard |
| Caching | Build your own | Built-in semantic cache |
| Rate limiting | Handle per-provider | Unified rate management |
| Observability | 5 logging integrations | Single observability layer |
The Gateway Landscape
Open Source
| Gateway | Type | Key Feature | Stars |
|---|---|---|---|
| LiteLLM | Python proxy | 100+ model support, OpenAI-compatible | 15K+ |
| Portkey Gateway | Node.js proxy | Reliability, guardrails | 5K+ |
| Jan | Desktop app | Local + cloud models | 20K+ |
| AI Gateway (CF) | Edge proxy | Cloudflare-integrated | N/A |
Managed Platforms
| Platform | Focus | Pricing |
|---|---|---|
| Portkey | Reliability + observability | Free tier, then usage-based |
| Helicone | Observability + analytics | Free tier, then $50+/month |
| Braintrust | Evaluation + gateway | Free tier, then usage-based |
| Martian | Smart model routing | Usage-based |
| Not Diamond | Intelligent model selection | Per-request |
Cloud Provider Gateways
| Provider | Product | Models Available |
|---|---|---|
| AWS | Bedrock | Claude, Llama, Cohere, Mistral |
| Azure | AI Studio | GPT-4o, o3, Llama, Mistral |
| GCP | Vertex AI | Gemini, Claude, Llama |
How AI Gateways Work
LiteLLM Example
from litellm import completion
# Same interface for any provider
response = completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Hello"}],
)
# Switch provider — same code
response = completion(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
# Or Gemini
response = completion(
model="gemini/gemini-2.0-flash",
messages=[{"role": "user", "content": "Hello"}],
)
LiteLLM Proxy (OpenAI-Compatible Server)
litellm --model anthropic/claude-sonnet-4-20250514
# Now any OpenAI-compatible client can connect
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "anthropic/claude-sonnet-4-20250514",
"messages": [{"role": "user", "content": "Hello"}]
}'
Portkey Example
import Portkey from 'portkey-ai';
const portkey = new Portkey({
apiKey: process.env.PORTKEY_API_KEY,
});
// Unified call with automatic retry + fallback
const response = await portkey.chat.completions.create({
model: 'claude-sonnet-4-20250514',
messages: [{ role: 'user', content: 'Hello' }],
// Portkey-specific config
config: {
retry: { attempts: 3, on_status_codes: [429, 500] },
cache: { mode: 'semantic', max_age: 3600 },
},
});
Key Gateway Features
1. Automatic Fallback
// If primary model fails, try fallbacks automatically
const config = {
strategy: {
mode: 'fallback',
on_status_codes: [429, 500, 503],
},
targets: [
{ provider: 'anthropic', model: 'claude-sonnet-4-20250514' },
{ provider: 'openai', model: 'gpt-4o' },
{ provider: 'google', model: 'gemini-2.0-flash' },
],
};
2. Load Balancing
// Distribute requests across providers
const config = {
strategy: {
mode: 'loadbalance',
},
targets: [
{ provider: 'anthropic', weight: 60 },
{ provider: 'openai', weight: 30 },
{ provider: 'google', weight: 10 },
],
};
3. Semantic Caching
// Cache similar queries to save money
// "What's the capital of France?" and "capital of France?"
// → Same cached response
const config = {
cache: {
mode: 'semantic',
max_age: 3600,
similarity_threshold: 0.95,
},
};
// Savings: 40-60% on repeated/similar queries
4. Cost Tracking
// Track spending per model, per user, per feature
const analytics = await gateway.getUsage({
timeRange: 'last_30_days',
groupBy: ['model', 'user', 'feature'],
});
// Result:
// {
// total_cost: $342.50,
// by_model: {
// 'claude-sonnet': { tokens: 5M, cost: $180 },
// 'gpt-4o': { tokens: 2M, cost: $100 },
// 'gemini-flash': { tokens: 10M, cost: $62.50 },
// },
// by_feature: {
// 'chat': $200,
// 'search': $80,
// 'summarization': $62.50,
// }
// }
5. Guardrails
// Block harmful content, PII, or off-topic responses
const config = {
guardrails: {
input: {
block_pii: true,
block_topics: ['violence', 'illegal'],
max_tokens: 4000,
},
output: {
block_pii: true,
require_citation: true,
max_tokens: 2000,
},
},
};
Choosing a Gateway
| If You Need... | Choose | Why |
|---|---|---|
| Maximum model support | LiteLLM | 100+ models |
| Production reliability | Portkey | Enterprise-grade fallback + retry |
| Observability focus | Helicone | Best analytics and logging |
| Smart routing | Martian / Not Diamond | AI selects best model per request |
| AWS ecosystem | Bedrock | Native integration |
| Self-hosted, open-source | LiteLLM Proxy | Full control |
| Edge deployment | Cloudflare AI Gateway | Global edge, no origin |
Cost Impact
Before Gateway
Dev time managing 3 providers: 2 hours/week
Wasted spend from no caching: ~30% of AI budget
Downtime from single-provider dependency: 2-3 incidents/quarter
No visibility into per-feature costs: over-spending on non-critical features
After Gateway
Provider management: automated
Cache hit rate: 40-60% → direct cost savings
Uptime: automatic failover prevents most outages
Cost visibility: per-feature, per-user, per-model tracking
Typical savings: 30-50% on AI API costs through caching + model routing alone.
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Not caching identical requests | 30-50% wasted spend | Enable semantic caching |
| Using frontier model for all tasks | 10x overspend | Route simple tasks to cheap models |
| No fallback configured | Outage when primary provider goes down | Set up at least 2 fallback providers |
| Ignoring token usage by feature | Can't optimize | Track per-feature costs |
| Gateway as single point of failure | Gateway down = everything down | Self-host or use multiple gateway instances |
Compare AI gateways and model providers on APIScout — pricing, model support, reliability, and developer experience.