How to Build Resilient API Integrations That Don't Break
How to Build Resilient API Integrations That Don't Break
Every API you depend on will go down. It will have bugs. It will change its response format. It will rate-limit you at the worst possible time. The question isn't whether your API integrations will face problems — it's whether your application survives them gracefully.
The Failure Modes
What Goes Wrong with API Integrations
| Failure Mode | Frequency | Impact |
|---|---|---|
| Timeout | Daily | Slow responses cascade through your system |
| Rate limiting (429) | Daily-weekly | Requests fail until rate resets |
| Server error (5xx) | Weekly | Temporary failures, usually recoverable |
| DNS resolution failure | Monthly | Complete inability to connect |
| Certificate expiry | Rare but devastating | HTTPS connections fail |
| Breaking API change | Quarterly | Integration stops working |
| Response format change | Quarterly | Parsing errors, data corruption |
| Deprecation | Annually | Endpoints removed, features dropped |
| Provider shutdown | Rare | Complete integration loss |
Pattern 1: Timeouts on Everything
The most common cause of cascading failure: no timeouts.
// ❌ No timeout — request hangs forever if API is slow
const data = await fetch('https://api.example.com/data');
// ✅ Always set timeouts
const data = await fetch('https://api.example.com/data', {
signal: AbortSignal.timeout(5000), // 5 second timeout
});
// ✅ Even better — different timeouts for different operations
const TIMEOUTS = {
read: 5000, // 5s for reads
write: 10000, // 10s for writes
upload: 60000, // 60s for file uploads
webhook: 3000, // 3s for webhook delivery
};
async function apiCall(path: string, type: keyof typeof TIMEOUTS) {
return fetch(`https://api.example.com${path}`, {
signal: AbortSignal.timeout(TIMEOUTS[type]),
});
}
Rule of thumb: Set timeout to 2x the expected response time. If the API normally responds in 200ms, timeout at 500ms-1s.
Pattern 2: Circuit Breaker
Stop calling a broken API. Let it recover instead of overwhelming it with retries.
class CircuitBreaker {
private failures = 0;
private lastFailure = 0;
private state: 'closed' | 'open' | 'half-open' = 'closed';
constructor(
private failureThreshold: number = 5,
private resetTimeMs: number = 30000, // 30 seconds
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
// Check if enough time has passed to try again
if (Date.now() - this.lastFailure > this.resetTimeMs) {
this.state = 'half-open';
} else {
throw new CircuitOpenError('Circuit breaker is open');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'closed';
}
private onFailure() {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.failureThreshold) {
this.state = 'open';
}
}
getState() {
return {
state: this.state,
failures: this.failures,
};
}
}
class CircuitOpenError extends Error {
constructor(message: string) {
super(message);
this.name = 'CircuitOpenError';
}
}
// Usage
const paymentCircuit = new CircuitBreaker(5, 30000);
async function processPayment(amount: number) {
try {
return await paymentCircuit.execute(() =>
stripe.charges.create({ amount, currency: 'usd' })
);
} catch (error) {
if (error instanceof CircuitOpenError) {
// Payment provider is down — queue for later
await queueForRetry({ amount, type: 'payment' });
return { status: 'queued', message: 'Payment will be processed shortly' };
}
throw error;
}
}
Pattern 3: Graceful Degradation
When an API is down, serve reduced functionality instead of breaking entirely.
// Example: Product page with reviews from external API
async function getProductPage(productId: string) {
// Core data — from your database (must succeed)
const product = await db.products.findById(productId);
// Enhanced data — from external APIs (can fail gracefully)
const [reviews, recommendations, inventory] = await Promise.allSettled([
fetchReviews(productId), // Third-party reviews API
fetchRecommendations(productId), // ML recommendation API
fetchInventory(productId), // Warehouse API
]);
return {
product,
reviews: reviews.status === 'fulfilled'
? reviews.value
: { items: [], message: 'Reviews temporarily unavailable' },
recommendations: recommendations.status === 'fulfilled'
? recommendations.value
: [],
inventory: inventory.status === 'fulfilled'
? inventory.value
: { available: true, message: 'Check store for availability' },
};
}
Degradation Levels
| Level | What Works | What's Degraded | User Experience |
|---|---|---|---|
| Full | Everything | Nothing | Normal |
| Partial | Core features | Enhancements (reviews, recommendations) | Minor loss |
| Minimal | Read operations | Write operations queued | Can browse, can't act |
| Cached | Stale data served | No fresh data | "Data as of X minutes ago" |
| Maintenance | Nothing | Everything | Maintenance page |
Pattern 4: Caching and Stale Data
Serve cached data when the API is unavailable:
class CachedAPIClient {
constructor(
private cache: Map<string, { data: any; timestamp: number }> = new Map(),
private maxAge: number = 300000, // 5 minutes
private staleMaxAge: number = 3600000, // 1 hour (serve stale if API is down)
) {}
async fetch<T>(url: string, options?: RequestInit): Promise<T & { _cached?: boolean }> {
const cached = this.cache.get(url);
// Fresh cache — serve immediately
if (cached && Date.now() - cached.timestamp < this.maxAge) {
return { ...cached.data, _cached: true };
}
// Try fresh fetch
try {
const response = await fetch(url, {
...options,
signal: AbortSignal.timeout(5000),
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const data = await response.json();
this.cache.set(url, { data, timestamp: Date.now() });
return data;
} catch (error) {
// Fetch failed — serve stale cache if available
if (cached && Date.now() - cached.timestamp < this.staleMaxAge) {
console.warn(`Serving stale cache for ${url} (age: ${Date.now() - cached.timestamp}ms)`);
return { ...cached.data, _cached: true, _stale: true };
}
throw error; // No cache available — propagate error
}
}
}
Pattern 5: Idempotent Retry with Deduplication
Safe to retry without duplicate side effects:
async function createOrderWithRetry(orderData: OrderInput): Promise<Order> {
// Generate idempotency key BEFORE first attempt
const idempotencyKey = `order_${orderData.userId}_${Date.now()}`;
for (let attempt = 0; attempt < 3; attempt++) {
try {
const response = await fetch('https://api.payments.com/v1/orders', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Idempotency-Key': idempotencyKey, // Same key for all retries
},
body: JSON.stringify(orderData),
signal: AbortSignal.timeout(10000),
});
if (response.ok) return response.json();
if (response.status === 429 || response.status >= 500) {
// Retryable — same idempotency key means no duplicate charges
await sleep(Math.pow(2, attempt) * 1000);
continue;
}
// 4xx (except 429) — don't retry, it's a client error
throw new Error(`HTTP ${response.status}: ${await response.text()}`);
} catch (error) {
if (attempt === 2) throw error;
await sleep(Math.pow(2, attempt) * 1000);
}
}
throw new Error('Max retries exceeded');
}
Pattern 6: Health Check Monitoring
Detect issues before they hit users:
class APIHealthChecker {
private healthStatus: Map<string, {
healthy: boolean;
lastCheck: number;
latency: number;
consecutiveFailures: number;
}> = new Map();
async check(name: string, healthUrl: string): Promise<boolean> {
const start = Date.now();
try {
const response = await fetch(healthUrl, {
signal: AbortSignal.timeout(3000),
});
const healthy = response.ok;
const latency = Date.now() - start;
this.healthStatus.set(name, {
healthy,
lastCheck: Date.now(),
latency,
consecutiveFailures: healthy ? 0 : (this.healthStatus.get(name)?.consecutiveFailures ?? 0) + 1,
});
return healthy;
} catch {
const current = this.healthStatus.get(name);
this.healthStatus.set(name, {
healthy: false,
lastCheck: Date.now(),
latency: Date.now() - start,
consecutiveFailures: (current?.consecutiveFailures ?? 0) + 1,
});
return false;
}
}
getStatus() {
return Object.fromEntries(this.healthStatus);
}
}
// Usage: check every 30 seconds
const checker = new APIHealthChecker();
setInterval(async () => {
await Promise.all([
checker.check('stripe', 'https://api.stripe.com/v1'),
checker.check('resend', 'https://api.resend.com/health'),
checker.check('auth', 'https://api.clerk.com/v1/health'),
]);
const status = checker.getStatus();
// Alert if any API has 3+ consecutive failures
for (const [name, state] of Object.entries(status)) {
if (state.consecutiveFailures >= 3) {
await alertOps(`${name} API is unhealthy: ${state.consecutiveFailures} consecutive failures`);
}
}
}, 30000);
Pattern 7: Response Validation
Don't trust API responses — validate them:
import { z } from 'zod';
// Define expected response shape
const UserResponseSchema = z.object({
id: z.string(),
email: z.string().email(),
name: z.string(),
created_at: z.string().datetime(),
});
type UserResponse = z.infer<typeof UserResponseSchema>;
async function getUser(userId: string): Promise<UserResponse> {
const response = await fetch(`/api/users/${userId}`);
const data = await response.json();
// Validate response matches expected schema
const result = UserResponseSchema.safeParse(data);
if (!result.success) {
// API response format changed — log and alert
console.error('API response validation failed:', {
endpoint: `/api/users/${userId}`,
errors: result.error.issues,
received: data,
});
// Option 1: Throw (fail fast)
throw new Error('API response format changed');
// Option 2: Use with defaults (graceful)
// return { ...defaults, ...data };
}
return result.data;
}
The Resilience Checklist
| Pattern | Priority | Impact |
|---|---|---|
| Timeouts on all API calls | P0 | Prevents cascading failures |
| Exponential backoff with jitter | P0 | Handles rate limits and transient errors |
| Input/output validation | P0 | Catches API changes early |
| Circuit breaker | P1 | Stops hammering failing APIs |
| Graceful degradation | P1 | Users get partial functionality vs errors |
| Response caching (stale-while-error) | P1 | Serves data during outages |
| Idempotency keys on writes | P1 | Safe retries without duplicates |
| Health check monitoring | P2 | Early detection of issues |
| Multi-provider fallback | P2 | Survive provider outages |
| Response schema validation | P2 | Detect breaking changes |
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| No timeouts | One slow API freezes entire app | Set timeouts on every external call |
| Retry without backoff | Makes outages worse | Exponential backoff + jitter |
| Same code path for all errors | Retrying non-retryable errors | Handle 4xx vs 5xx vs network errors differently |
| No fallback for external APIs | Single point of failure | Cache, degrade, or use backup provider |
| Trusting API response format | Breaks when API changes | Validate responses with Zod/schemas |
| No monitoring of API health | Issues discovered by users | Health checks + alerting |
| Tight coupling to one provider | Locked in when problems arise | Abstraction layer for critical APIs |
Find the most reliable APIs on APIScout — uptime tracking, reliability scores, and resilience pattern guides for every provider.