API Resilience: Circuit Breakers, Retries 2026

Q: Which Errors Should Retry?

Not all errors should retry — only transient failures: ---

API Resilience Patterns: Circuit Breakers, Retries, and Bulkheads in 2026

TL;DR

Every production API depends on external services — databases, third-party APIs, message queues — that will occasionally be slow, unavailable, or overloaded. Without resilience patterns, one slow downstream service cascades into a full outage. The four patterns that prevent cascade failures: retries with exponential backoff (don't hammer a struggling service), circuit breakers (stop calling a service that's consistently failing), bulkheads (limit concurrency so one slow integration doesn't exhaust all resources), and timeouts (never wait forever). In Node.js, Cockatiel is the go-to library for all four patterns with excellent TypeScript support. Axios-retry handles the simple retry case for HTTP clients.

Key Takeaways

Retries without backoff make outages worse — if 1,000 requests fail and all retry immediately, you send 2,000 requests to a service that's already struggling
Circuit breakers have three states: Closed (normal), Open (failing fast, not calling the service), Half-Open (testing if recovery is possible)
Bulkheads limit concurrency — cap how many simultaneous calls can go to a single downstream service; prevents one slow integration from blocking all your workers
Always set timeouts — a request with no timeout can hang forever, holding a Node.js event loop slot, a DB connection, and worker memory
Cockatiel is the best resilience library for Node.js — composable policies, TypeScript-first, covers retry/circuit breaker/bulkhead/timeout/fallback in one package
Jitter prevents thundering herd — add random variance to retry delays so retrying clients don't all slam the recovering service at the same instant

Why APIs Fail: The Cascade Problem

Consider a typical microservice API with four dependencies:

User Request → Orders API → [PostgreSQL, Stripe, Shipping API, Email Service]

If the Shipping API takes 10 seconds instead of 100ms, and there are 500 concurrent requests in flight, you have 500 requests each waiting 10 seconds, holding 500 database connections and 500 worker slots. The database connection pool exhausts. Now PostgreSQL queries start timing out too. Email service calls back up. The entire system fails because of one slow external API.

Resilience patterns break this cascade:

Timeout: Give up on the Shipping API after 2 seconds, not 10
Circuit breaker: After 10 failures, stop calling Shipping API entirely for 30 seconds
Bulkhead: Only allow 20 simultaneous Shipping API calls; queue or reject the rest
Retry: If a single Shipping API call fails, retry it up to 3 times with exponential backoff

Pattern 1: Retries with Exponential Backoff

The Simple Case: axios-retry

For HTTP clients using Axios, axios-retry adds retry logic in a few lines:

import axios from 'axios'
import axiosRetry from 'axios-retry'

const client = axios.create({
  baseURL: 'https://shipping-api.com',
  timeout: 5000,  // Always set a timeout
})

axiosRetry(client, {
  retries: 3,
  retryDelay: axiosRetry.exponentialDelay,  // 1s, 2s, 4s
  retryCondition: (error) => {
    // Only retry on network errors and 5xx responses
    return axiosRetry.isNetworkOrIdempotentRequestError(error)
      || error.response?.status === 429  // Rate limit — retry after
      || (error.response?.status ?? 0) >= 500
  },
  onRetry: (retryCount, error, config) => {
    console.log(`Retry ${retryCount} for ${config.url}: ${error.message}`)
  },
})

// Now every request through `client` automatically retries
const rates = await client.post('/calculate', { orderId, destination })

Adding Jitter to Prevent Thundering Herd

Exponential backoff without jitter means all retrying clients synchronize: they all failed at t=0, they all retry at t=1s, they all fail again, they all retry at t=2s. This creates burst traffic exactly when the service is trying to recover.

Jitter adds random variance:

function exponentialBackoffWithJitter(retryCount: number): number {
  const baseDelay = 1000  // 1 second
  const exponentialDelay = baseDelay * Math.pow(2, retryCount)
  // Add up to 30% random jitter
  const jitter = Math.random() * exponentialDelay * 0.3
  return exponentialDelay + jitter
}

axiosRetry(client, {
  retries: 3,
  retryDelay: (retryCount) => exponentialBackoffWithJitter(retryCount),
})

Which Errors Should Retry?

Not all errors should retry — only transient failures:

Error Type	Retry?	Reason
5xx (500, 502, 503, 504)	✅ Yes	Server-side transient errors
429 (Rate Limited)	✅ Yes, with Retry-After	Wait for the rate limit window
408 (Timeout)	✅ Yes	Transient
4xx (400, 401, 403, 404)	❌ No	Client errors — retrying won't help
Network error (ECONNREFUSED)	✅ Yes	Transient connectivity
Data validation error	❌ No	Your data is wrong — retrying won't fix it

Pattern 2: Circuit Breakers

A circuit breaker wraps calls to an external service and tracks the failure rate. When failures exceed a threshold, it "opens" — immediately returning an error without calling the service. After a cooldown period, it allows a test request through ("half-open") and closes if that succeeds.

Implementing with Cockatiel

Cockatiel is the best TypeScript-native resilience library for Node.js:

npm install cockatiel

import {
  CircuitBreakerPolicy,
  ConsecutiveBreaker,
  ExponentialBackoff,
  retry,
  circuitBreaker,
  handleAll,
  wrap,
} from 'cockatiel'

// Create a circuit breaker that opens after 5 consecutive failures
const shippingCircuitBreaker = circuitBreaker(handleAll, {
  halfOpenAfter: 30_000,  // Test recovery after 30 seconds
  breaker: new ConsecutiveBreaker(5),  // Open after 5 consecutive failures
})

// Create a retry policy with exponential backoff
const retryPolicy = retry(handleAll, {
  maxAttempts: 3,
  backoff: new ExponentialBackoff({ initialDelay: 500, maxDelay: 5000 }),
})

// Combine retry + circuit breaker (retry first, then circuit breaker)
const resilientShippingCall = wrap(retryPolicy, shippingCircuitBreaker)

// Usage
async function calculateShipping(orderId: string, destination: string) {
  try {
    return await resilientShippingCall.execute(async () => {
      const response = await fetch('https://shipping-api.com/calculate', {
        method: 'POST',
        body: JSON.stringify({ orderId, destination }),
        signal: AbortSignal.timeout(3000),  // 3 second timeout
      })
      if (!response.ok) throw new Error(`HTTP ${response.status}`)
      return response.json()
    })
  } catch (error) {
    // Circuit is open or all retries exhausted
    // Return a fallback value or throw a user-friendly error
    console.error('Shipping calculation failed, using fallback:', error)
    return { rate: 9.99, method: 'standard', estimate: '5-7 business days' }
  }
}

Monitoring Circuit Breaker State

shippingCircuitBreaker.onBreak(() => {
  console.error('Shipping API circuit OPENED — too many failures')
  metrics.circuitBreakerOpened.inc({ service: 'shipping-api' })
})

shippingCircuitBreaker.onReset(() => {
  console.info('Shipping API circuit CLOSED — service recovered')
  metrics.circuitBreakerReset.inc({ service: 'shipping-api' })
})

shippingCircuitBreaker.onHalfOpen(() => {
  console.info('Shipping API circuit HALF-OPEN — testing recovery')
})

Alerting on circuit breaker state changes is far more actionable than generic error rate alerts — you know exactly which service is failing and when it recovers.

Pattern 3: Bulkheads

A bulkhead limits concurrency to a downstream service. Without bulkheads, if your API receives 1,000 concurrent requests and all of them need to call the shipping API, you send 1,000 simultaneous requests to shipping. With a bulkhead of 20, you allow 20 concurrent calls and queue (or reject) the rest.

import { BulkheadPolicy, bulkhead, BulkheadRejectedError } from 'cockatiel'

// Allow max 20 concurrent calls to shipping API; queue up to 50 more
const shippingBulkhead = bulkhead(20, { queue: 50 })

// Combined: retry + circuit breaker + bulkhead
const resilientShipping = wrap(retryPolicy, shippingCircuitBreaker, shippingBulkhead)

async function calculateShipping(orderId: string, destination: string) {
  try {
    return await resilientShipping.execute(async () => {
      return shippingApiClient.calculate({ orderId, destination })
    })
  } catch (error) {
    if (error instanceof BulkheadRejectedError) {
      // Queue full — return a degraded response immediately
      return { rate: 12.99, method: 'standard', estimate: '7-10 days', degraded: true }
    }
    throw error
  }
}

Bulkhead sizing heuristics:

Start with concurrency = expected_max_concurrent * 1.5
Monitor queue depth — if consistently > 0, increase concurrency limit
Monitor rejection rate — if non-zero, increase queue size or improve upstream throughput

Pattern 4: Timeouts

Every external call needs a timeout. No exceptions. A request with no timeout can hold a database connection, a worker process, and memory indefinitely.

// Fetch: use AbortSignal.timeout (Node.js 18+)
const response = await fetch(url, {
  signal: AbortSignal.timeout(3000),  // 3 second timeout
})

// Axios: use timeout option
const client = axios.create({ timeout: 3000 })

// Cockatiel: timeout policy
import { timeout, TimeoutStrategy } from 'cockatiel'

const timeoutPolicy = timeout(3000, TimeoutStrategy.Aggressive)

// Aggressive vs Cooperative:
// Aggressive: cancels the operation immediately after timeout
// Cooperative: the operation must check cancellation tokens

Timeout sizing:

Set timeouts at the p99 latency + buffer of the healthy service
If the shipping API's p99 is 800ms when healthy, set timeout at ~2000ms
Too tight: false positives on slow-but-successful calls
Too loose: you wait too long before activating fallback behavior

Putting It All Together: A Production API Call

Here's how all four patterns compose for a single downstream integration:

import { retry, circuitBreaker, bulkhead, timeout, wrap,
  ExponentialBackoff, ConsecutiveBreaker, TimeoutStrategy,
  handleAll, BulkheadRejectedError } from 'cockatiel'

function createResilientPolicy(options: {
  maxRetries: number
  timeoutMs: number
  maxConcurrent: number
  queueSize: number
  breakAfterFailures: number
  cooldownMs: number
}) {
  const retryPolicy = retry(handleAll, {
    maxAttempts: options.maxRetries,
    backoff: new ExponentialBackoff({ initialDelay: 200, maxDelay: 5000 }),
  })

  const cbPolicy = circuitBreaker(handleAll, {
    halfOpenAfter: options.cooldownMs,
    breaker: new ConsecutiveBreaker(options.breakAfterFailures),
  })

  const bulkheadPolicy = bulkhead(options.maxConcurrent, {
    queue: options.queueSize,
  })

  const timeoutPolicy = timeout(options.timeoutMs, TimeoutStrategy.Aggressive)

  // Order: timeout wraps everything; retry retries; circuit breaker tracks failures
  return wrap(timeoutPolicy, retryPolicy, cbPolicy, bulkheadPolicy)
}

// Create per-service resilience policies
const shippingPolicy = createResilientPolicy({
  maxRetries: 3,
  timeoutMs: 3000,
  maxConcurrent: 20,
  queueSize: 50,
  breakAfterFailures: 5,
  cooldownMs: 30_000,
})

const stripePolicy = createResilientPolicy({
  maxRetries: 2,     // Fewer retries — payments must be idempotent
  timeoutMs: 10000,  // Longer timeout — Stripe is slow sometimes
  maxConcurrent: 10,
  queueSize: 20,
  breakAfterFailures: 3,
  cooldownMs: 60_000,
})

Pattern 5: Fallbacks and Degraded Mode

When a service is unavailable, you have three options:

Fail fast — return an error immediately (good for critical operations like payments)
Return a default — return a safe fallback value (good for non-critical enrichment)
Use a cache — return the last-known-good value from cache (good for read-heavy data)

import { FallbackPolicy, fallback, handleAll } from 'cockatiel'
import { redis } from './redis'

// Fallback: return cached value if service is unavailable
const exchangeRateWithFallback = fallback(handleAll, async () => {
  // Try cache first
  const cached = await redis.get('exchange_rate:USD:EUR')
  if (cached) return JSON.parse(cached)

  // Hard fallback if cache is also empty
  return { rate: 0.92, source: 'fallback', stale: true }
})

const resilientExchangeRate = wrap(
  timeout(2000, TimeoutStrategy.Aggressive),
  circuitBreaker(handleAll, { halfOpenAfter: 30_000, breaker: new ConsecutiveBreaker(5) }),
  exchangeRateWithFallback
)

async function getExchangeRate(from: string, to: string) {
  return resilientExchangeRate.execute(async () => {
    const rate = await currencyApiClient.getRate(from, to)
    // Cache the successful result for future fallbacks
    await redis.setex(`exchange_rate:${from}:${to}`, 3600, JSON.stringify(rate))
    return rate
  })
}

Degraded Mode UI Patterns

When an API returns degraded data, tell the user. A degraded response is better than an error, but users should know the data might be stale:

// API response includes degradation signals
interface ApiResponse<T> {
  data: T
  degraded?: {
    reason: 'cache' | 'fallback' | 'partial'
    staleSince?: string
    affectedFields?: string[]
  }
}

// The frontend shows a non-blocking banner
// "Shipping estimates may be delayed — using cached rates"

Testing Resilience Patterns

Resilience patterns need to be tested — not just in production:

import { describe, it, expect, vi } from 'vitest'
import { calculateShipping } from '../shipping'
import { shippingApiClient } from '../clients/shipping-client'

describe('calculateShipping resilience', () => {
  it('falls back to default rate when service is unavailable', async () => {
    vi.spyOn(shippingApiClient, 'calculate').mockRejectedValue(
      new Error('ECONNREFUSED')
    )

    const result = await calculateShipping('order-123', 'New York, NY')

    expect(result.rate).toBe(9.99)
    expect(result.method).toBe('standard')
  })

  it('opens circuit after 5 consecutive failures', async () => {
    vi.spyOn(shippingApiClient, 'calculate').mockRejectedValue(
      new Error('Service unavailable')
    )

    // Trigger 5 failures to open the circuit
    for (let i = 0; i < 5; i++) {
      await calculateShipping('order-123', 'New York').catch(() => {})
    }

    // Now the circuit should be open — it should fail fast without calling the service
    const callCountBefore = vi.mocked(shippingApiClient.calculate).mock.calls.length
    await calculateShipping('order-456', 'Chicago').catch(() => {})
    const callCountAfter = vi.mocked(shippingApiClient.calculate).mock.calls.length

    // Circuit breaker should have prevented the call
    expect(callCountAfter).toBe(callCountBefore)
  })
})

Observability for Resilience Patterns

Resilience patterns only work in production if you can observe them. Implementing circuit breakers without monitoring their state is like installing a fire suppression system without testing whether the alarm triggers.

Circuit breaker state changes should be tracked as discrete events, not just current state. A circuit breaker that opens and closes frequently indicates an unstable downstream service — and that instability is itself signal worth alerting on. An alert on "circuit opened more than 3 times in 10 minutes" is more actionable than "current state is open," because it catches intermittent instability before it becomes a full outage.

Retry rate by service. The ratio of retried requests to total requests reveals which integrations are silently causing problems. A 20% retry rate on Stripe during a specific window may indicate a rate limit issue or a Stripe status event you'd otherwise miss. Expose this as a metric labeled by service name so you can see it per-integration.

Bulkhead queue depth. If a service's queue depth is consistently above zero, you've underprovisioned the concurrency limit or have a genuine capacity problem downstream. Queue depth is a leading indicator — it grows before rejection errors appear, giving you time to respond before users are affected.

Fallback activation rate. Track how often each fallback path activates. A fallback that fires frequently is masking a real problem that deserves investigation. A fallback that has never activated in production may not work when you actually need it — test fallbacks explicitly in staging, not just the happy path.

Expose all of these as Prometheus metrics (or your preferred metric format) from the first day of production traffic. When a downstream service degrades at 3am, a dashboard showing circuit breaker state over time alongside error rate tells you immediately which service is responsible and when it started.

Policy Configuration by Service Type

Not all external services warrant the same resilience settings. The right configuration depends on the service's latency characteristics, the cost of a false failure, and whether operations are idempotent.

Service Type	Timeout	Retries	Breaker Threshold	Notes
Payment APIs (Stripe, Braintree)	10-15s	1-2	3 failures	Low retries — idempotency keys required or you risk double charges
AI APIs (OpenAI, Anthropic)	30-120s	1-2	5 failures	High timeout for streaming; low retries due to cost
Internal microservices	500-2000ms	3-4	10 failures	Fast, known code; aggressive retry is safe
Third-party data APIs	5-10s	3	5 failures	Fallback to cached data on failure
Email providers (Resend, Postmark)	10s	2	3 failures	Duplicate emails are user-visible; keep retries low
Search APIs (Algolia, Meilisearch)	2-5s	2	5 failures	Fallback to basic database search when unavailable

Payment APIs deserve explicit attention: retrying payment operations without idempotency keys creates double-charge risk. Always include an idempotency key header (Stripe's Idempotency-Key header, for example) so the provider can detect and deduplicate retried requests. A 429 from a payment API warrants a retry after the Retry-After period; a 400 does not — the request is malformed and will fail the same way every time.

AI API timeouts are counterintuitive: streaming responses begin returning tokens within 1-3 seconds but take 30-120 seconds to complete the full response. Your timeout should cover the full streaming duration, not just time-to-first-token. Setting a 5-second timeout on an AI API call will cut off nearly every non-trivial response, even though the API itself is healthy.

Methodology

npm download data from npmjs.com API, March 2026 weekly averages
Package versions: Cockatiel v3.x, axios-retry v4.x
Sources: "Release It!" by Michael Nygard, Netflix Tech Blog, Cockatiel documentation, AWS re:Invent resilience talks

Explore API reliability tooling on APIScout — find packages for circuit breakers, retry logic, and observability.

The API Integration Checklist (Free PDF)