Skip to main content

API Resilience: Circuit Breakers, Retries, Bulkheads 2026

·APIScout Team
apiresiliencecircuit-breakernodejsreliability

API Resilience Patterns: Circuit Breakers, Retries, and Bulkheads in 2026

TL;DR

Every production API depends on external services — databases, third-party APIs, message queues — that will occasionally be slow, unavailable, or overloaded. Without resilience patterns, one slow downstream service cascades into a full outage. The four patterns that prevent cascade failures: retries with exponential backoff (don't hammer a struggling service), circuit breakers (stop calling a service that's consistently failing), bulkheads (limit concurrency so one slow integration doesn't exhaust all resources), and timeouts (never wait forever). In Node.js, Cockatiel is the go-to library for all four patterns with excellent TypeScript support. Axios-retry handles the simple retry case for HTTP clients.

Key Takeaways

  • Retries without backoff make outages worse — if 1,000 requests fail and all retry immediately, you send 2,000 requests to a service that's already struggling
  • Circuit breakers have three states: Closed (normal), Open (failing fast, not calling the service), Half-Open (testing if recovery is possible)
  • Bulkheads limit concurrency — cap how many simultaneous calls can go to a single downstream service; prevents one slow integration from blocking all your workers
  • Always set timeouts — a request with no timeout can hang forever, holding a Node.js event loop slot, a DB connection, and worker memory
  • Cockatiel is the best resilience library for Node.js — composable policies, TypeScript-first, covers retry/circuit breaker/bulkhead/timeout/fallback in one package
  • Jitter prevents thundering herd — add random variance to retry delays so retrying clients don't all slam the recovering service at the same instant

Why APIs Fail: The Cascade Problem

Consider a typical microservice API with four dependencies:

User Request → Orders API → [PostgreSQL, Stripe, Shipping API, Email Service]

If the Shipping API takes 10 seconds instead of 100ms, and there are 500 concurrent requests in flight, you have 500 requests each waiting 10 seconds, holding 500 database connections and 500 worker slots. The database connection pool exhausts. Now PostgreSQL queries start timing out too. Email service calls back up. The entire system fails because of one slow external API.

Resilience patterns break this cascade:

  1. Timeout: Give up on the Shipping API after 2 seconds, not 10
  2. Circuit breaker: After 10 failures, stop calling Shipping API entirely for 30 seconds
  3. Bulkhead: Only allow 20 simultaneous Shipping API calls; queue or reject the rest
  4. Retry: If a single Shipping API call fails, retry it up to 3 times with exponential backoff

Pattern 1: Retries with Exponential Backoff

The Simple Case: axios-retry

For HTTP clients using Axios, axios-retry adds retry logic in a few lines:

import axios from 'axios'
import axiosRetry from 'axios-retry'

const client = axios.create({
  baseURL: 'https://shipping-api.com',
  timeout: 5000,  // Always set a timeout
})

axiosRetry(client, {
  retries: 3,
  retryDelay: axiosRetry.exponentialDelay,  // 1s, 2s, 4s
  retryCondition: (error) => {
    // Only retry on network errors and 5xx responses
    return axiosRetry.isNetworkOrIdempotentRequestError(error)
      || error.response?.status === 429  // Rate limit — retry after
      || (error.response?.status ?? 0) >= 500
  },
  onRetry: (retryCount, error, config) => {
    console.log(`Retry ${retryCount} for ${config.url}: ${error.message}`)
  },
})

// Now every request through `client` automatically retries
const rates = await client.post('/calculate', { orderId, destination })

Adding Jitter to Prevent Thundering Herd

Exponential backoff without jitter means all retrying clients synchronize: they all failed at t=0, they all retry at t=1s, they all fail again, they all retry at t=2s. This creates burst traffic exactly when the service is trying to recover.

Jitter adds random variance:

function exponentialBackoffWithJitter(retryCount: number): number {
  const baseDelay = 1000  // 1 second
  const exponentialDelay = baseDelay * Math.pow(2, retryCount)
  // Add up to 30% random jitter
  const jitter = Math.random() * exponentialDelay * 0.3
  return exponentialDelay + jitter
}

axiosRetry(client, {
  retries: 3,
  retryDelay: (retryCount) => exponentialBackoffWithJitter(retryCount),
})

Which Errors Should Retry?

Not all errors should retry — only transient failures:

Error TypeRetry?Reason
5xx (500, 502, 503, 504)✅ YesServer-side transient errors
429 (Rate Limited)✅ Yes, with Retry-AfterWait for the rate limit window
408 (Timeout)✅ YesTransient
4xx (400, 401, 403, 404)❌ NoClient errors — retrying won't help
Network error (ECONNREFUSED)✅ YesTransient connectivity
Data validation error❌ NoYour data is wrong — retrying won't fix it

Pattern 2: Circuit Breakers

A circuit breaker wraps calls to an external service and tracks the failure rate. When failures exceed a threshold, it "opens" — immediately returning an error without calling the service. After a cooldown period, it allows a test request through ("half-open") and closes if that succeeds.

Implementing with Cockatiel

Cockatiel is the best TypeScript-native resilience library for Node.js:

npm install cockatiel
import {
  CircuitBreakerPolicy,
  ConsecutiveBreaker,
  ExponentialBackoff,
  retry,
  circuitBreaker,
  handleAll,
  wrap,
} from 'cockatiel'

// Create a circuit breaker that opens after 5 consecutive failures
const shippingCircuitBreaker = circuitBreaker(handleAll, {
  halfOpenAfter: 30_000,  // Test recovery after 30 seconds
  breaker: new ConsecutiveBreaker(5),  // Open after 5 consecutive failures
})

// Create a retry policy with exponential backoff
const retryPolicy = retry(handleAll, {
  maxAttempts: 3,
  backoff: new ExponentialBackoff({ initialDelay: 500, maxDelay: 5000 }),
})

// Combine retry + circuit breaker (retry first, then circuit breaker)
const resilientShippingCall = wrap(retryPolicy, shippingCircuitBreaker)

// Usage
async function calculateShipping(orderId: string, destination: string) {
  try {
    return await resilientShippingCall.execute(async () => {
      const response = await fetch('https://shipping-api.com/calculate', {
        method: 'POST',
        body: JSON.stringify({ orderId, destination }),
        signal: AbortSignal.timeout(3000),  // 3 second timeout
      })
      if (!response.ok) throw new Error(`HTTP ${response.status}`)
      return response.json()
    })
  } catch (error) {
    // Circuit is open or all retries exhausted
    // Return a fallback value or throw a user-friendly error
    console.error('Shipping calculation failed, using fallback:', error)
    return { rate: 9.99, method: 'standard', estimate: '5-7 business days' }
  }
}

Monitoring Circuit Breaker State

shippingCircuitBreaker.onBreak(() => {
  console.error('Shipping API circuit OPENED — too many failures')
  metrics.circuitBreakerOpened.inc({ service: 'shipping-api' })
})

shippingCircuitBreaker.onReset(() => {
  console.info('Shipping API circuit CLOSED — service recovered')
  metrics.circuitBreakerReset.inc({ service: 'shipping-api' })
})

shippingCircuitBreaker.onHalfOpen(() => {
  console.info('Shipping API circuit HALF-OPEN — testing recovery')
})

Alerting on circuit breaker state changes is far more actionable than generic error rate alerts — you know exactly which service is failing and when it recovers.


Pattern 3: Bulkheads

A bulkhead limits concurrency to a downstream service. Without bulkheads, if your API receives 1,000 concurrent requests and all of them need to call the shipping API, you send 1,000 simultaneous requests to shipping. With a bulkhead of 20, you allow 20 concurrent calls and queue (or reject) the rest.

import { BulkheadPolicy, bulkhead, BulkheadRejectedError } from 'cockatiel'

// Allow max 20 concurrent calls to shipping API; queue up to 50 more
const shippingBulkhead = bulkhead(20, { queue: 50 })

// Combined: retry + circuit breaker + bulkhead
const resilientShipping = wrap(retryPolicy, shippingCircuitBreaker, shippingBulkhead)

async function calculateShipping(orderId: string, destination: string) {
  try {
    return await resilientShipping.execute(async () => {
      return shippingApiClient.calculate({ orderId, destination })
    })
  } catch (error) {
    if (error instanceof BulkheadRejectedError) {
      // Queue full — return a degraded response immediately
      return { rate: 12.99, method: 'standard', estimate: '7-10 days', degraded: true }
    }
    throw error
  }
}

Bulkhead sizing heuristics:

  • Start with concurrency = expected_max_concurrent * 1.5
  • Monitor queue depth — if consistently > 0, increase concurrency limit
  • Monitor rejection rate — if non-zero, increase queue size or improve upstream throughput

Pattern 4: Timeouts

Every external call needs a timeout. No exceptions. A request with no timeout can hold a database connection, a worker process, and memory indefinitely.

// Fetch: use AbortSignal.timeout (Node.js 18+)
const response = await fetch(url, {
  signal: AbortSignal.timeout(3000),  // 3 second timeout
})

// Axios: use timeout option
const client = axios.create({ timeout: 3000 })

// Cockatiel: timeout policy
import { timeout, TimeoutStrategy } from 'cockatiel'

const timeoutPolicy = timeout(3000, TimeoutStrategy.Aggressive)

// Aggressive vs Cooperative:
// Aggressive: cancels the operation immediately after timeout
// Cooperative: the operation must check cancellation tokens

Timeout sizing:

  • Set timeouts at the p99 latency + buffer of the healthy service
  • If the shipping API's p99 is 800ms when healthy, set timeout at ~2000ms
  • Too tight: false positives on slow-but-successful calls
  • Too loose: you wait too long before activating fallback behavior

Putting It All Together: A Production API Call

Here's how all four patterns compose for a single downstream integration:

import { retry, circuitBreaker, bulkhead, timeout, wrap,
  ExponentialBackoff, ConsecutiveBreaker, TimeoutStrategy,
  handleAll, BulkheadRejectedError } from 'cockatiel'

function createResilientPolicy(options: {
  maxRetries: number
  timeoutMs: number
  maxConcurrent: number
  queueSize: number
  breakAfterFailures: number
  cooldownMs: number
}) {
  const retryPolicy = retry(handleAll, {
    maxAttempts: options.maxRetries,
    backoff: new ExponentialBackoff({ initialDelay: 200, maxDelay: 5000 }),
  })

  const cbPolicy = circuitBreaker(handleAll, {
    halfOpenAfter: options.cooldownMs,
    breaker: new ConsecutiveBreaker(options.breakAfterFailures),
  })

  const bulkheadPolicy = bulkhead(options.maxConcurrent, {
    queue: options.queueSize,
  })

  const timeoutPolicy = timeout(options.timeoutMs, TimeoutStrategy.Aggressive)

  // Order: timeout wraps everything; retry retries; circuit breaker tracks failures
  return wrap(timeoutPolicy, retryPolicy, cbPolicy, bulkheadPolicy)
}

// Create per-service resilience policies
const shippingPolicy = createResilientPolicy({
  maxRetries: 3,
  timeoutMs: 3000,
  maxConcurrent: 20,
  queueSize: 50,
  breakAfterFailures: 5,
  cooldownMs: 30_000,
})

const stripePolicy = createResilientPolicy({
  maxRetries: 2,     // Fewer retries — payments must be idempotent
  timeoutMs: 10000,  // Longer timeout — Stripe is slow sometimes
  maxConcurrent: 10,
  queueSize: 20,
  breakAfterFailures: 3,
  cooldownMs: 60_000,
})

Pattern 5: Fallbacks and Degraded Mode

When a service is unavailable, you have three options:

  1. Fail fast — return an error immediately (good for critical operations like payments)
  2. Return a default — return a safe fallback value (good for non-critical enrichment)
  3. Use a cache — return the last-known-good value from cache (good for read-heavy data)
import { FallbackPolicy, fallback, handleAll } from 'cockatiel'
import { redis } from './redis'

// Fallback: return cached value if service is unavailable
const exchangeRateWithFallback = fallback(handleAll, async () => {
  // Try cache first
  const cached = await redis.get('exchange_rate:USD:EUR')
  if (cached) return JSON.parse(cached)

  // Hard fallback if cache is also empty
  return { rate: 0.92, source: 'fallback', stale: true }
})

const resilientExchangeRate = wrap(
  timeout(2000, TimeoutStrategy.Aggressive),
  circuitBreaker(handleAll, { halfOpenAfter: 30_000, breaker: new ConsecutiveBreaker(5) }),
  exchangeRateWithFallback
)

async function getExchangeRate(from: string, to: string) {
  return resilientExchangeRate.execute(async () => {
    const rate = await currencyApiClient.getRate(from, to)
    // Cache the successful result for future fallbacks
    await redis.setex(`exchange_rate:${from}:${to}`, 3600, JSON.stringify(rate))
    return rate
  })
}

Degraded Mode UI Patterns

When an API returns degraded data, tell the user. A degraded response is better than an error, but users should know the data might be stale:

// API response includes degradation signals
interface ApiResponse<T> {
  data: T
  degraded?: {
    reason: 'cache' | 'fallback' | 'partial'
    staleSince?: string
    affectedFields?: string[]
  }
}

// The frontend shows a non-blocking banner
// "Shipping estimates may be delayed — using cached rates"

Testing Resilience Patterns

Resilience patterns need to be tested — not just in production:

import { describe, it, expect, vi } from 'vitest'
import { calculateShipping } from '../shipping'
import { shippingApiClient } from '../clients/shipping-client'

describe('calculateShipping resilience', () => {
  it('falls back to default rate when service is unavailable', async () => {
    vi.spyOn(shippingApiClient, 'calculate').mockRejectedValue(
      new Error('ECONNREFUSED')
    )

    const result = await calculateShipping('order-123', 'New York, NY')

    expect(result.rate).toBe(9.99)
    expect(result.method).toBe('standard')
  })

  it('opens circuit after 5 consecutive failures', async () => {
    vi.spyOn(shippingApiClient, 'calculate').mockRejectedValue(
      new Error('Service unavailable')
    )

    // Trigger 5 failures to open the circuit
    for (let i = 0; i < 5; i++) {
      await calculateShipping('order-123', 'New York').catch(() => {})
    }

    // Now the circuit should be open — it should fail fast without calling the service
    const callCountBefore = vi.mocked(shippingApiClient.calculate).mock.calls.length
    await calculateShipping('order-456', 'Chicago').catch(() => {})
    const callCountAfter = vi.mocked(shippingApiClient.calculate).mock.calls.length

    // Circuit breaker should have prevented the call
    expect(callCountAfter).toBe(callCountBefore)
  })
})

Methodology

  • npm download data from npmjs.com API, March 2026 weekly averages
  • Package versions: Cockatiel v3.x, axios-retry v4.x
  • Sources: "Release It!" by Michael Nygard, Netflix Tech Blog, Cockatiel documentation, AWS re:Invent resilience talks

Explore API reliability tooling on APIScout — find packages for circuit breakers, retry logic, and observability.

Related: API Error Handling Patterns for Production 2026 · API Rate Limiting Best Practices 2026 · OpenTelemetry for API Observability 2026

Comments