How to Handle API Rate Limits Gracefully 2026

How to Handle API Rate Limits Gracefully

Every API has rate limits. Hit them, and your requests fail with 429 errors. Handle them poorly, and your users see errors, your batch jobs crash, and your integrations break. Handle them well, and your app stays reliable even when you're pushing limits.

The patterns in this guide form a hierarchy: start with exponential backoff (fixes immediate failures), add client-side rate limiting (prevents most failures), add monitoring (gives visibility into your utilization), and finally consider architecture changes (distributed rate limiting, queue-based processing) only when simpler approaches aren't enough. Most apps need the first two patterns; high-volume batch processing and multi-instance deployments need all of them.

How Rate Limits Work

Common Rate Limit Types

Type	How It Works	Example
Requests per second	Fixed window of requests per second	10 req/s
Requests per minute	Fixed window per minute	100 req/min
Token bucket	Tokens refill at steady rate, burst allowed	100 tokens, 10/s refill
Sliding window	Rolling time window, no burst edge	100 req in any 60s window
Concurrent	Max simultaneous requests	5 concurrent connections
Daily quota	Fixed daily limit	10,000 req/day
Token-based (AI)	Tokens per minute (TPM)	100K TPM

Rate Limit Headers

Most APIs tell you about limits in response headers:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1704067200
Retry-After: 30

# Or the newer standard (RFC 9110):
RateLimit-Limit: 100
RateLimit-Remaining: 87
RateLimit-Reset: 30

The 429 Response

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded. Please retry after 30 seconds.",
    "retry_after": 30
  }
}

Pattern 1: Exponential Backoff with Jitter

The most important pattern. Retry failed requests with increasing delays.

async function fetchWithRetry<T>(
  url: string,
  options: RequestInit,
  maxRetries = 5
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      const response = await fetch(url, options);

      if (response.status === 429) {
        // Respect Retry-After header if present
        const retryAfter = response.headers.get('Retry-After');
        const waitMs = retryAfter
          ? parseInt(retryAfter) * 1000
          : calculateBackoff(attempt);

        console.log(`Rate limited. Waiting ${waitMs}ms before retry ${attempt + 1}`);
        await sleep(waitMs);
        continue;
      }

      if (response.status >= 500 && attempt < maxRetries) {
        // Server error — also worth retrying
        await sleep(calculateBackoff(attempt));
        continue;
      }

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}: ${await response.text()}`);
      }

      return response.json();
    } catch (error) {
      if (attempt === maxRetries) throw error;
      if (error instanceof TypeError) {
        // Network error — retry
        await sleep(calculateBackoff(attempt));
        continue;
      }
      throw error;
    }
  }

  throw new Error('Max retries exceeded');
}

function calculateBackoff(attempt: number): number {
  // Exponential backoff: 1s, 2s, 4s, 8s, 16s
  const baseMs = Math.pow(2, attempt) * 1000;
  // Add jitter: random ±50% to prevent thundering herd
  const jitter = baseMs * (0.5 + Math.random());
  // Cap at 30 seconds
  return Math.min(jitter, 30000);
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

Why jitter matters: Without jitter, all retry requests hit the API at the same time (thundering herd). Jitter spreads them out.

Pattern 2: Client-Side Rate Limiting

Don't wait for 429s — prevent them by throttling requests yourself.

class RateLimiter {
  private queue: Array<{
    execute: () => Promise<any>;
    resolve: (value: any) => void;
    reject: (error: any) => void;
  }> = [];
  private activeCount = 0;
  private timestamps: number[] = [];

  constructor(
    private maxPerSecond: number,
    private maxConcurrent: number = 10
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push({ execute: fn, resolve, reject });
      this.processQueue();
    });
  }

  private async processQueue() {
    if (this.queue.length === 0) return;
    if (this.activeCount >= this.maxConcurrent) return;

    // Clean old timestamps
    const now = Date.now();
    this.timestamps = this.timestamps.filter(t => now - t < 1000);

    if (this.timestamps.length >= this.maxPerSecond) {
      // Wait until oldest timestamp expires
      const waitMs = 1000 - (now - this.timestamps[0]);
      setTimeout(() => this.processQueue(), waitMs);
      return;
    }

    const item = this.queue.shift();
    if (!item) return;

    this.activeCount++;
    this.timestamps.push(now);

    try {
      const result = await item.execute();
      item.resolve(result);
    } catch (error) {
      item.reject(error);
    } finally {
      this.activeCount--;
      this.processQueue();
    }
  }
}

// Usage
const limiter = new RateLimiter(10, 5); // 10 req/s, 5 concurrent

const results = await Promise.all(
  userIds.map(id =>
    limiter.execute(() => fetch(`/api/users/${id}`).then(r => r.json()))
  )
);

Pattern 3: Token Bucket

For APIs with token-bucket rate limiting (like AI APIs with tokens-per-minute):

class TokenBucket {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private maxTokens: number,
    private refillRate: number, // tokens per second
  ) {
    this.tokens = maxTokens;
    this.lastRefill = Date.now();
  }

  private refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
  }

  async consume(count: number): Promise<void> {
    this.refill();

    if (this.tokens >= count) {
      this.tokens -= count;
      return;
    }

    // Wait for enough tokens
    const deficit = count - this.tokens;
    const waitMs = (deficit / this.refillRate) * 1000;
    await new Promise(resolve => setTimeout(resolve, waitMs));
    this.refill();
    this.tokens -= count;
  }
}

// Usage with AI API (tokens per minute)
const bucket = new TokenBucket(100000, 100000 / 60); // 100K TPM

async function callAI(prompt: string) {
  const estimatedTokens = prompt.length / 4; // rough estimate
  await bucket.consume(estimatedTokens);
  return openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }],
  });
}

Pattern 4: Queue-Based Processing

For batch jobs that need to process thousands of items:

class BatchProcessor<T, R> {
  private queue: T[] = [];
  private results: Map<number, R> = new Map();

  constructor(
    private processFn: (item: T) => Promise<R>,
    private options: {
      maxPerSecond: number;
      maxConcurrent: number;
      onProgress?: (completed: number, total: number) => void;
    }
  ) {}

  async process(items: T[]): Promise<R[]> {
    this.queue = [...items];
    const total = items.length;
    let completed = 0;
    let active = 0;
    const results: R[] = new Array(total);

    return new Promise((resolve, reject) => {
      const interval = setInterval(() => {
        while (
          active < this.options.maxConcurrent &&
          this.queue.length > 0
        ) {
          const index = total - this.queue.length;
          const item = this.queue.shift()!;
          active++;

          this.processFn(item)
            .then(result => {
              results[index] = result;
              completed++;
              active--;
              this.options.onProgress?.(completed, total);

              if (completed === total) {
                clearInterval(interval);
                resolve(results);
              }
            })
            .catch(error => {
              clearInterval(interval);
              reject(error);
            });
        }
      }, 1000 / this.options.maxPerSecond);
    });
  }
}

// Usage
const processor = new BatchProcessor(
  async (userId: string) => {
    const response = await fetch(`/api/users/${userId}`);
    return response.json();
  },
  {
    maxPerSecond: 10,
    maxConcurrent: 5,
    onProgress: (done, total) => console.log(`${done}/${total}`),
  }
);

const allUsers = await processor.process(userIds);

Pattern 5: Adaptive Rate Limiting

Automatically adjust your request rate based on API responses:

class AdaptiveRateLimiter {
  private requestsPerSecond: number;
  private consecutiveSuccesses = 0;
  private consecutiveFailures = 0;

  constructor(
    private initialRate: number,
    private maxRate: number,
    private minRate: number = 1
  ) {
    this.requestsPerSecond = initialRate;
  }

  onSuccess() {
    this.consecutiveSuccesses++;
    this.consecutiveFailures = 0;

    // Increase rate after 10 consecutive successes
    if (this.consecutiveSuccesses >= 10) {
      this.requestsPerSecond = Math.min(
        this.maxRate,
        this.requestsPerSecond * 1.2
      );
      this.consecutiveSuccesses = 0;
    }
  }

  onRateLimit() {
    this.consecutiveFailures++;
    this.consecutiveSuccesses = 0;

    // Cut rate in half on rate limit
    this.requestsPerSecond = Math.max(
      this.minRate,
      this.requestsPerSecond * 0.5
    );
  }

  getDelayMs(): number {
    return 1000 / this.requestsPerSecond;
  }
}

Provider-Specific Rate Limits

Quick Reference

Provider	Rate Limit	Headers	Retry Strategy
Stripe	100/s (live), 25/s (test)	Standard X-RateLimit-*	Exponential backoff
OpenAI	TPM + RPM per model	Standard + usage headers	Exponential backoff, token estimation
Anthropic	TPM + RPM per tier	Standard	Backoff + tier upgrade
Twilio	100/s per account	Standard	Backoff + request queuing
GitHub	5,000/hour (auth)	X-RateLimit-*	Respect reset time
Shopify	2/s (REST), cost-based (GraphQL)	X-Shopify-Shop-Api-Call-Limit	Leaky bucket
Algolia	Varies by plan	Standard	Client-side limiting

Monitoring Rate Limits

// Track rate limit usage
class RateLimitMonitor {
  private metrics = {
    totalRequests: 0,
    rateLimitedRequests: 0,
    totalRetries: 0,
    avgRetryDelay: 0,
  };

  recordRequest(wasRateLimited: boolean, retryCount: number, retryDelayMs: number) {
    this.metrics.totalRequests++;
    if (wasRateLimited) {
      this.metrics.rateLimitedRequests++;
      this.metrics.totalRetries += retryCount;
    }
  }

  getReport() {
    return {
      ...this.metrics,
      rateLimitRate: this.metrics.rateLimitedRequests / this.metrics.totalRequests,
      recommendation: this.metrics.rateLimitedRequests / this.metrics.totalRequests > 0.05
        ? 'Consider reducing request rate or upgrading API tier'
        : 'Rate limit handling is healthy',
    };
  }
}

Understanding Your Rate Limit Budget

Most teams only discover their rate limit budget when they hit it — which means they're learning under pressure, often in an incident. Understanding your budget proactively lets you architect around limits before they become problems.

Calculate your effective request rate: For batch jobs and high-volume operations, calculate the request rate you'll need. If you need to process 1 million records and your API allows 100 requests per minute, that's 10,000 minutes — almost 7 days. This math should happen during planning, not production. When the required rate exceeds the available rate by more than 3x, either redesign the approach (batching, fewer API calls per record) or negotiate with the provider before building.

Token-per-minute math for AI APIs: OpenAI and Anthropic rate limit by tokens per minute (TPM), not requests per minute. A single GPT-4o request can consume anywhere from 100 to 128,000 tokens depending on input length and max_tokens setting. If you're processing long documents (2,000+ tokens each) at GPT-4o tier 1 limits (30,000 TPM), you can only process about 15 documents per minute — far fewer than the RPM limit suggests. Always calculate expected token consumption, not just request count, when planning AI integrations.

Header monitoring in production: Read and log rate limit headers on every response. Even when requests succeed, X-RateLimit-Remaining: 3 is a warning — you're about to hit the wall. Set up an alert when remaining drops below 20% of the limit. For OpenAI, the x-ratelimit-remaining-tokens header tells you how close you are to the TPM limit before you hit the first 429.

Rate Limiting in Distributed Systems

Client-side rate limiters work well for single-process applications but break down when your app runs multiple instances. If you have 10 instances of your API server, each with a local rate limiter allowing 10 requests/second to a downstream API, you're actually sending 100 requests/second total — 10x your intended limit.

Redis-based distributed rate limiting: Use Redis for a shared rate limit counter across all instances. The INCR + EXPIRE pattern is the simplest approach: increment a counter on each request, set expiry to the rate window, and reject if the counter exceeds the limit. For more sophisticated needs, Redis also supports the sliding window and token bucket patterns with Lua scripts that execute atomically.

Coordinated queue draining: For batch processing across multiple workers, use a single coordinated job queue (BullMQ, Temporal, or a cloud-managed queue) rather than independent per-instance queues. The queue coordinator enforces the global rate limit; individual workers pull from the queue at whatever rate the coordinator allows. This pattern is more complex to set up but eliminates the distributed rate limit problem entirely and provides better visibility into queue depth, processing rate, and failures.

Circuit breakers for downstream rate limits: When a downstream API starts returning 429s, a circuit breaker opens and stops requests to that service for a configured duration, then probes with a single request before re-enabling full traffic. This is healthier than retrying immediately: it protects your system from cascading failures when an upstream service is overloaded, and it reduces the load on the upstream service during its recovery. The opossum library provides circuit breaker functionality for Node.js.

When to Upgrade vs. Optimize

Hitting rate limits means either your usage has outgrown the tier or your code is making more requests than necessary. The right response depends on which is true.

Signs you should optimize first: Sending the same API request multiple times with the same inputs (missing caching); making API calls in loops where a single batched call would work; fetching paginated resources where you only need the first page; calling an API on every request when the response only changes hourly; fetching all fields when you only use one. Fix these before paying for a higher tier — optimization often reduces API costs by 50-80% without a plan upgrade.

Signs you should upgrade: You've profiled the code and there's no optimization left; the API doesn't offer batching; you genuinely need the data freshness that caching would compromise; or the cost of optimization exceeds the cost of a tier upgrade. Calculate the math: a plan upgrade from $100/month to $300/month costs $2,400/year. If engineering time to optimize is estimated at 40+ hours at your team's fully-loaded rate, upgrading is cheaper.

Negotiating custom limits: For enterprise volumes that exceed published tiers, most API providers will negotiate custom rate limits. Stripe, OpenAI, Anthropic, and most large API businesses have enterprise teams that can accommodate high-volume customers. Come to the conversation with data: your current volume, expected growth trajectory, and which specific limits you need increased. For AI APIs, committing to a minimum monthly spend often unlocks higher rate limits without per-request cost increases.

Methodology

The exponential backoff formula Math.pow(2, attempt) * 1000 produces delays of 1s, 2s, 4s, 8s, 16s for attempts 0-4. The ±50% jitter range is a practical default; AWS recommends ±25% jitter for their SDKs. The Retry-After header may be either an integer (seconds to wait) or an HTTP date string; always check the type before parsing. The token bucket implementation above is single-process only; for distributed systems, use Redis with the rate-limiter-flexible npm package, which implements all major rate limit algorithms with Redis backends and atomic Lua scripts. OpenAI's rate limit tiers (Tier 1 through Tier 5) are documented at platform.openai.com/docs/guides/rate-limits; tiers increase based on cumulative spend history.

Compare API rate limits across providers on APIScout — find the most generous limits and best rate limit handling documentation.

The API Integration Checklist (Free PDF)