Skip to main content

API Rate Limiting Best Practices for 2026

·APIScout Team
rate limitingapi designapi securitybest practicesapi architecture

API Rate Limiting Best Practices for 2026

Rate limiting protects your API from abuse, ensures fair usage, and prevents a single client from consuming all resources. Done well, rate limiting is invisible to normal users. Done poorly, it frustrates legitimate clients and leaks implementation details to attackers.

Why Rate Limit

  1. Protect infrastructure — prevent a single client from overwhelming your servers
  2. Ensure fairness — no client monopolizes shared resources
  3. Control costs — limit expensive operations (AI inference, database queries)
  4. Prevent abuse — slow down brute force, scraping, and enumeration attacks
  5. Enable pricing tiers — different limits for free/pro/enterprise plans

Rate Limiting Algorithms

1. Fixed Window

Count requests in fixed time windows (e.g., 100 requests per minute, resetting at :00, :01, :02...).

Pros: Simple to implement and understand. Cons: Burst problem — a client can send 100 requests at :59 and 100 more at :00 (200 in 2 seconds).

2. Sliding Window Log

Track timestamps of all requests. Count requests in a sliding window (last 60 seconds from now).

Pros: No burst problem. Accurate rate enforcement. Cons: Memory-intensive — stores every request timestamp.

3. Sliding Window Counter

Hybrid of fixed and sliding. Estimate the count in the current window based on the previous window's count.

Pros: Memory-efficient, smooth enforcement. Used by Cloudflare. Cons: Approximate — not exact counts.

4. Token Bucket

A bucket fills with tokens at a steady rate (e.g., 10 tokens/second). Each request consumes a token. Empty bucket = rate limited. Bucket has a max capacity for burst allowance.

Pros: Allows controlled bursts. Smooth average rate. Used by AWS, Stripe. Cons: Slightly more complex to implement.

5. Leaky Bucket

Requests enter a queue (bucket) and are processed at a fixed rate. If the queue is full, requests are dropped.

Pros: Guarantees a smooth, constant output rate. Cons: Doesn't allow bursts. Introduces latency (queued requests wait).

Recommendation: Token bucket for most APIs. It allows reasonable bursts while maintaining average rate limits.

Response Headers

Always include rate limit information in response headers. RFC 7231 doesn't standardize these, but the RateLimit header draft (RFC 9110 extension) is gaining adoption:

RateLimit-Limit: 100
RateLimit-Remaining: 67
RateLimit-Reset: 1710892800

Standard headers to include:

HeaderDescriptionExample
RateLimit-LimitMax requests in window100
RateLimit-RemainingRequests left in window67
RateLimit-ResetUnix timestamp when window resets1710892800
Retry-AfterSeconds to wait (on 429 response)30

The 429 Response

When a client exceeds the limit, return 429 Too Many Requests:

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded. Retry after 30 seconds.",
    "retry_after": 30
  }
}

Always include:

  • Retry-After header (seconds or HTTP-date)
  • RateLimit-Reset header (when the window resets)
  • Human-readable error message
  • Machine-readable error code

Implementation Patterns

Per-Client Limiting

Rate limit by API key, user ID, or OAuth token. Each client gets their own counter.

Client A: 45/100 requests used
Client B: 12/100 requests used
Client C: 100/100 requests used → 429

Per-Endpoint Limiting

Different limits for different endpoints. Expensive operations (search, AI inference) get lower limits than simple reads.

GET /api/users: 1000/minute
POST /api/search: 10/minute
POST /api/ai/generate: 5/minute

Tiered Limiting

Different limits per pricing tier. Free users get less. Enterprise gets more.

Free: 100 requests/hour
Pro: 1,000 requests/hour
Enterprise: 10,000 requests/hour

Cost-Based Limiting

Assign costs to operations. Complex queries cost more "tokens" than simple reads.

GET /users → 1 token
GET /users?include=orders,reviews → 3 tokens
POST /search → 5 tokens
POST /ai/generate → 10 tokens
Budget: 1000 tokens/minute

Client-Side Best Practices

Respect Rate Limits

  1. Read the headers — check RateLimit-Remaining before sending requests
  2. Handle 429s gracefully — don't retry immediately, wait Retry-After seconds
  3. Implement exponential backoff — 1s, 2s, 4s, 8s, with jitter
  4. Queue requests — buffer outgoing requests and drain at the allowed rate
  5. Cache responses — reduce unnecessary API calls

Exponential Backoff Pattern

Attempt 1: wait 1s + random(0-1s)
Attempt 2: wait 2s + random(0-2s)
Attempt 3: wait 4s + random(0-4s)
Attempt 4: wait 8s + random(0-8s)
Max attempts: 5 (then fail)

The random jitter prevents thundering herd — 1,000 clients all retrying at exactly 1s, 2s, 4s creates synchronized spikes.

Common Mistakes

MistakeWhy It's BadDo This Instead
No rate limit headersClients can't self-throttleAlways include RateLimit-* headers
500 instead of 429Client retries thinking it's a server errorReturn 429 with Retry-After
Rate limit by IP onlyPenalizes NAT/office users, trivial to bypass with proxiesRate limit by API key/token
Same limit for all endpointsExpensive operations can still overwhelmPer-endpoint or cost-based limits
No burst allowanceRejects legitimate traffic spikesUse token bucket with burst capacity
Hard limits with no warningClients discover limits by hitting themDocument limits, send approaching-limit warnings

Building APIs with rate limiting? Explore API management tools and best practices on APIScout — architecture guides, gateway comparisons, and developer resources.

Comments