API Rate Limiting Best Practices for 2026
API Rate Limiting Best Practices for 2026
Rate limiting protects your API from abuse, ensures fair usage, and prevents a single client from consuming all resources. Done well, rate limiting is invisible to normal users. Done poorly, it frustrates legitimate clients and leaks implementation details to attackers.
Why Rate Limit
- Protect infrastructure — prevent a single client from overwhelming your servers
- Ensure fairness — no client monopolizes shared resources
- Control costs — limit expensive operations (AI inference, database queries)
- Prevent abuse — slow down brute force, scraping, and enumeration attacks
- Enable pricing tiers — different limits for free/pro/enterprise plans
Rate Limiting Algorithms
1. Fixed Window
Count requests in fixed time windows (e.g., 100 requests per minute, resetting at :00, :01, :02...).
Pros: Simple to implement and understand. Cons: Burst problem — a client can send 100 requests at :59 and 100 more at :00 (200 in 2 seconds).
2. Sliding Window Log
Track timestamps of all requests. Count requests in a sliding window (last 60 seconds from now).
Pros: No burst problem. Accurate rate enforcement. Cons: Memory-intensive — stores every request timestamp.
3. Sliding Window Counter
Hybrid of fixed and sliding. Estimate the count in the current window based on the previous window's count.
Pros: Memory-efficient, smooth enforcement. Used by Cloudflare. Cons: Approximate — not exact counts.
4. Token Bucket
A bucket fills with tokens at a steady rate (e.g., 10 tokens/second). Each request consumes a token. Empty bucket = rate limited. Bucket has a max capacity for burst allowance.
Pros: Allows controlled bursts. Smooth average rate. Used by AWS, Stripe. Cons: Slightly more complex to implement.
5. Leaky Bucket
Requests enter a queue (bucket) and are processed at a fixed rate. If the queue is full, requests are dropped.
Pros: Guarantees a smooth, constant output rate. Cons: Doesn't allow bursts. Introduces latency (queued requests wait).
Recommendation: Token bucket for most APIs. It allows reasonable bursts while maintaining average rate limits.
Response Headers
Always include rate limit information in response headers. RFC 7231 doesn't standardize these, but the RateLimit header draft (RFC 9110 extension) is gaining adoption:
RateLimit-Limit: 100
RateLimit-Remaining: 67
RateLimit-Reset: 1710892800
Standard headers to include:
| Header | Description | Example |
|---|---|---|
RateLimit-Limit | Max requests in window | 100 |
RateLimit-Remaining | Requests left in window | 67 |
RateLimit-Reset | Unix timestamp when window resets | 1710892800 |
Retry-After | Seconds to wait (on 429 response) | 30 |
The 429 Response
When a client exceeds the limit, return 429 Too Many Requests:
{
"error": {
"code": "rate_limit_exceeded",
"message": "Rate limit exceeded. Retry after 30 seconds.",
"retry_after": 30
}
}
Always include:
Retry-Afterheader (seconds or HTTP-date)RateLimit-Resetheader (when the window resets)- Human-readable error message
- Machine-readable error code
Implementation Patterns
Per-Client Limiting
Rate limit by API key, user ID, or OAuth token. Each client gets their own counter.
Client A: 45/100 requests used
Client B: 12/100 requests used
Client C: 100/100 requests used → 429
Per-Endpoint Limiting
Different limits for different endpoints. Expensive operations (search, AI inference) get lower limits than simple reads.
GET /api/users: 1000/minute
POST /api/search: 10/minute
POST /api/ai/generate: 5/minute
Tiered Limiting
Different limits per pricing tier. Free users get less. Enterprise gets more.
Free: 100 requests/hour
Pro: 1,000 requests/hour
Enterprise: 10,000 requests/hour
Cost-Based Limiting
Assign costs to operations. Complex queries cost more "tokens" than simple reads.
GET /users → 1 token
GET /users?include=orders,reviews → 3 tokens
POST /search → 5 tokens
POST /ai/generate → 10 tokens
Budget: 1000 tokens/minute
Client-Side Best Practices
Respect Rate Limits
- Read the headers — check
RateLimit-Remainingbefore sending requests - Handle 429s gracefully — don't retry immediately, wait
Retry-Afterseconds - Implement exponential backoff — 1s, 2s, 4s, 8s, with jitter
- Queue requests — buffer outgoing requests and drain at the allowed rate
- Cache responses — reduce unnecessary API calls
Exponential Backoff Pattern
Attempt 1: wait 1s + random(0-1s)
Attempt 2: wait 2s + random(0-2s)
Attempt 3: wait 4s + random(0-4s)
Attempt 4: wait 8s + random(0-8s)
Max attempts: 5 (then fail)
The random jitter prevents thundering herd — 1,000 clients all retrying at exactly 1s, 2s, 4s creates synchronized spikes.
Common Mistakes
| Mistake | Why It's Bad | Do This Instead |
|---|---|---|
| No rate limit headers | Clients can't self-throttle | Always include RateLimit-* headers |
| 500 instead of 429 | Client retries thinking it's a server error | Return 429 with Retry-After |
| Rate limit by IP only | Penalizes NAT/office users, trivial to bypass with proxies | Rate limit by API key/token |
| Same limit for all endpoints | Expensive operations can still overwhelm | Per-endpoint or cost-based limits |
| No burst allowance | Rejects legitimate traffic spikes | Use token bucket with burst capacity |
| Hard limits with no warning | Clients discover limits by hitting them | Document limits, send approaching-limit warnings |
Building APIs with rate limiting? Explore API management tools and best practices on APIScout — architecture guides, gateway comparisons, and developer resources.