How to Monitor API Performance: Latency, Errors, and SLAs
How to Monitor API Performance: Latency, Errors, and SLAs
You can't improve what you don't measure. API performance monitoring tracks latency, error rates, throughput, and availability — the metrics that determine whether your API is meeting its commitments. Here's what to measure, how to measure it, and when to alert.
The Four Golden Signals
Google SRE's four golden signals apply directly to APIs:
1. Latency
What: Time from request received to response sent.
Measure percentiles, not averages:
| Percentile | Meaning | Use |
|---|---|---|
| p50 (median) | Half of requests are faster | Typical experience |
| p95 | 95% of requests are faster | Most users' experience |
| p99 | 99% of requests are faster | Worst-case normal experience |
| p99.9 | 99.9% are faster | Tail latency |
Why not averages? An average of 100ms hides that 1% of requests take 5 seconds. p99 catches that.
Targets:
| Endpoint Type | p50 | p95 | p99 |
|---|---|---|---|
| Simple read | <50ms | <200ms | <500ms |
| Database query | <100ms | <500ms | <1s |
| Search | <200ms | <1s | <2s |
| Write operation | <100ms | <500ms | <1s |
| External API call | <500ms | <2s | <5s |
2. Error Rate
What: Percentage of requests returning errors (4xx/5xx).
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| 5xx rate | <0.1% | 0.1-1% | >1% |
| 4xx rate | <5% | 5-10% | >10% |
| Total error rate | <1% | 1-5% | >5% |
Track by status code: Distinguish between client errors (4xx — usually the client's fault) and server errors (5xx — your fault).
3. Throughput
What: Requests per second (RPS) or requests per minute (RPM).
Track throughput to:
- Capacity plan (are you approaching limits?)
- Detect anomalies (sudden spike = attack? sudden drop = outage?)
- Correlate with latency (does latency increase with load?)
4. Saturation
What: How close your system is to capacity.
| Resource | Metric | Alert Threshold |
|---|---|---|
| CPU | Utilization % | >80% sustained |
| Memory | Usage / available | >85% |
| Database connections | Active / max pool | >80% |
| Disk I/O | IOPS / max IOPS | >70% |
| Network | Bandwidth usage | >70% |
SLA / SLO / SLI
SLI (Service Level Indicator)
A measurable metric: "99.5% of requests complete in under 500ms."
SLO (Service Level Objective)
Your internal target: "p99 latency < 500ms, error rate < 0.1%."
SLA (Service Level Agreement)
Your external commitment with consequences: "99.9% uptime or service credits."
Set SLOs tighter than SLAs. If your SLA promises 99.9% uptime, set your SLO at 99.95% so you have a buffer before breaching the SLA.
Uptime Targets
| Uptime | Downtime/Year | Downtime/Month |
|---|---|---|
| 99% | 3.65 days | 7.3 hours |
| 99.9% | 8.77 hours | 43.8 minutes |
| 99.95% | 4.38 hours | 21.9 minutes |
| 99.99% | 52.6 minutes | 4.38 minutes |
| 99.999% | 5.26 minutes | 26.3 seconds |
Alerting Strategy
Alert on Symptoms, Not Causes
Good alerts (symptoms):
- p99 latency > 2s for 5 minutes
- Error rate > 1% for 3 minutes
- Throughput dropped 50% vs same hour last week
Bad alerts (causes):
- CPU > 80% (may not affect users)
- Memory > 90% (may be normal)
- Single health check failed (transient)
Alert Severity
| Severity | Criteria | Response |
|---|---|---|
| P1 - Critical | Service down, data loss | Page on-call, all hands |
| P2 - High | Degraded performance, partial outage | Page on-call, investigate |
| P3 - Medium | Non-critical service degraded | Next business day |
| P4 - Low | Cosmetic, minor issue | Backlog |
Monitoring Tools
| Tool | Best For | Price |
|---|---|---|
| Datadog | Full observability | From $5/host/mo |
| Grafana + Prometheus | Self-hosted, open source | Free |
| Better Stack | Uptime + incidents | Free (10 monitors) |
| Checkly | Synthetic monitoring | Free (5 checks) |
| Sentry | Error tracking | Free (5K events) |
| PostHog | Product analytics | Free (1M events) |
Dashboard Essentials
Every API monitoring dashboard should show:
- Request volume — RPS over time (detect anomalies)
- Latency percentiles — p50, p95, p99 over time
- Error rate — 4xx and 5xx separately
- Top errors — most frequent error codes/messages
- Slowest endpoints — which endpoints need optimization
- Uptime — current and 30-day availability
Monitoring your API? Explore monitoring tools and best practices on APIScout — comparisons, guides, and developer resources.