OpenTelemetry for API Observability in 2026
OpenTelemetry for API Observability in 2026
TL;DR
Observability for APIs in 2026 means three things: traces (how long did each step take?), metrics (how many requests per second, error rates, latency percentiles), and logs (what happened in this specific request?). OpenTelemetry (OTel) is the open standard that unified the previously fragmented observability ecosystem — instead of instrumenting your API separately for Datadog, New Relic, and Jaeger, you instrument once with OTel and export to any backend. It's now the default choice for API observability: CNCF-graduated, vendor-neutral, and natively supported by every major cloud and APM provider. The learning curve is real — OTel has a lot of terminology — but the payoff is complete observability without vendor lock-in.
Key Takeaways
- OpenTelemetry is vendor-neutral — instrument once, export to Jaeger, Grafana Tempo, Datadog, Honeycomb, New Relic, or any OTLP-compatible backend
- Auto-instrumentation covers the common cases — HTTP, Express, Fastify, pg, Redis, gRPC all have official OTel packages that add traces with zero code changes
- The three pillars work together — a trace ID links a specific slow request's trace, metrics, and logs so you can go from "error spike" to "root cause" in one workflow
- OpenTelemetry Collector is optional but recommended — receives OTel data from your apps, transforms it, and fans it out to multiple backends
- OTel SDK for Node.js is stable (v1.x) — production-ready with active development; ~1.8M weekly downloads for
@opentelemetry/sdk-node - Context propagation is where OTel shines for microservices — a single trace ID follows a request across 10 services automatically
Why API Observability Matters in 2026
A production API without observability is a black box. You know when it's down (users complain), but you don't know why a specific endpoint got slow, which downstream service caused a timeout cascade, or which 3% of requests are failing silently.
The traditional approach was per-vendor instrumentation: Datadog agent for metrics, Sentry for errors, application logs to Elasticsearch. Each tool had its own SDK, its own data model, and its own concept of a "request." When a bug happened, you'd have metrics in Datadog, a trace in New Relic, and logs in Kibana — with no shared identifier to correlate them.
OpenTelemetry solves this by providing a unified data model (traces, metrics, logs) with a shared trace_id that links all three.
The OpenTelemetry Data Model
Traces and Spans
A trace represents the complete journey of one request through your system. It's composed of spans — each span represents one operation:
Trace ID: abc123
│
├── GET /api/orders/:id (span 1 — HTTP handler, 145ms)
│ ├── validateAuth (span 2 — JWT verify, 3ms)
│ ├── db.orders.findById (span 3 — PostgreSQL query, 38ms)
│ │ └── SELECT * FROM orders WHERE id = $1
│ ├── db.users.findById (span 4 — PostgreSQL query, 12ms)
│ └── calculateShipping (span 5 — external API call, 89ms)
│ └── POST https://shipping-api.com/calculate (span 6)
Every span has:
trace_id— shared across the entire request journeyspan_id— unique to this operationparent_span_id— which span created this onestart_time,end_time— when the operation ranattributes— key-value data (HTTP method, DB query, user ID)status— OK, Error, or Unset
Metrics
OTel metrics are time-series measurements: request counts, latency histograms, error rates, active connections. Unlike traces (sampled), metrics are aggregated — you capture every request but store the aggregation, not individual data points.
Logs
OTel logs connect your existing console.log / pino / winston output to the trace context — adding trace_id and span_id to every log line so you can find the logs for a specific slow request.
Setting Up OpenTelemetry for a Node.js API
Installation
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/exporter-metrics-otlp-http
Instrumentation Setup (tracing.ts)
The key principle: initialize OTel before importing anything else. This is because auto-instrumentation patches modules at import time.
// tracing.ts — must be the FIRST file executed
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http'
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'
import { Resource } from '@opentelemetry/resources'
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions'
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: 'orders-api',
[ATTR_SERVICE_VERSION]: process.env.APP_VERSION ?? '0.0.0',
'deployment.environment': process.env.NODE_ENV ?? 'development',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',
headers: { 'Authorization': `Bearer ${process.env.OTEL_AUTH_TOKEN}` },
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics',
}),
exportIntervalMillis: 10_000, // Export metrics every 10 seconds
}),
instrumentations: [
getNodeAutoInstrumentations({
// Auto-instruments: HTTP, Express/Fastify, pg, Redis, gRPC, fetch
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
// Disable noisy instrumentations
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
})
sdk.start()
process.on('SIGTERM', () => sdk.shutdown())
In package.json:
{
"scripts": {
"start": "node --require ./dist/tracing.js dist/server.js"
}
}
The --require flag loads tracing.js before your server — this ensures OTel patches Node.js modules before they're imported.
What Auto-Instrumentation Adds (Zero Code Changes)
With the above setup and no changes to your API code, you automatically get:
- Every HTTP request as a span with method, URL, status code, and duration
- Every PostgreSQL/MySQL query as a child span with the SQL text (sanitized)
- Every Redis command as a child span
- Every outbound HTTP/fetch call as a child span
- Error recording when exceptions are thrown
- Context propagation via
traceparentheader for distributed tracing
Manual Instrumentation
Auto-instrumentation covers I/O. Manual instrumentation adds business logic context:
import { trace, SpanStatusCode, context } from '@opentelemetry/api'
const tracer = trace.getTracer('orders-service', '1.0.0')
async function processOrder(orderId: string, userId: string) {
// Create a custom span for business logic
return tracer.startActiveSpan('processOrder', async (span) => {
try {
// Add business context as attributes
span.setAttributes({
'order.id': orderId,
'user.id': userId,
'order.processor': 'standard',
})
const order = await db.orders.findById(orderId)
// Add computed attributes as you discover them
span.setAttributes({
'order.total': order.total,
'order.item_count': order.items.length,
'order.currency': order.currency,
})
// Record events (milestones within a span)
span.addEvent('inventory_checked', { 'items.available': true })
const result = await fulfillOrder(order)
span.addEvent('fulfillment_queued', { 'queue.id': result.queueId })
span.setStatus({ code: SpanStatusCode.OK })
return result
} catch (error) {
// Record the error — this marks the span as failed
span.recordException(error as Error)
span.setStatus({
code: SpanStatusCode.ERROR,
message: (error as Error).message,
})
throw error
} finally {
span.end()
}
})
}
Custom Metrics
Beyond auto-instrumented HTTP metrics, you can track business metrics:
import { metrics } from '@opentelemetry/api'
const meter = metrics.getMeter('orders-service')
// Counters for event counting
const ordersCreated = meter.createCounter('orders.created', {
description: 'Number of orders created',
unit: 'orders',
})
// Histograms for duration/size distributions
const orderValue = meter.createHistogram('orders.value', {
description: 'Distribution of order values',
unit: 'USD',
advice: { explicitBucketBoundaries: [10, 50, 100, 500, 1000, 5000] },
})
// Observable gauges for current state
const activeConnections = meter.createObservableGauge('db.connections.active', {
description: 'Active database connections',
})
activeConnections.addCallback((result) => {
result.observe(pool.totalCount - pool.idleCount)
})
// Usage in business logic
async function createOrder(data: CreateOrderInput) {
const order = await db.orders.create(data)
ordersCreated.add(1, {
'order.type': order.type,
'user.plan': order.user.plan,
})
orderValue.record(order.total, {
'order.currency': order.currency,
'order.type': order.type,
})
return order
}
The OpenTelemetry Collector
For production, running the OTel Collector between your app and your backends provides:
- Protocol translation — your app sends OTLP; the collector translates to Datadog, Prometheus, Jaeger, etc.
- Fan-out — send the same traces to multiple backends (Grafana for SREs, Honeycomb for devs)
- Sampling — drop 99% of successful traces but keep all errors
- Batching and retry — buffer telemetry if your backend is temporarily unavailable
# otel-collector-config.yaml
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 10s
send_batch_size: 1024
# Tail-based sampling — keep errors, keep slow requests
probabilistic_sampler:
sampling_percentage: 5 # Keep 5% of successful traces
exporters:
# Send to Grafana Tempo for traces
otlp/tempo:
endpoint: "http://tempo:4317"
tls:
insecure: true
# Send to Prometheus for metrics
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
# Also send to Datadog
datadog:
api:
key: "${DD_API_KEY}"
site: datadoghq.com
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, probabilistic_sampler]
exporters: [otlp/tempo, datadog]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite, datadog]
Backend Options
| Backend | Best For | Pricing |
|---|---|---|
| Grafana Tempo | Self-hosted, budget-conscious, already using Grafana | Free OSS / Grafana Cloud free tier |
| Jaeger | Self-hosted, Kubernetes-native | Free OSS |
| Honeycomb | Developer-focused, high-cardinality queries | Paid ($) |
| Datadog APM | Enterprise, full-stack observability | Expensive ($$$$) |
| New Relic | Enterprise, full-stack | Expensive ($$$) |
| Lightstep (ServiceNow) | Enterprise reliability workflows | Paid ($$) |
For a startup or mid-size team: Grafana Tempo + Prometheus + Grafana Cloud provides excellent observability at near-zero cost.
Distributed Tracing Across Services
OTel's context propagation automatically handles microservice tracing via the traceparent header:
// Service A — makes a call to Service B
const response = await fetch('https://orders-api.internal/process', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
// OTel auto-instrumentation injects traceparent automatically
// traceparent: '00-abc123trace-def456span-01'
},
body: JSON.stringify(payload),
})
// Service B — receives the request
// OTel auto-instrumentation extracts the traceparent header
// and creates a child span with the same trace_id
// Your handler code needs no changes
The result: in your APM backend, you see a single trace spanning both services, showing the complete latency breakdown across the entire request path.
Correlating Logs with Traces
The third pillar — logs — becomes far more powerful when every log line includes the current trace_id and span_id. This lets you jump from a trace in Grafana Tempo to the exact log lines for that request in Loki.
Adding OTel Context to Pino Logs
import pino from 'pino'
import { trace, context } from '@opentelemetry/api'
// Custom log serializer that injects trace context
const logger = pino({
mixin() {
const span = trace.getActiveSpan()
if (!span) return {}
const spanContext = span.spanContext()
return {
trace_id: spanContext.traceId,
span_id: spanContext.spanId,
trace_flags: spanContext.traceFlags,
}
},
})
// Now every log line automatically includes trace context
async function processOrder(orderId: string) {
logger.info({ orderId }, 'Processing order') // → includes trace_id, span_id
const order = await db.orders.findById(orderId)
logger.info({ order_total: order.total }, 'Order found') // → same trace_id
return fulfillOrder(order)
}
In your observability backend, you can now:
- See a slow trace in Grafana Tempo
- Click "View logs for this trace"
- See every log line from all services for that exact request
Automatic Log Correlation with Winston
For Winston users, the @opentelemetry/winston-transport package adds trace context automatically:
import winston from 'winston'
import { OpenTelemetryTransportV3 } from '@opentelemetry/winston-transport'
const logger = winston.createLogger({
transports: [
new winston.transports.Console(),
new OpenTelemetryTransportV3(), // Sends logs as OTel log records
],
})
// Every winston.info() call now propagates trace context
Real-World Debugging Workflow
Here's how OTel transforms incident response. A user reports that their checkout is slow:
Before OTel
- Search application logs for the user's ID — find 400 unrelated log lines
- Check Datadog for latency spikes — see the spike but not which operation
- SSH into the server, check
pg_stat_activity— the slow query is gone - Guess that it was the shipping calculation
- Add
console.time()calls and wait for it to happen again
After OTel
- Search Honeycomb/Grafana for
user.id = 'abc'withduration > 2000ms - Find the trace immediately — see that
calculateShippingtook 1,800ms - Click into the
calculateShippingspan — see that the external shipping API returned 429 (rate limited) - Find the same pattern in metrics —
shipping_api.errorscounter spiked at 14:32 - Fix: add retry logic with backoff to the shipping API client
The entire process takes 5 minutes instead of 2 hours. This is the real ROI of observability.
OTel vs Vendor Agents
A common question: why use OTel instead of installing the Datadog agent?
| Factor | OpenTelemetry | Datadog Agent |
|---|---|---|
| Vendor lock-in | None — switch backends freely | High — proprietary format |
| Setup complexity | Higher (more config) | Lower (install agent) |
| Cost | Free (OSS) | Pay per host + volume |
| Ecosystem | Universal | Datadog-specific |
| Custom metrics | Full flexibility | Limited to Datadog types |
| Backend choice | Grafana, Jaeger, Honeycomb, etc. | Datadog only |
For startups: start with OTel + Grafana Cloud (generous free tier). For enterprises already on Datadog: use OTel SDK with the OTLP Datadog exporter — you get OTel's flexibility without switching backends.
Methodology
- npm download data from npmjs.com API, March 2026 weekly averages
- Package versions:
@opentelemetry/sdk-nodev1.x,@opentelemetry/auto-instrumentations-nodev0.54.x - Sources: OpenTelemetry official documentation (opentelemetry.io), CNCF project status, Grafana and Honeycomb blog posts
Explore observability and API tooling alternatives on APIScout — see which observability packages developers are adopting.
Related: API Error Handling Patterns for Production 2026 · API Gateway Patterns for Microservices 2026 · API Rate Limiting Best Practices 2026