OpenTelemetry for API Observability in 2026

TL;DR

Observability for APIs in 2026 means three things: traces (how long did each step take?), metrics (how many requests per second, error rates, latency percentiles), and logs (what happened in this specific request?). OpenTelemetry (OTel) is the open standard that unified the previously fragmented observability ecosystem — instead of instrumenting your API separately for Datadog, New Relic, and Jaeger, you instrument once with OTel and export to any backend. It's now the default choice for API observability: CNCF-graduated, vendor-neutral, and natively supported by every major cloud and APM provider. The learning curve is real — OTel has a lot of terminology — but the payoff is complete observability without vendor lock-in.

Key Takeaways

OpenTelemetry is vendor-neutral — instrument once, export to Jaeger, Grafana Tempo, Datadog, Honeycomb, New Relic, or any OTLP-compatible backend
Auto-instrumentation covers the common cases — HTTP, Express, Fastify, pg, Redis, gRPC all have official OTel packages that add traces with zero code changes
The three pillars work together — a trace ID links a specific slow request's trace, metrics, and logs so you can go from "error spike" to "root cause" in one workflow
OpenTelemetry Collector is optional but recommended — receives OTel data from your apps, transforms it, and fans it out to multiple backends
OTel SDK for Node.js is stable (v1.x) — production-ready with active development; ~1.8M weekly downloads for @opentelemetry/sdk-node
Context propagation is where OTel shines for microservices — a single trace ID follows a request across 10 services automatically

Why API Observability Matters in 2026

A production API without observability is a black box. You know when it's down (users complain), but you don't know why a specific endpoint got slow, which downstream service caused a timeout cascade, or which 3% of requests are failing silently.

The traditional approach was per-vendor instrumentation: Datadog agent for metrics, Sentry for errors, application logs to Elasticsearch. Each tool had its own SDK, its own data model, and its own concept of a "request." When a bug happened, you'd have metrics in Datadog, a trace in New Relic, and logs in Kibana — with no shared identifier to correlate them.

OpenTelemetry solves this by providing a unified data model (traces, metrics, logs) with a shared trace_id that links all three.

The OpenTelemetry Data Model

Traces and Spans

A trace represents the complete journey of one request through your system. It's composed of spans — each span represents one operation:

Trace ID: abc123
│
├── GET /api/orders/:id (span 1 — HTTP handler, 145ms)
│   ├── validateAuth (span 2 — JWT verify, 3ms)
│   ├── db.orders.findById (span 3 — PostgreSQL query, 38ms)
│   │   └── SELECT * FROM orders WHERE id = $1
│   ├── db.users.findById (span 4 — PostgreSQL query, 12ms)
│   └── calculateShipping (span 5 — external API call, 89ms)
│       └── POST https://shipping-api.com/calculate (span 6)

Every span has:

trace_id — shared across the entire request journey
span_id — unique to this operation
parent_span_id — which span created this one
start_time, end_time — when the operation ran
attributes — key-value data (HTTP method, DB query, user ID)
status — OK, Error, or Unset

Metrics

OTel metrics are time-series measurements: request counts, latency histograms, error rates, active connections. Unlike traces (sampled), metrics are aggregated — you capture every request but store the aggregation, not individual data points.

Logs

OTel logs connect your existing console.log / pino / winston output to the trace context — adding trace_id and span_id to every log line so you can find the logs for a specific slow request.

Setting Up OpenTelemetry for a Node.js API

Installation

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/exporter-metrics-otlp-http

Instrumentation Setup (tracing.ts)

The key principle: initialize OTel before importing anything else. This is because auto-instrumentation patches modules at import time.

// tracing.ts — must be the FIRST file executed
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http'
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'
import { Resource } from '@opentelemetry/resources'
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions'

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'orders-api',
    [ATTR_SERVICE_VERSION]: process.env.APP_VERSION ?? '0.0.0',
    'deployment.environment': process.env.NODE_ENV ?? 'development',
  }),

  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',
    headers: { 'Authorization': `Bearer ${process.env.OTEL_AUTH_TOKEN}` },
  }),

  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics',
    }),
    exportIntervalMillis: 10_000,  // Export metrics every 10 seconds
  }),

  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments: HTTP, Express/Fastify, pg, Redis, gRPC, fetch
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
      // Disable noisy instrumentations
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
})

sdk.start()

process.on('SIGTERM', () => sdk.shutdown())

In package.json:

{
  "scripts": {
    "start": "node --require ./dist/tracing.js dist/server.js"
  }
}

The --require flag loads tracing.js before your server — this ensures OTel patches Node.js modules before they're imported.

What Auto-Instrumentation Adds (Zero Code Changes)

With the above setup and no changes to your API code, you automatically get:

Every HTTP request as a span with method, URL, status code, and duration
Every PostgreSQL/MySQL query as a child span with the SQL text (sanitized)
Every Redis command as a child span
Every outbound HTTP/fetch call as a child span
Error recording when exceptions are thrown
Context propagation via traceparent header for distributed tracing

Manual Instrumentation

Auto-instrumentation covers I/O. Manual instrumentation adds business logic context:

import { trace, SpanStatusCode, context } from '@opentelemetry/api'

const tracer = trace.getTracer('orders-service', '1.0.0')

async function processOrder(orderId: string, userId: string) {
  // Create a custom span for business logic
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      // Add business context as attributes
      span.setAttributes({
        'order.id': orderId,
        'user.id': userId,
        'order.processor': 'standard',
      })

      const order = await db.orders.findById(orderId)

      // Add computed attributes as you discover them
      span.setAttributes({
        'order.total': order.total,
        'order.item_count': order.items.length,
        'order.currency': order.currency,
      })

      // Record events (milestones within a span)
      span.addEvent('inventory_checked', { 'items.available': true })

      const result = await fulfillOrder(order)
      span.addEvent('fulfillment_queued', { 'queue.id': result.queueId })

      span.setStatus({ code: SpanStatusCode.OK })
      return result
    } catch (error) {
      // Record the error — this marks the span as failed
      span.recordException(error as Error)
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      })
      throw error
    } finally {
      span.end()
    }
  })
}

Custom Metrics

Beyond auto-instrumented HTTP metrics, you can track business metrics:

import { metrics } from '@opentelemetry/api'

const meter = metrics.getMeter('orders-service')

// Counters for event counting
const ordersCreated = meter.createCounter('orders.created', {
  description: 'Number of orders created',
  unit: 'orders',
})

// Histograms for duration/size distributions
const orderValue = meter.createHistogram('orders.value', {
  description: 'Distribution of order values',
  unit: 'USD',
  advice: { explicitBucketBoundaries: [10, 50, 100, 500, 1000, 5000] },
})

// Observable gauges for current state
const activeConnections = meter.createObservableGauge('db.connections.active', {
  description: 'Active database connections',
})
activeConnections.addCallback((result) => {
  result.observe(pool.totalCount - pool.idleCount)
})

// Usage in business logic
async function createOrder(data: CreateOrderInput) {
  const order = await db.orders.create(data)

  ordersCreated.add(1, {
    'order.type': order.type,
    'user.plan': order.user.plan,
  })

  orderValue.record(order.total, {
    'order.currency': order.currency,
    'order.type': order.type,
  })

  return order
}

The OpenTelemetry Collector

For production, running the OTel Collector between your app and your backends provides:

Protocol translation — your app sends OTLP; the collector translates to Datadog, Prometheus, Jaeger, etc.
Fan-out — send the same traces to multiple backends (Grafana for SREs, Honeycomb for devs)
Sampling — drop 99% of successful traces but keep all errors
Batching and retry — buffer telemetry if your backend is temporarily unavailable

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  # Tail-based sampling — keep errors, keep slow requests
  probabilistic_sampler:
    sampling_percentage: 5  # Keep 5% of successful traces

exporters:
  # Send to Grafana Tempo for traces
  otlp/tempo:
    endpoint: "http://tempo:4317"
    tls:
      insecure: true

  # Send to Prometheus for metrics
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

  # Also send to Datadog
  datadog:
    api:
      key: "${DD_API_KEY}"
      site: datadoghq.com

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, probabilistic_sampler]
      exporters: [otlp/tempo, datadog]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite, datadog]

Backend Options

Backend	Best For	Pricing
Grafana Tempo	Self-hosted, budget-conscious, already using Grafana	Free OSS / Grafana Cloud free tier
Jaeger	Self-hosted, Kubernetes-native	Free OSS
Honeycomb	Developer-focused, high-cardinality queries	Paid ($)
Datadog APM	Enterprise, full-stack observability	Expensive ($$$$)
New Relic	Enterprise, full-stack	Expensive ($$$)
Lightstep (ServiceNow)	Enterprise reliability workflows	Paid ($$)

For a startup or mid-size team: Grafana Tempo + Prometheus + Grafana Cloud provides excellent observability at near-zero cost.

Distributed Tracing Across Services

OTel's context propagation automatically handles microservice tracing via the traceparent header:

// Service A — makes a call to Service B
const response = await fetch('https://orders-api.internal/process', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    // OTel auto-instrumentation injects traceparent automatically
    // traceparent: '00-abc123trace-def456span-01'
  },
  body: JSON.stringify(payload),
})

// Service B — receives the request
// OTel auto-instrumentation extracts the traceparent header
// and creates a child span with the same trace_id
// Your handler code needs no changes

The result: in your APM backend, you see a single trace spanning both services, showing the complete latency breakdown across the entire request path.

Correlating Logs with Traces

The third pillar — logs — becomes far more powerful when every log line includes the current trace_id and span_id. This lets you jump from a trace in Grafana Tempo to the exact log lines for that request in Loki.

Adding OTel Context to Pino Logs

import pino from 'pino'
import { trace, context } from '@opentelemetry/api'

// Custom log serializer that injects trace context
const logger = pino({
  mixin() {
    const span = trace.getActiveSpan()
    if (!span) return {}

    const spanContext = span.spanContext()
    return {
      trace_id: spanContext.traceId,
      span_id: spanContext.spanId,
      trace_flags: spanContext.traceFlags,
    }
  },
})

// Now every log line automatically includes trace context
async function processOrder(orderId: string) {
  logger.info({ orderId }, 'Processing order')  // → includes trace_id, span_id

  const order = await db.orders.findById(orderId)
  logger.info({ order_total: order.total }, 'Order found')  // → same trace_id

  return fulfillOrder(order)
}

In your observability backend, you can now:

See a slow trace in Grafana Tempo
Click "View logs for this trace"
See every log line from all services for that exact request

Automatic Log Correlation with Winston

For Winston users, the @opentelemetry/winston-transport package adds trace context automatically:

import winston from 'winston'
import { OpenTelemetryTransportV3 } from '@opentelemetry/winston-transport'

const logger = winston.createLogger({
  transports: [
    new winston.transports.Console(),
    new OpenTelemetryTransportV3(),  // Sends logs as OTel log records
  ],
})
// Every winston.info() call now propagates trace context

Real-World Debugging Workflow

Here's how OTel transforms incident response. A user reports that their checkout is slow:

Before OTel

Search application logs for the user's ID — find 400 unrelated log lines
Check Datadog for latency spikes — see the spike but not which operation
SSH into the server, check pg_stat_activity — the slow query is gone
Guess that it was the shipping calculation
Add console.time() calls and wait for it to happen again

After OTel

Search Honeycomb/Grafana for user.id = 'abc' with duration > 2000ms
Find the trace immediately — see that calculateShipping took 1,800ms
Click into the calculateShipping span — see that the external shipping API returned 429 (rate limited)
Find the same pattern in metrics — shipping_api.errors counter spiked at 14:32
Fix: add retry logic with backoff to the shipping API client

The entire process takes 5 minutes instead of 2 hours. This is the real ROI of observability.

OTel vs Vendor Agents

A common question: why use OTel instead of installing the Datadog agent?

Factor	OpenTelemetry	Datadog Agent
Vendor lock-in	None — switch backends freely	High — proprietary format
Setup complexity	Higher (more config)	Lower (install agent)
Cost	Free (OSS)	Pay per host + volume
Ecosystem	Universal	Datadog-specific
Custom metrics	Full flexibility	Limited to Datadog types
Backend choice	Grafana, Jaeger, Honeycomb, etc.	Datadog only

For startups: start with OTel + Grafana Cloud (generous free tier). For enterprises already on Datadog: use OTel SDK with the OTLP Datadog exporter — you get OTel's flexibility without switching backends.

Methodology

npm download data from npmjs.com API, March 2026 weekly averages
Package versions: @opentelemetry/sdk-node v1.x, @opentelemetry/auto-instrumentations-node v0.54.x
Sources: OpenTelemetry official documentation (opentelemetry.io), CNCF project status, Grafana and Honeycomb blog posts

Explore observability and API tooling alternatives on APIScout — see which observability packages developers are adopting.

Adopting OpenTelemetry is a long-term investment in observability portability. The standard ensures your instrumentation continues producing signals regardless of which backend you choose — start with Prometheus and Grafana, add Honeycomb or Datadog later, without re-instrumenting your application. The two-layer model (SDK for collection, Collector for routing) initially adds complexity but pays back in operational flexibility: the Collector becomes the single configuration point for sampling rates, attribute enrichment, and backend routing, keeping those decisions out of application code and out of deployment. For teams starting from scratch, prioritize trace instrumentation first — distributed tracing provides the highest diagnostic value per instrumentation effort, followed by metrics for capacity planning and structured logging for operational debugging.

The API Integration Checklist (Free PDF)