Skip to main content

OpenTelemetry for API Observability in 2026

·APIScout Team
opentelemetryobservabilitytracingapinodejs

OpenTelemetry for API Observability in 2026

TL;DR

Observability for APIs in 2026 means three things: traces (how long did each step take?), metrics (how many requests per second, error rates, latency percentiles), and logs (what happened in this specific request?). OpenTelemetry (OTel) is the open standard that unified the previously fragmented observability ecosystem — instead of instrumenting your API separately for Datadog, New Relic, and Jaeger, you instrument once with OTel and export to any backend. It's now the default choice for API observability: CNCF-graduated, vendor-neutral, and natively supported by every major cloud and APM provider. The learning curve is real — OTel has a lot of terminology — but the payoff is complete observability without vendor lock-in.

Key Takeaways

  • OpenTelemetry is vendor-neutral — instrument once, export to Jaeger, Grafana Tempo, Datadog, Honeycomb, New Relic, or any OTLP-compatible backend
  • Auto-instrumentation covers the common cases — HTTP, Express, Fastify, pg, Redis, gRPC all have official OTel packages that add traces with zero code changes
  • The three pillars work together — a trace ID links a specific slow request's trace, metrics, and logs so you can go from "error spike" to "root cause" in one workflow
  • OpenTelemetry Collector is optional but recommended — receives OTel data from your apps, transforms it, and fans it out to multiple backends
  • OTel SDK for Node.js is stable (v1.x) — production-ready with active development; ~1.8M weekly downloads for @opentelemetry/sdk-node
  • Context propagation is where OTel shines for microservices — a single trace ID follows a request across 10 services automatically

Why API Observability Matters in 2026

A production API without observability is a black box. You know when it's down (users complain), but you don't know why a specific endpoint got slow, which downstream service caused a timeout cascade, or which 3% of requests are failing silently.

The traditional approach was per-vendor instrumentation: Datadog agent for metrics, Sentry for errors, application logs to Elasticsearch. Each tool had its own SDK, its own data model, and its own concept of a "request." When a bug happened, you'd have metrics in Datadog, a trace in New Relic, and logs in Kibana — with no shared identifier to correlate them.

OpenTelemetry solves this by providing a unified data model (traces, metrics, logs) with a shared trace_id that links all three.


The OpenTelemetry Data Model

Traces and Spans

A trace represents the complete journey of one request through your system. It's composed of spans — each span represents one operation:

Trace ID: abc123
│
├── GET /api/orders/:id (span 1 — HTTP handler, 145ms)
│   ├── validateAuth (span 2 — JWT verify, 3ms)
│   ├── db.orders.findById (span 3 — PostgreSQL query, 38ms)
│   │   └── SELECT * FROM orders WHERE id = $1
│   ├── db.users.findById (span 4 — PostgreSQL query, 12ms)
│   └── calculateShipping (span 5 — external API call, 89ms)
│       └── POST https://shipping-api.com/calculate (span 6)

Every span has:

  • trace_id — shared across the entire request journey
  • span_id — unique to this operation
  • parent_span_id — which span created this one
  • start_time, end_time — when the operation ran
  • attributes — key-value data (HTTP method, DB query, user ID)
  • status — OK, Error, or Unset

Metrics

OTel metrics are time-series measurements: request counts, latency histograms, error rates, active connections. Unlike traces (sampled), metrics are aggregated — you capture every request but store the aggregation, not individual data points.

Logs

OTel logs connect your existing console.log / pino / winston output to the trace context — adding trace_id and span_id to every log line so you can find the logs for a specific slow request.


Setting Up OpenTelemetry for a Node.js API

Installation

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/exporter-metrics-otlp-http

Instrumentation Setup (tracing.ts)

The key principle: initialize OTel before importing anything else. This is because auto-instrumentation patches modules at import time.

// tracing.ts — must be the FIRST file executed
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http'
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'
import { Resource } from '@opentelemetry/resources'
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions'

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'orders-api',
    [ATTR_SERVICE_VERSION]: process.env.APP_VERSION ?? '0.0.0',
    'deployment.environment': process.env.NODE_ENV ?? 'development',
  }),

  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',
    headers: { 'Authorization': `Bearer ${process.env.OTEL_AUTH_TOKEN}` },
  }),

  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics',
    }),
    exportIntervalMillis: 10_000,  // Export metrics every 10 seconds
  }),

  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments: HTTP, Express/Fastify, pg, Redis, gRPC, fetch
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
      // Disable noisy instrumentations
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
})

sdk.start()

process.on('SIGTERM', () => sdk.shutdown())

In package.json:

{
  "scripts": {
    "start": "node --require ./dist/tracing.js dist/server.js"
  }
}

The --require flag loads tracing.js before your server — this ensures OTel patches Node.js modules before they're imported.

What Auto-Instrumentation Adds (Zero Code Changes)

With the above setup and no changes to your API code, you automatically get:

  • Every HTTP request as a span with method, URL, status code, and duration
  • Every PostgreSQL/MySQL query as a child span with the SQL text (sanitized)
  • Every Redis command as a child span
  • Every outbound HTTP/fetch call as a child span
  • Error recording when exceptions are thrown
  • Context propagation via traceparent header for distributed tracing

Manual Instrumentation

Auto-instrumentation covers I/O. Manual instrumentation adds business logic context:

import { trace, SpanStatusCode, context } from '@opentelemetry/api'

const tracer = trace.getTracer('orders-service', '1.0.0')

async function processOrder(orderId: string, userId: string) {
  // Create a custom span for business logic
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      // Add business context as attributes
      span.setAttributes({
        'order.id': orderId,
        'user.id': userId,
        'order.processor': 'standard',
      })

      const order = await db.orders.findById(orderId)

      // Add computed attributes as you discover them
      span.setAttributes({
        'order.total': order.total,
        'order.item_count': order.items.length,
        'order.currency': order.currency,
      })

      // Record events (milestones within a span)
      span.addEvent('inventory_checked', { 'items.available': true })

      const result = await fulfillOrder(order)
      span.addEvent('fulfillment_queued', { 'queue.id': result.queueId })

      span.setStatus({ code: SpanStatusCode.OK })
      return result
    } catch (error) {
      // Record the error — this marks the span as failed
      span.recordException(error as Error)
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      })
      throw error
    } finally {
      span.end()
    }
  })
}

Custom Metrics

Beyond auto-instrumented HTTP metrics, you can track business metrics:

import { metrics } from '@opentelemetry/api'

const meter = metrics.getMeter('orders-service')

// Counters for event counting
const ordersCreated = meter.createCounter('orders.created', {
  description: 'Number of orders created',
  unit: 'orders',
})

// Histograms for duration/size distributions
const orderValue = meter.createHistogram('orders.value', {
  description: 'Distribution of order values',
  unit: 'USD',
  advice: { explicitBucketBoundaries: [10, 50, 100, 500, 1000, 5000] },
})

// Observable gauges for current state
const activeConnections = meter.createObservableGauge('db.connections.active', {
  description: 'Active database connections',
})
activeConnections.addCallback((result) => {
  result.observe(pool.totalCount - pool.idleCount)
})

// Usage in business logic
async function createOrder(data: CreateOrderInput) {
  const order = await db.orders.create(data)

  ordersCreated.add(1, {
    'order.type': order.type,
    'user.plan': order.user.plan,
  })

  orderValue.record(order.total, {
    'order.currency': order.currency,
    'order.type': order.type,
  })

  return order
}

The OpenTelemetry Collector

For production, running the OTel Collector between your app and your backends provides:

  1. Protocol translation — your app sends OTLP; the collector translates to Datadog, Prometheus, Jaeger, etc.
  2. Fan-out — send the same traces to multiple backends (Grafana for SREs, Honeycomb for devs)
  3. Sampling — drop 99% of successful traces but keep all errors
  4. Batching and retry — buffer telemetry if your backend is temporarily unavailable
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  # Tail-based sampling — keep errors, keep slow requests
  probabilistic_sampler:
    sampling_percentage: 5  # Keep 5% of successful traces

exporters:
  # Send to Grafana Tempo for traces
  otlp/tempo:
    endpoint: "http://tempo:4317"
    tls:
      insecure: true

  # Send to Prometheus for metrics
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

  # Also send to Datadog
  datadog:
    api:
      key: "${DD_API_KEY}"
      site: datadoghq.com

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, probabilistic_sampler]
      exporters: [otlp/tempo, datadog]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite, datadog]

Backend Options

BackendBest ForPricing
Grafana TempoSelf-hosted, budget-conscious, already using GrafanaFree OSS / Grafana Cloud free tier
JaegerSelf-hosted, Kubernetes-nativeFree OSS
HoneycombDeveloper-focused, high-cardinality queriesPaid ($)
Datadog APMEnterprise, full-stack observabilityExpensive ($$$$)
New RelicEnterprise, full-stackExpensive ($$$)
Lightstep (ServiceNow)Enterprise reliability workflowsPaid ($$)

For a startup or mid-size team: Grafana Tempo + Prometheus + Grafana Cloud provides excellent observability at near-zero cost.


Distributed Tracing Across Services

OTel's context propagation automatically handles microservice tracing via the traceparent header:

// Service A — makes a call to Service B
const response = await fetch('https://orders-api.internal/process', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    // OTel auto-instrumentation injects traceparent automatically
    // traceparent: '00-abc123trace-def456span-01'
  },
  body: JSON.stringify(payload),
})

// Service B — receives the request
// OTel auto-instrumentation extracts the traceparent header
// and creates a child span with the same trace_id
// Your handler code needs no changes

The result: in your APM backend, you see a single trace spanning both services, showing the complete latency breakdown across the entire request path.


Correlating Logs with Traces

The third pillar — logs — becomes far more powerful when every log line includes the current trace_id and span_id. This lets you jump from a trace in Grafana Tempo to the exact log lines for that request in Loki.

Adding OTel Context to Pino Logs

import pino from 'pino'
import { trace, context } from '@opentelemetry/api'

// Custom log serializer that injects trace context
const logger = pino({
  mixin() {
    const span = trace.getActiveSpan()
    if (!span) return {}

    const spanContext = span.spanContext()
    return {
      trace_id: spanContext.traceId,
      span_id: spanContext.spanId,
      trace_flags: spanContext.traceFlags,
    }
  },
})

// Now every log line automatically includes trace context
async function processOrder(orderId: string) {
  logger.info({ orderId }, 'Processing order')  // → includes trace_id, span_id

  const order = await db.orders.findById(orderId)
  logger.info({ order_total: order.total }, 'Order found')  // → same trace_id

  return fulfillOrder(order)
}

In your observability backend, you can now:

  1. See a slow trace in Grafana Tempo
  2. Click "View logs for this trace"
  3. See every log line from all services for that exact request

Automatic Log Correlation with Winston

For Winston users, the @opentelemetry/winston-transport package adds trace context automatically:

import winston from 'winston'
import { OpenTelemetryTransportV3 } from '@opentelemetry/winston-transport'

const logger = winston.createLogger({
  transports: [
    new winston.transports.Console(),
    new OpenTelemetryTransportV3(),  // Sends logs as OTel log records
  ],
})
// Every winston.info() call now propagates trace context

Real-World Debugging Workflow

Here's how OTel transforms incident response. A user reports that their checkout is slow:

Before OTel

  1. Search application logs for the user's ID — find 400 unrelated log lines
  2. Check Datadog for latency spikes — see the spike but not which operation
  3. SSH into the server, check pg_stat_activity — the slow query is gone
  4. Guess that it was the shipping calculation
  5. Add console.time() calls and wait for it to happen again

After OTel

  1. Search Honeycomb/Grafana for user.id = 'abc' with duration > 2000ms
  2. Find the trace immediately — see that calculateShipping took 1,800ms
  3. Click into the calculateShipping span — see that the external shipping API returned 429 (rate limited)
  4. Find the same pattern in metrics — shipping_api.errors counter spiked at 14:32
  5. Fix: add retry logic with backoff to the shipping API client

The entire process takes 5 minutes instead of 2 hours. This is the real ROI of observability.


OTel vs Vendor Agents

A common question: why use OTel instead of installing the Datadog agent?

FactorOpenTelemetryDatadog Agent
Vendor lock-inNone — switch backends freelyHigh — proprietary format
Setup complexityHigher (more config)Lower (install agent)
CostFree (OSS)Pay per host + volume
EcosystemUniversalDatadog-specific
Custom metricsFull flexibilityLimited to Datadog types
Backend choiceGrafana, Jaeger, Honeycomb, etc.Datadog only

For startups: start with OTel + Grafana Cloud (generous free tier). For enterprises already on Datadog: use OTel SDK with the OTLP Datadog exporter — you get OTel's flexibility without switching backends.


Methodology

  • npm download data from npmjs.com API, March 2026 weekly averages
  • Package versions: @opentelemetry/sdk-node v1.x, @opentelemetry/auto-instrumentations-node v0.54.x
  • Sources: OpenTelemetry official documentation (opentelemetry.io), CNCF project status, Grafana and Honeycomb blog posts

Explore observability and API tooling alternatives on APIScout — see which observability packages developers are adopting.

Related: API Error Handling Patterns for Production 2026 · API Gateway Patterns for Microservices 2026 · API Rate Limiting Best Practices 2026

Comments