G
GuideDevOps
Lesson 11 of 13

APM

Part of the Monitoring & Observability tutorial series.

The Gap in Infrastructure Monitoring

So far, we have discussed using tools like Prometheus or the ELK stack to gather metrics and logs.

If your backend is a Node.js application running on a Kubernetes Pod, Prometheus will expertly tell you that the Pod is consuming 800MB of RAM and 95% CPU. And Loki will expertly show you the application's console.log() outputs.

But what if you need to know why exactly the application is slow?

  • Is it spending too much time executing a specific for-loop in the utils.js file?
  • Which exact PostgreSQL database query is taking the longest to resolve?
  • Is the application bottlenecking while parsing a massive JSON payload?

Infrastructure monitoring cannot answer these questions because it treats the application as an impenetrable black box. To see inside the code, you need APM (Application Performance Monitoring).


What is APM?

APM tools trace and monitor the exact execution patterns of code dynamically, in real-time, often down to the individual function or line level.

They provide "Code-level Visibility."

Commercial Dominance

While observability tools like Prometheus and Grafana are overwhelmingly Open Source, the APM space is heavily dominated by commercial SaaS vendors.

Building the agents required to hook deeply into Python, Java, Ruby, Node, and Go runtimes without crashing the applications is incredibly difficult. Corporations pay massive sums of money to vendors who have perfected this art.

The APM Giants:

  • Datadog
  • New Relic
  • AppDynamics (Cisco)
  • Dynatrace

How APM Agents Work (Auto-Instrumentation)

The magic of commercial APM tools is Auto-Instrumentation.

Let's look at the Datadog APM setup for a Node.js application. You do not need to rewrite your application to send metrics. You simply install the dd-trace library and require it at the very top of your application entry file:

// index.js (Line 1)
const tracer = require('dd-trace').init();
 
// The rest of your normal application code...
const express = require('express');
const app = express();

That single line of initialization code does something incredibly powerful: it dynamically "monkey-patches" the Node.js runtime.

It wraps itself silently around core libraries like express, pg (PostgreSQL), redis, and the native http module.

Without writing any custom spans or manual timing logs, the Datadog agent will automatically capture:

  • The execution time of every HTTP route (GET /api/users).
  • The exact raw text of every SQL query generated, and its execution latency.
  • The latency of every Redis cache hit or miss.

APM Profiling and Stack Traces

The most advanced feature of APM tools is Continuous Profiling.

Every few seconds, the APM agent takes a snapshot of the CPU thread stack trace. The web UI aggregates these thousands of snapshots into a "Flame Graph."

Looking at a Flame Graph allows an engineer to say: "Oh wow, 45% of our entire server's CPU time is being spent inside the JSON.parse() method explicitly on line 42 of payment_processor.js."

This allows developers to optimize the absolute most expensive lines of code in their massive codebases purely by glancing at a graph, rather than guessing where performance bottlenecks lie.


The Rise of OpenTelemetry

Historically, if a company used New Relic for APM, their application code was permanently married to the New Relic proprietary agents. Switching to Datadog meant ripping out libraries across 100 repositories.

As mentioned in the Distributed Tracing chapter, OpenTelemetry (OTel) was designed explicitly to break this vendor lock-in.

Modern APM vendors now all completely support the OpenTelemetry protocol (OTLP). Today, the best practice is to instrument your code once using the open-source OTel auto-instrumentation libraries, send the generic data to an OTel Collector, and then configure the Collector to forward the APM traces to Datadog, New Relic, or an open-source backend like Grafana Tempo.