Quick Facts
- Category: Linux & DevOps
- Published: 2026-05-19 15:36:26
- BYD's Song Ultra EV Shatters Records: 60,000 Orders in First Month, 5-Minute Flash Charging
- Amazon Redshift Unleashes Graviton-Powered RG Instances: 2.2x Speed, 30% Cost Cut for Data Warehouses and Lakes
- Urgent: Major Security Patches Rolled Out Across Linux Distributions – Critical Vulnerabilities Addressed
- Ubuntu's Twitter Hijack: Crypto Scam Masquerades as AI Agent Announcement
- Australia Approves Two Giant Batteries: One Built Without Concrete, Assembled by Robots
Introduction
With the release of Kubernetes v1.36, an important monitoring capability has reached general availability: Pressure Stall Information (PSI) metrics. Originally introduced in the Linux kernel in 2018, PSI provides high-fidelity signals that help operators detect resource saturation before it escalates into an outage. Unlike traditional utilization metrics that only show raw usage percentages, PSI reveals the real impact on workloads by measuring the time tasks spend stalled due to contention for CPU, memory, or I/O. This article explores why PSI matters, how it works, and the rigorous performance testing that confirmed its readiness for production environments.

Beyond Traditional Utilization: Why PSI Matters
Monitoring CPU or memory usage alone can be misleading. A node may report moderate CPU utilization (say 70%) while some critical tasks experience severe latency due to scheduling delays. PSI fills this gap by providing two key types of data:
- Cumulative Totals: Absolute time spent in a stalled state, giving a clear picture of resource contention over the system's lifetime.
- Moving Averages: Windows of 10, 60, and 300 seconds that help operators distinguish between transient spikes and sustained resource pressure.
By exposing these metrics at the node, pod, and container levels, Kubernetes now offers a stable, reliable interface for observing resource contention—enabling proactive tuning and capacity planning.
How PSI Metrics Integrate with Kubernetes
The PSI collection in Kubernetes builds on the Linux kernel's existing /proc/pressure interface. The kubelet queries these cgroup-level pressure files and exposes them via the metrics endpoint. This integration required careful engineering to minimize overhead, as discussed in the performance validation below.
Performance Validation: Proving GA Readiness
A common concern when graduating telemetry features is the resource overhead required to collect and serve the metrics. To address this, SIG Node conducted extensive performance testing on high-density workloads with over 80 pods across various machine types. The testing focused on isolating the impact of the kubelet and kernel-level collection separately.
Scenario 1: Kubelet Overhead
In this scenario, the Linux kernel was already tracking pressure on both clusters (with psi=1), but the kubelet feature gate was toggled to measure the impact of actively querying and exposing these metrics. Tests were run on 4-core machines. The results show that the kubelet's CPU usage remained practically identical in both magnitude and frequency, regardless of whether PSI metrics were being collected. As illustrated in the CPU usage rate comparison graph (not shown here), the kubelet consumed approximately 0.1 cores or 2.5% of total node capacity—well within normal operating overhead. This confirms that the collection logic is lightweight and blends seamlessly into standard housekeeping cycles.
Scenario 2: Kernel Overhead
Next, the team evaluated system-level CPU overhead by comparing a cluster with kernel PSI enabled (psi=1) and kubelet feature on, against a cluster with kernel PSI disabled (psi=0) and kubelet feature on. The system CPU usage lines for the PSI-enabled cluster followed the same pattern as the disabled cluster, with only a slight expected increase from the baseline—around 2.5 cores of system CPU. This demonstrates that once the OS is tracking PSI, the act of Kubernetes reading those cgroup metrics introduces negligible additional kernel load. The feature is therefore safe for production-scale deployments.
Conclusion and Next Steps
The graduation of PSI metrics to GA in Kubernetes v1.36 marks a significant milestone for observability. Operators now have a trusted, low-overhead mechanism to detect and diagnose resource contention before it impacts users. To leverage this feature, ensure your nodes run a Linux kernel version 4.20+ (which supports PSI), and enable the feature gate (it is on by default in v1.36). Start by monitoring the node_pressure_cpu_waiting_seconds_total, node_pressure_memory_stalled_seconds_total, and similar metrics in your monitoring stack. With PSI, you can move beyond simple utilization and truly understand the pressure your workloads are under.
For more details, see the Kubernetes documentation on PSI metrics.