Quick Facts
- Category: Linux & DevOps
- Published: 2026-05-18 10:33:29
- BYD Song Ultra EV: 60,000 Orders in One Month, Starts at $22,000 – Everything You Need to Know
- MOFT’s MagSafe Wallet with Kickstand Finally Gets Find My Support – Now Shipping
- Unlock the Best Mid-Week Android Deals: Games, Apps, and Galaxy Savings
- Grafana Unveils Terminal-First CLI for Observability: gcx Targets Agentic Coding Blind Spots
- Reclaiming the American Dream: Why Sharing Our Prosperity Is the Path Forward
Overview
The CUBIC congestion control algorithm, standardized in RFC 9438, is the default in Linux and therefore governs how most TCP and QUIC connections on the public internet probe for bandwidth, react to loss, and recover. Cloudflare's open-source QUIC implementation, quiche, uses CUBIC as its default congestion controller, putting it in the critical path for a significant share of served traffic. This guide tells the story of a bug where CUBIC's congestion window (cwnd) gets permanently pinned at its minimum after a congestion collapse, never recovering. The root cause was a Linux kernel change intended to align CUBIC with the app-limited exclusion described in RFC 9438 §4.2-12 — a valid TCP fix that, when ported to QUIC, exposed unexpected behaviors. The happy ending: an elegant near-one-line fix that broke the cycle.

Prerequisites
To follow this tutorial, you should have:
- A basic understanding of TCP/IP and QUIC protocols
- Familiarity with congestion control concepts (cwnd, pacing, loss detection)
- Some knowledge of the Linux kernel and C programming (for the code examples)
- Access to a QUIC implementation (like quiche) for testing
Step-by-Step Instructions
1. Understand CUBIC's Core Logic
Before diving into the bug, grasp how CUBIC works. The central knob is the congestion window (cwnd) — a sender-side limit on bytes in flight. CUBIC, like all loss-based algorithms, grows cwnd when the network appears healthy and shrinks it on loss. Its key premise: no loss means increase sending rate; loss means capacity exceeded, so back off. However, RFC 9438 introduced an app-limited exclusion: if the sender is not fully utilizing the window (e.g., due to application limitations), CUBIC should not grow cwnd as aggressively. This is crucial for fairness.
2. Identify the Symptom: Intermittent Test Failures
Our investigation began with reports of erratic failures in the ingress proxy integration test pipeline. Tests involving heavy early loss in the connection showed that CUBIC's cwnd would never recover from congestion collapse. The test failed 61% of the time — a clear sign of a state machine bug. Most congestion control tests exercise steady-state growth; this one probed the rare but critical minimum-cwnd regime after heavy loss.
3. Trace the Root Cause: App-Limited Exclusion in TCP
A prior Linux kernel change aimed to fix CUBIC's compliance with RFC 9438 by adding app-limited exclusion logic. In TCP, this fix worked fine. However, when quiche ported that same logic to QUIC, it introduced a subtle bug. The bug surfaced because QUIC's loss recovery and acknowledgment semantics differ from TCP's. Specifically, the app-limited exclusion condition reset an internal state variable (epoch_start) at the wrong time, preventing cwnd from ever growing after a collapse.
4. Analyze the Bug: How cwnd Gets Pinned
Let's walk through the bug mechanism step by step:
- During heavy loss, CUBIC reduces cwnd to its minimum (typically 2 packets).
- The connection enters recovery; new data may be limited because the application hasn't queued more (app-limited).
- The app-limited exclusion logic, when triggered, sets
epoch_startto the current time, effectively restarting the growth phase. - Because
epoch_startkeeps getting reset (each time the sender is app-limited during recovery), CUBIC's window growth is constantly restarted — it never accumulates enough time to increase cwnd. - The cwnd remains stuck at the minimum, even after the network recovers.
5. Implement the Near-One-Line Fix
The fix was elegant: only apply the app-limited exclusion when the connection is not in a loss recovery state. Adding a single check — if (!in_recovery) before resetting epoch_start — broke the cycle. In code, this might look like:

if (app_limited && !in_recovery) {
// Apply app-limited exclusion
epoch_start = now;
}
This ensures that during the critical recovery phase, the CUBIC state machine is not interrupted. Once recovery ends, normal app-limited logic can safely apply.
6. Verify the Fix
After applying the fix, re-run the integration test with heavy early loss. The failure rate dropped to zero. Additionally, monitor throughput and cwnd traces to confirm the cwnd recovers after congestion events. Use tools like ss or QUIC logging to observe the cwnd evolution.
Common Mistakes
Mistake 1: Blindly Porting Kernel Code to QUIC
The original Linux kernel fix was correct for TCP, but QUIC's different loss recovery model (e.g., packet numbers instead of sequence numbers, faster acknowledgments) made the app-limited exclusion logic behave differently. Always test edge cases when porting congestion control code.
Mistake 2: Ignoring the Recovery State
Many congestion control implementations treat app-limited logic uniformly, without considering whether the connection is in recovery. This can lead to cwnd starvation. Ensure that state transitions are well-defined.
Mistake 3: Insufficient Testing of Minimum cwnd Regimes
Most tests focus on steady-state throughput. As this bug shows, the minimum cwnd regime is fragile. Incorporate soak tests that simulate severe loss and then clear conditions to verify recovery.
Summary
This tutorial covered the discovery and fix of a CUBIC bug in QUIC where the congestion window got stuck at its minimum after heavy loss. The culprit was an app-limited exclusion logic ported from TCP that reset a critical state variable during recovery. The fix was a single conditional check to skip the exclusion during recovery. Key takeaways: understand the nuances of protocol differences when porting CC code, test recovery scenarios thoroughly, and keep fixes simple. The fix has been merged into quiche and improves resilience for all QUIC traffic.