Quick Facts
- Category: Education & Careers
- Published: 2026-05-17 21:52:06
- Exploring Python 3.15.0 Alpha 2: New Profiler, UTF-8 Default, and More
- DirtyDecrypt Exploit: Q&A on Linux Kernel Privilege Escalation Vulnerability
- From Rules to Reasoning: Building a B2B Document Extractor with OCR and LLMs
- Mastering GDB's Source-Tracking Breakpoints: A Step-by-Step Guide
- How to Automate LDAP Secrets Rotation with Vault Enterprise 2.0
Cloudflare relies heavily on ClickHouse for processing billions of dollars in usage billing every day. When their daily aggregation jobs suddenly slowed to a crawl after a routine migration, the engineering team faced a major crisis. This article unpacks the hidden bottleneck deep inside ClickHouse’s internals, the investigation that ruled out all usual suspects, and the three key patches that restored performance. Here are 10 critical insights from that saga.
1. The Billing Pipeline That Powers Millions
Cloudflare’s billing system depends on millions of daily ClickHouse queries to compute usage for hundreds of thousands of customers. These jobs determine how much each user pays for Cloudflare products, and they also feed fraud detection systems. Any delay in the aggregation jobs creates downstream reconciliation nightmares. The pipeline processes hundreds of millions of dollars in usage revenue, making its reliability absolutely critical. When these jobs slowed down, the entire billing timeline was thrown off, and the team knew they had to act fast.

2. The Migration That Sparked the Slowdown
The trouble began right after a scheduled migration of the billing cluster. The team expected some minor performance hiccups, but instead they saw the daily aggregation jobs taking hours longer than usual. All the typical performance metrics looked normal: IO waits were low, memory pressure was negligible, and the number of rows scanned per query hadn’t changed. This ruled out the obvious causes and suggested something much deeper was wrong inside ClickHouse. The search for the real culprit had to dig into the database’s engine internals.
3. Why the Usual Suspects Were Clean
When a ClickHouse query slows down, engineers usually check I/O utilization, memory consumption, and the number of parts read. In this case, every one of those metrics was well within normal bounds. The team even looked at the query profiles—no strange wait events, no unusual memory allocations. This was puzzling; the system seemed healthy everywhere they looked. They eventually realized that the bottleneck was hidden in a part of ClickHouse that rarely gets attention: the sorting and merging logic for the table’s primary key. This discovery set them on a path to understand the internal data structures.
4. The Scale of Cloudflare’s ClickHouse Deployments
Cloudflare stores over 100 petabytes of data across several dozen ClickHouse clusters. The Ready-Analytics system alone had grown to more than 2 petabytes by December 2024, ingesting millions of rows per second. This massive scale amplifies every inefficiency. Even a small change in query execution can cascade into hours of extra processing. Understanding the environment is crucial: the table in question used a standard schema with 20 float fields, 20 string fields, a timestamp, and an indexID, sorted by a primary key of (namespace, indexID, timestamp).
5. How Ready-Analytics Simplified Onboarding
Introduced in early 2022, Ready-Analytics allowed internal teams to stream data into one giant table instead of designing custom schemas. Datasets were disambiguated by a namespace string, and each record used a fixed schema. The indexID field formed part of the primary key, letting each namespace sort its data optimally for the queries it expected to run. This lowered the barrier for adoption—hundreds of applications signed up. But the simplicity came with a hidden cost: the entire table shared a single retention policy, which became the next big problem.
6. The Single Retention Policy Limitation
Before ClickHouse had native TTL features, Cloudflare built its own retention system based on daily partitions. The Ready-Analytics table retained data for exactly 31 days, dropping older partitions automatically. This one-size-fits-all approach was a dealbreaker for many teams. Some needed to store data for years due to legal or contractual reasons; others needed only a few days. Those teams could not use Ready-Analytics and had to go through a much more complex conventional setup. The growing demand for flexible per-namespace retention forced the team to redesign the system.

7. The Hidden Internals Bottleneck
During the investigation, the team discovered that the bottleneck was buried deep inside ClickHouse’s primary key sorting and merging stages. The per-namespace retention requirement meant that old partitions could no longer be dropped uniformly. Instead, the system had to selectively delete rows from within partitions, which triggered unexpected, heavy merges. These merges were not visible in the usual performance counters because they happened as background operations. The engine was spending huge amounts of CPU time re-sorting and merging ranges that contained data from many namespaces, leading to severe query delays.
8. Why Partition-Level Deletion Was Not Sufficient
The team initially tried to solve the retention problem by just dropping old partitions based on the minimum timestamp across all namespaces. But because different namespaces had different retention needs, this approach either deleted data that should be kept or kept data that should be deleted. A more granular deletion mechanism was required, but ClickHouse’s merge-tree engine struggled to efficiently remove individual rows within a partition without causing expensive merges. This tension between per-namespace retention and partition-level operations was the root cause of the slowdown.
9. Three Patches to Restore Performance
The engineering team wrote three targeted patches to address the hidden bottleneck. Patch 1 optimized the merge logic so that selective deletions (using ALTER TABLE … DELETE) did not force full-range re-sorting of large partitions. Patch 2 introduced a new index structure that allowed the engine to skip entire blocks of rows from namespaces that were not targeted by the deletion. Patch 3 improved the concurrency of background merges to prevent them from starving query processing. Together, these patches cut the aggregation job runtime back to normal levels.
10. Lessons Learned for Large-Scale ClickHouse Deployments
This incident taught Cloudflare’s team several valuable lessons. First, always monitor background merge activity, not just query performance. Second, when implementing per-namespace retention, consider the impact on the merge-tree’s sorting structure early in the design. Third, internal engine behaviors that are invisible in standard metrics can become critical bottlenecks at petabyte scale. The three patches not only fixed the immediate problem but also informed best practices for other teams using similar setups. For anyone running ClickHouse at scale, these insights are essential to avoid hidden slowdowns.
In conclusion, Cloudflare’s billing pipeline slowdown was a wake-up call that even the most trusted databases can hide deep bottlenecks. By understanding the interaction between retention policies, primary key design, and merge mechanics, the team not only restored performance but also improved the platform for future needs. The three patches they developed serve as a blueprint for tackling similar issues in high-volume analytics workloads.