As a Postdoctoral Research Associate working in computing for high‑energy physics (HEP), I spend a large amount of time thinking about reducing RAM usage, because it’s the most precious resource we usually have on our computing clusters. Decreasing memory usage by 20% effectively means we can run 20% more jobs in parallel and that’s powerful when analyzing billions of detector events.
In a recent 20‑minute RSE group talk, I shared two threads from our work: first, how memory fragmentation in Python‑driven HEP workflows can look like a memory leak; and second, how delivering only the arrays we need—when we need them—lets us analyze more events without buying more RAM.
1) Memory fragmentation isn’t a leak—but it sure looks like one
When a program needs memory, it asks the OS via syscalls like sbrk/mmap. Doing that constantly is expensive, so allocators grab larger chunks (“arenas”) and parcel them out. Over time, those arenas can get fragmented. There’s internal fragmentation (you asked for 15 B, you get 16 B—typically multiples of 4) and external fragmentation (freed holes can’t be reused by differently sized objects). The oversimplified cartoon version: allocate A/B/C in a 12 kB arena, free B (you “waste” 4 kB), then allocate a larger D and you end up needing a second arena—24 kB in use even though your live data could fit in 16 kB.
Our trigger was Uproot reading of ROOT TTrees—lots of differently sized NumPy arrays interleaved with many tiny Python objects. Iterating a CMS NanoAOD open‑data file with tree.iterate(step_size="20 MB") on macOS showed a steadily increasing resident set size (RSS). That graph screams “leak,” but the culprit is fragmentation. Swapping the default allocator for mimalloc stabilized RSS and cut peak memory roughly in half on macOS. Links to prior observations are here: Uproot 5 discussion and a coffea issue for similar symptoms.
The twist: this behavior is environment‑dependent. On Linux (e.g. using the coffea‑casa computing infrastructure), fragmentation was smaller to begin with and mimalloc actually made things slightly worse. Add Python’s garbage collector to the mix—collection timing is stochastic—and you have a debugging problem that’s hard to understand and easy to misdiagnose as a true leak. A forward‑looking note: the new RNTuple format is expected to behave more robustly, and is ready to be used with Uproot.
2) Optimized array delivery: only keep in RAM what an Awkward operation needs
HEP event trees are wide: >1000 columns is normal, while a typical analysis uses on the order of ~50. Last year I worked on VirtualArrays in Awkward Array—lazy columns that load only when touched—which already moved us from O(10) GB down to O(1) GB for many analyses. But we can go further by controlling not just which arrays are available, but when they’re resident in memory for a given operation.
We implemented three array delivery options, each exposing a Python MutableMapping interface so analyses can swap strategies without refactoring:
BufferCache — basically a dictionary: arrays materialize lazily and stick around. Think of this as “plain” VirtualArrays.
CompressedBufferCache — arrays are always stored compressed in RAM (blosc + ZSTD, clevel=1, bit‑shuffling) and decompressed on access.
HDF5BufferCache — arrays live on disk in an HDF5 file near the CPU; __getitem__ reads them on demand, after usage they’re dropped again from RAM (they’re still on disk). On my local Apple SSD I measured ~10.73 GB/s reads and ~2.67 GB/s writes for this benchmark.
For a realistic micro‑analysis on CMS public data (~1M events), and with mimalloc enabled to neutralize fragmentation (that we discussed earlier in this post), we saw:
- BufferCache: peak RSS 1.51 GB, runtime 13.48 s.
- CompressedBufferCache: peak RSS 1.01 GB, runtime 14.10 s.
- HDF5BufferCache: peak RSS 0.73 GB, runtime 13.34 s.
That last result is the punchline: by keeping arrays off RAM until the moment of use, peak memory fell to ~730 MB with essentially no runtime penalty relative to the in‑memory baseline. Since SSDs are dramatically cheaper than DRAM in our clusters, this is a practical way to “buy” effective memory: more events fit in a single job, which means bigger Awkward Arrays and thus greater SIMD benefits of our kernels.
Put together with the earlier story, the picture looks like this:
- Eager materialization of all columns: >10 GB.
- Lazy VirtualArrays (status quo): ~1.5 GB (≈10× reduction).
- Lazy on‑disk VirtualArrays (e.g. with HDF5): ~0.73 GB (another ~2×).
What I’m taking forward
Two lessons stuck with me. First, in certain software scenarios climbing RSS isn’t necessarily because of a “leak”: inspect allocator behavior and OS differences; try an alternative allocator like mimalloc and verify on the platform you’ll run at scale. Second, lazy data access and using data only when it’s needed is a powerful pattern. By aligning array lifetimes with actual operations—and being able to store them compressed in-memory or on disk (e.g. SSD)—we enable processing more collision events in parallel without changing analysis code semantics. If you’re working with Uproot, Awkward Array, or coffea and want to try the cache strategies, I’m happy to compare notes and share code.
Links & attributions
- Uproot TTree discussion on fragmentation: https://github.com/scikit-hep/uproot5/discussions/1535
- Historical coffea issue showing similar symptoms: https://github.com/scikit-hep/coffea/issues/249
- Awkward‑Array: https://github.com/scikit-hep/awkward/tree/main/src/awkward
- Uproot: https://github.com/scikit-hep/uproot5
- mimalloc: https://github.com/microsoft/mimalloc
- RNTuple: https://github.com/root-project/root/blob/master/tree/ntuple/doc/BinaryFormatSpecification.md