SPECFEM++ — A Modern, Unified Rewrite of a Seismology Workhorse

Posted on March 31, 2026 by Lucas Sawade

SPECFEM has been a cornerstone of computational seismology for over two decades. The Fortran-based software suite is used for seismic wave propagation simulations and adjoint tomography, accumulating thousands of citations per year across a broad user base. But the original codebase grew across three separate repositories (2D, 3D, and 3D Globe) with substantial redundancy and architecture-specific GPU kernels, meaning new features rarely made it to all variants.

SPECFEM++ is the solution: a ground-up rewrite in C++ using Kokkos for performance portability across CPUs, NVIDIA GPUs (CUDA), and AMD GPUs (HIP) — all from a single codebase.

Expanded Physics

One of the primary goals of SPECFEM++ is to make adding new physics straightforward. The solver now supports elastic P-SV and SH waves, anisotropic media, poroelastic domains, and Cosserat media (which adds rotational degrees of freedom alongside displacement). Most significantly, 3D elastic isotropic simulation is now fully supported, with end-to-end integration tests confirming sample-by-sample agreement with SPECFEM3D Cartesian.

For problems with large contrasts in physical properties, non-conforming mesh support via a Discontinuous Galerkin approach allows different element sizes on either side of an interface. This enables the code to choose larger timesteps and drastically improve performance.

Associated Non-conforming simulation.
The conforming simulation is part of Pipatprathanporn et al., 2024 (DOI). Credit: Kentaro G. Hanson (G3, PACM).

Seismograms for conforming and nonconforming DG simulation as well as associated error. Credit: Kentaro G. Hanson (G3, PACM).

Performance on Par with — and Beyond — SPECFEM2D

After memory layout optimizations, SPECFEM++ now matches SPECFEM2D on CPU for elastic and acoustic problems. On GPU, it pulls ahead: for large elastic-acoustic domains (10M+ spectral elements), SPECFEM++ achieves up to 2× speedup over SPECFEM2D.

Performance comparison between SPECFEM2D Fortran and SPECFEM++ on CPU.

Real-World Benchmarks: Marmousi and Cosserat

Two showcases stand out. The Marmousi model cookbook runs wave propagation on a high-resolution CUBIT mesh derived from a classic seismic benchmark, and is a great stress test of the GPU backend. The Cosserat media simulation illustrates the rotational wavefield alongside displacement — something not possible in the original SPECFEM.

Wave propagation due to an isotropic, explosive source in the complex Marmousi model. Water (acoustic) layer on top and a complex, solid layer below with Stacey Boundary conditions.

Wave propagation in a homogeneous Cosserat medium with Stacey boundary conditions on all sides. Left: Magnitude of the displacement component. Right: Rotational (spin) component. Credit: Max Chien (Junior, PACM).

Tooling and Documentation

Nightly benchmarks run automatically via Jenkins on Intel Gold CPUs and NVIDIA H100 GPUs, with results on a live review dashboard. The API documentation now includes the actual implemented equations, and cookbooks cover everything from basic homogeneous media to non-conforming fluid-solid interfaces.

What’s Next

Active development is focused on MPI support, attenuation, and acoustic-elastic coupling — the remaining pieces needed for large-scale parallel 3D production runs.

SPECFEM++ is a community project with monthly developer meetings (first
Wednesday of each month, 12:00 PM Eastern sign up here). Documentation is at specfem2d-kokkos.readthedocs.io and the code is on GitHub.

Acknowledgements: The initial draft for this post was generated by Anthropic’s Claude Sonnet 4.6 using presentation slides and github release notes.

Taming memory—and making room for more physics

Posted on March 26, 2026 by Peter Fackeldey

As a Postdoctoral Research Associate working in computing for high‑energy physics (HEP), I spend a large amount of time thinking about reducing RAM usage, because it’s the most precious resource we usually have on our computing clusters. Decreasing memory usage by 20% effectively means we can run 20% more jobs in parallel and that’s powerful when analyzing billions of detector events.

In a recent 20‑minute RSE group talk, I shared two threads from our work: first, how memory fragmentation in Python‑driven HEP workflows can look like a memory leak; and second, how delivering only the arrays we need—when we need them—lets us analyze more events without buying more RAM.

1) Memory fragmentation isn’t a leak—but it sure looks like one

When a program needs memory, it asks the OS via syscalls like sbrk/mmap. Doing that constantly is expensive, so allocators grab larger chunks (“arenas”) and parcel them out. Over time, those arenas can get fragmented. There’s internal fragmentation (you asked for 15 B, you get 16 B—typically multiples of 4) and external fragmentation (freed holes can’t be reused by differently sized objects). The oversimplified cartoon version: allocate A/B/C in a 12 kB arena, free B (you “waste” 4 kB), then allocate a larger D and you end up needing a second arena—24 kB in use even though your live data could fit in 16 kB.

Our trigger was Uproot reading of ROOT TTrees—lots of differently sized NumPy arrays interleaved with many tiny Python objects. Iterating a CMS NanoAOD open‑data file with tree.iterate(step_size="20 MB") on macOS showed a steadily increasing resident set size (RSS). That graph screams “leak,” but the culprit is fragmentation. Swapping the default allocator for mimalloc stabilized RSS and cut peak memory roughly in half on macOS. Links to prior observations are here: Uproot 5 discussion and a coffea issue for similar symptoms.

The twist: this behavior is environment‑dependent. On Linux (e.g. using the coffea‑casa computing infrastructure), fragmentation was smaller to begin with and mimalloc actually made things slightly worse. Add Python’s garbage collector to the mix—collection timing is stochastic—and you have a debugging problem that’s hard to understand and easy to misdiagnose as a true leak. A forward‑looking note: the new RNTuple format is expected to behave more robustly, and is ready to be used with Uproot.

2) Optimized array delivery: only keep in RAM what an Awkward operation needs

HEP event trees are wide: >1000 columns is normal, while a typical analysis uses on the order of ~50. Last year I worked on VirtualArrays in Awkward Array—lazy columns that load only when touched—which already moved us from O(10) GB down to O(1) GB for many analyses. But we can go further by controlling not just which arrays are available, but when they’re resident in memory for a given operation.

We implemented three array delivery options, each exposing a Python MutableMapping interface so analyses can swap strategies without refactoring:

BufferCache — basically a dictionary: arrays materialize lazily and stick around. Think of this as “plain” VirtualArrays.

CompressedBufferCache — arrays are always stored compressed in RAM (blosc + ZSTD, clevel=1, bit‑shuffling) and decompressed on access.

HDF5BufferCache — arrays live on disk in an HDF5 file near the CPU; __getitem__ reads them on demand, after usage they’re dropped again from RAM (they’re still on disk). On my local Apple SSD I measured ~10.73 GB/s reads and ~2.67 GB/s writes for this benchmark.

For a realistic micro‑analysis on CMS public data (~1M events), and with mimalloc enabled to neutralize fragmentation (that we discussed earlier in this post), we saw:

BufferCache: peak RSS 1.51 GB, runtime 13.48 s.
CompressedBufferCache: peak RSS 1.01 GB, runtime 14.10 s.
HDF5BufferCache: peak RSS 0.73 GB, runtime 13.34 s.

That last result is the punchline: by keeping arrays off RAM until the moment of use, peak memory fell to ~730 MB with essentially no runtime penalty relative to the in‑memory baseline. Since SSDs are dramatically cheaper than DRAM in our clusters, this is a practical way to “buy” effective memory: more events fit in a single job, which means bigger Awkward Arrays and thus greater SIMD benefits of our kernels.

Put together with the earlier story, the picture looks like this:

Eager materialization of all columns: >10 GB.
Lazy VirtualArrays (status quo): ~1.5 GB (≈10× reduction).
Lazy on‑disk VirtualArrays (e.g. with HDF5): ~0.73 GB (another ~2×).

What I’m taking forward

Two lessons stuck with me. First, in certain software scenarios climbing RSS isn’t necessarily because of a “leak”: inspect allocator behavior and OS differences; try an alternative allocator like mimalloc and verify on the platform you’ll run at scale. Second, lazy data access and using data only when it’s needed is a powerful pattern. By aligning array lifetimes with actual operations—and being able to store them compressed in-memory or on disk (e.g. SSD)—we enable processing more collision events in parallel without changing analysis code semantics. If you’re working with Uproot, Awkward Array, or coffea and want to try the cache strategies, I’m happy to compare notes and share code.

Links & attributions

Uproot TTree discussion on fragmentation: https://github.com/scikit-hep/uproot5/discussions/1535
Historical coffea issue showing similar symptoms: https://github.com/scikit-hep/coffea/issues/249
Awkward‑Array: https://github.com/scikit-hep/awkward/tree/main/src/awkward
Uproot: https://github.com/scikit-hep/uproot5
mimalloc: https://github.com/microsoft/mimalloc
RNTuple: https://github.com/root-project/root/blob/master/tree/ntuple/doc/BinaryFormatSpecification.md

Launching an Interactive Hydrology Dashboard on HPC with Open OnDemand

Posted on February 11, 2026 by Amy Defnet

As a Research Software Engineer, I work with Professor Reed Maxwell’s group, helping scientists better understand how water moves through the environment. Our lab studies computational hydrology, which means we use mathematical models to simulate how groundwater and surface water flow across landscapes. These simulations help researchers explore questions related to current groundwater availability, future scenario predictions, and long‑term water sustainability.

Recently our team, together with collaborators at the University of Arizona, has been building a continental‑scale model of water movement across the entire United States using the integrated groundwater model ParFlow. A model this large produces quite a bit of data, and one of my goals was to make it easier for the research team to explore and interpret the results. To help with this, I built a small interactive dashboard using Dash, a Python framework that lets you create web apps with interactive plots and maps. The dashboard lets us directly compare model results with real‑world observations: for example, streamflow measurements from monitoring stations. This helps researchers quickly see where the model is performing well and where it may need adjustment.

While there were some quirks to get the initial dashboard set up, the step that proved to be more challenging was in how to share it with all of the collaborators.

Our group works on a high‑performance computing (HPC) cluster, because our simulations are too large to run on a laptop. But HPC environments aren’t naturally designed for running web applications, especially not ones that multiple people may want to launch on demand.

We needed a solution that would:

Let anyone on the team launch the dashboard without installing software
Scale up if we wanted to start visualizing the project’s larger datasets
Be friendly to both programmers and non‑programmers

That’s where Open OnDemand came in.

Open OnDemand (OOD) is a web platform that sits on top of an HPC cluster and gives users the ability to run jobs, move files, and launch applications, all from a browser. Instead of everyone needing to memorize Unix commands or write job scripts, they can launch the dashboard in just a couple of clicks.

What I didn’t know at first was that OOD supports two distinct ways of deploying custom applications. Choosing between them turned out to be a key technical decision.

Open OnDemand supports:

1. Passenger Applications

These behave like traditional web apps.

They run on the cluster’s login or “head” node.
They’re ideal for small, lightweight applications with modest computing needs.
They automatically stay online, which is convenient.

For our dashboard, this approach wasn’t ideal. Because our visualizations can involve large model outputs, running everything on the head node could slow things down or interfere with other users.

2. Interactive Applications

These work differently:

Each time a user launches the app, OOD submits a job to the cluster
That job reserves compute resources on one of the worker nodes
OOD then connects the user’s browser to that job so the app runs inside the HPC environment

This design has important advantages:

The dashboard can use as much memory or computing power as the user requests
Each user gets their own isolated instance
It keeps our options flexible in case we want to add more computational complexity to the dashboard in the future

For our use case, we decided the Interactive App approach gave us the most flexibility for future development.

In the final setup, running the dashboard looks like this:

A collaborator logs into our cluster’s Open OnDemand page
They click on the dashboard’s icon under “Interactive Apps”
They choose how many hours their session will be
They click “Launch”
Within a minute or two, a unique dashboard session opens right in their browser

This means hydrologists who may not write code every day can still explore large datasets and investigate model behavior, without ever touching the command line.

Setting this up taught me a lot about how Open OnDemand handles custom applications. To help others who might be trying something similar, I put together a small GitHub repository with several “Hello World” examples to show how both Passenger and Interactive apps can be developed, both with and without Dash.

You can find it here: https://github.com/amy-defnet/hello-world-ood

Making scientific tools accessible is just as important as building them. By turning complex datasets into an interactive dashboard that anyone on the team can launch with a few clicks, we’ve made it easier for researchers to analyze results, spot issues, and ask new scientific questions. I’m excited to continue improving this tool and exploring new ways to help scientists use computational models more effectively.

Advent of Code 2025 in Typescript

Posted on January 5, 2026 by Henry Schreiner

After two years of Advent of Code in Rust, I thought I’d try TypeScript. I’ve always wanted to improve repo-review’s webapp, and that requires knowledge of the packaging systems for JavaScript, so I thought I’d try TypeScript this year. I also used this as an opportunity to learn more AI tooling too, mostly CoPilot in VSCode & ChatGPT. I’d like to share my experience and thoughts! My code is at aoc2025 (and aoc2024, aoc2023).

Background

Since this is my experience with TypeScript, I should start with my background. I’m very familiar with Python, C++, Rust, and Ruby, and some experience with a few other languages, including JavaScript. I largely interact with JavaScript because I am providing WebAssembly code in the browser, and that’s how you set up the WebApp running Python or whatever. That’s also why I’m interested in how to do that properly; repo-review’s webapp runs in live JSX, and is not properly bundled. Of course, it uses Python, which is several MB, so it isn’t that important to bundle it up, but I’d like to do better eventually.

I’m pretty heavily involved in Python packaging, having written the Scientific Python Library Development Guide and parts of packaging.python.org, and maintain a variety of foundational tools, including packaging, build, scikit-build-core, pybind11, and nox, among others.

As part of the Princeton RSE program, I’ve seen a lot of my colleagues starting to use and talking on AI (also developing it), and I’ve been wanting a chance to do more with it. Several of us also do the Advent of Code each year as a language learning exercise.

Getting started: packaging

In my opinion, “packaging” is the most important software skill. By packaging, I don’t just mean “how to ship code”, I mean how to develop code – the infrastructure you use to run tests, formatters and linters, manage dependencies, etc. I started by asking ChatGPT for what was commonly done, and also did some searching; and settled on pnpm, which was a fast modern alternative to npm (and yarn, etc). AI tends to be really bad at this, by the way; it doesn’t handle changes all that well to the way things are done. Once I had picked a tool, then it produced somewhat useful suggestions on how to set it up, and I also had to consult the documentation a little (but not much). This is approximately the commands I ended up with at first:

brew install pnpm node
pnpm init
pnpm install --save-dev typescript tsk @types/node

Continue reading on ISciNumPy →

NeurIPS spotlight: SWE-smith: Scaling Data for Software Engineering Agents

Posted on December 17, 2025 by Tai Sakuma

The following work received a spotlight at NeurIPS:

John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, Diyi Yang

Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at swesmith.com.

https://arxiv.org/abs/2504.21798

USRSE’25

Posted on November 21, 2025 by jbretheim

Last month many members of the RSE Group and other Princeton colleagues attended USRSE’25, the third annual conference from US-RSE. Hosted this year in Philadelphia, the conference theme was “Code, Practices, and People.” Princeton University (authors in bold) contributions included:

Accelerating Research: Strategies from the Field – Jen Rosiere Reynolds, Lance Parsons, Gail Rosenbaum, Joost Wagenaar, and Sarah Stevens (BoF)
Sustainable Models of RSE Support: The Prospects of Centralization in Institutional Research – Eric Manning, Lori Bougher, Colin Swaney, and Sangyoon Park (BoF)
Undate: computing with uncertain and partially-unknown dates – Rebecca S. Koeser (notebook)
Integrating ATR Software with University HPC Infrastructure: balancing diverse compute needs – Christine Roughan and Rebecca S. Koeser (paper)
INnovative Training Enabled by a Research Software Engineering Community of Trainers (INTERSECT) – Jeffrey Carver and Ian Cosden (poster)
Building Scientific Python Packages – Henry Schreiner (poster)
Community Code Review in the Digital Humanities – Julia Damerow, Rebecca S. Koeser, Jeffrey C. Carver, and Malte Vogl (poster)
Surveying the Digital Humanities Research Software Engineering Landscape – Rebecca S. Koeser and Julia Damerow (poster)
Ten Simple Rules for Catalyzing Collaborations and Building Bridges between Research Software Engineers and Software Engineering Researchers – Nasir Eisty, Jeffrey Carver, Johanna Cohoon, Ian Cosden, Carole Goble, and Samuel Grayson (poster)
Developing a Machine Learning-Augmented Solver for the Hydrologic Model ParFlow – Georgios Artavanis, Laura Condon, Andrew Bennett, and Reed Maxwell (talk)
Everything, All at Once, Yesterday: Creating Research Software with Humanities Faculty – Jeri Wieringa and Mary Naydan (talk)
What happened to Curt’s arm? – Curt Hillegas (RAM)
Agile Foundations for RSEs: Building an AI Assistant with Agile – Tisha Charles and David Luet (workshop)

Additionally, Princeton University Professor Reed Maxwell delivered the first keynote address on Accelerating Continental-Scale Groundwater Simulation With a Fusion of Machine Learning, Integrated Hydrologic Models and Community Platforms. His keynote highlighted three of his lab’s software projects centered around hydrologic data, simulations, and visualizations, and he noted contributions to those projects from five current and past RSE Group members (Vineet Bansal, Calla Chenault, Georgios Artavanis, Amy Defnet, and Bill Hasling). Professor Maxwell stated that not only RSE contributions to software, but additionally that “RSEs enable digital education and outreach content.”

All in all, it was inspiring to convene with RSEs from all over the country. We already look forward to next year’s conference to be hosted in the San Francisco Bay Area!

arXiv: CodeClash: Benchmarking Goal-Oriented Software Engineering

Posted on November 2, 2025 by Tai Sakuma

John Yang, Kilian Lieret, Joyce Yang, Carlos E. Jimenez, Ofir Press, Ludwig Schmidt, Diyi Yang

Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it’s writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.

https://arxiv.org/abs/2511.00839

Published in NeurIPS 2025: What Makes a Reward Model a Good Teacher? An Optimization Perspective

Posted on September 27, 2025 by AbhishekB

By Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora

The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient~optimization.

Read the paper: https://arxiv.org/abs/2503.15477

Discovery of a widespread chemical signalling pathway in the Bacteroidota

Posted on August 20, 2025 by AbhishekB

By Luis Linares-Otoya, Jaden D. Shirkey, Bhuwan Khatri Chhetri, Amira Mira, Abhishek Biswas, Samuel L. Neff, Maria V. Linares-Otoya, Ye Chen, Julio V. Campos-Florian, Mayar L. Ganoza-Yupanqui, Philip D. Jeffrey, Frederick M. Hughson & Mohamed S. Donia

Considerable advances have been made in characterizing bioactive molecules secreted by bacteria, yet the regulatory elements controlling their production remain largely understudied. Here we identify and characterize the N-acyl-cyclolysine (ACL) system—a cell-density-dependent chemical signalling system specific to and widespread in the phylum Bacteroidota (formerly Bacteroidetes)—and show that it regulates the expression of co-localized operons encoding diverse secreted molecules. Using genetic and biochemical analyses, combined with structural studies of a key biosynthetic enzyme, AclA, we elucidate the molecular structure of various ACLs and their complete biosynthetic pathway involving l-lysine acylation and ATP-dependent cyclization. Furthermore, we find that secreted ACLs are sensed by a dedicated transcription factor, AclR, resulting in the expression of associated operons and the autoinduction of ACL biosynthesis. Moreover, we show that different Bacteroidota strains produce structurally diverse ACLs and encode transcription factors with varying ligand specificities. Finally, we find that the acl circuit is widely distributed and transcribed in human gut and oral microbiome samples, with clear evidence for an active role in regulating associated operons under host colonization conditions. Understanding the function of the ACL system in different contexts has the potential to reveal details about the biology, ecology and chemistry of the Bacteroidota and how members of this phylum interact with their environments and hosts.

Read the paper: https://www.nature.com/articles/s41586-025-09418-9

Wrapping Up a Successful INTERSECT RSE Bootcamp at Princeton

Posted on July 21, 2025 by Ian Cosden

We’re thrilled to share that the third annual INTERSECT Research Software Engineering Bootcamp, held July 14-18, 2025 at Princeton University, concluded with great success! This immersive 4.5-day event brought together a vibrant cohort of intermediate research software developers from diverse domains, many of whom lack formal computer science training.

Funded by a National Science Foundation (NSF) grant and organized in collaboration with Dr. Jeff Carver from the University of Alabama, the bootcamp focused on core Research Software Engineering (RSE) practices. Led by volunteer instructors from the broader RSE community, participants engaged in hands-on sessions covering:

Software Design

Collaborative Git & Pull Requests

Code Review

Licensing & Documentation

Testing & CI/CD

Packaging & Distribution

The energy and enthusiasm throughout the week were inspiring. Attendees not only sharpened their technical skills but also built lasting connections across institutions and disciplines. We’re proud to support the growth of the RSE community and grateful to everyone who made this event possible.

More information on INTERSECT, including the open-source curriculum is available here: https://intersect-training.org/.