Program

Day 1: 6 November

Time	Session
13:00–13:15	Opening & Welcome Organiser
Session 1: Performance analysis tools 1
13:15–13:45	An Instrumentation Plugin for Asynchronous Task Based Runtimes and Tools Kevin Huck, Olivier Aumage, Camille Coti, Thomas Herault, Allen Malony, Mohammad Alaul Haque Monil and Joseph Schuchart (slides) abstract Asynchronous task scheduling in parallel computing allows for non-blocking execution of code, messaging, and I/O, enabling multiple operations to run concurrently, improving responsiveness and efficiency of the system. Asynchronous Tasking Model (ATM) runtimes go by many different names and implementations, but are similar in that they are difficult for conventional performance tools to analyze. The program callstack does not relate to the runtime generated task dependency graph. In addition, standard function instrumentation approaches do not match the full set of task life cycle stages. Tools that do support ATM runtimes provide insight and guide optimization, but require code instrumentation. To aid in that goal, we developed extit{TaskStubs}, a generic plugin programming interface that enables coupling of performance tools and ATM runtimes while simplifying integration complexity. In this paper we discuss the motivation for TaskStubs, provide implementation details, and present integration cases with three different ATM runtimes and performance tools.
13:45–14:15	Trace-based, time-resolved analysis of MPI application performance using standard metrics Kingshuk Haldar (slides) abstract Detailed trace analysis of MPI applications is essential for performance engineering, but growing trace sizes and complex communication behaviour often render comprehensive visual inspection impractical. This work presents a trace-based calculation of time-resolved values of standard MPI performance metrics, load balance, serialisation, and transfer efficiency by discretising execution traces into fixed or adaptive time segments. Our implementation processes Paraver traces post-mortem, reconstructing critical execution paths and handling common event anomalies, such as clock inconsistencies and unmatched MPI events, to calculate the metrics in each segment robustly. The calculated per-window metric values expose transient performance bottlenecks that the time-aggregated metrics from existing tools may conceal. Evaluations on a synthetic benchmark and real-world applications (LaMEM and ls1-MarDyn) demonstrate how time-resolved metrics reveal localised performance bottlenecks obscured by global aggregates, offering a lightweight and scalable alternative even when trace visualisation is impractical.
14:15–14:45	HPC-Workspace: A tool for Data-Life-Cycle-Management Christoph Niethammer, Holger Berger, Martin Schroschk and Maria Hampel (slides) abstract High-performance computing (HPC) simulations generate massive amounts of data that must be stored on costly and limited high-performance I/O subsystems. The efficient managing of this data is therefore an essential task in HPC system operation today. In this paper, we present HPC-workspace, a tool for data life-cycle management. HPC-workspace enables users to allocate temporary scratch directories according to administrator-defined policies, including configurable lifetimes and automated cleanup. HPC-workspace is designed to integrate with any Unix-based operating system and complex HPC file system environments. We describe the architecture and implementation of HPC-workspace, highlighting the various challenges posed by diverse HPC infrastructures and our solutions to address them. HPC-workspace is designed with three main objectives in mind: intuitive usability for users, minimal maintenance effort for administrators, and robust security. We illustrate its effectiveness through real-world use cases, deployment experiences at HPC centers, and best practices for adoption.
14:45–15:15	Coffee break
Session 2: Program correctness and data management tools
15:15–15:45	Recent developments in Archer: New analyses and support for OpenMP 6.0 Joachim Jenke, Felix Tomski and Simon Schwitanski (slides) abstract The OpenMP data race detection tool Archer has a history of more than ten years now. Archer by itself does not perform any data race analysis. Instead, it acts as an adapter between the OpenMP tool interface (OMPT) and the data race analysis tool ThreadSanitizer. From the beginning, Archer was designed as a general-purpose data race detection tool, supporting all available OpenMP features as long as the code executes on the CPU. Whenever the OpenMP standard adds new features, like taskwait nowait with depend clause, Archer is one of the first tools adding support for such features. OpenMP 6.0 introduced transparent task dependencies, which allow specifying dependencies between tasks at different nesting levels. As part of this paper, we evaluate which changes are necessary for Archer to support this new type of dependence. Furthermore, we introduce new correctness analyses for OpenMP applications into Archer that go beyond the data race detection currently supported. An example is the misuse of locks, when a different thread unlocks the lock than the thread that locked it. Having Archer available as part of most LLVM-based vendor compiler suites demonstrates the success of developing Archer in the LLVM project. Therefore, all presented changes are available as pull requests towards the LLVM project or are already merged.
15:45–16:15	NCCLSanitizer: Runtime Correctness Checking Stream-based Communication in NCCL Programs Felix Tomski, Joachim Jenke and Simon Schwitanski (slides) abstract GPU-accelerated HPC systems increasingly rely on direct GPU-to-GPU communication. Collective communication libraries for GPU-to-GPU communication, such as NCCL, RCCL, and oneCCL, have gained traction due to their integration into large-scale machine learning frameworks. These libraries support offloading communication to GPU streams, leading to reduced CPU overhead by decoupling communication from the CPU and allowing for more fine-grained control over communication and computation ordering on the GPU. However, stream-based communication semantics also introduce new challenges for correctness checking. The asynchronous nature and weak ordering of GPU stream execution can lead to concurrency bugs between CPU and GPU or within the GPU itself between different streams. In particular, data races and deadlocks may emerge that remain undetected by existing correctness-checking tools that are unaware of stream semantics and assume CPU-driven communication. We present NCCLSanitizer, a runtime correctness-checking tool specifically targeting NCCL programs with stream-based communication. NCCLSanitizer performs on-the-fly detection of data races and deadlocks in GPU communication, accounting for the concurrency semantics introduced by streams. It extends the MPI correctness-checking tool MUST to support hybrid MPI+NCCL programs, and further leverages CuSan, a compiler-assisted data race detector for MPI+CUDA applications, for its data race detection. By explicitly modeling stream-based execution, NCCLSanitizer uncovers correctness issues that remain invisible to existing tools.
Social event
18:30	Dinner: Taverna Elia abstract Taverna Elia Pfaffenwaldring 62 70569 Stuttgart

Day 2: 7 November

Time	Session
Session 3: Performance analysis tools 2
09:00–09:30	A Case Study in Implementing rocprofiler-sdk Support in Score-P Ammar Elwazir, William R. Williams, Jonathan Madsen and Bert Wesarg (slides) abstract We present our experiences in co-designing a new generic accelerator offloading event model and implementing it based on the new ROCProfiler-SDK tools interface in the Score-P measurement system. We discuss how we represent the host-device correlation and aspects of the AMD GPU hardware and software stacks in performance data, and how the new event model and interface has enabled more functionality in the measurement system and, subsequently, in the analysis tools. We present the design principles behind the ROCProfiler-SDK tools interface and how they interact with Score-P. We describe how prototyping a Score-P adapter against early implementa- tions of the tools interface led to additional features in the ROCProfiler-SDK interface, such as a stream information service, that will be helpful to other tools as well. We verify that the overhead of our measurements is not significantly affected by using the new tools interface. We then consider a case study where we analyze the Quicksilver benchmark at small scale on Frontier using both old and new versions of Score-P, demonstrating the utility of these improvements.
09:30–10:00	Scalable Metric Calculations Across Hardware Levels in Vampir Maximilian Knespel, Robert Schöne and Bert Wesarg (slides) abstract Modern HPC systems require sophisticated and scalable tools to monitor the plethora of existing software, operating system, and hardware levels. One of the most crucial issues is the mapping of metrics in hardware and OS, so that a later analysis allows for an attribution of these metrics to their cause in software. This is made even more difficult by metrics that belong to hardware components that are shared by multiple monitored software threads. While the direct influence of a single thread cannot be attributed to the hardware in such cases, the available information can still describe whether its performance is within expected ranges or represents a bottleneck. Bringing together multiple of these hardware-centric information poses another challenge. Monitoring tools must define and record the placement of components within a system so that subsequent analysis tools can evaluate them. In this paper, we present enhancements to the metric plugin interface used by Score-P and lo2s, which lay the groundwork by enabling a proper attribution for metrics to the correct hardware level. Furthermore, we describe how this enables the Vampir performance analysis tool to visualize and process such metrics across the system hierarchy. We demonstrate the new capabilities through a case study for energy usage analysis, utilizing RAPL energy metrics recorded on packages and cores. Our approach enables the comparison of energy usage at different system levels, providing valuable insights into system performance and energy efficiency.
10:00–10:30	Cube Everywhere: Easy Entry into Performance Analysis Pavel Saviankou, Anke Visser and Bernd Mohr (slides) abstract The analysis of High-Performance Computing (HPC) application performance is a crucial, yet often complex, task. While the Cube tool has previously employed a client-server architecture to manage this complexity, its clients have been restricted to traditional desktop operating systems. This limitation has created a significant barrier to entry and hindered its use in broader contexts, such as workshops and large-scale training events. This paper presents a new development that completes the vision of a truly portable and accessible performance analysis platform. We introduce new clients for web browsers, Android, and iOS that seamlessly integrate with the existing cube_server. This strategic expansion of client platforms is what fundamentally unlocks the full potential of the client-server model, offering direct and easy access to the Cube without a tedious local installation. We detail the technical setup of this system, highlighting how it provides significant advantages for setting up training events and offering convenient access to a wider user base. Furthermore, we analyze the specific benefits and challenges of this approach, offering a comprehensive look into its practical application and potential for democratizing HPC performance analysis.
10:30–11:00	Coffee Break
Session 4: Programing models and tool usage experiences
11:00–11:30	Hierarchical Collective Operations for the Asynchronous Many-Task Runtime HPX Alexander Strack, Lukas Zeil, Hartmut Kaiser and Dirk Pflüger (slides) abstract Asynchronous many-task runtimes (AMTRs) are a refreshing alternative to existing synchronous parallelization approaches like MPI and OpenMP. The AMTR HPX not only provides efficient shared-memory parallelization based on task-futurization, but also supports distributed computing leveraging an active global address space (AGAS). The AGAS of HPX allows for implicit communication utilizing different communication backends, including TCP, LCI, and MPI. For compatibility with traditional parallelization approaches, HPX allows for explicit point-to-point communication and collective operations. In a recent work, we discovered that on large scales, these HPX collective operations are a major bottleneck due to their basic implementation. Therefore, we are working on revising the HPX collective using more advanced algorithms. Precisely, we use a hierarchical communicator to allow for tree-based message propagation with different arities. Early benchmarks indicate significant performance improvements with our newly developed collectives. We achieve a speedup of up to 32 for reduce and broadcast operations and of up to 16 for gather and scatter running on 256 processes on a 16-node cluster. Furthermore, our approach shows competitive performance compared to the respective collectives in OpenMPI. We are in the process of overhauling the entire set of HPX collectives using our new hierarchical approach. This marks a crucial step toward making the HPX runtime competitive with state-of-the-art MPI implementations.
11:30–12:00	Performance Evaluation of the Flow Solver TRACE – from Fluid Flows to Fluid-Structure Interaction via the FSTraceInterfaceeak Martin Clemens, Ramandeep Jain, Christian Berthold, Immo Huismann and Ronny Tschüter (slides) abstract The CFD solver TRACE is used for state-of-the-art turbomachinery de- velopment. The code can either be run as a standalone program for pure CFD applications, or be steered via the HPC-capable FlowSimulator framework using the FSTraceInterface. The latter enables coupled Fluid-Structure Interaction (FSI) simulations. To ensure good performance of the existing code base and new addi- tions, performance evaluations using tools such as Score-P, Vampir and Cube are necessary. A performance analysis for the Harmonic-Balance solver of TRACE is conducted and a costly synchronization delay is detected and subsequently removed. Then, the two-way FSI coupling via FlowSimulator is analyzed for the first time. The Score-P Python bindings are used to instrument this complex C++ – Python tool chain.
12:00–13:30	Lunch
13:30–14:00	Accelerating the FlowSimulator: Low-level Analysis and Optimization of CODA Kernels Johannes Wendler, Jana Gericke-Schuster, Ronny Tschüter and Immo Huismann (slides) abstract This paper investigates the performance of the solver CFD by Onera, DLR and Airbus (CODA). Performance of an HPC code mainly consists of two factors: scalability and node-level performance. Previous studies showed good scalability for CODA [5, 6], but low node-level performance [7], with relevant code parts being neither compute nor memory bound. In this study, to now get a deeper look into what the actual bottlenecks are, we use dynamic and static analysis tools. In conjunction they provide very detailed data about the program execution. Using the dynamic analysis tools LIKWID [4], Linux perf-tools and MAQAO and the static analysis tools OSACA and MAQAO, we investigate a selection of the most relevant kernels in CODA. Focusing on assembly instruction level we then develop optimization strategies, implement them and evaluate their impact.
14:00–14:30	Heuristic-Based Merging of HPC Traces to Extend Hardware Counter Coverage Júlia Orteu, Fabio Banchelli, Marc Clascà and Marta Garcia-Gasulla (slides) abstract The work extends a framework for predicting the performance of High-Performance Computing (HPC) workloads using Machine Learning (ML). A common limitation in performance modeling is the restricted number of hardware counters that can be collected simultaneously. To address this, we propose a heuristic-based methodology to merge execution traces from multiple runs, each instrumented with a different set of hardware counters. Our approach matches computation bursts across executions by analyzing MPI structure, timing, and communication patterns. This process enables the construction of a unified dataset that includes a wider set of hardware features without relying on multiplexing. The output is a new synthetic trace with all merged counters, which can be used both for HPC performance prediction and for conventional performance analysis. The methodology has been validated on MareNostrum5 machine with a range of kernels and real applications. Results show that the merged counters maintain acceptable accuracy depending on the application, and can be directly used to train ML models on a richer feature space without prior counter selection.