Characterization and Acceleration of High Performance Compute Workloads

December 19, 2022

Stefano Corda defended his PhD thesis at the department of Electrical Engineering on December 19th.

Source: iStockPhoto.

Modern big-data workloads found in radio-astronomical imaging, machine learning techniques, and bioinformatics algorithms have demanding performance requirements, which leads to computation and memory bottlenecks. For instance, future radio telescopes, such as the Square Kilometre Array, will have to process vast amounts of data using high-performance computing systems with high-energy efficiency. Computing approaches such as Near-Memory Computing have been proposed as solutions, but these still suffer from performance bottlenecks and optimization. For his PhD research, Stefano Corda looked at application profiling and optimization for such high-performance computing systems.

With the demise of Dennard scaling and slowing down of Moore's law, computing performance is approaching a plateau. Furthermore, improvements in memory and processor technology have grown at different speeds, infamously termed the memory wall. These challenges make it difficult to meet the requirements of demanding computing applications such as machine learning and bioinformatics.

A promising solution is High-Performance Computing (HPC), which uses modern architectures such as multi-core central processing units (CPUs), graphic processing units (GPUs), and field programmable gate arrays (FPGAs) to accelerate workloads by optimizing the code to exploit performance close to their limits of operation.

Furthermore, among today's emerging computing paradigms, Near-Memory Computing (NMC) is gaining in popularity. NMC is a data-centric computation approach that performs the computation near the memory, thus avoiding the movement of data that characterize classical compute-centric systems, and potentially is a candidate for high-performance computing.

Bottleneck issues

With the advent of numerous emerging computing systems, it has become crucial to characterize applications so as to highlight performance bottlenecks and optimization opportunities.

Moreover, algorithm optimization and acceleration are key factors for providing high performance on modern computing systems. However, contemporary workloads do not perform equally on different systems, such as GPUs and FPGAs.

This leads to a careful selection of application-domain architectures and optimizations. To overcome this issue, it’s imperative that application characterization techniques are studied. For instance, one application of note would be the use of machine learning for efficient offloading decisions, and the optimization of performance bottlenecks in radio-astronomical imaging applications on heterogeneous architectures.

Application profiling and optimization

For his PhD research, Stefano Corda studied the key contributions needed for application profiling and optimization on high-performance computing systems. He extended the state-of-the-art Platform-Independent Software Analysis (PISA) with metrics concerning memory and parallelism relevant to NMC.

The metrics considered include memory entropy, spatial locality, data-level, and basic-block-level parallelism. By profiling a set of representative applications and correlating the metrics with the application's performance on a simulated NMC system, Corda and his collaborators were able to show that these additional metrics improve state-of-the-art tools in the identification of applications suitable for NMC architectures.

Since hardware-independent analysis is expensive in terms of computation time and resources, Corda suggests employing an ensemble machine learning model together with hardware-dependent application analysis. This led to a reduction in the prediction time of up to 3 orders of magnitude in comparison to the state-of-the-art.

Radio-astronomical imaging case study

While the previous contributions employ the benchmark methodology, Corda also focused on the real-world use case of radio-astronomical imaging, where CPI (clock per instruction) breakdown analysis on modern CPUs identifies large 2D FFTs and Gridder to be a performance bottleneck.

It presents an NMC accelerator for 2D FFT computation and shows its implementation on FPGA outperforms the CPU counterpart and performs comparably to a high-end GPU.

To improve the performance of Gridder, it exploits reduced precision acceleration contrary to the usual practice of employing high-precision computations in radio astronomy imaging. Reduced-precision analysis shows that precision must be selected carefully. It presents the first reduced precision accelerator for Gridder, employing custom floating-point data types on FPGA. The prototype outperforms a CPU and keeps up with a GPU with similar peak performance and lithography technology.

Title of PhD thesis: Characterization and Acceleration of High-Performance Compute Workloads. Supervisors: Henk Corporaal, Akash Kumar, and Roel Jordans.