Vishvanath Metkari

Aug 1, 20235 min read

Investigating Performance Discrepancy in HPL Test on ARM64 Machines

Updated: Nov 12

Introduction:

High-Performance Linpack (HPL) is a widely used benchmark for testing the computational performance of computing systems. In this blog post, we explore an intriguing scenario where we conducted HPL tests on two ARM64 machines. Surprisingly, the Host-2 machine exhibited a 20% lower performance than the Host-1 machine in the HPL test. Intrigued by this result, we embarked on a journey to comprehensively diagnose the underlying cause of this performance discrepancy.

Why HPL for Performance Testing?

People use the High-Performance Linpack (HPL) benchmark for performance testing because it provides a standardised and demanding workload that measures the peak processing power of computer systems, particularly in terms of floating-point calculations.
It helps assess and compare the computational capabilities of different hardware configurations.
This benchmark helps in comparing and ranking supercomputers' performance and is often used as a metric for the TOP500 list of the world's most powerful supercomputers. For more information, you can refer to the TOP500 article here: TOP500 List

Objective:

The primary objective of this investigation was to identify the reason behind the 20% performance difference observed in the HPL test between the Host-1 and Host-2 machines. To comprehensively diagnose the performance discrepancy, we conducted additional benchmark tests, including Stream, Lmbench, and bandwidth tests.

1. System Details:

We conducted a fair and controlled experiment using two ARM64 machines, referred to as Host-1 and Host-2.

1.1 Machine Specifications ( Host-1 and Host-2 ):

CPU(s): 96
Architecture: aarch64
Total memory: 96 GB
Memory speed: 3200 MHz

2. Running HPL Benchmark:

To run the HPL benchmark on an arm64 machine, you can refer to the GitHub repository provided: https://github.com/AmpereComputing/HPL-on-Ampere-Altra.
This repository likely contains instructions, scripts, and configurations specific to running HPL on Ampere Altra ARM64-based machines.
It's important to follow the guidelines provided in the repository to ensure accurate and meaningful benchmarking results.

2.1 HPL Scores:

Upon completing the HPL benchmark on both machines, we computed and compared the achieved HPL scores. The Host-1 machine garnered a higher HPL score, signifying better computational performance.

Machine	Time ( sec )	Score
Host-1	619.91	1245
Host-2	784	985

This result raised a critical question: why was there such a substantial performance gap? To delve into the root causes behind this discrepancy, we decided to conduct a series of additional tests to comprehensively investigate the issue.

3. Exploring Additional Tests:

We conducted several other benchmark tests to comprehensively investigate the performance discrepancy between the Host-1 and Host-2 ARM64 machines. These tests aimed to shed light on various aspects of the systems' hardware and memory subsystems, providing a holistic understanding of the observed difference.

Below, we detail the tests and their findings:

3.1 Stream Benchmark:

The Stream benchmark assesses memory bandwidth and measures the system's capability to read from and write to memory.
The benchmark consists of four fundamental tests: Copy, Scale, Add, and Triad.
Copy: Measures the speed of copying one array to another.
Scale: Evaluates the performance of multiplying an array by a constant.
Add: Tests the speed of adding two arrays together.
Triad: Measures the performance of a combination of operations involving three arrays.
The Stream benchmark helps uncover memory bandwidth limitations and assess memory subsystem efficiency.

Host-1 machine result :

Function	Best Rate MB/s	Avg time	Min time	Max time
Copy	103837.8	0.367897	0.36192	0.373494
Scale	102739.4	0.369191	0.365789	0.372439
Add	106782.7	0.536131	0.527908	0.542759
Triad	106559.1	0.533549	0.529016	0.537881

Host-2 machine result :

Function	Best Rate MB/s	Avg time	Min time	Max time
Copy	66071.3	0.572721	0.568794	0.575953
Scale	65708.8	0.575758	0.571932	0.580686
Add	67215.5	0.843995	0.838667	0.848371
Triad	67668.1	0.837109	0.833058	0.84079

Best Rate MB/s vs Function Graph

In the Stream Benchmark results, Host-1 outperformed Host-2 across all functions (Copy, Scale, Add, Triad).

Host-1 demonstrated higher memory bandwidth in each function, achieving significantly faster data transfer rates.
This suggests a stronger memory subsystem performance in Host-1 compared to Host-2.

3.2 Lmbench for Memory Latency:

Lmbench is a suite of micro-benchmarks designed to provide insights into various aspects of system performance.
The suite includes latency tests for system calls, memory accesses, and various operations to quantify the system's responsiveness.
Memory access tests include random read/write latency and bandwidth, helping to identify memory subsystem performance.
File I/O tests evaluate file system performance, providing insights into storage subsystem capabilities

Result

Memory Latency: Memory latency refers to the time it takes for the CPU to access a specific memory location. Lower latency values indicate better performance, as data can be fetched more quickly.

size (MB)	latency (ns)-HOST-1	latency (ns)-HOST-2
0.00049	1.43	1.429
—-----	—-----	—-----
—-----	—-----	—-----
2	32.355	32.786
3	34.503	36.012
4	37.403	37.932
6	39.982	52.922
8	41.007	54.001
12	44.315	55.466
16	65.52	73.016
24	95.131	117.278
32	115.081	138.945
48	126.796	151.945
64	129.558	159.225
96	134.413	166.359
128	136.239	167.788
192	136.245	168.689
256	136.366	170.464
384	137.732	170.461
—-----	—-----	—-----
—-----	—-----	—-----
2048	135.61	149.809

latency vs size - Lmbench for memory latency

4. Analysis and Findings:

After conducting these benchmark tests, we observed the Host-2 machine consistently exhibited lower performance across different tests compared to the Host-1 machine.

The most significant finding came from the Lmbench test, which revealed that the Host-2 machine's RAM had notably higher latency compared to the Host-1 machine.

Notably, an additional factor was identified—the RAM rank. The Host-1 machine is equipped with Dual-Rank RAM, while the Host-1 machine has Single-Rank RAM. This RAM rank difference could contribute to the performance discrepancy.

The observation is in line with findings from various other studies that have examined the influence of RAM rank on system performance. To gain a more comprehensive understanding of this subject, the following articles could be of interest:
Single vs. Dual-Rank RAM: Which Memory Type Will Boost Performance? - This article provides a thorough comparison between single and dual-rank RAM, aiding in comprehending the disparities between these two RAM types, methods to distinguish them, and guidance on selecting the most suitable option for your needs. (LINK)

Single Rank vs Dual Rank RAM: Differences & Performance Impact : This article delves into the differences between Single Rank and Dual Rank RAM modules, investigating their structural dissimilarities and assessing the respective impacts on performance. (LINK)

5. Conclusion:

After conducting an extensive series of benchmark tests, we have pinpointed certain factors that contribute to the performance disparity observed in the HPL test between the two ARM64 machines.

In the Stream Benchmark results, Host-1 outperformed Host-2 across all functions (Copy, Scale, Add, Triad). Host-1 demonstrated higher memory bandwidth in each function, achieving significantly faster data transfer rates.

Additionally, the higher memory latency in the Host-2 machine's RAM was identified as a key contributor to the performance gap.

This latency impacted the efficiency of memory operations and had a cascading effect on overall performance.

Another significant factor was the difference in RAM rank configurations — Host-1 had Dual-Rank RAM, while Host-2 had Single-Rank RAM.

This divergence likely contributed to the varying memory access speeds between the two machines.

6. Future Scope:

In the context of further exploration, it is recommended to extend the investigation by including additional benchmark tests, specifically focusing on the Lmbench memory bandwidth test.

This test would provide deeper insights into the memory subsystem's performance on both the Host-1 and Host-2 machines.

Additionally, an interesting avenue for investigation could involve modifying the RAM configuration in one of the machines and assessing its impact on performance. This would provide valuable information about the role of memory specifications in influencing the overall system performance.