top of page
Writer's pictureVishvanath Metkari

Investigating Performance Discrepancy in HPL Test on ARM64 Machines

Updated: Nov 12

Introduction:


Introduction to HPL tests

High-Performance Linpack (HPL) is a widely used benchmark for testing the computational performance of computing systems. In this blog post, we explore an intriguing scenario where we conducted HPL tests on two ARM64 machines. Surprisingly, the Host-2 machine exhibited a 20% lower performance than the Host-1 machine in the HPL test. Intrigued by this result, we embarked on a journey to comprehensively diagnose the underlying cause of this performance discrepancy.


Why HPL for Performance Testing?


Why use HPL

  • People use the High-Performance Linpack (HPL) benchmark for performance testing because it provides a standardised and demanding workload that measures the peak processing power of computer systems, particularly in terms of floating-point calculations.

  • It helps assess and compare the computational capabilities of different hardware configurations.

  • This benchmark helps in comparing and ranking supercomputers' performance and is often used as a metric for the TOP500 list of the world's most powerful supercomputers. For more information, you can refer to the TOP500 article here: TOP500 List


Objective:


The primary objective of this investigation was to identify the reason behind the 20% performance difference observed in the HPL test between the Host-1 and Host-2 machines. To comprehensively diagnose the performance discrepancy, we conducted additional benchmark tests, including Stream, Lmbench, and bandwidth tests.


1. System Details:


We conducted a fair and controlled experiment using two ARM64 machines, referred to as Host-1 and Host-2.


 1.1 Machine Specifications ( Host-1 and Host-2 ): 


  • CPU(s): 96

  • Architecture: aarch64

  • Total memory: 96 GB

  • Memory speed: 3200 MHz


2. Running HPL Benchmark:


  • To run the HPL benchmark on an arm64 machine, you can refer to the GitHub repository provided: https://github.com/AmpereComputing/HPL-on-Ampere-Altra.​

  • This repository likely contains instructions, scripts, and configurations specific to running HPL on Ampere Altra ARM64-based machines. 

  • It's important to follow the guidelines provided in the repository to ensure accurate and meaningful benchmarking results.


2.1 HPL Scores:


HPL Scores

Upon completing the HPL benchmark on both machines, we computed and compared the achieved HPL scores. The Host-1 machine garnered a higher HPL score, signifying better computational performance.



Machine​

Time ( sec )

Score

Host-1

619.91 

1245

Host-2

784

985


This result raised a critical question: why was there such a substantial performance gap? To delve into the root causes behind this discrepancy, we decided to conduct a series of additional tests to comprehensively investigate the issue.


3. Exploring Additional Tests:


We conducted several other benchmark tests to comprehensively investigate the performance discrepancy between the Host-1 and Host-2 ARM64 machines. These tests aimed to shed light on various aspects of the systems' hardware and memory subsystems, providing a holistic understanding of the observed difference.


Below, we detail the tests and their findings:


3.1 Stream Benchmark:


Stream Benchmark

  • The Stream benchmark assesses memory bandwidth and measures the system's capability to read from and write to memory. 

  • The benchmark consists of four fundamental tests: Copy, Scale, Add, and Triad. 

  • Copy: Measures the speed of copying one array to another. 

  • Scale: Evaluates the performance of multiplying an array by a constant.

  • Add: Tests the speed of adding two arrays together. 

  • Triad: Measures the performance of a combination of operations involving three arrays. 

  • The Stream benchmark helps uncover memory bandwidth limitations and assess memory subsystem efficiency.


Host-1 machine result :

Function

Best Rate MB/s

Avg time

Min time

Max time​

Copy

103837.8

0.367897

0.36192

0.373494

Scale

102739.4

0.369191

0.365789

0.372439

Add

106782.7

0.536131

0.527908

0.542759

Triad

106559.1

0.533549

0.529016

0.537881

Host-2 machine result :

Function

Best Rate MB/s

Avg time

Min time

Max time

Copy

66071.3

0.572721

0.568794

0.575953

Scale

65708.8

0.575758

0.571932

0.580686

Add

67215.5

0.843995

0.838667

0.848371

Triad

67668.1

0.837109

0.833058

0.84079

Best Rate MB/s vs Function Graph 



Stream Benchmark Performance(%)

Stream Benchmark Performance

In the Stream Benchmark results, Host-1 outperformed Host-2 across all functions (Copy, Scale, Add, Triad). 

  • Host-1 demonstrated higher memory bandwidth in each function, achieving significantly faster data transfer rates. 

  • This suggests a stronger memory subsystem performance in Host-1 compared to Host-2.


3.2 Lmbench for Memory Latency: 


Memory latency (Lmbench)

  • Lmbench is a suite of micro-benchmarks designed to provide insights into various aspects of system performance.

  • The suite includes latency tests for system calls, memory accesses, and various operations to quantify the system's responsiveness. 

  • Memory access tests include random read/write latency and bandwidth, helping to identify memory subsystem performance.

  • File I/O tests evaluate file system performance, providing insights into storage subsystem capabilities


Result


  • Memory Latency: Memory latency refers to the time it takes for the CPU to access a specific memory location. Lower latency values indicate better performance, as data can be fetched more quickly.



size (MB)

latency (ns)-HOST-1

latency (ns)-HOST-2 

0.00049

1.43

1.429

—----- 

—-----

​—-----

—-----

—-----

—-----

2

32.355

32.786

3

34.503

36.012

4

37.403

37.932

6

39.982

52.922

8

41.007

54.001

12

44.315

55.466

16

65.52

73.016

24

95.131

117.278

32

115.081

138.945

48

126.796

151.945

64

129.558

159.225

96

134.413

166.359

128

136.239

167.788

192

136.245

168.689

256

136.366

170.464

384

137.732

170.461

—-----

—-----

—-----

—-----

—-----

—-----

2048

135.61

149.809


 latency vs size - Lmbench for memory latency

4. Analysis and Findings:


Dual Rank vs. Single Rank RAM

  • After conducting these benchmark tests, we observed the Host-2 machine consistently exhibited lower performance across different tests compared to the Host-1 machine.


  • The most significant finding came from the Lmbench test, which revealed that the Host-2 machine's RAM had notably higher latency compared to the Host-1 machine.


  • Notably, an additional factor was identified—the RAM rank. The Host-1 machine is equipped with Dual-Rank RAM, while the Host-1 machine has Single-Rank RAM. This RAM rank difference could contribute to the performance discrepancy.


  • The observation is in line with findings from various other studies that have examined the influence of RAM rank on system performance. To gain a more comprehensive understanding of this subject, the following articles could be of interest:

  • Single vs. Dual-Rank RAM: Which Memory Type Will Boost Performance? - This article provides a thorough comparison between single and dual-rank RAM, aiding in comprehending the disparities between these two RAM types, methods to distinguish them, and guidance on selecting the most suitable option for your needs. (LINK)


  • Single Rank vs Dual Rank RAM: Differences & Performance Impact : This article delves into the differences between Single Rank and Dual Rank RAM modules, investigating their structural dissimilarities and assessing the respective impacts on performance. (LINK)



5. Conclusion:


Final Analysis and conclusion

  • After conducting an extensive series of benchmark tests, we have pinpointed certain factors that contribute to the performance disparity observed in the HPL test between the two ARM64 machines.


  • In the Stream Benchmark results, Host-1 outperformed Host-2 across all functions (Copy, Scale, Add, Triad). Host-1 demonstrated higher memory bandwidth in each function, achieving significantly faster data transfer rates. 


  • Additionally, the higher memory latency in the Host-2 machine's RAM was identified as a key contributor to the performance gap.


  • This latency impacted the efficiency of memory operations and had a cascading effect on overall performance. 


  • Another significant factor was the difference in RAM rank configurations — Host-1 had Dual-Rank RAM, while Host-2 had Single-Rank RAM. 


  • This divergence likely contributed to the varying memory access speeds between the two machines.


6. Future Scope:


Future Scope

  • In the context of further exploration, it is recommended to extend the investigation by including additional benchmark tests, specifically focusing on the Lmbench memory bandwidth test.


  • This test would provide deeper insights into the memory subsystem's performance on both the Host-1 and Host-2 machines. 


  • Additionally, an interesting avenue for investigation could involve modifying the RAM configuration in one of the machines and assessing its impact on performance. This would provide valuable information about the role of memory specifications in influencing the overall system performance.


29 views0 comments

Recent Posts

See All

Comments


bottom of page