Server Performance Prediction using ML Models

This blog is the first part of the series in "Server Performance Prediction using Machine Learning Models"

OVERVIEW:

Introduction to server performance Prediction using ML

In the semiconductor industry, a Silicon Cycle takes nearly a year or more from conception to the silicon being available. As soon as the concept comes in, there is immense pressure on the marketing and sales teams to come up with performance numbers for this new generation of the silicon i.e. processor. There is a need for a way to find out what the score could be on this new generation. Since the new processor is not physically available, companies run several benchmarks on simulators/emulators for getting the performance scores. This process has two major drawbacks:

Running a benchmark on a simulator takes hours more than running it on the physical processor
Such simulators are very expensive and a limited number of them are available

There should be some way of projecting new scores based on the scores of the older generation of processors. In this project, we aim to predict performance for a benchmark on newer generation processors by training Machine Learning Models on older generation data.

METHODOLOGY:

Performance Parameters:

We identified two performance parameters, namely IPC (Instructions per cycle) and Performance Runtime. The Runtime is actually derived from the Instructions per cycle, assuming a particular clock speed. In the current iteration of the research, we will limit our work to IPC and Runtime prediction.

Data Gathering and pre-processing:

We gathered our data on two generations of the Graviton processors, namely Graviton 2 and Graviton 3. We captured data for benchmark suites named SPECInt and SPECFp designed by the Standard Performance Evaluation Corporation.

For each benchmark in these SPEC benchmark suites, we have captured the following counters at every 100 millisecond intervals:

Sr. No:	Counter Name
1.	L1I_TLB_REFILL
2.	L1D_TLB_REFILL
3.	L3D_CACHE_ALLOCATE
4.	L3D_CACHE_REFILL
5.	L3D_CACHE
6.	l1i_cache
7.	l1i_cache_refill
8.	l1d_cache
9.	l1d_cache_refill
10.	l2d_cache
11.	l2d_cache_refill
12.	br_mis_pred
13.	br_pred
14.	mem_access
15.	stall_backend
16.	stall_frontend
17.	ASE_SPEC
18.	VFP_SPEC
19.	L1I_CACHE_REFILL
20.	L2D_CACHE_REFILL
21.	INST_SPEC
22.	BR_RETIRED
23.	BR_MIS_PRED_RETIRED
24.	branch-loads
25.	MEM_ACCESS_RD

Why SPEC?

The reason that we captured data on the SPEC benchmark suite is because these benchmarks are designed to capture most of the areas of the CPU. For example, a benchmark like gcc stresses the memory, whereas, another benchmark like x264 stresses on vectorization capabilities of the CPU.

Let us call a set of counters collectively as “The Snapshot” of the System.

On the other hand, we also capture instructions and cycles for every 100 milliseconds. During post processing, we calculate the Instructions-Per-Cycle (IPC) for every snapshot.

For our machine learning model discussed further, we need a set of Input variables and an output variable. The Snapshot of the System serves as the input set of variables (X). The output variable is the Ratio of the IPCs of the newer generation of CPUs to the older generation. Let us have a look at how this training data is generated.

Post Processing the counters:

For a benchmark, the captured data is available in 7 CSV files. Each file has a few counters (for eg. L1D_TLB_REFILL) captured at every 100 milliseconds intervals over the duration of the benchmark. There are 7 files because only a limited number of counters can be captured during a given benchmark run. Hence, we run the benchmark several times in order to capture all the necessary counters.Inside a counter file, there are many columns, out of which only 3 are important to us for post processing. These are time, count, and type. We filter the dataframe that is read from the csv file for only these 3 columns. For example, a row in the filtered dataframe would look like - 0.300514373,1778078,L1D_TLB_REFILL. The row entry means that the counter L1D_TLB_REFILL is captured at 300 milliseconds from the start of the test, for which the count is 1778078.

Once we have the counter files for a benchmark, we apply post processing to the data, where we calculate the Per-Kilo-Instruction (PKI) metric for each counter. For example, we calculate L1D_TLB_REFILL_PKI for the L1D_TLB_REFILL counter for every capture of the L1D_TLB_REFILL counter. So for example, if the corresponding instructions capture for the above example is like this - 0.300514373,618104386,instructions, it means that the instructions count was 618104386 at 300 milliseconds. To calculate the PKI count for each counter, we use the following formula:

PKI_Count = (Counter/instructions)*1000

The formula gives us the Per-Kilo-Instructions value of a particular counter.

For example, for the 300 milliseconds capture example above, the PKI Count would be calculated as follows:

L1D_TLB_REFILL_PKI = L1D_TLB_REFILL/instructions*1000

= 1778078/618104386*1000

= 2.876662972

Hence the L1D_TLB_REFILL_PKI for this particular capture at 300 milliseconds is equal to 2.876662972.

Additionally, an Instructions Per Cycle (IPC) count is calculated for each capture of the counters. The formula for IPC simply divides the number of Instructions by the number of cycles.

IPC = instructions/cycles

It tells you how many instructions were executed for a given clock cycle of the CPU. We consider it a measure of the performance of the CPU.

Likewise, the PKI count calculation (same as shown above) is done for all the 25 counters captured. As a result, we have a post processed dataframe which has the following columns: time, IPC, and 25 PKI counters. This is the 2-dimensional dataframe for a benchmark.

The post processing is repeated for all the benchmarks, thus creating 2-dimensional dataframes for each benchmark. These dataframes are concatenated one after another and are indexed by the benchmark name, which is our 3rd dimension. Hence, we stack all the 2D dataframes into a 3D dataframe indexed by the benchmark name.

Plotting a comparative IPC plot

When a benchmark runs on a newer generation processor, it is generally expected that it would take less time to run than on the older generation processor. For example, consider a benchmark named ‘500.perlbench_r_checkspam_0’. It takes less time to run on Graviton 3 as compared to Graviton 2.

Below is a plot of the calculated IPC for the entire duration of the ‘500.perlbench_r_checkspam_0’ benchmark.

As seen above, the benchmark on G3 runs in lesser time compared to the same benchmark on G2. The IPC is higher as well.

Dynamic Time Warping, Generating Training Data

Let us consider an example while taking Graviton 2 (G2) as the older generation and Graviton 3 (G3) as the newer generation used while training the machine learning model. Now for each benchmark, we calculate a new column called the “IPC Ratio”. This is the ratio of the IPC values of each row of G3 dataframe mapped to a row of G2 dataframe for that benchmark. We perform this mapping using Dynamic Time Warping.

Why IPC Ratio?

For a given snapshot of the system (X), the IPC ratio of G3 to G2 is the same irrespective of the benchmark being run. This is because the architecture of the systems are designed in such a way that if the “state” of the older system is the same at a point during any two benchmark runs, then the IPC Ratios of the new generation to the older generation are equal for both the benchmarks. This is true irrespective of the benchmark being run. Note that only the ratio is the same and not the raw IPC numbers.

Hence, we choose IPC Ratio as the output (Y) parameter

Dynamic Time Warping

Since we need to find the ratio of the IPCs at each capture of the snapshot, there should be an intelligent way to map the benchmark run instances. If we observe the comparative IPC plot above, we notice that the benchmark runs in lesser time on G3 than on G2. This means that more instructions are executed in the same amount of time on G3 than on G2.

We calculated the cumulative sum of the number of instructions executed for each row, and store it in a column named “instructions_cumulative_sum”. The graph of the cumulative instructions for G3 and G2 can be seen below (The numbers on the Y-Axis are in the order of 10^12). The example belongs to the benchmark named '502.gcc_r_gcc-pp_3'

We map the captured instances by matching the instructions_cumulative_sum column. This is because taking the IPC ratio makes sense only when we are taking the ratio of rows at nearly the same time during the benchmark run.

Dynamic time warping is a method that calculates an optimal match between two given sequences with certain rules. For example,

First value of sequence 1 is mapped to first value of sequence 2
Last value of sequence 1 mapped to last value of sequence 2
Intermediate mappings have monotonically increasing indices

Below is an example of the mapping path generated using the Dynamic Time Warping algorithm

We have time on the X-axis, and instructions_cumulative_sum on the Y-axis. The line above represents G2 and the line below represents G3. The orange part in the graph is a lot of lines that represent mapping from one row of G2 to a row of G3 based on the cumulative instructions count.

Below are a few initial actual mappings for the above example using the DTW algorithm

((row_g2, row_g3), cumulative_sum_g2, cumulative_sum_g3)

(0, 0), 460190954.0, 672156965.0),((1, 0), 1106484590.0, 672156965.0),((2, 1), 1724588976.0, 1657376501.0),((3, 2), 2249696930.0, 2344648605.0),((4, 3), 2780771235.0, 3021707178.0),((5, 3), 3313773511.0, 3021707178.0),((6, 4), 3797070083.0, 3687200150.0),((7, 5), 4311980235.0, 4247959294.0),((8, 6), 4851216305.0, 4889782076.0),((9, 7), 5393885229.0, 5538932327.0),((10, 8), 5943214302.0, 6192166460.0),((11, 8), 6520858849.0, 6192166460.0),((12, 9), 7152711138.0, 7087454578.0)...

As seen in the example, the 0th row of G2 is mapped to 0th row of G3, 1st row of G2 with 0th row of G3, 2nd row of G2 with 1st row of G3 and so on. This is done by matching the total number of instructions that are executed as explained above.

Calculating IPC Ratio

Now since the mapping is done, calculating the IPC ratio is relatively straightforward. We calculate the IPC Ratio as follows:

For row i belonging to G2 dataframe, and row j belonging to the G3 dataframe:

IPC_Ratio = IPC_j/IPC_i

This is done for each (i, j) mapping calculated above.

A few insights about IPC_Ratio for '502.gcc_r_gcc-pp_3':

The IPC_Ratio when calculated for the '502.gcc_r_gcc-pp_3' benchmark shows the following distribution which looks pretty much accurate. This is because the improvement is approximately 30% which means the ratio should be 1.3.

The comparative IPC plot for G3 vs G2 for this benchmark is shown below

The IPC_Ratio calculated for each row after Dynamic time warping (DTW) is shown below

Final training data

The final training data has the columns ‘time’, ‘instructions’, ‘instructions_cumulative_sum’, ‘IPC’ removed. Hence, the training data has the 25 counters (X variables), and the IPC_Ratio (Y).

We use the K Neighbors Regression model for training and inference of IPC Ratio for a given snapshot X.

X: Set of variables that define the state of the system at a particular time during the benchmark.Y: IPC Ratio - The Ratio of the Instructions per cycle (IPC) of G3 to G2 at the time when the same number of instructions (approximately) were executed for the benchmark on both the CPUs.

Part 1 - Summary

In this part, we have covered the captured counters and its post processing, and also the generation of the Final training data.

In the next part, we will cover the machine learning algorithm for predicting the IPC Ratio given a snapshot of the system.

Server Performance Prediction using ML Models - Part 1