Estimating CPU cache size with the memory mountain

The Memory Mountain is a graphical representation of the read bandwidth (the rate data is read by a program in MB/s) in terms of the total data size processed and its access pattern. It provides great insight into the way CPU cache helps bridge the gap between the high-speed processor and the relatively slow access times of the main memory. It was introduced by Stricker and Gross, and I first read about it in Computer Systems: A Programmer’s Perspective (CS:APP).

In this post, we’ll generate a memory mountain with the diagnostic program provided by CS:APP and plot it using Python’s matplotlib. Then we’ll analyze it to gain some insight about the memory hierarchy. In particular, we’ll estimate the CPU cache sizes.

Measuring the memory read bandwidth

As it was mentioned, the program used to measure the read bandwidth is the one provided by CS:APP. You can get it from their website (file mountain.tar in the Chapter 6 section).

Data

The program allocates 128 MB of memory for an array of long elements (long is usually 8 bytes).

#define MAXELEMS (1 << 27)/sizeof(long)

long data[MAXELEMS];

It then proceeds to initialize the array with each element as data[i] = i.

for (int i = 0; i < n; i++)
  data[i] = i;

Testing memory access

The actual operation that is measured, is the test function. This function iterates over some elements of the array to calculate their total sum. In order to take advantage of modern processors parallelism, this operation is performed by 4 accumulators simultaneously.

int test(int elems, int stride) {
  ...
  for (long i = 0; i < elems; i += sx4) {
    acc0 = acc0 + data[i];     
    acc1 = acc1 + data[i+stride];
    acc2 = acc2 + data[i+sx2]; 
    acc3 = acc3 + data[i+sx3];
   }
  ...
  return ((acc0 + acc1) + (acc2 + acc3));
}

A couple of parameters come into play here: the number of elements elems, and the stride.

The first one, elems, is derived from the portion of the array that is processed expressed by the variable size in bytes.

int elems = size / sizeof(long);

size values correspond to 16KB, 32 KB, 128 KB, 256 KB, 512 KB, 1 MB, 2 MB, 4 MB, … up to 128 MB.

The stride defines the access pattern. The simplest one is stride=1, which corresponds to sequential access: we access the array elements in the order i, i + 1, i + 2, etc. In general, a stride-k access pattern would access the array elements in the order: i, i + k, i + 2k, etc.

stride takes values between 1 and 15.

The other variables in the code snippet, sx2, sx3, and sx4, are just multiples of stride used to process the multiple sum accumulators (acc0 to acc3) in parallel.

Measuring performance

The fcyc2 function measures the CPU cycles it takes test to run. With this information, as well as the processor’s clock rate (obtained from /proc/cpuinfo, variable Mhz), it returns the read bandwidth as:

double run(int size, int stride, double Mhz) {
  ... 
  cycles = fcyc2(test, elems, stride, 0); 
  return (size / stride) / (cycles / Mhz);  // MB/s
}

size / stride is a value in bytes. cycles / Mhz is a time in microseconds. The final result is the read bandwidth in MB / s.

The read bandwidth is calculated for every combination of size and stride.

Running the memory mountain program

Executing mountain yields the following table.

$ ./mountain
Clock frequency is approx. 2030.5 MHz
Memory mountain (MB/sec)
	s1	s2	s3	s4	s5	s6	s7	s8	s9	s10	s11	s12	s13	s14	s15	
128m	5412	2849	1984	1523	1243	1039	897	780	702	637	595	545	511	487	461	
64m	5509	2892	2007	1525	1254	1056	900	769	709	643	593	542	510	492	470	
32m	5538	2856	2001	1575	1262	1059	901	791	708	638	594	548	517	492	465	
16m	5483	2897	1998	1574	1276	1037	911	793	712	646	595	549	519	492	474	
8m	6704	3575	2486	1934	1599	1331	1147	997	910	847	786	741	720	707	715	
4m	12623	8458	6679	5319	4406	3732	3228	2842	2839	2871	2902	2962	3042	3080	3164	
2m	14230	10136	8697	6890	5616	4701	4052	3547	3544	3522	3521	3524	3508	3506	3512	
1024k	14360	10248	8939	7052	5699	4761	4113	3586	3584	3552	3541	3534	3531	3518	3512	
512k	14347	10435	8950	7023	5695	4785	4130	3585	3583	3590	3637	3747	3750	3919	4104	
256k	14948	12462	11642	9075	7676	6656	5874	5060	6136	6396	6434	6516	7013	7092	7491	
128k	15354	13997	13621	12659	11529	10204	9065	8300	8888	8732	8804	8606	8394	8728	8723	
64k	15499	14046	13454	12573	11506	10031	8912	8704	8646	8530	8524	8439	8509	8657	11201	
32k	16344	16142	15932	15619	15401	15274	15842	13016	14408	15187	15270	14214	12824	14942	14078	
16k	16626	15933	16001	14825	14782	15659	14530	11848	12832	14982	13089	11131	12729	10420	13198

The final step is to plot the results.

Graphing the memory mountain

We’ll graph the ./mountain results we got above with the following script (graph.py).

#!/usr/bin/env python3

import sys
import numpy as np
import matplotlib.pyplot as plt


def main():
    x = np.arange(1, 16)  # stride
    y = np.arange(1, 15)  # size ticks
    x, y = np.meshgrid(x, y)
    xticks = list((map(lambda stride: "s" + str(stride), range(1, 16))))
    yticks = "128m 64m 32m 16m 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k".split()

    z0 = sys.argv[1]
    z1 = z0.splitlines()[3:]
    # Divide by tab, discard first and last column
    z2 = np.array(list(map(lambda line: line.split("\t")[1:-1], z1)))
    z2 = z2.astype(np.int)

    fig = plt.figure()
    ax = plt.axes(projection="3d")
    ax.set_xticks(range(1, 16, 2))
    ax.set_yticks(range(1, 15, 2))
    # Take only every other label
    plt.rcParams.update({"font.size": 16})
    ax.set_xticklabels(xticks[::2])
    ax.set_yticklabels(yticks[::2])
    ax.tick_params(axis="z", pad=10)
    ax.set_xlabel("Stride", fontsize=18, labelpad=20)
    ax.set_ylabel("Size", fontsize=18, labelpad=20)
    ax.set_title("Memory Mountain\n", fontsize=22)
    ax.plot_surface(x, y, z2, cmap="jet")
    plt.tight_layout()
    plt.show()

We can execute it as follows:

$ ./graph.py "$(./mountain)"

"$(./mountain)" will evaluate it and pass the results as an argument to graph.py.

Analyzing the memory mountain

We’ll get a window with the following graph (the black dashed lines were added by me).

There are many interesting results that can be noted. The black dashed lines show the “size limits” for which we have sharp decreases in the bandwidth for any given stride. For example, as we increase the size from 32 KB, top black dashed line, to 64 KB, the bandwidth dramatically decreases. These limits are indicators of the CPU cache sizes.

The top dashed line corresponds to the L1 cache. It’s the fastest one (thus explaining the greatest read bandwidths), but also very limited in size. Based on the graph, we estimate it to be 32 KB. Next, we have the L2 cache. It’s a bit slower than L1, but larger in size. We estimate it to be 128 KB. Finally, we have the L3 cache. The graph suggests it has a size of 4 MB.

A side-view of the plot helps visualize the cache sizes.

The blue zone with the least read bandwidth, corresponds to the main memory (RAM). The CPU cache is unable to store all of the data for sizes larger than 8 MB (given my system’s hardware). Then, the CPU is required to constantly fetch the information from the main memory. This process is relatively slow and dramatically decreases the read bandwidth.

Spatial locality

A second important result can be seen with the decrease in bandwidth as the stride increases for any given data size. The largest bandwidth, for any size, is obtained through sequential access (stride-1). A more detailed analysis of the effect of spatial locality can be seen in my previous post about array reference performance.

Estimated vs. actual CPU cache sizes

My processor is an Intel Core i5. Its CPU cache specifications correspond pretty well with the results of the memory mountain.

CPU Cache	Size
L1	32 KB
L2	256 KB
L3	6 MB

The largest divergence was in the L2 cache size. See that for a size of 256 KB, we get read bandwidths that are in-between a sharp decrease between 128 KB and 512 KB. This could be because the L2 cache is required to store other data besides the array values, being then used beyond its capacity when we test it with an array of 256 KB.

In conclusion

It’s quite amazing to see the effect of low-level hardware on program performance. The most important aspect, however, is what we can learn to write better programs. First, writing programs that process data, as much as possible, in sequential access to exploit spatial locality. And secondly, reduce the size of the working data set that is being processed at a given time to exploit temporal locality.