# Introduction
A couple of weeks ago, during a work conversation about SIMD support on modern ARM processors, I bumped again in the fact that I never had tried ARM's NEON (opens new window) instruction set extension. Today was the day.
To do so, I decided to use JishinMaster's simd_utils (opens new window), which packs common mathematical functions on a header-only library, allowing from 128 up to 512 bit wide operations per instruction ( depending if we're on a system that supports SSE (or NEON), AVX, AVX2, or AVX512 ).
To get some ballpark numbers, I've create a simple microbenchmark leveraging google/benchmark framework that computes the min and max over over large vectors, which I made available on github at codeperfio/simd_utils_bench (opens new window).
The explicit change from scalar instructions, to NEON instructions, proved to be exceptionally good for performance. For the case presented in this blog post, vectorization improved performance by a factor of up to 7.9x on my Apple M1 test bed. Notice that for simpler mathemathical functions (i.e. doing a simple calculation) the speedup factor should be limited at around 4x ( given we're moving from a 32 bit wide instruction to a 128 bit one ( thus 4X more SP FP numbers )).
# Preparing the baseline scalar minmax32f()
As I mentioned earlier, the expected outcome of the scalar/vector functions we're using is to compute the min and max over the observed values of a large vector.
To do so, simd_utils provides us with an out of the box function minmax128f
with the following signature:
static inline void minmax128f(float *src, int len, float *min_value, float *max_value)
An equivalent scalar minmax32f
can be coded as follow:
static inline void minmax32f(float *src, const size_t len, float *min_value,
float *max_value) {
float min = FLT_MAX;
float max = FLT_MIN;
for (size_t i = 0; i < len; i++) {
if (src[i] < min)
min = src[i];
if (src[i] > min)
max = src[i];
}
*max_value = max;
*min_value = min;
}
# Defining the experimental setup and getting a grasp of on-cpu time
Now that we have an equivalent scalar minmax() function, the only thing that remains to do, is to prepare the microbenchark wrapper used to obtain the results while varying the input vector size. To do so, and as mention before, I'm using google/benchmark (opens new window) which contains a comprehensive feature overview on it's github repo.
The microbenchmark splits itself into two stages, the fist one in which we generate the input data, and the second in which we measure the time taken to compute the min and max over the entire array.
The code for benchmarking minmax over a scalar/vectorized implementation is the same, with the only variation being the function name.
Bellow you can find the microbenchmark definition for the scalar implementation.
To prevent expressions from being optimized away by the compiler we're forcing a read/write barrier on the computed min value.
static void BM_minmax32f(benchmark::State &state) {
const int64_t stream_size = state.range(0);
std::vector<float> input;
input.resize(stream_size, 0);
std::mt19937_64 rng;
rng.seed(std::random_device()());
std::uniform_real_distribution<float> dist(0, 1);
for (float &i : input) {
i = dist(rng);
}
while (state.KeepRunning()) {
std::vector<int> v;
v.reserve(1);
float min, max;
minmax32f(&input[0], stream_size, &min, &max);
v.push_back(min);
benchmark::ClobberMemory(); // Force min to be written to memory.
state.SetItemsProcessed(stream_size);
}
}
The full micro-benchmarks can be consulted and directly compiled from the following GH repo (opens new window).
To clone the benchmark repo, build the microbenchmark, and run it locally do as follows:
git clone --recursive https://github.com/codeperfio/simd_utils_bench
make bench
Depending on the machine ISA that you have available ( SSE, AVX, AVX2, AVX512, NEON ) you will see different
minmax<bt operation width>f
being used.
For an Apple M1 you should see both minmax32f and minmax128f variations, as exemplified bellow.
Notice that the last number on the benchmark name description refers to the variation of the total number of elements of the vector. As you can see I went from 128 to 32768, which is enough to surpase the 1st CPU cache level of the M1 which is 65536 Bytes (32768 * 4 Bytes is 2x as large as the l1dcachesize )
build/simd_utils_bench --benchmark_min_time=1
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
2021-07-18T00:14:01+01:00
Running build/simd_utils_bench
Run on (8 X 24.0825 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x8)
L1 Instruction 128 KiB (x8)
L2 Unified 4096 KiB (x8)
Load Average: 4.72, 2.45, 2.27
------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------
BM_minmax32f/128 172 ns 172 ns 7614572 items_per_second=97.6911/s
BM_minmax32f/256 332 ns 332 ns 4214405 items_per_second=182.807/s
BM_minmax32f/512 653 ns 653 ns 2146357 items_per_second=365.45/s
BM_minmax32f/1024 1298 ns 1297 ns 1078740 items_per_second=731.633/s
BM_minmax32f/2048 2578 ns 2577 ns 541456 items_per_second=1.46749k/s
BM_minmax32f/4096 5171 ns 5159 ns 272782 items_per_second=2.91064k/s
BM_minmax32f/8192 10289 ns 10288 ns 136017 items_per_second=5.85402k/s
BM_minmax32f/16384 20557 ns 20556 ns 68047 items_per_second=11.7133k/s
BM_minmax32f/32768 41165 ns 41161 ns 34038 items_per_second=23.3882k/s
BM_minmax128f/128 39.2 ns 39.2 ns 35864698 items_per_second=91.0305/s
BM_minmax128f/256 55.3 ns 55.3 ns 25341843 items_per_second=182.572/s
BM_minmax128f/512 96.8 ns 96.8 ns 14316070 items_per_second=369.38/s
BM_minmax128f/1024 177 ns 177 ns 7914657 items_per_second=730.286/s
BM_minmax128f/2048 338 ns 338 ns 4169993 items_per_second=1.45464k/s
BM_minmax128f/4096 658 ns 657 ns 2116274 items_per_second=2.94373k/s
BM_minmax128f/8192 1303 ns 1302 ns 1076294 items_per_second=5.84427k/s
BM_minmax128f/16384 2590 ns 2590 ns 537946 items_per_second=11.7614k/s
BM_minmax128f/32768 5162 ns 5162 ns 271148 items_per_second=23.4122k/s
# Performance results
The explicit change from the scalar minmax32f mathematical function to the NEON leveraged minmax128b proved to be exceptionally good for performance. As seen on the bellow charts, vectorization improved performance by a factor of up to 7.9x on my Apple M1 test bed.
# Further notes
As stated initially, this is only one set of benchmarks, used primarily so that I could try ARM's NEON (opens new window) instruction set extension, and JishinMaster's simd_utils (opens new window).
All of the source code is available and ANY contribution/comment is gratefully accepted.