# Introduction

A couple of weeks ago, during a work conversation about SIMD support on modern ARM processors, I bumped again in the fact that I never had tried ARM's NEON (opens new window) instruction set extension. Today was the day.

To do so, I decided to use JishinMaster's simd_utils (opens new window), which packs common mathematical functions on a header-only library, allowing from 128 up to 512 bit wide operations per instruction ( depending if we're on a system that supports SSE (or NEON), AVX, AVX2, or AVX512 ).

To get some ballpark numbers, I've create a simple microbenchmark leveraging google/benchmark framework that computes the min and max over over large vectors, which I made available on github at codeperfio/simd_utils_bench (opens new window).

The explicit change from scalar instructions, to NEON instructions, proved to be exceptionally good for performance. For the case presented in this blog post, vectorization improved performance by a factor of up to 7.9x on my Apple M1 test bed. Notice that for simpler mathemathical functions (i.e. doing a simple calculation) the speedup factor should be limited at around 4x ( given we're moving from a 32 bit wide instruction to a 128 bit one ( thus 4X more SP FP numbers )).

# Preparing the baseline scalar minmax32f()

As I mentioned earlier, the expected outcome of the scalar/vector functions we're using is to compute the min and max over the observed values of a large vector.

To do so, simd_utils provides us with an out of the box function minmax128f with the following signature:

static inline void minmax128f(float *src, int len, float *min_value, float *max_value)

An equivalent scalar minmax32f can be coded as follow:

static inline void minmax32f(float *src, const size_t len, float *min_value,
                             float *max_value) {
  float min = FLT_MAX;
  float max = FLT_MIN;
  for (size_t i = 0; i < len; i++) {
    if (src[i] < min)
      min = src[i];
    if (src[i] > min)
      max = src[i];
  }
  *max_value = max;
  *min_value = min;
}

# Defining the experimental setup and getting a grasp of on-cpu time

Now that we have an equivalent scalar minmax() function, the only thing that remains to do, is to prepare the microbenchark wrapper used to obtain the results while varying the input vector size. To do so, and as mention before, I'm using google/benchmark (opens new window) which contains a comprehensive feature overview on it's github repo.

The microbenchmark splits itself into two stages, the fist one in which we generate the input data, and the second in which we measure the time taken to compute the min and max over the entire array.

The code for benchmarking minmax over a scalar/vectorized implementation is the same, with the only variation being the function name.

Bellow you can find the microbenchmark definition for the scalar implementation.

To prevent expressions from being optimized away by the compiler we're forcing a read/write barrier on the computed min value.

static void BM_minmax32f(benchmark::State &state) {
  const int64_t stream_size = state.range(0);
  std::vector<float> input;
  input.resize(stream_size, 0);
  std::mt19937_64 rng;
  rng.seed(std::random_device()());
  std::uniform_real_distribution<float> dist(0, 1);

  for (float &i : input) {
    i = dist(rng);
  }

  while (state.KeepRunning()) {
    std::vector<int> v;
    v.reserve(1);
    float min, max;
    minmax32f(&input[0], stream_size, &min, &max);
    v.push_back(min);
    benchmark::ClobberMemory(); // Force min to be written to memory.

    state.SetItemsProcessed(stream_size);
  }
}

The full micro-benchmarks can be consulted and directly compiled from the following GH repo (opens new window).

To clone the benchmark repo, build the microbenchmark, and run it locally do as follows:

git clone --recursive https://github.com/codeperfio/simd_utils_bench
make bench

Depending on the machine ISA that you have available ( SSE, AVX, AVX2, AVX512, NEON ) you will see different minmax<bt operation width>f being used. For an Apple M1 you should see both minmax32f and minmax128f variations, as exemplified bellow.

Notice that the last number on the benchmark name description refers to the variation of the total number of elements of the vector. As you can see I went from 128 to 32768, which is enough to surpase the 1st CPU cache level of the M1 which is 65536 Bytes (32768 * 4 Bytes is 2x as large as the l1dcachesize )

build/simd_utils_bench --benchmark_min_time=1
Unable to determine clock rate from sysctl: hw.cpufrequency: No such file or directory
2021-07-18T00:14:01+01:00
Running build/simd_utils_bench
Run on (8 X 24.0825 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x8)
  L1 Instruction 128 KiB (x8)
  L2 Unified 4096 KiB (x8)
Load Average: 4.72, 2.45, 2.27
------------------------------------------------------------------------------
Benchmark                    Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------
BM_minmax32f/128           172 ns          172 ns      7614572 items_per_second=97.6911/s
BM_minmax32f/256           332 ns          332 ns      4214405 items_per_second=182.807/s
BM_minmax32f/512           653 ns          653 ns      2146357 items_per_second=365.45/s
BM_minmax32f/1024         1298 ns         1297 ns      1078740 items_per_second=731.633/s
BM_minmax32f/2048         2578 ns         2577 ns       541456 items_per_second=1.46749k/s
BM_minmax32f/4096         5171 ns         5159 ns       272782 items_per_second=2.91064k/s
BM_minmax32f/8192        10289 ns        10288 ns       136017 items_per_second=5.85402k/s
BM_minmax32f/16384       20557 ns        20556 ns        68047 items_per_second=11.7133k/s
BM_minmax32f/32768       41165 ns        41161 ns        34038 items_per_second=23.3882k/s
BM_minmax128f/128         39.2 ns         39.2 ns     35864698 items_per_second=91.0305/s
BM_minmax128f/256         55.3 ns         55.3 ns     25341843 items_per_second=182.572/s
BM_minmax128f/512         96.8 ns         96.8 ns     14316070 items_per_second=369.38/s
BM_minmax128f/1024         177 ns          177 ns      7914657 items_per_second=730.286/s
BM_minmax128f/2048         338 ns          338 ns      4169993 items_per_second=1.45464k/s
BM_minmax128f/4096         658 ns          657 ns      2116274 items_per_second=2.94373k/s
BM_minmax128f/8192        1303 ns         1302 ns      1076294 items_per_second=5.84427k/s
BM_minmax128f/16384       2590 ns         2590 ns       537946 items_per_second=11.7614k/s
BM_minmax128f/32768       5162 ns         5162 ns       271148 items_per_second=23.4122k/s

# Performance results

The explicit change from the scalar minmax32f mathematical function to the NEON leveraged minmax128b proved to be exceptionally good for performance. As seen on the bellow charts, vectorization improved performance by a factor of up to 7.9x on my Apple M1 test bed.

# Further notes

As stated initially, this is only one set of benchmarks, used primarily so that I could try ARM's NEON (opens new window) instruction set extension, and JishinMaster's simd_utils (opens new window).

All of the source code is available and ANY contribution/comment is gratefully accepted.