# Introduction
As you will see during this post, even highly optimized software is very likely to perform “redundant” computation during single or consecutive executions. This is either due to way we've designed it (algorithms) or the nature of the values assumed as inputs (dataset). Therefore, to fully understand if removing redundant computation represents an opportunity for improving the efficiency of our programs we need to:
- have visibility to where threads are spending CPU cycles while running on-CPU, and wether those cycles are effectively being used for computation.
- understand if we can reuse that same computation as a mean for improving performance -- and this can only be made with a deep understand of the domain and problems our program is solving ans how is it being used.
This post idea came after I've started a POC on my Redis fork (link) (opens new window) to extended the current latency metrics Redis Provides. To do so I'm using the C version of the HdrHistogram (opens new window) to have a per command histogram and calculate both the cumulative latency distribution and pre-compute some "commonly used" percentiles used for doing latency analysis. If you haven't yet come across this type of highly optimized sketching data-structure I suggest you look into one of the many Gil Tene's excellent presentations on youtube (opens new window).
Now getting back to our problem, and thinking on the way we do latency percentile analysis, we normally require more than one percentile to be calculated for the given histogram at any given moment (example of p50, p95, p99, p999).
If you look at the C function that provides the percentile calculation implementation you'll notice that we iterate over the counts array starting from 0 up until the we've reached a cumulative count of samples that represents the requested percentile.
int64_t hdr_value_at_percentile(const struct hdr_histogram* h, double percentile)
{
struct hdr_iter iter;
int64_t total = 0;
double requested_percentile = percentile < 100.0 ? percentile : 100.0;
int64_t count_at_percentile =
(int64_t) (((requested_percentile / 100) * h->total_count) + 0.5);
count_at_percentile = count_at_percentile > 1 ? count_at_percentile : 1;
hdr_iter_init(&iter, h);
while (hdr_iter_next(&iter))
{
total += iter.count;
if (total >= count_at_percentile)
{
int64_t value_from_index = iter.value;
return highest_equivalent_value(h, value_from_index);
}
}
return 0;
}
# Defining the experimental setup and getting a grasp of on-cpu time
To realize computation reuse, it is necessary to capture run-time behavior of a program. Given the above function definition we should start by preparing an example that will allow us to profile code execution and determine which functions are consuming the most time and thus are targets for optimization.
To do so, we'll use Google's microbenchmarking framework (google/benchmark (opens new window)), that the C version of the HdrHistogram already incorporates, to simulate a dataset representative of real-life scenarios ( random deterministic gamma distribution to mimic latencies ) and calculate 4 very common percentiles used on any analysis (p50,p95,p99,p999).
The bellow micro-benchmark can be consulted and directly compiled from the following GH repo (opens new window).
static void BM_hdr_value_at_percentile_given_array(benchmark::State &state) {
/////////////////////
// benchmark setup //
/////////////////////
srand(12345);
const int64_t precision = state.range(0);
const int64_t max_value = state.range(1);
const double percentile_list[4] = {50.0, 95.0, 99.0, 99.9};
std::default_random_engine generator;
// gama distribution shape 1 scale 100000
std::gamma_distribution<double> latency_gamma_dist(1.0, 100000);
struct hdr_histogram *histogram;
hdr_init(min_value, max_value, precision, &histogram);
for (int64_t i = 1; i < generated_datapoints; i++) {
int64_t number = int64_t(latency_gamma_dist(generator)) + 1;
number = number > max_value ? max_value : number;
hdr_record_value(histogram, number);
}
benchmark::DoNotOptimize(histogram->counts);
int64_t items_processed = 0;
///////////////////
// benchmark run //
///////////////////
for (auto _ : state) {
for (auto percentile : percentile_list) {
benchmark::DoNotOptimize(hdr_value_at_percentile(histogram, percentile));
// read/write barrier
benchmark::ClobberMemory();
}
items_processed += 4;
}
state.SetItemsProcessed(items_processed);
}
To clone the HdrHistogram repo and build the microbenchmark locally do as follows:
git clone https://github.com/RedisBloom/HdrHistogram_c
git checkout value_at_percentiles
mkdir build && cd build
CC=gcc-10 CXX=g++-10 cmake -DHDR_HISTOGRAM_BUILD_BENCHMARK=ON -DCMAKE_BUILD_TYPE=ReleaseWithDebugInfo ..
make
To run it:
~/HdrHistogram_c/build$ ./test/hdr_histogram_benchmark --benchmark_filter=BM_hdr_value_at_percentile_given_array/3/86400000000 --benchmark_min_time=60
2021-02-07 14:54:39
Running ./test/hdr_histogram_benchmark
Run on (40 X 1240.4 MHz CPU s)
CPU Caches:
L1 Data 32K (x20)
L1 Instruction 32K (x20)
L2 Unified 1024K (x20)
L3 Unified 28160K (x1)
Load Average: 0.02, 0.16, 0.20
---------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------
BM_hdr_value_at_percentile_given_array/3/86400000000 344046 ns 344042 ns 244556 items_per_second=11.6265k/s
# Where are those CPU cycles being spent
As seen above, it takes us 344 microseconds of CPU time to compute the 4 percentiles, leading to a max of approximately 11.6K calculations per second. It is now time to understand where are those CPU cycles being spent on the above defined function definition. To do so we'll collect, report and visualize hotspots using perf. There are several tools that can be used to achieve the same outcome like bcc/BPF tracing tools, or Intel Vtune Profiler, among others...
The following steps rely upon Linux perf_events (aka "perf" (opens new window)). I'll assume beforehand you have installed the perf tool on your system. Most Linux distributions will likely package this as a package related to the kernel. More information about the perf tool can be found at perf wiki (opens new window).
# Profiling build prerequisites
For a proper On-CPU analysis of our example we're required that stack traces to be available to tracers.
By default our example is compiled with the -O2
switch ( which we intent to keep during profiling). This means that compiler
optimizations are enabled.
Many compilers omit the frame pointer as a way of runtime optimization ( saving a register ), thus breaking frame pointer-based stack walking. This makes the executable faster, but at the same time it makes it (like any other program) harder to trace, potentially wrongfully pinpointing on-CPU time to the last available frame pointer of a call stack that can get a lot deeper ( but impossible to trace ).
It's important that you ensure:
- debug information is present: compile option
-g
- frame pointer register is present:
-fno-omit-frame-pointer
- we still run with optimizations to get an accurate representation of production run times, meaning we will keep:
-O2
To do, and within the project's build
folder, configure and build the project as follows:
CXXFLAGS="-g -fno-omit-frame-pointer -O2" CFLAGS="-g -fno-omit-frame-pointer -O2" CC=gcc-10 CXX=g++-10 cmake -DHDR_HISTOGRAM_BUILD_BENCHMARK=ON -DCMAKE_BUILD_TYPE=ReleaseWithDebugInfo ..
# Sampling stack traces using perf
Now that we have our executable properly build for profiling, and to profile both user and kernel-level stacks for a duration of 60 seconds, at an sampling frequency of 999 samples per second run the following command:
~/HdrHistogram_c/build$ sudo perf record -g -F 999 ./test/hdr_histogram_benchmark --benchmark_filter=BM_hdr_value_at_percentile_given_array/3/86400000000 --benchmark_min_time=60
2021-02-07 14:54:39
Running ./test/hdr_histogram_benchmark
Run on (40 X 1240.4 MHz CPU s)
CPU Caches:
L1 Data 32K (x20)
L1 Instruction 32K (x20)
L2 Unified 1024K (x20)
L3 Unified 28160K (x1)
Load Average: 0.02, 0.16, 0.20
***WARNING*** Library was built as DEBUG. Timings may be affected.
---------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------
BM_hdr_value_at_percentile_given_array/3/86400000000 344046 ns 344042 ns 244556 items_per_second=11.6265k/s
[ perf record: Woken up 91 times to write data ]
[ perf record: Captured and wrote 22.751 MB perf.data (126180 samples) ]
# Displaying the recorded profile information using perf report
By default perf record will generate a perf.data file in the current working directory.
You can then report with a call-graph output (call chain, stack backtrace), with a minimum call graph inclusion threshold of 0.5%, with:
perf report -g "graph,0.5,caller"
Note: See the perf report (opens new window) documentation for advanced filtering, sorting and aggregation capabilities.
If we filter by the hdr_value_at_percentile
symbol and annotate it we'll notice that 99% of the CPU cycles of hdr_value_at_percentile
are spent on the iterator calculating the cumulative count up until we reach or surpass the count that represents the percentile.
This means that if we want to compute p50 and p99, we could re-use the already precomputed cumulative count that gives us the p50 and start from that value ( instead of starting again at 0 ) for calculating the p99, thus eliminating the redundant computation.
# Adding the value_at_percentiles API
As seen above, when we want to compute multiple percentiles for a given histogram, we've identified a reusable partial results in hdr_value_at_percentile
that is also where threads are spending most CPU cycles while running on-CPU.
With that in mind, by adding a new API with the following signature and implementation, and assuming the user will follow the API pre-requisites (sorted percentiles array input) we should be able to deeply reduce redundant work and improve the overall function performance.
# new api function signature
/**
* Get the values at the given percentiles.
*
* @param h "This" pointer.
* @param percentiles The ordered percentiles array to get the values for.
* @param N Number of elements in the arrays.
* @param values Destination array containg the values at the given percentiles.
* @return 0 on success, ENOMEM if malloc failed.
*/
int hdr_value_at_percentiles(const struct hdr_histogram* h, const double* percentiles, const size_t N, int64_t** values);
# new api function definition
int hdr_value_at_percentiles(const struct hdr_histogram* h, const double* percentiles, const size_t N, int64_t** values)
{
*values = (int64_t*) calloc((size_t) N, sizeof(int64_t));
if (!*values)
{
return ENOMEM;
}
int64_t* count_at_percentiles = (int64_t*) calloc((size_t) N, sizeof(int64_t));
if (!count_at_percentiles)
{
free(*values);
return ENOMEM;
}
struct hdr_iter iter;
const int64_t total_count = h->total_count;
for (size_t i = 0; i < N; i++)
{
const double requested_percentile = percentiles[i] < 100.0 ? percentiles[i] : 100.0;
int64_t count_at_percentile =
(int64_t) (((requested_percentile / 100) * total_count) + 0.5);
count_at_percentiles[i] = count_at_percentile > 1 ? count_at_percentile : 1;
}
hdr_iter_init(&iter, h);
int64_t total = 0;
size_t at_pos = 0;
while (hdr_iter_next(&iter) && at_pos < N)
{
total += iter.count;
while (total >= count_at_percentiles[at_pos] && at_pos < N)
{
(*values)[at_pos] = highest_equivalent_value(h, iter.value);
at_pos++;
}
}
free(count_at_percentiles);
return 0;
}
# benchmarking the new API
Now that we've defined the new API, using the same setup code as the above microbenchmark, we should be able to deterministically and properly compare the optimization provided by eliminating the redundant computation.
New microbenchmark:
static void BM_hdr_value_at_percentile_given_array(benchmark::State &state) {
/////////////////////
// benchmark setup //
/////////////////////
srand(12345);
const int64_t precision = state.range(0);
const int64_t max_value = state.range(1);
const double percentile_list[4] = {50.0, 95.0, 99.0, 99.9};
std::default_random_engine generator;
// gama distribution shape 1 scale 100000
std::gamma_distribution<double> latency_gamma_dist(1.0, 100000);
struct hdr_histogram *histogram;
hdr_init(min_value, max_value, precision, &histogram);
for (int64_t i = 1; i < generated_datapoints; i++) {
int64_t number = int64_t(latency_gamma_dist(generator)) + 1;
number = number > max_value ? max_value : number;
hdr_record_value(histogram, number);
}
benchmark::DoNotOptimize(histogram->counts);
int64_t items_processed = 0;
///////////////////
// benchmark run //
///////////////////
for (auto _ : state) {
for (auto percentile : percentile_list) {
benchmark::DoNotOptimize(hdr_value_at_percentile(histogram, percentile));
// read/write barrier
benchmark::ClobberMemory();
}
items_processed += 4;
}
state.SetItemsProcessed(items_processed);
}
~/HdrHistogram_c/build$ ./test/hdr_histogram_benchmark --benchmark_filter=BM_hdr_value_at_percentile --benchmark_min_time=60
2021-02-07 16:55:23
Running ./test/hdr_histogram_benchmark
Run on (40 X 1227.32 MHz CPU s)
CPU Caches:
L1 Data 32K (x20)
L1 Instruction 32K (x20)
L2 Unified 1024K (x20)
L3 Unified 28160K (x1)
Load Average: 0.83, 0.37, 0.14
---------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------
BM_hdr_value_at_percentile_given_array/3/86400000 342337 ns 342335 ns 245263 items_per_second=11.6845k/s
BM_hdr_value_at_percentile_given_array/3/86400000000 343298 ns 343295 ns 245245 items_per_second=11.6518k/s
BM_hdr_value_at_percentile_given_array/4/86400000 3064091 ns 3064067 ns 27417 items_per_second=1.30545k/s
BM_hdr_value_at_percentile_given_array/4/86400000000 3062938 ns 3062914 ns 27379 items_per_second=1.30595k/s
BM_hdr_value_at_percentiles_given_array/3/86400000 107060 ns 107059 ns 786567 items_per_second=37.3625k/s
BM_hdr_value_at_percentiles_given_array/3/86400000000 107040 ns 107039 ns 783051 items_per_second=37.3694k/s
BM_hdr_value_at_percentiles_given_array/4/86400000 1043602 ns 1043592 ns 80462 items_per_second=3.83292k/s
BM_hdr_value_at_percentiles_given_array/4/86400000000 1043322 ns 1043312 ns 80458 items_per_second=3.83394k/s
As expected, by simply reusing intermediate computation and consequently reducing the redudant computation we've moved from taking 342 microseconds of CPU time to compute the 4 percentiles to 107 microseconds, and from a max of approximately 11.6K percentile calculations per second to 37.36K calculations per second, representing an 3.4X speedup ( dependant on both the number of percentiles to calculate and also the percentile and data distribution ).
The following chart visualize demonstrates the above optimization.
# Further notes
At the moment I wrote this blog we were discussing this optimization PRs I've opened both on the C and Go versions of the HdrHistogram.
Contributions and corrections to the above PRs are gratefully accepted.
# Used platform details
The performance results provided in this blog were collected using:
- Hardware platform: A physical HPE ProLiant DL380 Gen10 Server, with one Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz, Maximum memory capacity of 3 TB, disabling CPU Frequency Scaling and with all configurable BIOS and CPU system settings set to performance.
- OS: Ubuntu 18.04 Linux release 4.15.0-123.
- Installed Memory: 64GB DDR4-SDRAM @ 2933 MHz
- Compiler: gcc-10 (Ubuntu 10.1.0-2ubuntu1~18.04) 10.1.0