There is an 'erase-remove' idiom in C++ when it comes to removing several elements from containers, and there are discussions about an alternative 'resize-remove' way, e.g., here. People say that 'erase-remove' is better than 'resize-remove', but according to my tests, the latter is (slightly) faster on vectors. So, should I use 'resize-remove' when it comes to vectors?
Here's my benchmarking code:
#include <benchmark/benchmark.h>
#include <algorithm>
#include <functional>
#include <iostream>
#include <random>
#include <vector>
using namespace std;
constexpr size_t N_ELEMS = 1000000;
constexpr int MAX_VAL = N_ELEMS / 10;
constexpr int THRESH = MAX_VAL / 5 * 3;
static vector<int> generate_input() {
vector<int> nums(N_ELEMS);
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dist(0, N_ELEMS);
std::generate(nums.begin(), nums.end(), std::bind(dist, std::ref(gen)));
return std::move(nums);
}
static void bm_erase_remove(benchmark::State &state) {
for (auto _ : state) {
state.PauseTiming();
auto nums = generate_input();
state.ResumeTiming();
nums.erase(std::remove_if(nums.begin(), nums.end(),
[](int x) { return x < THRESH; }),
nums.end());
benchmark::DoNotOptimize(nums);
}
}
BENCHMARK(bm_erase_remove);
static void bm_resize_remove(benchmark::State &state) {
for (auto _ : state) {
state.PauseTiming();
auto nums = generate_input();
state.ResumeTiming();
nums.resize(std::distance(
nums.begin(), std::remove_if(nums.begin(), nums.end(),
[](int x) { return x < THRESH; })));
benchmark::DoNotOptimize(nums);
}
}
BENCHMARK(bm_resize_remove);
BENCHMARK_MAIN();
Output:
$ g++ main.cpp -lbenchmark -O3 -pthread
$ ./a.out
2023-05-24T20:07:22+08:00
Running ./a.out
Run on (16 X 3193.91 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 512 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.16, 0.14, 0.16
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
bm_erase_remove 822789 ns 759162 ns 838
bm_resize_remove 818217 ns 754749 ns 935
The difference is larger when using clang++:
$ clang++ main.cpp -lbenchmark -O3 -pthread
$ ./a.out
Load Average: 0.25, 0.18, 0.17
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
bm_erase_remove 1165085 ns 1074667 ns 611
bm_resize_remove 958856 ns 884584 ns 782
Extra information:
g++'s version is13.1.1,clang++'s version is15.0.7- I'm using Arch Linux on WSL, kernel version is
5.15.90.1-microsoft-standard-WSL2 - CPU Model is
AMD Ryzen 7 6800H with Radeon Graphics
UPDATE: What is interesting is that, when I run the benchmarks individually (using the benchmark_filter option), the results are equivalent. Is this because of caching? If so, how does caching mechanism work?
UPDATE (2023/5/25): If the two BENCHMARK statements get swapped, a totally opposite result is showed.
They are about the same performance, this is because the
std::remove_ifis the only function that modifies the array, then the difference comes form the other functions,std::vector::resizemakes a realloc to fit the new size (std::distancejust returns the size so it's negligible) andstd::vector::erasejust adopts the container making it slightly faster.As pointed by @Peter Cordes "It's actually guaranteed not to reallocate",
std::vector::resizedoes not resize the vector if the new size is smaller, so the difference should be from an extra move(vs, clang) that erase (vs, clang) does and resize(vs, clang) doesn't.To be able to mesure differences
generate_input()needs to return the same vector for all of the tests, in your implementation every call returns a new random vector that makes imposible to tell apart a run to run variance from the vectors changing.With that, @463035818_is_not_a_number makes an interesting point, but for the same reason as before, there is no difference between those functions. That doesn't mean that it's the same case, for a struct the penalty from a bad branch much greater making
std::remove_ifa prime target for optimization, see in the windows figures bellow.All of the figures are from 50 runs each, green is the range between the best run and the average and the red is the range from the average to the worst result, this is to show the variance from run to run.
i5-1155G7 + 3200MHz RAM on manjaro with clang-13 and -O3
1600X + 1067MHz RAM on windows 10 (22H2) with vs2022 and /O2
1600X + 1067MHz RAM on windows 10 (22H2) with clang-16 and -O3 -pthread
1600X + 1067MHz RAM on ubuntu 22 with clang-13 and -O3
With vs, it seems that there is an optimization bug (?), that makes
simple_removeunderperform with u32 and u64.And the code to recreate the plot.