my original c++ demo code is shown as follows:
int counter = 0;
while (counter < 5) {
auto start = std::chrono::high_resolution_clock::now();
// instance and result are pre-defined local variables
instance.Search(40.055948, 116.411325, &result);
auto end = std::chrono::high_resolution_clock::now();
int64_t cost_us = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout << "cost_us=" << cost_us << std::endl;
// usleep(100); // case1: sleep 100 us
// sleep(1); // case2: sleep 1 second
// case3: no sleep at all
counter++;
}s
some people may doubt my call of instance.Search() brings in something unknown, so plz refer to code below:
#include <time.h>
#include <unistd.h>
#include <iostream>
#include <set>
#include <chrono>
void test(const std::set<int>& numbers) {
for (int counter = 0; counter < 5; ++counter) {
auto start = std::chrono::high_resolution_clock::now();
auto it = numbers.lower_bound(5555555);
auto end = std::chrono::high_resolution_clock::now();
int64_t cost_us = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout << "cost_us=" << cost_us << std::endl;
// usleep(100);
// sleep(1);
}
}
int main() {
std::set<int> test_set;
for (int i = 0; i < 100000000; i++) {
test_set.insert(i);
}
test(test_set);
}
precondition:use perf stat
to count cache_miss, instructions and so on...
case 1: when adding usleep(100) at the end of while loop, result after process finishes is:
cost_us=5
cost_us=5
cost_us=5
cost_us=8
cost_us=6
Performance counter stats for './latency_perf_test_sleep_100_us':
1,785,438 cache-references
419,583 cache-misses # 23.500 % of all cache refs
203,832,235 cycles
118,093,490 instructions # 0.58 insn per cycle
23,198,708 branches
35,092 faults
302 migrations
1.031460583 seconds time elapsed
case 2: when adding sleep(1), the result is:
cost_us=7
cost_us=65
cost_us=21
cost_us=21
cost_us=32
Performance counter stats for './latency_perf_test_sleep_1_sec':
15,093,302 cache-references
1,303,941 cache-misses # 8.639 % of all cache refs
14,759,103,041 cycles
24,548,401,788 instructions # 1.66 insn per cycle
5,062,488,529 branches
35,372 faults
3,444 migrations
6.033182248 seconds time elapsed
case 3: where no sleep()/usleep() at all, the result is:
cost_us=5
cost_us=2
cost_us=1
cost_us=1
cost_us=1
Performance counter stats for './latency_perf_test_without_sleep':
1,715,128 cache-references
420,368 cache-misses # 24.509 % of all cache refs
209,238,013 cycles
130,647,626 instructions # 0.62 insn per cycle
25,827,456 branches
35,092 faults
362 migrations
1.032256618 seconds time elapsed
As mentioned above, time cost on same function/method call differs much in different case. At first, I tended to think that sleep()
causes cache miss (data used by my call). However, after I used taskset
to bind my process with a specific cpu core, the difference didn't disappear as I expected.
I also wonder why adding sleep()
/ usleep()
leads to such a sharp increase of the instructions
counted by perf stat
.
I didn't read any source code of sleep()
or usleep()
, but I guess that the kernel may do some trick when a process calls sleep()
or usleep()
(both call nanosleep()
inside).
Could anyone explain the reason behind this weird phenomenon? Thanks in advance.
Your
sleep
looks like this glibc sources from 2012 sysv/linux/sleep.c. It blocksSIGCHLD
before callingnanosleep
because of a kernel bug (or it's intended SysV behavior? I'm not exactly sure...), see this commit and referenced by it LKML thread.Most probably the additional overhead is from calling
__sig*
related functions. To further investigate, profile the code (thinking ofgprof
) or/and compile yourglibc
from sources with debug information (or just install the debugging info, if you can) and profile the code then.