When I write a simple application, running for 10 minutes, that starts 10 threads once (pthreads), each sleeping for 1 ms in a loop (not doing anything else) the CPU is used ca. 44% (top
reports that). It is a ARM9 CPU with 450 MHz, Linux 2.6.37 is used as OS. There is no other program running, it tried out different kernel configs (Dynamic Ticks, Soft/Hard IRQ, High Resolution Timer, ..., ..., ...), different priorities (up to 99) but the numbers stay the same. /usr/bin/time -v
shows ca. 5'200'000 voluntary context switches and ca. 3 minutes are spent in Kernel space. Sleepin in each thread for ca. 5 ms and the CPU utilization goes down to ca. 9% which is IMO still crazy (40'500'000 cycles to safe some registers). clock_nanosleep was used for sleeping (CLOCK_REALTIME/CLOCK_MONOTONIC did not change anything).
I'm aware that a full context switch is expensive on ARM9 because caches have to be cleared. But a simple thread switch, or switch to the OS shouldn't be that expensive IMHO (address space remains the same, no cache/TLB flushing required). Is this common or should I try to find the bottleneck in the kernel?
You're busily waking up and going back to sleep at 100uS intervals -- 10 threads, 1ms, that's 100uS on average. And keep in mind that you have two context switches for each of those 100uS intervals, so you have a context switch every 50uS on average, or 20,000 times per second.
Perhaps that's the answer you're looking for?