I want to set affinity on multiple CPUs by sched_affinity
as follows.
void
pin(pid_t t, int cpu)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
sched_setaffinity(t, sizeof(cpu_set_t), &cpuset);
}
My environment is 32 cores, where 4 CPUs exists and single CPU has 8cores.
I want thread 0 ~ 7 run on the same cpu, and thread 8 ~ 15 run on the same cpu, and so on.
I wonder what to set variable cpu in CPU_SET.
This is set as the thread id, if the core numbers are allocated as naively, that is, cpu0 has 0th core, and 1st core, and 2nd core,..., and cpu1 has 8th core, 9th core,... .
On the one hand, cpu is set as the round-robin rule, if the core numbers are allocated as round-robin rule, that is, cpu0 has 0th core, and 4th core, and 8th core,..., and cpu1 has 1th core, and 5th core,... .
Which rule should I set variable cpu, naive rule or round-robin rule?
Under Linux (and other OS') the programmer may set CPU affinity, i.e. the allowed CPUs that the kernel may schedule this process to. Upon fork(), processes inherit the parents CPU affinity. This comes in very handy, if one wants to limit CPUs for whatever reason.
E.g. one might limit
In general, it may be benefical to limit the process/thread to certain cores or a socket in order to not have them scheduled away by the OS -- maximising the benefits of the L1/L2 cache (when pinning to cores) or the L3/LLC cache (when pinning to sockets).
Regarding Your question on "Thread distribution": Processor development has introduced Symmetrical Multithreading (SMT) or Hyperthreading (as called by Intel), which introduces 2 logical cores (e.g. Intel Xeon) or even 4 logical cores (e.g. Intel Knights Landing, IBM Power) per physical core. These logical cores are as well represented as "CPU" in the cpuset above. Moreover some processors impose NUMA domains, where memory access from one core to it's "own" memory is fast, while access to another cores memory in another NUMA domain is slower...
So, as some of the above comments suggest: it depends! Do Your threads communicate amongst each other (via shared memory), then they should be kept close within the same cache. Do Your threads exercise the same functional units (e.g. FPU), then scheduling two on the same physical core (with 2 logical cores, i.e. Hyperthread) may be detrimental to performance.
To play around, please find enclosed the following code:
Edit: Should clarify Apple since OSX 10.5 (Leopard) offers Affinity as in https://developer.apple.com/library/mac/releasenotes/Performance/RN-AffinityAPI/