Problem: We implement a video recording system on a Windows Server 2012 system. In spite of low CPU and memory consumption, we face serious performance problems.
Short program description: the application (VS2005/C++) creates many network sockets, each receiving a multicast UDP video stream from an Ethernet network. Per stream the application provides a receiver buffer by calling WSARecvFrom() (overlapped operation), waits in MsgWaitForMultipleObjects() for the Window's "data arrived" event, takes the data packet, and repeats all again in an endless loop. For testing, to assure minimal CPU and memory consumption beside the pure socket IO work, the application does nothing, neither any disk/file IO. The application process is configured to use all available cores on the machine (default affinity settings unchanged).
Tests run: the test is run on two different machines: a) a Windows 7 with 4 physical cores / 8 with hyper-threading, and b) a Windows Server 2012 with 12 physical cores / 24 with hyper-threading.
Both systems show the same problem: everything works fine up to a certain number of configured sockets / network streams. Increasing them further (and we need to) finally paralyses the Windows desktop (mouse-pointer, repainting). At this stage the total CPU load is still very low (i.e. 10-15%) and there is much free memory available. But the Task-Manager shows extremely one-sided CPU loads: CPU 0 nearly 100%, all other CPUs near to 0%. Changing the Processor Affinity for the process in the Task Manager doesn't help.
Question 1: it looks like CPU 0 is doing the whole kernel's network IO work. Is that likely ?
Question 2: if yes, is there a way to control the kernel's use of available CPUs? If yes, how ?
Question 3: if no, is there any other way to make Windows distribute the (kernel) network IO work to other CPUs (i.e. by installing multiple NIC Cards, each NIC receiving only a subset of the network streams, and bind each NIC to another CPU) ?
Most thankful for any hints from anybody out there.
 
                        
I'm not a Windows server guy, but this sounds like an interrupt issue. This often happens in high throughput systems, especially real-time ones.
Background:
Simply speaking, for each packet your network interface generates an interrupt, informing the CPU that it needs to handle the newly arrived data. High throughput network cards (e.g. 10Gbps) that receive small packets can easily overwhelm the CPU with these interrupts.
Just to get a feel for the problem, let's do some math - if you saturate a 10G line with 100 byte packets, that means that (ideally) 12,500,000 packets are sent over the line each second. In reality, It's less due to overhead; say 10,000,000 packets per second (pps). Your 3Ghz cpu generates 3,000,000,000 clocks per second. So it needs to handle a packet a packet every 300 clock cycles. That's pretty hard for a general purpose machine.
Now, I don't know the rate of packet arrival in your case, nor do I know your average packet length. But based on the symptoms you described, you might have run into this issue.
Solutions
Modern day network cards, especially high throughput ones, support all kinds of useful offloads such as GRO, TOE, and others. These take some network related work off the CPU (such as checksum calculation, packet fragmentation etc) and put it onto the network card which carries dedicated hardware for performing it. Check out the offloads supported by your card. In Linux, managing offloading is performed using an application called ethtool. Since I never played with offloading in windows, I can only point in the direction of the most relevant windows article I found, but I can't offer any experience-based advice.
Interrupt throttling is another ability of (some) network cards and their drivers which allows them to limit the number of interrupts your CPU receives, essentially interrupting the core once every few packets instead of once per packet.
Some network cards have multiple (packet) queues, and therefore multiple interrupt lines, one per queue. They split incoming traffic evenly between queues using a hash function, creating (usually) 8 or 16 flows at 1/8 or at 1/16 of the line rate. Each flow can be tied to a specific CPU core using interrupt affinity, and since the hash function is calculated on IPs and port numbers, and is deterministic, each TCP/IP level session will always be handled by the same core. In Linux, setting the affinity requires writing to
/proc/irq/<interrupt number>/smp_affinity. In windows, this seems to be the way.