Seeing long delays due to EventMachine quantum – how can I speed things up?

170 Views Asked by At

I am developing a real-time application using EventMachine. Two clients, A and B, connect to an EventMachine server over standard TCP, or via WebSocket with em-websocket.

Every time data goes through EventMachine, code execution takes a 95ms hit. When A talks to the server, there is a 95ms delay. When A talks to B, then there is a 190ms delay.

If many requests occur in rapid succession, the delay disappears, except for the final request in the sequence. So, if I send 10 rapid requests, I'll get 9 responses after about 5ms each, but the 10th response will take 95ms again.

I've deduced that this has something to do with EventMachine.set_quantum. From the docs:

Method: EventMachine.set_quantum

For advanced users. This function sets the default timer granularity, which by default is slightly smaller than 100 milliseconds. Call this function to set a higher or lower granularity. The function affects the behavior of add_timer and add_periodic_timer. Most applications will not need to call this function.

Avoid setting the quantum to very low values because that may reduce performance under some extreme conditions. We recommend that you not use values lower than 10.

Well, that explains where the 95ms came from. Sure enough, the delays change by calling EventMachine.set_quantum, but I am wary of tweaking this value because of the warning in the documentation.

What is set_quantum actually doing? I can't find any documentation or explanation about what the quantum variable means.

What can I do to reduce these delays? I'd like to understand the potential repercussions of decreasing the quantum to, say, 10ms.

Is EventMachine even the right choice? I'm essentially using it as a glorified TCP connection. Maybe I should just stick to raw sockets for inter-process communication, and find a WebSocket server gem that doesn't use EventMachine.

1

There are 1 best solutions below

5
On

EventMachine is constantly running a loop, where it checks:

  1. Whether any timers got triggered.
  2. If any of the file descriptors have something to do with them.

The second step involves the appropriate mechanism under the hood, e.g. the select(..) call. This is where that quantum value goes. So basically the loop looks rather like this:

  1. Any timers triggered?
  2. Any of the file descriptors have something to do with them? Wait for them, up to quantum millis.
  3. Unless there's a shutdown request, go to the 1st step.

Therefore setting quantum to the lower values will make that loop be iterated more often, thus eating up the CPU cycles. I don't think that could really be an issue though.

What surprises me is that you have that communication delay at all, since all of those querying mechanisms (select, or epoll, or whatever) return immediately if there's an event (e.g. data) on the file descriptor. That basically means that you shouldn't be incurring those delays at all. And if that delay was by design, then numerous Thin users would've already been pretty upset about it.

All of this makes me think that there is something slightly not right in your code that makes it work that way. Unfortunately, I can't tell much more than that unless I see it.

Hope it helps!