I'm currently working on a Windows Azure application using WebAPI and SignalR for communication. Both services are hosted via OWIN on a Worker role with multiple instances.
Current solution
Currently we start one Owin host with WebAPI on port 443 on every machine and one SignalR Owin host on the instance input endpoint port (e.g. 10106-1010x) on every machine.
Everything works fine, but some of our customer are sitting behind a firewall where all ports except 80/443 are blocked -> so no websocket communication there (WebAPI works fine).
New solution
We are starting one Owin host with WebAPI and SignalR on every instance. So both HTTP and WebSocket traffic will be routed through the loadbalancer over port 443 -> no more instance input endpoints (and no more firewall problems).
The problem
The problem now is that sometimes the WebSocket connection can be established and sometimes not (browser independent). If the connection can't be established the following error appears in the console:
Error during WebSocket handshake: Unexpected response code: 400
No transport could be initialized successfully. Try specifying a different transport or none at all for auto initialization.
I've already added the role instance id to the websocket response messages from the server, but couldn't find any (ir)regularities (e.g. a single instance doesn't respond, ...). All SignalR servers seem to be up and running, but sometimes the connection can't be established.
You can test it yourself by going to the following link. If you don't get an error dialog ("Connection to server lost") it is working, otherwise try to refresh the page several times.
-
I'm not looking for a scaleout feature for SignalR (as described here or here). The client just connects to one (random) server (worker role instance) and communicates with the server until a close message is sent. If he connects again he can be routed to any other server. Also there is no communication between the servers.
Update/Solution
halter73 was right, each instance generates its own anti-CSRF token. To avoid this I implemented my own IDataProtector/IDataProtectionProvider, similar to these to SO questions (see here and here).
If you can look at content of the 400 response (this may be difficult since it is an SSL encrypted response to a WebSocket request), you will probably see a message similar to "The ConnectionId is in the incorrect format."
SignalR uses the server's machine key to create an anti-CSRF token, but this requires that all the servers in your farm share a machine key for the token to be properly decrypted in when SignalR requests hop servers. The /negotiate is the request that retrieves the anti-CSRF token. When the SignalR client then uses the anti-CSRF token to make a /connect request, it sometimes fails when the /connect request is processed by a different server which didn't create the token and therefore is unable to decrypt it.
Here is an issue that filed on GitHub by someone who experienced a similar issue: https://github.com/SignalR/SignalR/issues/2292.