As a tip to anyone using nginx to proxy websockets: make sure you increase the p...

derefr · on April 24, 2013

Indeed; all the docs I could find referred to the old pre-websockets-support behavior, where you used TCP-passthrough and there was a (hackish) "websocket_read_timeout" property.

I ended up going with

    proxy_read_timeout 31536000;

--meaning that a client can hold open a websocket without the server needing to say anything for a year at a time. I'm not sure whether that's a good idea in all cases, though; it means you won't detect silent backend netsplits (i.e. the network cable getting cut to your appserver box without giving it a chance to FIN) since the TCP connection will still look open.

The more cautious solution might be to use a lower timeout (120s or so) but to have the server send heartbeats over the websocket when it's not doing anything. (If you're using socket.io, you're already getting this behavior for free; if you're using sock.js, you can enable it.) This breaks down at true scale, though: when you have 100000 idle clients you might want to announce to at any time, but don't usually have anything to say to most of them, sending heartbeats to every client in that pool can saturate your link.

A compromise is probably to set the long timeout, but then use a cluster-monitoring service which will push a new config to your load-balancers in response to observed network-toplogy changes.

atto · on April 24, 2013

Yep. We initially went with a 1 hour timeout — worst case, the client drops every hour, and then immediately reconnects. We actually moved to a heartbeat system recently, because we found issues with spotty connections where we'd have packets lost / delayed (for up to a couple minutes), but the server and client both believed they were connected.

Scaling the heartbeat is actually not a big issue. If you're actively using the socket for things, the overhead of that is likely much higher than supporting pings. You'll actually hit port limitations first (if you can support ~65k connected clients per machine), in our experience. 3000 pings a second isn't too bad (that's 100k clients, pinging every 30 seconds or so). You can also change how fast the client is pinging based on client activity.

derefr · on April 24, 2013

> You'll actually hit port limitations first (if you can support ~65k connected clients per machine)

This isn't actually a limit, by the way. The port limitation is a uniqueness constraint on full (source IP, source port, dest IP, dest port) tuples; it just means one client can't have more than 65k connections open to your server (which tends to trip people up, because they see themselves running into the limit when benchmarking parallelism--because they're sending all the requests from their own computer.)

What you'll usually hit first is the open file descriptor ulimit.

Amfy · on April 24, 2013

You can increase the open file descriptor ulimit very easily. Regarding the port-limitation: I think he means the connections from nginx to your Websocket-App...

atto · on April 24, 2013

Ah cool. Yep, that's what we did, thanks.

edit: I thought too fast. Amfy's reply to you is correct (and what I have seen previously). When proxying with nginx, you use up the ports locally (unless I'm missing something?).

derefr · on April 24, 2013

Ah, I got confused with what exactly you meant. This is indeed an actual problem, but it's one with an easy solution: you just make each backend process listen on multiple ports, and list them all as separate entries in the nginx's upstream{} section for that backend. One backend with 1024 ports open = 67M connections nginx can make to that backend.

Again, you're not "using up" local ports, just (IP/port, IP/port) pairs--so increasing the number of remote ports you want to talk to allows you to make more connections just as well as if you could increase the number of local ports used to talk to them.

(This might not be so simple for some servers which expect to only listen on one port; the workaround is to use multiple IPs for the backend server, and make sure the backend process is listening on 0.0.0.0. They don't have to be real IPs--you can just as well do port-forwarding on the backend box from one-IP:lots-of-ports to lots-of-virtual-IPs:one-port. It's simple enough to listen on multiple ports in both Node and Erlang, though, so this probably doesn't matter for most people writing websocket servers.)

natejenkins · on April 24, 2013

Is there a reason why a proper websocket connection along with the client pinging the server is preferable over long-polling?

atto · on April 24, 2013

Yes, for a few reasons — most are application specific, though.

An active websocket connection does have a faster response time than a regular HTTP connection ([1]). The difference here isn't a ton, but may affect real-time applications. The packets are also smaller, so less overhead if you're sending many.

The biggest difference I saw is that when the client or server needs to send several quick requests (many within a couple second), long polling breaks down. From the spec ([2]), "Once the server sends a long poll response, typically the client immediately sends a new long poll request." This delay can add up, and is not pure full duplex. Chunked responses help for server -> client, but client -> server still has the same issues.

[1] http://eng.42go.com/secure-websockets-vs-https-benchmark/

[2] http://tools.ietf.org/html/draft-loreto-http-bidirectional-0...

natejenkins · on April 24, 2013

Awesome stuff, thanks. Your second point could explain some of the mysterious non-updates that I've seen from time-to-time when using faye with long-polling behind nginx.

daemon13 · on April 25, 2013

any pointers how you implemented hearbeat with nginx?

thank you

atto · on April 25, 2013

You actually need to implement it outside of nginx. The easiest way is just have the client send a message every x seconds, and your application server will immediately respond. If the server does not respond quickly (within a few seconds), close the connection and optionally reconnect.

ushi · on April 24, 2013

The default proxy_send_timeout is 60s. If you rely on the information about connected clients (chat anyone?), it's way nicer to ping the server every 30s, than to set huge timeouts. If internet connection of the client breaks (mobile anyone?) and the client has no possibility to close the websocket connection, nginx keeps the connection open and you server thinks that there are some clients to talk to. Nice side effect: Your client knows about the disconnect, because the ping fails.

EDIT: it's the proxy_send_timeout directive.

derefr · on April 24, 2013

EDIT: the parent originally said "proxy_read_timeout", and this comment was in response to that.

proxy_read_timeout times out reads from the upstream. It will trigger when a backend server goes AWOL, not when a client falls off the net.

If you want to detect client-side disconnection, you want proxy_send_timeout.

You can emulate this by having server sent heartbeats with a proxy_read_timeout with the client required to respond to them before the server will send any more, thus going read-silent whenever the client fails to respond, but why not just have the client do the pinging? Then the server doesn't have to say anything in response most of the time.

ushi · on April 24, 2013

> but why not just have the client do the pinging?

That's exactly what i meant. Sorry if it was unclear. I accidentally switched proxy_send_timeout and proxy_read_timeout.

atto · on April 24, 2013

Agreed. We have a larger window so that we can adaptively ping (when the client isn't very active, we don't ping as often). We're not making chat though.

If you do want timeouts like this, an alternative is to handle this on the application layer, and not rely on nginx to drop connections.