Linux TCP/IP Tuning for Scalability

Hi there! I’m Philip (@bluesmoon), the CTO of LogNormal. We’re a performance company, and performance and scalability go hand in hand. Better scalability results in more consistent performance and at LogNormal, we like pushing our hardware as far as it will go. Today’s post is about some of the infrastructure we use and how we tune it to handle a large number of requests.

We have separate components of our software stack to handle different tasks. In this post I’ll only cover the parts that make up our beacon collection component and how we tune it. Only a few of the tuning points are specific to this component.

The Stack

(Side note: someone needs to start a Coffee Shop + Co-working Space called The Stack).

  • The beacon collector runs __Linux__ at its base. We use a combination of Ubuntu 11.10 and 12.04, which for most purposes are the same. If you're going with a new implementation though, I'd suggest 12.04 (or at least the 3.x kernels).
  • Slightly higher up is __iptables__ to restrict inbound connections. This is mainly because we're hosted on shared infrastructure and need to restrict internal communications only to hosts that we trust. iptables is the cheapest way to do this, but it brings in a few caveats that we address in the tuning section later.
  • We then have __nginx__ set up to serve HTTP traffic on ports 80 and 443 and do some amount of filtering (more on this later)
  • Behind nginx is our custom __node.js__ server that handles and processes beacons as they come in. It reads some configuration data from __couchdb__ and then sends these processed beacons out into the ether. Nginx and node talk to each other over a unix domain socket. That's about all that's relevant for this discussion, but at the heart of it, you'll see that there are lots of file handles and sockets in use at any point of time. A large part of this is due to the fact that nginx only uses HTTP/1.0 when it proxies requests to a back end server, and that means it opens a new connection on every request rather than using a persistent connection.

What should we tune?

For the purposes of this post, I’ll talk only about the first two parts of our stack: Linux and iptables.

###Open files

Since we deal with a lot of file handles (each TCP socket requires a file handle), we need to keep our open file limit high. The current value can be seen using ulimit -a (look for open files). We set this value to ‘999999’ and hope that we never need a million or more files open. In practice we never do. We set this limit by putting a file into /etc/security/limits.d/ that contains the following two lines:

*	soft	nofile	999999
*	hard	nofile	999999

(side node: it took me 10 minutes trying to convince Markdown that those asterisks were to be printed as asterisks) If you don’t do this, you’ll run out of open file handles and could see one or more parts of your stack die.

###Ephemeral Ports

The second thing to do is to increase the number of Ephemeral Ports available to your application. By default this is all ports from ‘32768’ to ‘61000’. We change this to all ports from ‘18000’ to ‘65535’. Ports below 18000 are reserved for current and future use of the application itself. This may change in the future, but is sufficient for what we need right now, largely because of what we do next.

###TIME_WAIT state

TCP connections go through various states during their lifetime. There’s the handshake that goes through multiple states, then the ESTABLISHED state, and then a whole bunch of states for either end to terminate the connection, and finally a TIME_WAIT state that lasts a really long time. If you’re interested in all the states, read through the netstat man page, but right now the only one we care about is the TIME_WAIT state, and we care about it mainly because it’s so long. By default, a connection is supposed to stay in the TIME_WAIT state for twice the msl. Its purpose is to make sure any lost packets that arrive after a connection is closed do not confuse the TCP subsystem (the full details of this are beyond the scope of this article, but ask me if you’d like details). The default msl is 60 seconds, which puts the default TIME_WAIT timeout value at 2 minutes. This means you’ll run out of available ports if you receive more than about 400 requests a second, or if we look back to how nginx does proxies, this actually translates to 200 requests per second. Not good for scaling. We fixed this by setting the timeout value to 1 second. I’ll let that sink in a bit. Essentially we reduced the timeout value by 99.16%. This is a huge reduction, and not to be taken lightly. Any documentation you read will recommend against it, but here’s why we did it. Again, remember the point of the TIME_WAIT state is to avoid confusing the transport layer. The transport layer will get confused if it receives an out of order packet on a currently established socket, and send a reset packet in response. The key here is the term established socket. A socket is a tuple of 4 terms. The source and destination IPs and ports. Now for our purposes, our server IP is constant, so that leaves 3 variables. Our port numbers are recycled, and we have 47535 of them. That leaves the other end of the connection. In order for a collision to take place, we’d have to get a new connection from an existing client, AND that client would have to use the same port number that it used for the earlier connection, AND our server would have to assign the same port number to this connection as it did before. Given that we use persistent HTTP connections between clients and nginx, the probability of this happening is so low that we can ignore it. 1 second is a long enough TIME_WAIT timeout. The two TCP tuning parameters were set using sysctl by putting a file into /etc/sysctl.d/with the following:

net.ipv4.ip_local_port_range = 18000    65535
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 1

###Connection Tracking

The next parameter we looked at was Connection Tracking. This is a side effect of using iptables. Since iptables needs to allow two-way communication between established HTTP and ssh connections, it needs to keep track of which connections are established, and it puts these into a connection tracking table. This table grows. And grows. And grows. You can see the current size of this table using sysctl net.netfilter.nf_conntrack_count and its limit using sysctl net.nf_conntrack_max. If count crosses max, your linux system will stop accepting new TCP connections and you’ll never know about this. The only indication that this has happened is a single line hidden somewhere in /var/log/syslog saying that you’re out of connection tracking entries. One line, once, when it first happens. A better indication is if count is always very close to max. You might think, “Hey, we’ve set max exactly right.”, but you’d be wrong. What you need to do (or at least that’s what you first think) is to increase max. Keep in mind though, that the larger this value, the more RAM the kernel will use to keep track of these entries. RAM that could be used by your application. We started down this path, increasing net.nf_conntrack_max, but soon we were just pushing it up every day. It turned out that connections that were getting in there were never getting out.

###nf_conntrack_tcp_timeout_established

It turns out that there’s another timeout value you need to be concerned with. The established connection timeout. Technically this should only apply to connections that are in the ESTABLISHED state, and a connection should get out of this state when a FIN packet goes through in either direction. This doesn’t appear to happen and I’m not entirely sure why. So how long do connections stay in this table then? It turns out that the default value for nf_conntrack_tcp_timeout_established is 432000 seconds. I’ll wait for you to do the long division… Fun times. I changed the timeout value to 10 minutes (600 seconds) and in a few days time I noticed conntrack_countwent down steadily until it sat at a very manageable level of a few thousand. We did this by adding another line to the sysctl file:

(net.netfilter.nf_conntrack_tcp_timeout_established=600)

Speed bump

At this point we were in a pretty good state. Our beacon collectors ran for months (not counting scheduled reboots) without a problem, until a couple of days ago, when one of them just stopped responding to any kind of network requests. No ping responses, no ACK packets to a SYN, nothing. All established ssh and HTTP connections terminated and the box was doing nothing. I still had console access, and couldn’t tell what was wrong. The system was using less than 1% CPU and less than 10% of RAM. All processes that were supposed to be running were running, but nothing was coming in or going out. I looked through syslog, and found one obscure message repeated several times.

IPv4: dst cache overflow

Well, there were other messages, but this was the one that mattered. I did a bit of searching online, and found something about an rt_cache leak in 2.6.18. We’re on 3.5.2, so it shouldn’t have been a problem, but I investigated anyway. The details of the post above related to 2.6, and 3.5 was different, with no ip_dst_cache entry in /proc/slabinfo so I started searching for its equivalent on 3.5, when I came across Vincent Bernat’s post on the IPv4 route cache. This is an excellent resource to understand the route cache on Linux, and that’s where I found out about the lnstat command. This is something that needs to be added to any monitoring and stats gathering scripts that you run. Further reading suggests that the dst cache gc routines are complicated, and a bug anywhere could result in a leak, one which could take several weeks to become apparent. From what I can tell, there doesn’t appear to be a rt_cacheleak. The number of cache entries increases and decreases with traffic, but I’ll keep monitoring it to see if that changes over time.

Other things to tune

There are a few other things you might want to tune, but they’re becoming less of an issue as base system configs evolve.

###TCP Window Sizes

This is related to TCP Slow Start, and I’d love to go into the details, but our friends Sajal and Aaron over at CDN Planet have already done an awesome job explaining how to tune TCP initcwnd for optimum performance. This is not an issue for us because the 3.5 kernel’s default window size is already set to 10.

###Window size after idle

Related to the above is the sysctl setting net.ipv4.tcp_slow_start_after_idle. This tells the system whether it should start at the default window size only for new TCP connections or also for existing TCP connections that have been idle for too long (on 3.5, too long is 1 second, but see net.sctp.rto_initial for its current value on your system). If you’re using persistent HTTP connections, you’re likely to end up in this state, so set net.ipv4.tcp_slow_start_after_idle=0(just put it into the sysctl config file mentioned above).

Endgame

After changing all these settings, a single quad core vm (though using only one core) with 1Gig of RAM has been able to handle all the load that’s been thrown at it. We never run out of open file handles, never run out of ports, never run out of connection tracking entries and never run out of RAM. We have several weeks before another one of our beacon collectors runs into the dst cache issue, and I’ll be ready with the numbers when that happens. Thanks for reading, and let us know how these settings work out for you if you try them out. If you’d like to measure the real user impact of your changes, have a look at our Real User Measurement tool at LogNormal.