SSH keeps disconnecting

24 Dec 2016, 21:16

SSH

This blog post is of a more practical nature, and may be of use for people at home who ssh into servers and then come back later to find their session disconnected. It might also help some people in offices with nasty firewalls!

Basically the scenario goes something like:

ssh into a server
lock your screen, go away for a few hours
come back, unlock your screen
ssh session has been disconnected

So how does this happen, and what can we do to stop it?

Network Address Translation (NAT)

The first thing we need to be aware of is how most home router work. When you connect to a standard home consumer ISP they give you one IP address¹.

Now every device you have including the router needs an IP address. Your desktop, your laptop, your cellphone, your games console, your iPad… if you go around your house and look at the number of devices you’ve connected to your WiFi (or plugged in with a cable into the router) then you might be surprised. I counted at least 20 devices in my house; there may be more!

So the router needs to work hard in how it lets all those devices share the same IP address. It does this by creating a local network on your WiFi/LAN and devices get their addresses from that. So your laptop may get 192.168.1.10, your cellphone might get 192.168.1.11, your iPad maybe 192.168.1.24 and so on.

Where this gets clever is that the router then “changes” the address whenever any of your devices talks to the internet. So lets say your iPad hits this web site. Your iPad is 192.168.1.24 and wants to talk to my server. When the traffic gets to the router it will translate the first address into the external address provided by your ISP. So now your iPad can talk to my server.

The problem is that the router needs to remember that it’s done this, so that when my server replies then it knows to send the response back to the iPad and not to your desktop or your phone. The router needs to keep a connection table in memory of all the address translations it is doing.

Dropping connections

There are a number of problems with NAT, but two are relevant here:

The memory in a router is small; if too many connections are made then it might have to drop some traffic (I saw this with the older Verizon FIOS ActionTec routers; it only had a small NAT table).
Connections might break messily; the router needs to have a way of cleaning up the connection table entries for dead connections.

This is normally solved by having a timer associated with each entry; “when did we last see a packet?”. If the table gets full and a new entry needs to be created then the oldest connection could be dropped and closed down. Similarly if the connection is too old (“we haven’t seen a packet for over 2 hours”) then the router might consider this a dead connection and drop it. Some routers are even more aggressive and may drop connections after 30 minutes or even sooner!

This second case is where we start to see idle sessions get dropped. Because the ssh session hasn’t sent any traffic in a few hours the router thinks it is a dead session and will close it.

KeepAlive

The standard workaround for this sort of problem is to make sure the router never thinks your connection is idle; send “fake traffic” at regular intervals. In computer jargon this is called “keep alive” traffic; we’re sending it just to keep the connection alive.

Now the first thing someone notices when reading the ssh manual page is something called TCPKeepAlive. Hey, that sounds good! It turns on keep alive packets at the TCP layer! Unfortunately… not. TCP KeepAlive packets weren’t designed to handle intermediate router drops; they’re more there to keep the local socket alive and detect if the remote server has gone away (and so drop the connection). The traffic is frequently sent at a much lower frequency than the router needs, and so it doesn’t prevent the connection being dropped.

The ssh authors noticed this and in version 2 of the protocol (there’s a 99.99% chance you are using SSH2; SSH1 has known security issues and has been turned off almost everywhere) they added a “No Operation” (NOOP) message type. This can be used as a simple Keep Alive packet.

How you turn this on depends on your client.

Unix (and MacOS)

In your $HOME/.ssh/config file you can turn on KeepAlive with the following commands:

Host *
  ServerAliveInterval 600

This tells the ssh client that if it hasn’t received a message from the server within 600 seconds then send a NOOP message to it, to get a response. This works pretty well at making the router think the connection is still busy and so don’t drop it as idle. Nicely this packet is only sent if you are idle, so it doesn’t cause any extra traffic on active sessions.

Depending on how aggressive your router is, you may need a smaller number.

Windows, using putty

putty is a pretty good Windows ssh client. It’s the one I use when forced to use windows! It also has the ability to send keepalives, which can be found on the Connection section of the connection menu:

You can enter a timeout here, and save it as part of your connection profile.

Others

If you have a favourite ssh client with keepalive ability built in then feel free to drop me a note in the comments!

Conclusion

NAT is nasty and some of the compromises it causes can cause longer running idle sessions to break. This doesn’t really impact most users because typical activity (eg web surfing) uses short-lived connections. But for long lived (eg ssh) sessions that can be idle for a long time then we may see disruption.

We can work around this at the application level by sending application layer keepalive traffic. Many ssh clients have this ability built in.

Connection dropping isn’t solely caused by NAT, but it is a common cause, especially for people initiating connections from home with an ISP provided router.

Not always, but if you’ve requested more than one address then you should know enough to realise this is a simplification. ^[return]

How to stop it