The recent AWS outage led to a slew of It was DNS jokes.
It reminded me of a time maybe 12 or 13 years back. I was part of the architecture/engineering team for the company’s Unix authentication product. Basically every login, su, privilege escalation call went through our code.
So when things went wrong we got the blame.
One day a call got escalated to me. It went something like:
- Caller (C): It’s taking a long time to login
- Me (M): How long?
- C: About 10 seconds
- M: (thinking “hmm, that sounds like a DNS timeout”). Does it happen if you login again?
- C: No, just on first login
- M: It’s DNS
- C: It can’t be, otherwise we’d have got an alert
- M: OK, so what happens is that the SSH daemon is trying to do a reverse
DNS lookup on the IP address connecting to it. If DNS isn’t working
properly then this can take some time. But the
nscdprocess caches this for a while, which is why the second login is fast. - C: What can we do about it?
- M: Well, we can change the
sshd_configfile to disable those lookups but it might not always work
Then we went through the process of checking their DNS configs were
correct, being able to demonstrate an issue using nslookup, and then
configuring SSH to disabling the lookup. The caller went away happy;
their logins worked fast again. I told my team of a potential DNS issue.
2 hours later a global email announcement of DNS issues was made.
Yeah, it was DNS.
How I’ve grown
But going through this again in my mind, I realised where I didn’t go far enough.
Yes, I solved the caller’s problem. People were happy that my team was responsive. The business kept running. I did my job :-)
But what I didn’t do was to take this to the next step. Was it worth
disabling this configuration globally? It would be any easy change
to make and to deploy a new rpm which we could push out and it would
deploy as part of the patching process.
What I should have done was to discuss this with the Linux and Unix engineering teams to come up with a consensus of whether to leave the default values as-is, or to disable this lookup. I think it would have made an interesting discussion, especially considering the potential impact. It likely would have had the Linux/Unix teams trying to evaluate the likelihood of that impact (perhaps collecting the config files from the 60,000+ endpoints).
Clearly somewhere in the past decade I learned to look further than just “incident response” and into a “what could be done to prevent this from happening again” mindset.
Because it’s always DNS.
