A personal growth moment · Ramblings of a Unix Geek

26 Oct 2025, 20:39

about me / philosophy

The recent AWS outage led to a slew of It was DNS jokes.

It reminded me of a time maybe 12 or 13 years back. I was part of the architecture/engineering team for the company’s Unix authentication product. Basically every login, su, privilege escalation call went through our code.

So when things went wrong we got the blame.

One day a call got escalated to me. It went something like:

Caller (C): It’s taking a long time to login
Me (M): How long?
C: About 10 seconds
M: (thinking “hmm, that sounds like a DNS timeout”). Does it happen if you login again?
C: No, just on first login
M: It’s DNS
C: It can’t be, otherwise we’d have got an alert
M: OK, so what happens is that the SSH daemon is trying to do a reverse DNS lookup on the IP address connecting to it. If DNS isn’t working properly then this can take some time. But the nscd process caches this for a while, which is why the second login is fast.
C: What can we do about it?
M: Well, we can change the sshd_config file to disable those lookups but it might not always work

Then we went through the process of checking their DNS configs were correct, being able to demonstrate an issue using nslookup, and then configuring SSH to disabling the lookup. The caller went away happy; their logins worked fast again. I told my team of a potential DNS issue.

2 hours later a global email announcement of DNS issues was made.

Yeah, it was DNS.

How I’ve grown

But going through this again in my mind, I realised where I didn’t go far enough.

Yes, I solved the caller’s problem. People were happy that my team was responsive. The business kept running. I did my job :-)

But what I didn’t do was to take this to the next step. Was it worth disabling this configuration globally? It would be any easy change to make and to deploy a new rpm which we could push out and it would deploy as part of the patching process.

What I should have done was to discuss this with the Linux and Unix engineering teams to come up with a consensus of whether to leave the default values as-is, or to disable this lookup. I think it would have made an interesting discussion, especially considering the potential impact. It likely would have had the Linux/Unix teams trying to evaluate the likelihood of that impact (perhaps collecting the config files from the 60,000+ endpoints).

Clearly somewhere in the past decade I learned to look further than just “incident response” and into a “what could be done to prevent this from happening again” mindset.

Because it’s always DNS.