Key man dependencies and resilient processes

What we can learn from Equifax

Unless you’ve been living in a cave for the past couple of months, you’ll have heard that Equifax, one of the ‘big three’ credit reporting agencies, suffered a massive breach leaking privileged data on over 143 million US people (and millions outside the US as well).

The story went from bad to worse as the company completely failed to handle the response properly, with poor communication, staff giving out the URL to phishing sites, web site failures and the story that three executives sold millions of dollars of shares before the leak notification was made.

As a technologist, I’m more interest on how the breach occurred. Reports from the company are that the web site was susceptible to a Struts2 vulnerability that had been known for months, but hadn’t been patched.

This, naturally, led the question of why it hadn’t been patched. In testimony to Congress, Richard Smith (the Equifax ex-CEO) laid the blame on one unnamed person who failed to require deployment the patch

“The human error was that the individual who’s responsible for communicating in the organization to apply the patch, did not,” Smith said in the hearing.

“So does that mean that that individual knew the software was there,” Rep. Greg Walden replied, “and it needed to be patched, and did not communicate that to the team that did the patching?”

“That is my understanding, sir,” Smith said.

This short segment highlights a number of process failures.

  1. An individual who is responsible
    This is an obvious failure mode, and is inherently fragile. It’s almost a textbook example of a “key man dependency.” What if that person was on vacation? Or was sick? Or quit? (Or, for an example of a worst-case scenario, fell under a bus?)
    In any critical process (and vulnerability management is a critical process) all steps need to have redundancy in them. This implies a team with cross-checks, a workflow process (new vulnerability reported from upstream, assigned, worked on, tasks initiated… with reporting and alerting if a workflow shows no progress) and so on.

  2. Asset management
    To a lot of people, asset management merely means tracking what hardware you have out there. However the true scope is greater than that; it also encompasses software, and it also encompasses components. If this was an inhouse developed application then there needs to be a record of all the opensource components used in the application. This can help from a legal perspective to ensure the license is suitable, but it can also help with identifying applications with vulnerable components. This also implies the need for internal repository services that are used to serve approved code, with firewalls blocking direct access to upstream servers.

  3. Asset discovery
    This works in conjunction with the management piece; people make mistakes, developers take shortcuts, vendor products don’t always list their components. Where possible every OS and appliance should be scanned and the discovered components properly catalogued. Workflows can then be spun off to remediate the catalogue (eg license validation). Were you aware that your Oracle product may have Struts2 in it?

  4. Perimeter protection
    You do host your applications behind protection technologies such as WAFs, yes? These WAFs must be kept up to date with the latest signatures. Even if you don’t think you host that technology, you may find you are (vendor supplied, developer inclusion) and that your asset reports don’t show it. Vulnerability scanning may not show it, if the path is sufficiently hidden, but that doesn’t mean you don’t have it. The WAF acts as an additional defense layer, and can also be used to help highlight external attack attempts via SIEM reporting.

  5. Internal protection
    Related to this would be IDS that may have been able to detect the persistent shell sessions, DAMs to detect anomalous query patterns, multi-tier application architectures (a breach at the presentation layer doesn’t give direct access to the database), sending logs to the SIEM, and so on. This may not have prevented the breach, but may have detected them sooner, before data had been exfiltrated.

Proper deployment of these processes and technologies would have prevented or mitigated the breach; even if the vulnerability hadn’t been patched then the WAF could have blocked access. If neither patching or WAFs worked then IDS and DAM may have alerted and limited the exposure.

This is a classic in-depth scenario, with overlapping solutions in place to mitigate failures elsewhere.

Aside: One thing that would not have helped here, is container technology. I’m getting tired of people telling me “if their app was in a container then they could just redploy with the fixed version”. This wasn’t the problem. Why would someone redeploy containers if they weren’t told?

Summary

We could argue there was a single person to blame for the Equifax issue, but this was not the person Richard Smith suggested. The blame lies at the CTO layer for not ensuring resilient processes and technologies were deployed. Vulnerability Management is a critical function in any organisation that handles PII data; it’s a core requirement of PCI for anyone who handles credit card data.

We’ve also learned how little use auditors are!