Resiliency

Crowdstrike issues

I was asked about today’s Crowdstrike issues on Windows. Naturally I have some thoughts… What went wrong? What I know. Crowdstrike is an EDR (Endpoint Detection and Response) tool. (Well, they claim “XDR”, but that’s marketing). It has an agent component and a set of rule sets (called “channel file”). The agent has both user space and kernel space components to better give visibility into what is happening on the machine, and to be able to block bad things.

Can't Patch, Won't Patch

Whenever a new “critical” vulnerability is found, the cry goes out across the land; Patch! Patch! Patch! Whenever a major incident is caused by known vulnerabilities the question is always Why didn’t they patch? We’ve known about this for months! They should have patched! Sometimes this is valid criticism, and learning why the organisation wasn’t patched can lead to some insights into failure modes.

Bottlenecks and SPOFs

If you’ve ever built any enterprise level system you’ll be aware of the needs of performance and resiliency. You may do performance testing on your application; you may have a backup server in a second data center; you may even do regular “Disaster Recovery tests”. And yet, despite all these efforts, your application fails in unexpected ways or isn’t as resilient as you planned. Your primary server dies, the DR server doesn’t work properly.