One of the “hot” things around today is the concepts of Site Reliability Engineering (SRE). I’m gonna be slightly provocative and state that this is not a new thing; we were doing this 30 years ago. Indeed, these concepts go back to where we were when I started out in this industry.
Although, to be fair, there is one new factor.
Now I’ll be the first to say that my take on history is very much biased by my personal experiences, and how I worked.
I started in a small company in 1990. I had to deal with the servers, the terminals, the printers, the fax machine, the phone system…. even the door entry system. Pretty much if it was vaguely technology related it became my problem. Plus, of course, “systems development” and creation of tools for the users. Most of the communication was via serial terminals (VT220/320, Wyse85/99) and modems running UUCP. Most of the machines didn’t even have an ethernet port when I started. Naturally this lead to each machine being custom and unique. I learned a lot. Since I was the only person supporting the London office every failure was mine to fix. Nothing like being called during Sunday Dinner to fix a problem because a ship-broker was unable to dialin to complete a deal.
My next job was at an internet publishing division of a traditional publisher. Here we had a team of 5, all with Sun’s, all connected to the internet. The original architect had tried to create a resilient design, but didn’t know what he was doing. It was over-engineered and full of single points of failure. I simplified it… and then started to automate tasks; want to register a new DNS domain? Run this command, answer the questions; it will create the DNS entries on master and secondary, create the email form, PGP sign it, send it to Internic/Nominet/whoever. Want to add a new dialup? Run that; it’s create forward/reverse DNS, tacacs, mailboxes, etc etc. New web site? Cisco IOS config…
When I moved into the central “systems” team I started to automate their processes. I even got zsh working on Windows NT so I could automate tasks the help desk commonly did. Automated monitoring (“Big Brother”, for those who remember that). Automated paging (Kermit, talking to a TAP service).
My third job was in a megacorp; I now had a thousand servers; more than the previous company had people! My first task was in Y2K readiness; how can we test these thousand servers are all working properly so the Y2K command center can tick their box.
I created a tool that could check all the servers, centrally coordinated (via ssh) with tunable (and self-tuning) parallelism. It would run in under 30 minutes. I wasn’t on call for Y2K, but the 2 people who were basically just watched the fireworks from the Embankment (the office was right on the banks of the Thames) while the code ran. (The Windows and Novell teams weren’t so lucky; they had to work).
A good sysadmin is a lazy sysadmin; they automate everything. I was a VERY lazy sysadmin.
So lazy that the code remained in daily use for years afterwards, as a general purpose healthcheck. Because I was in operations I wrote code to make my life easier, and as a result every ops guy in my team gained the benefit.
Where it started to break down
The company wanted me to do more and more complicated work. Stuff that required focus. Systems architecture; engineering; deployment. Stuff that I couldn’t do if I was paged out at 3am because a server needed attention. I wasn’t the only one. They split the team into engineering and operations. Naturally the smart coders got moved into engineering; the remaining ops guys were more “press the button” types. Not stupid, by a long stretch; they could do some really detailed problem analysis and solve problems… they were good SAs. But they were more incident management focused, rather than problem management. They might document a solution and then next time it happened another team member could follow the document.
At the same time the engineering team started to be divorced from the daily operations. We were focused on higher level tasks set by management, rather than scripting and automating BAU activities. Barriers between the teams were created by this organizational distinction; the automators didn’t understand the ops problems, so they started to need requirements docs and ops wanted handover docs and the slow decline into your structured support model (level 1,2,3 even 4) was inevitable.
The ops teams were considered by management to be a “cog” and each person replaceable. This lead to “cheapest person” type policies, which naturally leads to off-shoring and stagnation; if a person in L2 learns enough then they move into L3, and L2 gains a newbie with no institutional knowledge.
Where the original idea to split engineering from ops was meant to speed up delivery (focused teams) the leaching of talent from ops lead to slower processes as formal handover between teams became necessary; just to protect themselves the ops teams required good tested procedures from engineering; waterfall enforced by organizational and managerial boundaries.
So what about SRE?
If you look back at my history, SRE is kinda what I was doing 20 years ago. Find a problem, automate the solution so it never becomes a problem again. Find a task, automate it. Find a multi-step process, automate it to one step.
This is why I say that SRE isn’t anything new.
But there is a change… and it’s a good one. It’s formalization.
When I did my ops work I was working mostly in a vacuum. I was a single contributor, inventing technologies and solutions as I went along. I worked with smart people who were also single contributors in their area. Sometimes the areas overlapped and, if you were lucky, the solutions could be combined. But mostly each person just ran their own way.
SRE, in comparison, is taking a modern software design approach towards the solution. Instead of individual contributors we now have a team approach. There may be agile techniques, but I don’t think that’s necessarily so important.
Today tools and technologies exist that were not around 20 years ago; JIRA queues, wikis, deployment frameworks (ansible, cfengine, salt,…). We can create a common technology stack within the team that anyone in the team can use. We no longer have “Stephen’s magic” and “Fred’s magic’ and “Matt’s magic”; we have automation.
This is a massive step forward.
So why will the SRE model work?
If we look at my history and look at SRE today, these common tools aren’t enough to overcome the “I was paged at 3am; I can’t code today” problem.
If SRE was just a “modern technique to an old solution” then it’ll die out for the same reasons that my team got split in 2000.
Fortunately, SRE isn’t the only change. There’s also the DevOps model. Now that has it’s own problems (a blog post for another day!) but it can mean that the layer that SRE is working on is the engineering team problems of the last 20 years.
When we look at containers and automated application delivery; when we look at the “cattle” model of software delivery; when we look at hands off automation and configuration management… the old school “ops” model is a lot smaller. Some of the work has been pushed onto the Dev teams; some of the work just disappears.
And this is why, in my opinion, that SRE will succeed; it’s not in a vacuum, but it lives within a larger operational model change.
Some regulatory environments (e.g. Payment Card Industry - PCI) have restrictions around dev/prod access.
Indeed PCI says
Reducing the number of personnel with access to the production environment and cardholder data minimizes risk and helps ensure that access is limited to those individuals with a business need to know.
The intent of this requirement is to separate development and test functions from production functions. For example, a developer may use an administrator-level account with elevated privileges in the development environment, and have a separate account with user-level access to the production environment.
In the SRE model, the operations engineer is a developer. So can SRE be adopted within these constraints? I, personally, think it can… but it requires additional controls around the process. Fortunately, the modern SRE tooling has a lot of process focus; after all, that’s what modern software delivery methodology is all about!
I was being slightly provocative in my opening statements. I do believe the the concept of SRE is nothing new. What has changed, though, is the practice.
Because of the mismanagement earlier discussed, “operations” has a bad name. The SRE branding is a way of avoiding this tainted reputation, but it’s more than just a re-branding. It’s a shift back to the lazy sysadmin role, and you really want lazy sysadmins because they can do things at scale that can’t be done otherwise.
This, along with the role change causing a responsibility shift, means that the SRE model is a potentially viable one for organizations to adopt.