DevOps and Separation of Duties

03 Jun 2018, 14:01

automation / devops / sre

Reducing the number of personnel with access to the production environment and cardholder data minimizes risk and helps ensure that access is limited to those individuals with a business need to know.
The intent of this requirement is to separate development and test functions from production functions. For example, a developer may use an administrator-level account with elevated privileges in the development environment, and have a separate account with user-level access to the production environment.

PCI-DSS section 6.4.2 “Separation of duties between development/test and production environments”

So here’s a problem. Regulatory environments, such as PCI, require a strict separation of duties between the development teams and the operate teams. This can lead to “Application Dev” (AppDev) and “Application Operate” (AppOp) teams. In some places AppOps may also be combined with Infrastructure Operate (InfraOps), but in others those two operate teams may be separate. The key, though, is that the Dev and Ops teams are separate.

If something goes wrong at 3am then it’s the operate team that’s paged out, who doing primary analysis, call out to the dev team, who may make a fix and have the ops team deploy. In some cases the dev team may not have any access to production at all and are doing their work hands off, which can lead to extended times to repair.

This separation also takes us down the path of requiring heavy hand-over documentation so that AppOp are happy to take on the support. This can slow down release cycles and tends to lead to a waterfall workflow.

DevOps, as a methodology, is designed to break down these barriers; allow faster problem remediation, speed up the release cycle, support agile methodologies.

How can we do DevOps when the Dev team appears to need to be segregated from the Operate team?

Interpretation of the problem

Let’s define what admin access means. In my mind it’s the ability to directly make changes to the environment. An important thing is that the Dev team are permitted to see what is going on in production. They just can’t change it.

How the dev team sees the data is also pretty flexible. It could mean that they can log into the production server and view the logs; they may be able to trace application execution or transactions as they pass from subsystem to subsystem.

Other concerns may impact this, though. Allowing a developer to run a command such as sudo cat /app/log/file may fall afoul of a corporate definition of ‘privileged access’ and so automatically trip a separation of duties flag. It can be argued that any server access is privileged, because even unprivileged users can impact server operations.

One step removed

So rather than give direct access we can focus on a hands-off approach. Tools such as AppDynamics or DynaTrace can create application monitoring dashboards; log collectors (Splunk, ELK stacks) can provide real-time activity logs; apps can be written to allow for introspection. Now it is clear that our developers can get access to the data they need to analyse issues, without going anywhere near the separation of duties issue.

A nice bonus to this is that this it the methodology you want to follow for modern elastic compute environments. It can also help with historical analysis (“what happened 3 months ago”) because the logs are available, which is useful to both the operate and security teams. A central cross-enterprise collector can make life easier for the app teams (no need to stand up unique infrastructure they need to manage) and may provide operational and threat insight, and allow for common development patterns and libraries to be used.

Just make sure the data that you are collecting is properly redacted of any sensitive content (do not push credit card numbers into your log stream!).

But what about changes?

We’ve solved the easy part of merging Dev and Ops; monitoring. But what about changing the environment? Pushing out fixes or new code versions?

This is where automation and CI/CD tools really help.

In theory we could create a pipeline that allows the developer to merge a branch into the “prod candidate” branch; this can kick off a build cycle, automated unit tests, integration tests; it could spin up a QA instance to verify all expected functionality works, and if that succeeds to push to production.

But do we really want that? If we do this then the developer effectively has the ability to change production and so we introduce a separation of duties conflict via a side door.

But the concept is good; we just need some controls on it. Call-outs to an approval process.

An example might be an emergency fix; the DevOps team member may be able to say “release code with tag breakfix-201806031514 under problem ticket 623493”. The CI/CD tool can call out to the problem ticket system, verify it is valid and covers the servers/application in the code scope and perform the deployment. Management oversight performs post-fix reviews.

Similarly a planned release could be done with a similar call, but to the Change Request system, verifying the scope and time window and CR approval state.

A regular cadence may be developed with a “pre-approved regular change”, by pushing the controls further into the dev stack (“merges to the production candidate must have 2 additional developers signing off”).

You are (probably) not Google.

(If you are Google, then you have pretty mature solutions to all this, anyway!)

People look at some of the modern internet giants with their claims of tens of thousands of changes per day and then look at the change model I wrote above and cry “That’s not how it’s done!”. But remember, one size does not fit all. You are not Google; your organisation isn’t structured to be Google. Google is based around rapid iteration; if a change fails it can be fixed with little harm (“Something went wrong; that’s all we know”). Twitter had the “Fail Whale” when stuff broke.

But if you’re handling thousands of credit card transactions per second, what is the consequence of an outage? What will your clients (all those stores who are paying you to process those card swipes) feel? How many of those clients will you lose, and how much future custom will be lost?

Different organisations have different risk tolerances. You are not Google; can you afford this risk?

Exception processes

The problem with automation is that they only handle cases you’ve coded for. You just know that something will go wrong and your AppOp person will need privileged access to the server. This may be a “break glass” type process (“I have an emergency; I need the root password; break glass to get it”) with strong oversight. Initially you may find this process executed a lot, but as your tools, your developers, your procedures mature then this should truly become an exception.

If you still feel the need for an InfraOps team (personally, I think this is a good idea; trained DBAs are better than ad-hoc dev oriented DBAs, similarly for trained SAs) then they may take over the vestigial AppOp requirements that can’t be met by the new DevOps team. Again, engaging these teams may be on an exception basis so doesn’t slow down the DevOps velocity.

Conclusion

What I’m describing here isn’t easy. We’re talking about a requirement that is firmly rooted in the traditional compute method. Organisations have a tonne of embedded processes in place to meet regulatory requirements, and one process may meet multiple requirements so you can’t just skip it. But if you’re looking to pivot towards a DevOps model then investing in the tooling (logging, automation, approvals) can get you a long way towards meeting your regulatory requirements. And nicely set you up for newer technologies (e.g. deployment to a PaaS; management K8S deployments).

How to overcome regulatory blocks