Data Loss Prevention (DLP)

Preventing unauthorized exfiltration of data

Working in Cyber Security I’m frequently reminded that the reason we do all the things we do is, ultimately, to protect the data. After all, apps are there to process data, servers (and clouds) are there to run apps and store data. So the whole of cyber security is there to protect the data. It may be Identity and Access Management (restrict access to data to those people who should have access to it). It may be anti-virus or endpoint malware detection (prevent ransomware from destroying data). It may be encryption (prevent attackers from bypassing other controls and reading data directly). We have secure software development processes to prevent attacker getting access to data via application bugs. We have WAFs to detect and block common patterns (eg SQL injection). And so on.

Traditional DLP

Each of these “pillars” of cyber security are critical and all form part of a DLP posture. But in this blog post I’m going to focus on the area most traditionally associated with the moniker “DLP”; preventing data “theft” and how the cloud makes this a lot more challenging.

In this realm we’ve traditionally focused on protecting the data at rest and in preventing unauthorized data from leaving the company (the challenge, of course, being to permit authorized data transfers; don’t block the business from working!).

Our threat actors are staff making mistakes (“oops; I sent this confidential email to the wrong address”), malicious insiders (e.g. a salesman planning on getting a new job trying to take confidential documents with him) through attackers who have a footprint in your environment and want to use that to exfiltrate your data.


A standard way to assist in DLP is to create a strong border around your assets. There’s firewalls between you and the internet, between you and any partners you may have connections to, potentially even between parts of your company (why should HR staff have connectivity to trading systems?), and definitely between desktops and the internet. Of course you need internet access so you create bridges across the border between the internal and external networks, such as web proxies. Traffic is funneled through these proxies and can be inspected. Bad traffic can be blocked.

The challenge, here, is that these controls are relatively easily bypassed.

Those that work on traffic inspection (“Hmm, this looks like a bunch of credit card numbers; we shouldn’t let that out”) can be defeated by a simple obfuscation (map 1 to !; map 2 to " and so on). Encrypted files are yet another challenge for traffic inspection. There are some clever tricks that can increase coverage (require encrypted files to use a managed solution with keys the DLP agent can access) but in general they’re a “best efforts” type solution. They’ll definitely help in catching mistakes but a competent malicious person can bypass them. The stricter you make the rules the harder you make it for the company to do business.

Similarly controls that work on volumetric patterns (“Huh, this machine is consistently sending 10Mbit/s of traffic for the past day”) are defeated by slow exfiltration techniques.

And to make things even worse, not all traffic can be proxied so the firewall, itself, may need to inspect traffic. TLS1.3 (and good TLS1.2 settings) is designed for Perfect Forward Secrecy (PFS) and so the firewall can not just decrypt the traffic as it flows through.

Where is your border?

But all this presumes you know where your border is. Do you? The cloud changes all this. If you spin up an AWS VPC and create an Internet Gateway to that VPC then you’ve added a new border to your environment. Every vNet in Azure automatically has a gateway and so a border. If you use an S3 bucket then that’s potentially exposed to the internet; a border! Other cloud native services? Borders!

Even if you think you know all your borders then the posture may change without you knowing. In March 2021 Amazon introduced EC2 Serial consoles which let you access the console of your EC2 instance. This is really useful if you have broken your network config; it lets you get in and fix things. The problem is that this serial connection can be exposed to the internet via SSH. It requires a malicious person to do this (since the config needs to be refreshed every minute to persist the gateway) but this connection is direct to your VM bypassing all your firewalls, your processes. It’s a new border! Oops! (At time of writing there’s no way to disable this SSH gateway without blocking serial console access totally because of how AWS has implemented the technology; there’s an enhancement request in!)

Other borders may be hidden. If you use Lambda functions (pretty common!) then additional triggers may be added. In particular, HTTP API gateways provide a way for the function to be called from the internet. Again it bypasses your firewalls and WAFs; it’s a new border. Amusingly, Alexa triggers can be added to Lambda functions. An intruder could potentially persist a foothold in your environment as an Alexa call (“Alexa, open the backdoor to megacorp”; “Alexa read out all the stolen credit card numbers”). Are you monitoring for Alexa triggers attached to your lambda functions? You should be; it’s a border that needs to be controlled!

So we can see that controlling the border is a lot more complicated in the world of cloud. All of these new borders need new technologies to detect, block and control. Your traditional firewall/web-proxy solution won’t work.

Zero Trust

You may look at Zero Trust Networking as the solution to this; if every endpoint is doing the inspection and validation of traffic then do we need a border? In my view the answer is “yes”. For example, developers make mistakes. Also not all technologies can be fully controlled (e.g. BMC ports, out of band consoles) to the same level as a VM (which can have agents installed).

We also have a much bigger configuration management problem to deal with, including how to be sure you’ve got 100% coverage (which in turns requires an accurate inventory of all assets, which is a hard problem!), activity logging and so on.

We’ve moved from needing to control a small number of border crossing points to controlling tens of thousands of endpoints; every server, every database, memcached, key/value stores… a green-field environment may be able to create an architecture that would work, but for the rest of us in brownfield waste dumps… nope! A lot of enterprise technology of the past 2 decades was designed with the assumption that it would be protected from the outside world (single factor LDAP authn? Ugh) and can’t easily be fixed.

Configuration management

Configuration management also applies to cloud environments. So many cloud providers allow changes to be performed from web consoles. Indeed many changes are easier from the console because they call dozens of backend APIs to get the job done. Where possible you want to deploy and manage your cloud environments programmatically (e.g. using Terraform) but even there you must validate the deployed configuration hasn’t drifted because of a mistake by an admin. This is critical for environments such as “Microsoft 365”; even if you use Powershell to manage the environment it’s possible for an admin on the console to accidentally click the wrong button. Monitor your configurations and alert if they change!


But what about encryption? Can this help with DLP? Yes… and no!

As I wrote in Data At Rest Encryption encryption can be deployed at different layers.

  • Block Level. This is useful for laptops (“oh no, I left my laptop in the taxi”) or mobile devices where the risk of physical loss of a device is something to be mitigated, but it doesn’t really help in a datacenter environment where the loss of a disk isn’t that high a risk.
  • Database level. This can help against an attacker getting access to the server, but isn’t a strong defense against someone with database credentials
  • Application level. This now requires the application itself to be compromised (so you need credentials and encryption keys to get the data) but has some downsides (e.g. full text searching may not be possible) so isn’t a good solution.

Remember that everywhere you store your data you need to protect it; that includes backups! Depending on how the backup is done and the encryption used, you may need to ensure your backups are fully protected. There has been more than one data leak from badly protected backups.

Cloud services complicate the encryption world even more. Many services have “server side encryption” options; e.g. S3, RDS, EBS in Amazon can all be encrypted. The problem with this encryption is that it’s closely aligned to “block level” or “database level” encryption; if you get valid credentials then the data is transparently decrypted. The Capital One breach in 2019 had data stored encrypted this way but because the attacker was able to get credentials they were able to get access to the unencrypted text.

I, personally, don’t consider cloud server side encryption to be sufficient protection for data at rest. Wherever possible use application level encryption.

Scan on upload

You should also try and scan data when it’s been placed in storage; e.g. putting data in an S3 bucket or in OneDrive/Sharepoint can raise an event that can trigger a scan of the object. If it’s been determined to be “bad” then remedial action (“delete!!”) can be taken programmatically. Of course this scanning still has the same problems previously discussed, but it will (again) detect mistakes.

Data Destruction

That’s not to say that cloud level encryption is useless. If you can encrypt it with a customer managed key (CMK) that you can delete then you have the ability to delete data (by cryptoshredding) when you no longer need it (e.g. exiting a cloud provider). Remember that data needs to be protected wherever it is; when you leave an environment you should ensure all traces are destroyed, and deleting the encryption key so any data remnants that may remain are unreadable.

This CMK requirement can rule out some services. For example some part of the Azure DevOps suite use shared services at the backend that are controlled and managed by Microsoft and so use Microsoft managed keys. Your source code may be confidential data and a corporate asset; do you want to trust it to an environment where you can’t be sure of deletion?

SaaS and DLP

And then we get to SaaS offerings. These providers need access to your data so they can process it. If you just sent them encrypted blobs then they won’t be able to do much work. So, remembering that data needs to be protected everywhere, you need to look at how the SaaS provider protects it. Do they let you use a CMK? Are you comfortable with their encryption processes? How is access to the SaaS offering managed? This is effectively a new border! How can you control this one!


As we can see, DLP is not an easy topic. There’s no good answer here, just various levels of “not so good”.

And this is where risk management comes in. What is the impact to the company if this set of data is exposed (including reputational impact, which still applies even if the data was encrypted) versus the benefit to the company of processing data in this way. Because of the limited border crossings in the traditional world we had a pretty good idea of the size of the risk and we knew how to mitigate many of them (e.g. endpoint security controls and agents; remember all the cyber pillars are involved!).

But in this new world where data is everywhere and every location has unique controls and traditional solutions can’t be deployed, we need to look at alternate mitigations and controls and evaluate the risk on a case by case basis.

DLP is hard!