Let’s say you run an ecommerce web site. As part of this people create accounts. Naturally those accounts have passwords. Do you encrypt those passwords before storing them in the database?
All you people who put your hands up and answered yes… put them down; you’re wrong. You should hash the data instead. This may sound like a “tomato/tomato” thing, but it’s not.
Encryption and hashing may both be cryptographic in nature, but there’s a subtle difference between them.
Encryption is the act of taking a plain text message and encoding it. This process is reversible; someone with the right knowledge could take the cipher text (as the encoded data is known) and decrypt it back to the original plain text.
An early example that is sometimes taught to school kids is the “Caesar Cipher”. Old usenet nerds may know of a special case of this; “ROT 13”.
With the Caesar cipher a number is picked, and the characters moved forward that many. So, for example, if the number was “3” then we would move “A”->“D”, “B”->“E”, … “X”->“A”, “Y”->“B”, “Z”->“C”.
In this scenario the plain-text “HELLO THERE” would become “KHOOR WKHUH”.
To decrypt that string you would need to know the magic number (“3”), and then rotate backwards.
Now that’s a pretty simple cipher, and you wouldn’t use it in the real world!
Today you would use algorithms such as AES256 to encrypt the data, which is the AES algorithm with 256bit key sizes.
Typically encryption is used when the original plain text needs to be protected and be recoverable. A password manager would use encryption to protect my passwords from being stolen. Only someone knowing the encryption key (me!) would be able to decode the values and get my password back.
Symmetrical vs asymmetrical.
Algorithms such as AES are considered symmetric because the same key is used to encrypt and decrypt the data. If you wanted to share data with someone else then they also need to know this key. Thus the key is a shared secret.
Asymmetrical encryption uses two different keys, which have a relationship to each other. A common one is the RSA algorithm which is based on some clever facts around prime numbers. We typically call one of these keys the public key, and the other one the private key (hence another name for this is “public key cryptography”). The public key is used to encrypt the data. It normally doesn’t matter that the public key is known (hence the name). The data can only be decrypted if the private key is known. This is the basis for SSL/TLS encryption used on the internet.
Symmetric and Asymmetric working together
Many asymmetric algorithms are slow and not always suitable for large chunks of data. RSA, for example, can’t encrypt data larger than the key size. In situations where we want asymmetric (e.g. the web) what we can do is pick a random key and use a faster symmetric solution for the data. We use the slower RSA system to encrypt that random key, and both the encrypted key and the encrypted data are sent to the other person. Using their private key they can decrypt the random value and then use that to decrypt the main data.
One problem with many cryptographic schemes is that the same plain-text will lead to the same cipher text. This means that an evil person looking at the messages might be able to see patterns; if the word “OK” is encrypted the same way each time then that evil person can make a guess.
To avoid this a salt value can be prepended to the plain-text before encrypting. This salt can be random; it really doesn’t matter what it is. Now the same plain text (“OK”) is made into a different message each time, and so the encrypted data is different each time.
So if we need to use encryption the data back, why did I say encryption was wrong for storing customer passwords at the beginning of this post?
Hashing is a one way function. The idea is to come up with a piece of data that represents the original text. In this scenario the plain text is converted to a hash value.
A simple case, if we said that “A”=1, “B”=2, “C”=3… then we could define a simple hash of just adding up the values. Now the word “HELLO” would be hashed as 52 ( 8+5+12+12+15 ).
There is no way of getting back from “52” to the original plain text, which is why these are called “one way” or “trap door” functions.
There are some common use cases for file integrity and checking. For example, if I have a file I could generate a hash value for it. Now at any point in the future I could re-compute the hash of file; if the hash value is the same then I can be confident the file has not changed. Similarly if I send you a copy of the file then you can compute the hash and if it matches my copy then you can be confident you’ve received the file.
The other main use case are for passwords. We don’t actually need to store the customer password in our database; we can store the hash value. Now when the customer wants to login, we can calculate the hash of their input; if it matches what we have saved we can be confident they entered the right password.
Now in the use cases I said “confident”…
The earlier hash algorithm is bad and many strings would cause the same
result (“OHELL” would also give 52). But if we pick a good hash then
we can give a probability of how likely it is for someone to enter
a different value that collides. For example, the
returns a 128bit number, which means that the chances of two pieces of
plain text colliding are 2-128. If you had 264
pieces of plain-text then you might expect to find a collision; a very small chance indeed.
md5algorithm has a number of weaknesses, so it’s no longer considered a strong hash. Don’t use it.
With the right algorithm (eg SHA256) the risk of collision is considered so tiny that the chance of someone entering the wrong password on a login screen and getting the right hash is so vanishingly small that we can ignore it.
There’s another problem with hashing, though.
If I manage to get a copy of your hash file then I could try and guess what plain text was used to generate the hash. This is relatively simple for password files because people tend to pick easy to remember passwords. I could simply brute force attempt a lot of passwords and see what ones match. But there’s a simpler way; since the hash has to be consistent for a given plain text we could precompute all the common passwords and store the results. The result is called a rainbow table. Sure, they can be large but storage is cheap.
Now, with a rainbow table, converting from the hash back to a password is as simple a looking up the hash value.
To make this harder we can use a salt value. This is randomly chosen each time. For SHA256 this salt is 16 characters, resulting in the same plain text being hashed in one of 296 different ways. This can make rainbow tables impractical, and require the attacker to use brute force.
Now modern GPUs can make brute force attacks fast - over 4.7 billion md5s per second! Even SHA-1 is fast; it may take hours per password… but that’s not much money on an AWS-GPU cluster for a determined attacker!
Fortunately new algorithms (eg bcrypt and PBKDF2) have been designed to be slow to calculate. It doesn’t matter if it takes your login page an extra second to calculate the hash; it won’t do it very often! But slowing down brute force attacks can make them impractically expensive… today.
So if hash files and salts are not resilient to brute force attacks, why did I recommend hashing for customer password files? Simply because encryption requires the server to have a copy of the decryption key available (unless you have a HSM), so theft of an encrypted password file would also likely lead to theft of the decryption key, making it even easier for the attacker.
When you don’t need to recover the plain text, hashing is normally always a better solution. Just remember to pick a good algorithm that’s resistant to brute force attacks!