Let’s say you run an ecommerce web site. As part of this people create accounts. Naturally those accounts have passwords. Do you encrypt those passwords before storing them in the database?

All you people who put your hands up and answered yes… put them down;
you’re wrong. You should *hash* the data instead. This may sound like
a “tomato/tomato” thing, but it’s not.

Encryption and hashing may both be *cryptographic* in nature, but there’s a subtle difference between them.

# Encryption

Encryption is the act of taking a plain text message and encoding it.
This process is *reversible*; someone with the right knowledge could
take the cipher text (as the encoded data is known) and decrypt it back
to the original plain text.

An early example that is sometimes taught to school kids is the “Caesar Cipher”. Old usenet nerds may know of a special case of this; “ROT 13”.

With the Caesar cipher a number is picked, and the characters moved forward that many. So, for example, if the number was “3” then we would move “A”->“D”, “B”->“E”, … “X”->“A”, “Y”->“B”, “Z”->“C”.

In this scenario the plain-text “HELLO THERE” would become “KHOOR WKHUH”.

To decrypt that string you would need to know the magic number (“3”), and
then rotate *backwards*.

Now that’s a pretty simple cipher, and you wouldn’t use it in the real world!

Today you would use algorithms such as AES256 to encrypt the data, which is the AES algorithm with 256bit key sizes.

## Use cases

Typically encryption is used when the original plain text
needs to be protected *and* be recoverable. A password
manager would use
encryption to protect my passwords from being stolen. Only someone
knowing the encryption key (me!) would be able to decode the values and
get my password back.

## Symmetrical vs asymmetrical.

Algorithms such as AES are considered *symmetric* because the same key
is used to encrypt and decrypt the data. If you wanted to share data
with someone else then they also need to know this key. Thus the key
is a *shared secret*.

Asymmetrical encryption uses two different keys, which have a relationship
to each other. A common one is the RSA algorithm which is based on some clever facts around
prime numbers. We typically call one of these keys the *public* key, and
the other one the *private* key (hence another name for this is “public key
cryptography”). The public key is used to encrypt the data. It normally
doesn’t matter that the public key is known (hence the name). The data
can only be decrypted if the private key is known. This is the basis
for SSL/TLS encryption used on the internet.

## Symmetric and Asymmetric working together

Many asymmetric algorithms are slow and not always suitable for large chunks of data. RSA, for example, can’t encrypt data larger than the key size. In situations where we want asymmetric (e.g. the web) what we can do is pick a random key and use a faster symmetric solution for the data. We use the slower RSA system to encrypt that random key, and both the encrypted key and the encrypted data are sent to the other person. Using their private key they can decrypt the random value and then use that to decrypt the main data.

## Salting

One problem with many cryptographic schemes is that the same plain-text will lead to the same cipher text. This means that an evil person looking at the messages might be able to see patterns; if the word “OK” is encrypted the same way each time then that evil person can make a guess.

To avoid this a *salt* value can be prepended to the plain-text before
encrypting. This salt can be random; it really doesn’t matter what
it is. Now the same plain text (“OK”) is made into a different message
each time, and so the encrypted data is different each time.

# Hashing

So if we need to use encryption the data back, why did I say encryption was wrong for storing customer passwords at the beginning of this post?

Hashing is a *one way* function. The idea is to come up with a piece
of data that represents the original text. In this scenario the
plain text is converted to a hash value.

A simple case, if we said that “A”=1, “B”=2, “C”=3… then we could define a simple hash of just adding up the values. Now the word “HELLO” would be hashed as 52 ( 8+5+12+12+15 ).

There is no way of getting back from “52” to the original plain text, which is why these are called “one way” or “trap door” functions.

## Use cases

There are some common use cases for file integrity and checking. For example, if I have a file I could generate a hash value for it. Now at any point in the future I could re-compute the hash of file; if the hash value is the same then I can be confident the file has not changed. Similarly if I send you a copy of the file then you can compute the hash and if it matches my copy then you can be confident you’ve received the file.

The other main use case are for passwords. We don’t actually need to store the customer password in our database; we can store the hash value. Now when the customer wants to login, we can calculate the hash of their input; if it matches what we have saved we can be confident they entered the right password.

## Collisions

Now in the use cases I said “confident”…

The earlier hash algorithm is *bad* and many strings would cause the same
result (“OHELL” would also give 52). But if we pick a good hash then
we can give a probability of how likely it is for someone to enter
a different value that collides. For example, the `md5`

algorithm
returns a 128bit number, which means that the chances of two pieces of
plain text colliding are 2^{-128}. If you had 2^{64}
pieces of plain-text then you might expect to find a collision; a very small chance indeed.

Unfortunately, the

`md5`

algorithm has a number of weaknesses, so it’s no longer considered a strong hash. Don’t use it.

With the right algorithm (eg SHA256) the risk of collision is considered so tiny that the chance of someone entering the wrong password on a login screen and getting the right hash is so vanishingly small that we can ignore it.

## Salting

There’s another problem with hashing, though.

If I manage to get a copy of your hash file then I could try and guess
what plain text was used to generate the hash. This is relatively
simple for password files because people tend to pick easy to remember
passwords. I could simply *brute force* attempt a lot of passwords
and see what ones match. But there’s a simpler way; since the hash
has to be consistent for a given plain text we could precompute all
the common passwords and store the results. The result is called a
rainbow table. Sure,
they can be large but storage is cheap.

Now, with a rainbow table, converting from the hash back to a password is as simple a looking up the hash value.

To make this harder we can use a *salt* value. This is randomly chosen
each time. For SHA256 this salt is 16 characters, resulting in the
same plain text being hashed in one of 2^{96} different ways.
This can make rainbow tables impractical, and require the
attacker to use brute force.

Now modern GPUs can make brute force attacks fast - over 4.7 *billion* md5s per second!
Even SHA-1 is fast; it may take hours per password… but that’s not
much money on an AWS-GPU cluster for a determined attacker!

Fortunately new algorithms (eg bcrypt and PBKDF2) have been designed to be slow to calculate. It doesn’t matter if it takes your login page an extra second to calculate the hash; it won’t do it very often! But slowing down brute force attacks can make them impractically expensive… today.

# Summary

So if hash files and salts are not resilient to brute force attacks,
why did I recommend hashing for customer password files? Simply because
*encryption* requires the server to have a copy of the decryption key
available (unless you have a HSM), so theft
of an encrypted password file would also likely lead to theft of the
decryption key, making it even easier for the attacker.

When you don’t need to recover the plain text, hashing is normally always a better solution. Just remember to pick a good algorithm that’s resistant to brute force attacks!