Understanding the Security of Cryptographic Hash Functions

Hash functions are an extremely versatile tool that you can find nearly everywhere in software engineering.

This article intends to give you a deeper understanding of cryptographic hash functions, when to use them, and how to think about their security properties.

A hash function is a function (in both the math and programming sense) that is “one way” – meaning given the output yy , it’s near impossible to find the input xx .

H(x)yH(x) \rightarrow y

Hash functions are useful for cryptographic and non-cryptographic applications. For example, Python has a non-cryptographic hash() function built into the standard library. If you call it a bunch of times on different inputs, you get different outputs. If you call it with the same input, you get the same output.

# Python's hash() always outputs a signed 64 bit integer.
>>> hash('rutabaga')
-7515648070926626588
>>> hash('fishtaco')
8880674569626935499
# The same input always maps to the same output.
>>> hash('rutabaga')
-7515648070926626588

How do you define the security of hash functions?

How do you know how to choose which secure hash functions to use and what security properites to they guarantee? There are three security models for hash functions, described below.

1. Preimage attacks

A preimage attack is, given the output, find the input that produced it.

given y=H(x), find x\text{given }y = H(x)\text{, find }x

Doing so would require exhaustively checking all possible values of xx . The cost of attacking SHA256, a hash function with the output space the size of 2256|2^{256}| for instance, is O(2256)O(2^{256}) . This is currently computationally infeasible. Here’s a great resource that explains how big 22562^{256} is.

Resistance to preimage attacks is an absolute minimum requirement for hash functions, since it means the function is, in practice, “one way.”

2. Second preimage attacks

A second preimage attack is, given a pair of input/output values, find another input that produces the same output.

given xx and y=H(x)y = H(x) , find xx' such that H(x)=H(x)=yH(x') = H(x) = y (and xxx' \neq x )

Second preimages are substantially more difficult to find than first preimages.

When is 2nd preimage resistance important?

The downloaded app checksum example above is an example of this attack scenario. If an attacker can write another app that hashes to the same value, they can replace the real app for their app in transit. The checksum would verify successfully, and then you’d be running the attacker’s code.

2nd preimage resistance likely applies to most applications of hash functions.

3. Hash collisions

A collision happens when an attacker can find two inputs x1x_1 and x2x_2 that both map to the same output.

find any x1,x2x_1,x_2 such that H(x1)=H(x2)=yH(x_1) = H(x_2) = y (and x1x2x_1 \neq x_2 )

This differs from a second preimage attack because the attacker can choose both inputs, making it easier to accomplish.

Collision resistance is important when you’re relying on hash functions to always map two inputs to two different values, like in the content-addressable storage example above. But wait, since our hash function takes anything from tiny to huge-sized inputs and always returns a fixed-size output, we know collisions for sure exist for all outputs.

SHA-1, a hash function with a 160-bit output size, has been broken for collision attacks.

Collisions are guaranteed to exist

We can prove this using the pidgeonhole principle. In other words, if the input space is bigger than the output space, then we know that many inputs must map to the same output.

But even though we know collisions exist, we still use a certain subset of hash functions for cryptography. These are called collision resistant hash functions or cryptographic hash functions.

Visualizing collision resistance

Imagine that the ground in front of you is covered with pennies, each painted one of 1024 colors. If you wanted to find any two pennies of the same color on the ground, you’d need look at about 32 pennies. Super easy, right? That would take maybe a couple seconds to do. This is because of the birthday paradox, which is computed as 1024=32\sqrt{1024} = 32 .

The birthday paradox goes like this: how many people do you need to add to a room before two of them share the same birthday? The answer is lower than you might expect, at around 36519\sqrt{365} \approx 19 people.

Now let’s say the color painted on the penny is determined by the output of the collision-resistant hash function SHA-256. Because the output of SHA-256 is 256 bits, now each penny is painted one of 22562^{256} unique colors. Now shrink the penny down to the size of an atom.

if there are 107810^{78} to 108210^{82} atoms in the universe

and 22561077\sqrt{2^{256}} \approx 10^{77}

If you tried to find two atom-sized pennies of the same color, you’d roughly need to look at every atom in the observable universe.

A computational perspective: we know computers are much faster than humans at checking for matching pennies. Are they fast enough to make a difference?

The answer, as you may have guessed, is no. This random ASIC Bitcoin miner I picked can compute 65GH/s, or 65 billion hashes per second. In order to compute the necessary 2256=2128\sqrt{2^{256}} = 2^{128} hashes to produce a collision, if you somehow acquired 10,000 of these miners (and your own personal hydroelectric dam), you’d have to wait:

2128 hashes65B hashes/sec10K miners=5.234×1023 seconds\frac{2^{128}\text{ hashes}}{65B\text{ hashes/sec} \cdot 10K \text{ miners}} = 5.234 \times 10^{23} \text{ seconds}

In other words, you and your 10,000 miners would have to wait {{ katex_inline “10^{16}” >}} years (or about the number of meters in a parsec) to find a hash.


A shorter version of this post was initially published on April 1, 2018. This is an expanded and rewritten version. Thanks to the folks on Lobste.rs for feedback and corrections on the initial version of this post.

Also check out The Code Monkey’s Guide to Cryptographic Hashes for Content-based Addressing.