Understanding the Security of Cryptographic Hash Functions

Edited 2018-04-10 to fix second preimage attack definition.

Collision resistant hash functions like SHA256 take an arbitrary input (of length $n$) and map it seemingly randomly into a fixed-length output space (of length $\ell$). They have many purposes, like verifying downloaded programs or guessing the next block in a blockchain.

$$H : \{0,1\}^n \rightarrow \{0,1\}^\ell$$

The security of these functions rest on the property that they’re “one way,” meaning given the output, it’s impossible for an efficient attacker to find out what the input was (this is known as a first preimage attack). We also rely on hash functions to reliably map different inputs to different outputs. Given an input $x$, if an attacker can find another $x’$ such that $x’ \neq x$ and $H(x’) = H(x)$ this is called a second preimage attack. For example, if you’re verifying a program you downloaded, you want to make sure an attacker can’t write a malicious program that would map to the same hash digest or else you’d have no way of knowing the program you’re about to run is the one you intended or the attacker’s. A collision happens when an attacker can find two arbitrary inputs (they choose both $x_0$ and $x_1$, where $x_0 \neq x_1$) that both map to the same output: $H(x_0) = H(x_1)$. As pointed out by hanno on Lobste.rs, some hash functions like SHA-1 and MD5 are vulnerable to collisions but not to second preimage attacks.

But here’s the paradox: we know for sure that many collisions must exist for any possible output. It’s just really difficult to find these collisions. But how difficult is it?

Imagine you have a penny. That penny is painted one of $2^{10} = 1024$ colors. Now imagine that the ground is covered with those pennies. If you had one penny in your hand and wanted to find another penny of the same color on the ground, you’d need to walk around and look at on average $\sqrt{1024} = 32$ pennies. Super easy, right? That would take maybe a couple seconds to do. This is because of the birthday paradox.

The birthday paradox goes like this: how many people do you need to add to a room before two of them share the same birthday? The answer is lower than you might expect, at around $\sqrt{365} = 19.105$ people.

Now let’s say the color painted on the penny is determined by the output of the collision-resistant hash function SHA256. Because the output of SHA256 is 256 bits, now each penny is painted one of $2^{256}$ unique colors. Now shrink the penny down to the size of an atom. If you had an atom-sized penny and tried to find another atom-sized penny of the same color, even if looked at every atom in the observable universe you still would not find a match.

A computational perspective

We know computers are much faster than humans at checking for matching pennies. Are they fast enough to make a difference?

The answer, as you may have guessed, is no. This random ASIC Bitcoin miner I picked can compute 65GH/s, or 65 billion hashes per second. In order to compute the necessary $\sqrt{2^{256}} = 2^{128}$ hashes to produce a collision, if you somehow acquired 10,000 of these miners (and your own personal hydroelectric dam), you’d have to wait:

$$\frac{2^{128}\ \text{hashes}}{65B\ \text{hashes/sec} \cdot 10K\ \text{miners}} = 5.234 \times 10^{23}$$

In other words, you and your 10,000 miners would have to wait $10^{16}$ (or about the number of meters in a parsec) years to find a hash.