# Understanding the Security of Cryptographic Hash Functions

A shorter version of this post was initially published on April 1, 2018. This is an expanded and rewritten version.

A **hash function** is a function (in both the math and programming sense) that is **“one way”** – meaning given the output $y$, it’s near impossible to find the input $x$.

$$H(x) \rightarrow y$$

Hash functions are extremely versatile and are found in all parts of software engineering.

- They’re how a a Bitcoin miner tries to guess the next block in a blockchain.
- git stores files internally using the hash of the content as the filename (content-based addressing).
- When you download an app, you verify that the app you’re about to run is the app you intended to download by computing a checksum (a hash) of the downloaded file and comparing it to one provided by the author.
- They’re used extensively in cryptography. In nearly every cryptographic construction, you’ll find one.

Hash functions are useful for cryptographic and non-cryptographic applications. For example, Python has a non-cryptographic `hash()`

function built into the standard library. If you call it a bunch of times on different inputs, you get different outputs. If you call it with the same input, you get the same output.

```
# Python's hash() always outputs a signed 64 bit integer.
>>> hash('rutabaga')
-7515648070926626588
>>> hash('fishtaco')
8880674569626935499
# The same input always maps to the same output.
>>> hash('rutabaga')
-7515648070926626588
```

## Why is it important to understand the security of hash functions?

How do you know how to choose a secure hash function? (Or in other words, why is it important to never use Python’s `hash()`

for cryptography?) Along your journey as a Software Engineer, you’ll likely be faced with this question. In order to answer this question we need to more clearly define the three security models for hash functions.

### 1. Preimage attacks

A **preimage attack** is, given the output, find the input that produced it.

*given $y = H(x)$, find $x$*

Doing so would require exhaustively checking all possible values of $x$. The cost of attacking SHA256, for instance, is $O(2^{256})$. This is currently computationally infeasible. Here’s a great resource that explains how big $2^{256}$ is.

Resistance to preimage attacks is an absolute minimum requirement for hash functions, since it means the function is, in practice, “one way.”

### 2. Second preimage attacks

A **second preimage attack** is, given a pair of input/output values, find *another* input that produces the same output.

*given $x$ and $y = H(x)$, find $x’$ such that $H(x’) = H(x) = y$ (and $x’ \neq x$)*

**Why is 2nd preimage resistance important?**

The downloaded app checksum example above is an example of this attack scenario. If an attacker can write another app that hashes to the same value, they can replace the real app for their app in transit. The checksum would verify successfully, and then you’d be running the attacker’s code.

### 3. Hash collision

A **collision** happens when an attacker can find two inputs $x_1$ and $x_2$ that both map to the same output.

*find any $x_1,x_2$ such that $H(x_1) = H(x_2) = y$ (and $x_1 \neq x_2$)*

This differs from a **second preimage attack** because the attacker can choose both inputs, making it easier to accomplish.

**Collision resistance** is important when you’re relying on hash functions to always map two inputs to two different values, like in the content-addressable storage example above. But wait, since our hash function takes anything from tiny to huge-sized inputs and always returns a fixed-size output, we know collisions **for sure** exist for all outputs.

## Collisions are guaranteed to exist

We can prove this using the pidgeonhole principle. In other words, if the input space is bigger than the output space, then we know that many inputs must map to the same output.

But even though we know collisions exist, we still use a certain subset of hash functions for cryptography. These are called collision resistant hash functions or cryptographic hash functions.

## Visualizing collision resistance

Imagine that the ground in front of you is covered with pennies, each painted one of $1024$ colors. If you wanted to find any two pennies of the same color on the ground, you’d need look at about $32$ pennies. Super easy, right? That would take maybe a couple seconds to do. This is because of the birthday paradox, which is computed as $\sqrt{1024} = 32$.

### The birthday paradox 🎂 🎉

The **birthday paradox** goes like this: how many people do you need to add to a room before two of them share the same birthday? The answer is lower than you might expect, at around $\sqrt{365} \approx 19$ people.

### Now this time, with more color

Let’s say the color painted on the penny is determined by the output of the collision-resistant hash function SHA-256. Because the output of SHA-256 is 256 bits, now each penny is painted one of $2^{256}$ unique colors. Now shrink the penny down to the size of an atom.

If you tried to find two atom-sized pennies of the same color, you’d roughly need to look at **every atom in the observable universe.**

## A computational perspective

We know computers are much faster than humans at checking for matching pennies. Are they fast enough to make a difference?

The answer, as you may have guessed, is no. This random ASIC Bitcoin miner I picked can compute 65GH/s, or 65 billion hashes per second. In order to compute the necessary $\sqrt{2^{256}} = 2^{128}$ hashes to produce a collision, if you somehow acquired 10,000 of these miners (and your own personal hydroelectric dam), you’d have to wait:

$$\frac{2^{128}\ \text{hashes}}{65B\ \text{hashes/sec} \cdot 10K\ \text{miners}} = 5.234 \times 10^{23}\ \text{seconds}$$

In other words, you and your 10,000 miners would have to wait $10^{16}$ **years** (or about the number of meters in a parsec) to find a hash.

Thanks to the folks on Lobste.rs for feedback and corrections on the initial version of this post.

Also check out The Code Monkey’s Guide to Cryptographic Hashes for Content-based Addressing.