# Understanding the Security of Cryptographic Hash Functions

Hash functions are an extremely versatile tool that you can find nearly everywhere in software engineering.

- Bitcoin mining is done by repeatedly computing SHA256 hashes.
- git stores files internally using the SHA-1 hash of the content as the filename (content-based addressing).
- When you download an app, your phone verifies that the app you’re about to run is the app you intended to download by computing a checksum (a hash of the entire app) and comparing it to one provided by the app store. If the values are different, that means someone might have changed the contents of the app.

This article intends to give you a deeper understanding of cryptographic hash functions, when to use them, and how to think about their security properties.

A **hash function** is a function (in both the math and programming sense) that
is **“one way”** – meaning given the output $y$
, it’s
near impossible to find the input $x$
.

$H(x) \rightarrow y$

Hash functions are useful for cryptographic and non-cryptographic applications.
For example, Python has a non-cryptographic `hash()`

function built into the
standard library. It’s used for comparing values and building dictionaries.

Python’s `hash()`

demonstrates some of the basic requirements for a hash
function: if you call it a with different inputs, you get different outputs. If
you call it with the same input, you always get the same output.

```
# Python's hash() always outputs a signed 64 bit integer.
>>> hash('rutabaga')
-7515648070926626588
>>> hash('fishtaco')
8880674569626935499
# The same input always maps to the same output.
>>> hash('rutabaga')
-7515648070926626588
```

# How do you define the security of hash functions?

How do you know how to choose which secure hash functions to use and what security properites to they guarantee? There are three security models for hash functions, described below.

# 1. Preimage resistance

A **preimage attack** is, given the output, find the input that produced it.

$\text{given }y = H(x)\text{, find }x$

Doing so would require exhaustively checking all possible values of $x$ . The cost of attacking SHA256, a hash function with the output space the size of $|2^{256}|$ for instance, is $O(2^{256})$ . This is currently computationally infeasible. Here’s a great resource that explains how big $2^{256}$ is.

Preimage resistance is an absolute minimum requirement for hash functions, since it means the function is, in practice, “one way.”

# 2. Second preimage resistance

A **second preimage attack** is, given a pair of input/output values, find
*another* input that produces the same output.

Second preimages are substantially more difficult to find than first preimages.

## When is 2nd preimage resistance important?

The downloaded app checksum example above is an example of this attack scenario. If an attacker can write another app that hashes to the same value, they can replace the real app for their app in transit. The checksum would verify successfully, and then you’d be running the attacker’s code.

2nd preimage resistance likely applies to most applications of hash functions.

# 3. Hash collisions

A **collision** happens when an attacker can find two inputs
$x_1$
and $x_2$
that both map to the
same output.

This differs from a **second preimage attack** because the attacker can choose
both inputs, making it easier to accomplish.

**Collision resistance** is important when you’re relying on hash functions to
always map two inputs to two different values, like in the content-addressable
storage example above. But wait, since our hash function takes anything from
tiny to huge-sized inputs and always returns a fixed-size output, we know
collisions **for sure** exist for all outputs.

SHA-1, a hash function with a 160-bit output size, has been broken for collision attacks.

## Collisions are guaranteed to exist

We can prove this using the pidgeonhole principle. In other words, if the input space is bigger than the output space, then we know that many inputs must map to the same output.

But even though we know collisions exist, we still use a certain subset of hash functions for cryptography. These are called collision resistant hash functions or cryptographic hash functions.

## Visualizing collision resistance

Imagine that the ground in front of you is covered with pennies, each painted one of 1024 colors. If you wanted to find any two pennies of the same color on the ground, you’d need look at about 32 pennies. Super easy, right? That would take maybe a couple seconds to do. This is because of the birthday paradox, which is computed as $\sqrt{1024} = 32$ .

The **birthday paradox** goes like this: how many people do you need to add to
a room before two of them share the same birthday? The answer is lower than you
might expect, at around $\sqrt{365} \approx 19$
people.

Now let’s say the color painted on the penny is determined by the output of the collision-resistant hash function SHA-256. Because the output of SHA-256 is 256 bits, now each penny is painted one of $2^{256}$ unique colors. Now shrink the penny down to the size of an atom.

If you tried to find two atom-sized pennies of the same color, you’d roughly need to look at **every atom in the observable universe.**

**A computational perspective:** we know computers are much faster than humans
at checking for matching pennies. Are they fast enough to make a difference?

The answer, as you may have guessed, is no. This random ASIC Bitcoin miner I picked can compute 65GH/s, or 65 billion hashes per second. In order to compute the necessary $\sqrt{2^{256}} = 2^{128}$ hashes to produce a collision, if you somehow acquired 10,000 of these miners (and your own personal hydroelectric dam), you’d have to wait:

$\frac{2^{128}\text{ hashes}}{65B\text{ hashes/sec} \cdot 10K \text{ miners}} = 5.234 \times 10^{23} \text{ seconds}$

In other words, you and your 10,000 miners would have to wait {{ katex_inline
“10^{16}” >}} **years** (or about the number of meters in a
parsec) to find a hash.

A shorter version of this post was initially published on April 1, 2018. This is an expanded and rewritten version. Thanks to the folks on Lobste.rs for feedback and corrections on the initial version of this post.

Also check out The Code Monkey’s Guide to Cryptographic Hashes for Content-based Addressing.