Understanding the Security of Cryptographic Hash Functions

Hash functions are an extremely versatile tool that you can find nearly everywhere in software engineering.

This article intends to give you a deeper understanding of cryptographic hash functions, when to use them, and how to think about their security properties.

A hash function is a function (in both the math and programming sense) that is “one way” – meaning given the output , it’s near impossible to find the input .

Hash functions are useful for cryptographic and non-cryptographic applications. For example, Python has a non-cryptographic hash() function built into the standard library. It’s used for comparing values and building dictionaries.

Python’s hash() demonstrates some of the basic requirements for a hash function: if you call it a with different inputs, you get different outputs. If you call it with the same input, you always get the same output.

# Python's hash() always outputs a signed 64 bit integer.
>>> hash('rutabaga')
-7515648070926626588
>>> hash('fishtaco')
8880674569626935499
# The same input always maps to the same output.
>>> hash('rutabaga')
-7515648070926626588

How do you define the security of hash functions?

How do you know how to choose which secure hash functions to use and what security properites to they guarantee? There are three security models for hash functions, described below.

1. Preimage resistance

A preimage attack is, given the output, find the input that produced it.

Doing so would require exhaustively checking all possible values of . The cost of attacking SHA256, a hash function with the output space the size of for instance, is . This is currently computationally infeasible. Here’s a great resource that explains how big is.

Preimage resistance is an absolute minimum requirement for hash functions, since it means the function is, in practice, “one way.”

2. Second preimage resistance

A second preimage attack is, given a pair of input/output values, find another input that produces the same output.

given and , find such that (and )

Second preimages are substantially more difficult to find than first preimages.

When is 2nd preimage resistance important?

The downloaded app checksum example above is an example of this attack scenario. If an attacker can write another app that hashes to the same value, they can replace the real app for their app in transit. The checksum would verify successfully, and then you’d be running the attacker’s code.

2nd preimage resistance likely applies to most applications of hash functions.

3. Hash collisions

A collision happens when an attacker can find two inputs and that both map to the same output.

find any such that (and )

This differs from a second preimage attack because the attacker can choose both inputs, making it easier to accomplish.

Collision resistance is important when you’re relying on hash functions to always map two inputs to two different values, like in the content-addressable storage example above. But wait, since our hash function takes anything from tiny to huge-sized inputs and always returns a fixed-size output, we know collisions for sure exist for all outputs.

SHA-1, a hash function with a 160-bit output size, has been broken for collision attacks.

Collisions are guaranteed to exist

We can prove this using the pidgeonhole principle. In other words, if the input space is bigger than the output space, then we know that many inputs must map to the same output.

But even though we know collisions exist, we still use a certain subset of hash functions for cryptography. These are called collision resistant hash functions or cryptographic hash functions.

Visualizing collision resistance

A very badly drawn green bunny behind a door..

Imagine that you’re presented with a vast number of doors, behind each of which is a bunny colored one of 1024 colors. How many doors would you need to look behind to find a matching pair of same-color bunnies? A lot? Actually only 32 doors!! This is because of the birthday paradox, which is computed as .

Now let’s say each bunny’s color is determined by the output of the collision-resistant hash function SHA-256. Because the output of SHA-256 is 256 bits, now each bunny is one of unique colors. Now shrink the bunny down to the size of an atom.

if there are to atoms in the universe

and

If you tried to find two atom-sized bunnies of the same color, you’d roughly need to look at every atom in the observable universe.

A computational perspective: we know computers are much faster than humans at checking for matching bunnies. Are they fast enough to make a difference?

The answer, as you may have guessed, is no. This ASIC Bitcoin miner I picked can compute 65GH/s, or 65 billion hashes per second. In order to compute the necessary hashes to produce a collision, if you somehow acquired 10,000 of these miners (and your own personal hydroelectric dam), you’d have to wait:

In other words, you and your 10,000 miners would have to wait {{ katex_inline “10^{16}” >}} years (or about the number of meters in a parsec) to find a hash.


A shorter version of this post was initially published on April 1, 2018. This is an expanded and rewritten version. Thanks to the folks on Lobste.rs for feedback and corrections on the initial version of this post.

Also check out The Code Monkey’s Guide to Cryptographic Hashes for Content-based Addressing.

Contents (top)