Understanding the Security of Cryptographic Hash Functions

Published Apr 16, 2018 in Cryptography, Software Engineering

Hash functions are an extremely versatile tool that you can find nearly everywhere in software engineering.

Bitcoin mining is done by repeatedly computing SHA256 hashes.
git stores files internally using the SHA-1 hash of the content as the filename (content-based addressing).
When you download an app, your phone verifies that the app you’re about to run is the app you intended to download by computing a checksum (a hash of the entire app) and comparing it to one provided by the app store. If the values are different, that means someone might have changed the contents of the app.

This article intends to give you a deeper understanding of cryptographic hash functions, when to use them, and how to think about their security properties.

A hash function is a function (in both the math and programming sense) that is “one way” – meaning given the output $y$ , it’s near impossible to find the input $x$ .

H (x) \to y

Hash functions are useful for cryptographic and non-cryptographic applications. For example, Python has a non-cryptographic hash() function built into the standard library. It’s used for comparing values and building dictionaries.

Python’s hash() demonstrates some of the basic requirements for a hash function: if you call it a with different inputs, you get different outputs. If you call it with the same input, you always get the same output.

# Python's hash() always outputs a signed 64 bit integer.
>>> hash('rutabaga')
-7515648070926626588
>>> hash('fishtaco')
8880674569626935499
# The same input always maps to the same output.
>>> hash('rutabaga')
-7515648070926626588

How do you define the security of hash functions?

How do you know how to choose which secure hash functions to use and what security properites to they guarantee? There are three security models for hash functions, described below.

Security Property 1: Preimage resistance

A preimage attack is, given the output, find the input that produced it.

given y = H (x), find x

Doing so would require exhaustively checking all possible values of $x$ . The cost of attacking SHA256, a hash function with the output space the size of $∣ 2^{256} ∣$ for instance, is $O (2^{256})$ . This is currently computationally infeasible. Here’s a great resource that explains how big $2^{256}$ is.

Preimage resistance is an absolute minimum requirement for hash functions, since it means the function is, in practice, “one way.”

Security Property 2: Second preimage resistance

A second preimage attack is, given a pair of input/output values, find another input that produces the same output.

Second preimages are substantially more difficult to find than first preimages.

When is 2nd preimage resistance important?

The downloaded app checksum example above is an example of this attack scenario. If an attacker can write another app that hashes to the same value, they can replace the real app for their app in transit. The checksum would verify successfully, and then you’d be running the attacker’s code.

2nd preimage resistance applies to most applications of hash functions.

Security Property 3: Collision Resistance

A collision happens when an attacker can find two inputs $x_{1}$ and $x_{2}$ that both map to the same output.

This differs from a second preimage attack because the attacker can choose both inputs, making it easier to accomplish.

Collision resistance is important when you’re relying on hash functions to always map two inputs to two different values, like in the content-addressable storage example above. But wait, since our hash function takes anything from tiny to huge-sized inputs and always returns a fixed-size output, we know collisions for sure exist for all outputs.

SHA-1, a hash function with a 160-bit output size, has been broken for collision attacks.

Collisions are guaranteed to exist

We can prove this using the pidgeonhole principle. In other words, if the input space is bigger than the output space, then we know that many inputs must map to the same output.

But even though we know collisions exist, we still use a certain subset of hash functions for cryptography. These are called collision resistant hash functions or cryptographic hash functions.

Visualizing collision resistance

A very badly drawn green bunny behind a door..

Imagine that you’re presented with a vast number of doors, behind each of which is a bunny colored one of 1024 colors. How many doors would you need to look behind to find a matching pair of same-color bunnies? A lot? Actually only 32 doors!! This is because of the birthday paradox, which is computed as $1024 = 32$ .

Now let’s say each bunny’s color is determined by the output of the collision-resistant hash function SHA-256. Because the output of SHA-256 is 256 bits, now each bunny is one of $2^{256}$ unique colors. Now shrink the bunny down to the size of an atom.

If you tried to find two atom-sized bunnies of the same color, you’d roughly need to look at every atom in the observable universe.

A computational perspective: we know computers are much faster than humans at checking for matching bunnies. Are they fast enough to make a difference?

The answer, as you may have guessed, is no. This ASIC Bitcoin miner I picked can compute 65GH/s, or 65 billion hashes per second. In order to compute the necessary $2^{256} = 2^{128}$ hashes to produce a collision, if you somehow acquired 10,000 of these miners (and your own personal hydroelectric dam), you’d have to wait:

\frac{2 ^{128} hashes}{6 5 B hashes/sec \cdot 1 0 K miners} = 5.234 \times 1 0^{23} seconds

In other words, you and your 10,000 miners would have to wait {{ katex_inline “10^{16}” >}} years (or about the number of meters in a parsec) to find a hash.

Summary

To conclude, the 3 security properties discussed were:

Preimage resistance: the output is not guessable
Second preimage resistance: you can’t find a new input to map to the same output
Collision resistance: you can’t find any two inputs to map to the same output

For a hash function to be cryptographically secure, it must have all three of these properties.

A shorter version of this post was initially published on April 1, 2018. This is an expanded and rewritten version. Thanks to the folks on Lobste.rs for feedback and corrections on the initial version of this post.

Also check out The Code Monkey’s Guide to Cryptographic Hashes for Content-based Addressing.

jeffcarp

Software

Life