Putting the “Fun” in “Hash Function”

There are several different methods for securely hashing a password server-side for storage and future authentication. The most common one (a.k.a. the one that FIPS allows you to use, if compliance matters for you) is called PBKDF2. It stands for Password-Based Key Derivation Function #2.

Why #2? It’s got nothing to do with pencils. There was, in fact, a PBKDF1! But PBKDF1 was fatally insecure in a way I find very interesting. This StackOverflow answer is a great explainer on the difference between the two.

Very Hand-Wavy Description of a Hash Function

Let’s defined a hash function H(m) as any one-way transformation of some arbitrary-length string (m) to a fixed-length, deterministic, pseudo-random output.

Soatok Explains it All
Note: When in doubt, I err on the side of being easily understood by non-experts over pedantic precision. (Art by Swizz)

For example, this is a dumb hash function (uses SipHash-2-4 with a constant key):

function dumb_hash(string $arbitrary, bool $raw_binary = false): string
    $h = sodium_crypto_shorthash($arbitrary, 'SoatokDreamseekr');
    if ($raw_binary) {
        return $h;
    return sodium_bin2hex($h);

You can see the output of this function with some sample inputs here.

Properties of Hash Functions

A hash function is considered secure if it has the following properties:

  1. Pre-image resistance. Given H(m), it should be difficult to find m.
  2. Second pre-image resistance. Given m_1, it should be difficult to find m_2 such that H(m_1) = H(m_2)
  3. Collision resistance. It should be difficult to find any arbitrary pair of messages ((m_{1}, m_{2})) such that H(m_1) = H(m_2)

That last property, collision resistance, is guaranteed up to the Birthday Bound of the hash function. For a hash function with a 256-bit output, you will expect to need on average 2^{128} trial messages to find a collision.

If you’re confused about the difference between collision resistance and second pre-image resistance:

Collision resistance is about finding any two messages that produce the same hash, but you don’t care what the hash is as long as two distinct but known messages produce it.

On the other paw, second pre-image resistance is about finding a second message that produces a given hash.

Exploring PBKDF1’s Insecurity

If you recall, hash functions map an arbitrary-length string to a fixed-length string. If your input size is larger than your output size, collisions are inevitable (albeit computationally infeasible for hash functions such as SHA-256).

But what if your input size is equal to your output size, because you’re taking the output of a hash function and feeding it directly back into the same hash function?

Then, as explained here, you get an depletion of the possible outputs with each successive iteration.

But what does that look like?

Soatok has a eureka moment

Without running the experiments on a given hash function, there are two possibilities that come to mind:

  1. Convergence. This is when H(H(H(...H(m)...))) will, for two arbitrary messages and a sufficient number of iterations, converge on a single hash output.
  2. Cycles. This is when H_{n}(m) = H_{n + k}(m) for some integer k.

The most interesting result would be a quine, which is a cycle where H(m) = m (that is to say, k = 1).

The least interesting result would be for random inputs to converge into large cycles e.g. cycles of size 2^{128} for a 256-bit hash function.

I calculated this lazily as the birthday bound of the birthday bound (so basically the 4th root, which for 2^{256} is 2^{64}).

Update: According to this 1960 paper, the average time to cycle is (1/2) (2\pi n)^{1/2}, and cycle length should be (1/4) (2\pi n)^{1/2}, which means for a 256-bit hash you should expect a cycle after about (1/2) (2\pi 2^{256})^{1/2} \approx 2^{128.326} or about 128 bits, and the average cycle length will be about (1/4) (2\pi 2^{256})^{1/4} \approx 2^{127.326}. Thanks Riastradh for the corrections.

Conjecture: I would expect secure cryptographic hash functions in use today (e.g. SHA-256) to lean towards the least interesting output.

An Experiment Design

Since I don’t have an immense amount of cheap cloud computing at my disposal to run this experiments on a real hash function, I’m going to cheat a little and use my constant-key SipHash code from earlier in this post. In future work, cryptographers may find studying real hash functions (e.g. SHA-256) worthwhile.

Given that SipHash is a keyed pseudo-random function with an output size of 64 bits, my dumb hash function can be treated as a 64-bit hash function.

This means that you should expect your first collision (with 50% probability) after only 2^{32} trial hashes. This is cheap enough to try on a typical laptop.

Here’s a simple experiment for convergence:

  1. Generate two random strings (m1, m2).
  2. Set h1 = H(m1), h2 = H(m2).
  3. Iterate h1 = H(h1), h2 = H(h2) until h1 = h2.

You can get the source code to run this trivial experiment here.

Clone the git repository, run composer install, and then php bin/experiment.php. Note that this may need to run for a very long time before you get a result.

If you get a result, you’ve found a convergence.

If the loop doesn’t terminate even after 2^64 iterations, you’ve definitely found a cycle. (Actually detecting a cycle and analyzing its length would require a lot of memory, and my simple PHP script isn’t suitable for this effort.)

Soatok angrily grasping computer monitor
“What do you mean I don’t have a petabyte of RAM at my disposal?”

What Does This Actually Tell Us?

The obvious lesson: Don’t design key derivation functions like PBKDF1.

But beyond that, unless you can find a hash function that reliably converges or produces short cycles (k < (1/4) (2\pi 2^{n})^{1/2}, for an n-bit hash function), not much. (This is just for fun, after all!)

Definitely fun to think about though! (Art by circuitslime)

If, however, a hash function is discovered to produce interesting results, this may indicate that the chosen hash function’s internal design is exploitable in some subtle ways that, upon further study, may lead to better cryptanalysis techniques. Especially if a hash quine is discovered.

(Header art by Khia)

By Soatok

Security engineer with a fursona. Ask me about dholes or Diffie-Hellman!

5 replies on “Putting the “Fun” in “Hash Function””

At least cycle detection should be quite easy, if you run h2=H(H(h2)) like in Pollard’s Rho method. In that case you get collision even in a cycle. And if running for 2^64 is not a problem you can also run h1=H(h1) again after finding h1=h2 until you get it again and count iterations to get cycle length

Bark My Way

This site uses Akismet to reduce spam. Learn how your comment data is processed.