# Argument for MD5

So, there has been a lot of talk over the past few years about using MD5 hash sums in digital forensics, due to the fact that some collisions have been found for MD5.

First, a hash algorithm/function has the following properties:
1) The algorithm takes in a variable sized input data and transforms the data into a fixed size output (usually smaller).

2) The algorithm is one-way, meaning to calculate a hash sum of the input data is relatively easy, while to reverse the process, to calculate the original input data given the hash sum is hard (a.k.a. computationally infeasible)

3) The algorithm is collision-free, meaning that different inputs to the algorithm will yield different outputs.

A nice write-up from RSA labs about hash functions can be found at http://www.rsasecurity.com/rsalabs/node.asp?id=2176.

Now the third property is somewhat of an issue. Due to the pigeon hole principle (http://en.wikipedia.org/wiki/Pigeonhole_principle) a hash function with a fixed output size is never truly collision-free. Here is the rationale:
– Assume a hash function with a fixed output size N. Call the set of possible values that the hash function maps to (the range) B

– Take the set of unique inputs to the hash function, where there are at least N+1 elements. Call this set A.

– Since a hash function maps A to B, there is guaranteed to be a collision, since there are more inputs than there are unique outputs (|A| > |B|).

So, since we have collisions with MD5 does this mean that all of the criminals convicted with digital evidence will run free? Have they all been wrongly convicted? I would guess the answer is no.

First there is the inherent trust in the data collection, handling, processing, analysis, and presentation. This is where things such as chain of custody, and trust in an examiner come into play. While nothing will create a deductive chain of logic that we can rely on for the integrity of evidence, current practices seem not to fail either.

Second, the attacks against MD5 have been against the aspect of being collision free. Nested in the concept of being collision free are two additional concepts:

3.1) A hash function is weakly collision-free if given an input X, then it is computationally infeasible to find another input Y, where X != Y, yet both produce the same hash sum. This is also called a “pre-image attack” by the folks at NSRL.

3.2) A hash function is strongly collision-free if it is computationally infeasible to find to different inputs X and Y that produce the same hash sum.

The key difference between 3.1 and 3.2 is that 3.1 has a pre-specified initial input X, and consequently a pre-specified hash sum to find a match while 3.2 prescribes that is is computationally infeasible to find two different inputs which yield the same output. The subtle difference is one you’re trying to find an input that will yield a given output, while the other just says find two inputs that have the same output.

The role of MD5 in helping show preservation of evidence integrity in digital forensics is the fact that it is weakly collision-free. Courtesy of the pigeon-hole principle, it has been known that there ARE collisions, just finding them was rather time consuming. The attacks against MD5 were focused on 3.2, showing that MD5 was not strongly collision-free.

Here is a deductive argument:

Premise 1:
If you have a hash algorithm suitable for digital forensics [A], then it is weakly collision-free [B] (don’t worry about strongly collision-free, we won’t use it).
A -> B

Premise 2:
If a hash algorithm is weakly collision-free [B] and given an input [C] then it is computationally infeasible to find another unique input with the same hash sum [D] .
(B ^ C) -> D

Premise 3:
The MD5 algorithm is suitable for digital forensics [A]
A

Premise 4:
A bit-stream image of evidence is an input [C]
C

Argument:
The MD5 algorithm is weakly collision-free.

Proof:
1. A (by Premise 3)
2. A -> B (by Premise 1)
3. ((A -> B) ^ A) (by Conjunction of Premise 3 and 1)
4. B (by Modus Ponens and 3)

Argument:
Given a bit-stream image of evidence, it is computationally infeasible to find another input that yields the same MD5 hash sum.

Proof:
5. C (by the argument)
6. B ^ C (by Conjunction of 4 and 5)
7. (B ^ C) -> D (by Premise 2)
8. ((B ^ C) -> D) ^ (B ^ C) (by Conjuction of 6 and 7)
9. D (by Modus Ponens and 8)

Of course, the underlying premise that is up for attack is Premise 3, which states that MD5 is suitable for digital forensics. Notice the recent attacks don’t come into play, since they focus on the strongly collision-free aspect of MD5. The folks at NSRL also stated:

“none of the attacks are pre-image attacks. In a pre-image attack, an attacker
must manufacture a file with a previously known hash. All of the current attacks
are only able to produce two files with the same hash, but it is a random hash. “
(http://www.nsrl.nist.gov/documents/analysis/draft-060530.pdf)

There are however projects underway to undermine the weakly collision-free aspect of the MD5 algorithm. Notably, the folks over at the Digital Forensics Reasearch Workshop a.k.a. DFRWS have issued a challenge (http://www.dfrws.org/hashchallenge/index.html).

So in summary, MD5 is still suitable for digital forensic use, although it might be advisable to use an additional algorithm such as SHA-1, SHA-256,