Self replicating software – Part 1 – The Recursion Theorem

This is the first in a multi part post about computing theory and self replicating software. This post assumes you have knowledge and understanding of a Turing Machine (abbreviated TM). If you aren’t familiar with Turing Machines (TMs) then you may want to take a look at the Wikipedia entry on the topic at http://en.wikipedia.org/wiki/Turing_machine before continuing. Much of the material for this post comes from Michael Sipser’s “An Introduction to the Theory of Computation”.

Let’s assume there is a computable function t which takes two inputs and yields one output. This function can be stated as:

t: E* x E* -> E*

Call the TM that implements this function TM_T. Now if one of the inputs that TM_T takes is the description of a TM (perhaps a description of TM_T or the description of another TM) we can call TM_T a “Meta TM”, meaning it operates on TMs (more accurately, descriptions of TMs). TM_T could do a number of things with the description, such as emulate it (making TM_T a universal TM), or print the description to the TM’s tape (making TM_T a “quine”). For the purpose of the recursion theorem, it’s irrelevant what TM_T does with the description. So TM_T can be written as taking two inputs:
TM_T(<TM>, w)

Where <TM> is a description of a TM. The other argument (w) could be input to the TM described by the <TM> argument.

Now, assume there is another computable function r which takes one parameter, w, the same w passed as the second argument to function t. Function r can be stated as:

r: E* -> E*

Call the TM that implements this function TM_R. So TM_R can be written as taking one input:
TM_R(w)

The TM_R is described by <TM_R>.

The recursion theorem states (in terms of functions r and t):

r(w) = t(<TM_R>, w)

alternatively

TM_R(w) = TM_T(<TM_R>, w)

This means that the ouput from TM_T given a description of TM_R and input w yields the same output as TM_R with input w. Think about the implications of this for a moment. If TM_T prints the description of the TM passed to it (which in this case is the description of TM_R), then TM_R can do the same thing without having the description passed in explicitly, instead TM_R can calculate <TM_R>. In essence TM_R could print out a description of itself, without having the description given to it.

I’ll post the proof and examples in Python in later entries. The recursion theorem can allow a virus to obtain a description of itself and “print” the description into a file. Although, there are other methods that virii can use to obtain descriptions of themselves, which will also be covered in later posts.

Naming structure of recycle bin files

Was doing some research on the structure of the Windows Recycle Bin, and found an interesting article over at Microsoft. It talks about the naming structure of the files in the Recycle Bin directories. In essence, the structure is as follows:

D<drive letter from original path><order #>.<original extension>

The field is a number signifying when the order the file was deleted since the recycle bin was last emptied.

For example, lets say I empty the recycle bin and then delete the following files (in the order shown):

c:somefile.txt
e:document.doc
d:picture.jpg
f:suspect_file.txt

Then, in the recycle bin directory for my user, I would find the following file names:

Dc1.txt
De2.doc
Dd3.jpg
Df4.txt

This is probably already well known, thought it would be interesting to share.

Argument for MD5

So, there has been a lot of talk over the past few years about using MD5 hash sums in digital forensics, due to the fact that some collisions have been found for MD5.

First, a hash algorithm/function has the following properties:
1) The algorithm takes in a variable sized input data and transforms the data into a fixed size output (usually smaller).

2) The algorithm is one-way, meaning to calculate a hash sum of the input data is relatively easy, while to reverse the process, to calculate the original input data given the hash sum is hard (a.k.a. computationally infeasible)

3) The algorithm is collision-free, meaning that different inputs to the algorithm will yield different outputs.

A nice write-up from RSA labs about hash functions can be found at http://www.rsasecurity.com/rsalabs/node.asp?id=2176.

Now the third property is somewhat of an issue. Due to the pigeon hole principle (http://en.wikipedia.org/wiki/Pigeonhole_principle) a hash function with a fixed output size is never truly collision-free. Here is the rationale:
– Assume a hash function with a fixed output size N. Call the set of possible values that the hash function maps to (the range) B

– Take the set of unique inputs to the hash function, where there are at least N+1 elements. Call this set A.

– Since a hash function maps A to B, there is guaranteed to be a collision, since there are more inputs than there are unique outputs (|A| > |B|).

So, since we have collisions with MD5 does this mean that all of the criminals convicted with digital evidence will run free? Have they all been wrongly convicted? I would guess the answer is no.

First there is the inherent trust in the data collection, handling, processing, analysis, and presentation. This is where things such as chain of custody, and trust in an examiner come into play. While nothing will create a deductive chain of logic that we can rely on for the integrity of evidence, current practices seem not to fail either.

Second, the attacks against MD5 have been against the aspect of being collision free. Nested in the concept of being collision free are two additional concepts:

3.1) A hash function is weakly collision-free if given an input X, then it is computationally infeasible to find another input Y, where X != Y, yet both produce the same hash sum. This is also called a “pre-image attack” by the folks at NSRL.

3.2) A hash function is strongly collision-free if it is computationally infeasible to find to different inputs X and Y that produce the same hash sum.

The key difference between 3.1 and 3.2 is that 3.1 has a pre-specified initial input X, and consequently a pre-specified hash sum to find a match while 3.2 prescribes that is is computationally infeasible to find two different inputs which yield the same output. The subtle difference is one you’re trying to find an input that will yield a given output, while the other just says find two inputs that have the same output.

The role of MD5 in helping show preservation of evidence integrity in digital forensics is the fact that it is weakly collision-free. Courtesy of the pigeon-hole principle, it has been known that there ARE collisions, just finding them was rather time consuming. The attacks against MD5 were focused on 3.2, showing that MD5 was not strongly collision-free.

Here is a deductive argument:

Premise 1:
If you have a hash algorithm suitable for digital forensics [A], then it is weakly collision-free [B] (don’t worry about strongly collision-free, we won’t use it).
A -> B

Premise 2:
If a hash algorithm is weakly collision-free [B] and given an input [C] then it is computationally infeasible to find another unique input with the same hash sum [D] .
(B ^ C) -> D

Premise 3:
The MD5 algorithm is suitable for digital forensics [A]
A

Premise 4:
A bit-stream image of evidence is an input [C]
C

Argument:
The MD5 algorithm is weakly collision-free.

Proof:
1. A (by Premise 3)
2. A -> B (by Premise 1)
3. ((A -> B) ^ A) (by Conjunction of Premise 3 and 1)
4. B (by Modus Ponens and 3)

Argument:
Given a bit-stream image of evidence, it is computationally infeasible to find another input that yields the same MD5 hash sum.

Proof:
5. C (by the argument)
6. B ^ C (by Conjunction of 4 and 5)
7. (B ^ C) -> D (by Premise 2)
8. ((B ^ C) -> D) ^ (B ^ C) (by Conjuction of 6 and 7)
9. D (by Modus Ponens and 8)

Of course, the underlying premise that is up for attack is Premise 3, which states that MD5 is suitable for digital forensics. Notice the recent attacks don’t come into play, since they focus on the strongly collision-free aspect of MD5. The folks at NSRL also stated:

“none of the attacks are pre-image attacks. In a pre-image attack, an attacker
must manufacture a file with a previously known hash. All of the current attacks
are only able to produce two files with the same hash, but it is a random hash. “
(http://www.nsrl.nist.gov/documents/analysis/draft-060530.pdf)

There are however projects underway to undermine the weakly collision-free aspect of the MD5 algorithm. Notably, the folks over at the Digital Forensics Reasearch Workshop a.k.a. DFRWS have issued a challenge (http://www.dfrws.org/hashchallenge/index.html).

So in summary, MD5 is still suitable for digital forensic use, although it might be advisable to use an additional algorithm such as SHA-1, SHA-256,