The five phases of recovering digital evidence

This is the second post in a series about the five phases of recovering data structures from a stream of bytes (a form of digital evidence recovery). In the last post we discussed what data structures were, how they related to digital forensics, and a high level overview of the five phases of recovery. In this post we’ll examine each of the five phases in finer grained detail.

In the previous post, we defined five phases a tool (or human if they’re that unlucky) goes through to recover data structures. They are:

  1. Location
  2. Extraction
  3. Decoding
  4. Interpretation
  5. Reconstruction

We’ll now examine each phase in more detail…

Location

The first step in recovering a data structure from a stream of bytes is to locate the data structure (or at least the fields of the data structure we’re interested in.) Currently, there are 3 different commonly used methods for location:

  1. Fixed offset
  2. Calculation
  3. Iteration

The first method is useful when the data structure is at a fixed location relative to a defined starting point. This is the case for a FAT boot sector, which is located in the first 512 bytes of a partition. The second method (calculation) uses values from one or more other fields (possibly in other data structures) to calculate the location of a data structure (or field). The last method (iteration) examines “chunks” of data, and attempts to identify if the chunks are “valid”, meaning the (eventual) interpretation of the chunk fits predetermined rules for a given data structure.

These three methods aren’t mutually exclusive, meaning they can be combined and intermixed. It might be the case that locating a data structure requires all three methods. For example the ils tool from Sleuthkit, when run against a FAT file system ils first recovers the boot sector, then calculates the start of the data region, and finally iterates over chunks of data in the data region, attempting to validate the chunks as directory entries.

While all three methods require some form of a priori knowledge, the third method (iteration) isn’t necessarily dependent on knowing the fixed offset of a data structure. From a purist perspective, iteration itself really is location. Iteration yields a set of possible locations, as opposed to the first two methods which yield a single location. The validation aspect of iteration is really a combination of the rest of the phases (extraction, decoding, interpretation and reconstruction) combined with post recovery analysis.

Another method for location, that is less common than the previous three is location by outside knowledge from some source. This could be a person who has already performed location, or it could be the source that created the data structure (e.g. the operating system). Due to the flexible and dynamic nature of computing devices, this isn’t commonly used, but it is a possible method.

Extraction

Once a data structure (or the relevant fields) have been located, the next step is to extract the fields of the data structure out of the stream of bytes. Realize that the “extracting” is really the application of type information. The information from the stream is the same, but we’re using more information about how to access (and possibly manipulate) the value of the field(s). For example the string of bytes 0x64 and 0x53 can be extracted as an ASCII string composed of the characters “dS”, or it could be the (big endian) value 0x6453 (25683 decimal). The information remains the same, but how we access and manipulate the values (e.g. concatenation vs. addition) differs. Knowledge of the type of field provides the context for how to access and manipulate the value, which is used during later phases of decoding, interpretation, and reconstruction.

The extraction of a field that is composed of multiple bytes also requires knowledge of the order of the bytes, commonly referred to as the “endianess”, “byte ordering”, or “big vs. little endian”. Take for instance the 16-bit hexadecimal number 0x6453. Since this is a 16-bit number, we would need two bytes (assuming 8-bit bytes) to store this number. So the value 0x6453 is composed of the (two) bytes 0x64 and 0x53

It’s logical to think that these two bytes would be adjacent to each other in the stream of bytes, and they typically are. The question is now what is the order of the two bytes in the stream?

0x64, 0x53 (big endian)

or

0x53, 0x64 (little endian)

Obviously the order matters.

Decoding

After the relevant fields of a data structure have been located and extracted, it’s still possible further extraction is necessary, specifically for bit fields (e.g. flags, attributes, etc.) The difference between this phase and the extraction phase is that extraction extracts information from a stream of bytes and decoding extracts information from the extraction phase. Alternatively, the output from the extraction phase is used as the input to the decoding phase. Both phases however focus on extracting information. Computation using extracted information is reserved for later phases (interpretation and reconstruction).

Another reason to distinguish this phase from extraction is that most (if not all) computing devices can only read (at least) whole bytes, not individual bits. While a human with a hex dump could potentially extract a single bit, software would need to read (at least) a whole byte and extract the various bits within the byte(s).

There isn’t much that happens at this phase, as much of the activity focuses around accessing various bits.

Interpretation

The interpretation phase takes the output of the decoding phase (or the extraction phase if the decoding phase uses only identity functions) and performs various computations using the information. While extraction and decoding focus on extracting and decoding values, interpretation focuses on computation using the extracted (and decoded) values.

Two examples of interpretation are unit conversion, and the calculation of values used during the reconstruction phase. An example of unit conversion would be converting the seconds field of a FAT time stamp from it’s native format (seconds/2) to a more logical format (seconds). A useful computation for reconstruction might be to calculate the size of the FAT region (in bytes) for a FAT file system (bytes per sector * size of one FAT structure (in sectors) * number of FAT structures.)

Since this phase is used heavily by the reconstruction phase, it’s not uncommon to see this phase embodied in the code for reconstruction. However this phase is still a logically separate component.

Reconstruction

This is the last phase of recovering digital evidence. Information from previous phases is used to reconstruct a usable representation of the data structure (or at least the relevant fields.) Possible usable representations include:

  • A language specific construct or class (e.g. Python date object or a C integer)
  • Printable text (e.g. output from fsstat)
  • A file (e.g. file carving)

The idea is that the output from this phase can be used for some further analysis (e.g. timeline generation, analyzing email headers, etc.) Some tools might also perform some error checking during reconstruction, failing if the entire data structure is unable to be properly recovered. While this might be useful in some automated scenarios, it has the downside of potentially missing useful information when only part of the structure is available or is valid.

At this point, we’ve gone into more detail of each phase and hopefully explained in enough depth the purpose and types of activity that happen in each. The next (and last) post in this series is an example of applying the five phases to recovering a short name directory entry from a FAT file system.

How forensic tools recover digital evidence (data structures)

In a previous post I covered “The basics of how digital forensics tools work.” In that post, I mentioned that one of the steps an analysis tool has to do is to translate a stream of bytes into usable structures. This is the first in a series of three posts that examines this step (translating from a stream of bytes to usable structures) in more detail. In this post I’ll introduce the different phases that a tool (or human if they’re that unlucky) goes through when recovering digital evidence. The second post will go into more detail about each phase. Finally, the third post will show an example of translating a series of bytes into a usable data structure for a FAT file system directory entry.

Data Structures, Data Organization, and Digital Evidence

Data structures are central to computer science, and consequently bear importance to digital forensics. In The Art of Computer Programming, Volume 1: Fundamental Algorithms (3rd Edition), Donald Knuth provides the following definition for a data structure:

Data Structure: A table of data including structural relationships

In this sense, a “table of data” refers to how a data structure is composed. This definition does not imply that arrays are the only data structure (which would exclude other structures such as linked lists.) The units of information that compose a data structure are often referred to as fields. That is to say, a data structure is composed of one or more fields, where each field contains information, and the fields are adjacent (next) to each other in memory (RAM, hard disk, usb drive, etc.)

The information the fields contain falls into one of two categories, the data a user wishes to represent (e.g. the contents of a file), as well as the structural relationships (e.g. a pointer to the next item in a linked list.) It’s useful to think of the former (data) as data, and the latter (structural relationships) as metadata. Although the line between the two is not always clear, and depends on the context of interpretation. What may be considered data from one perspective, may be considered metadata from another perspective. An example of this would be a Microsoft Word document, which from a file system perspective is data. However, from the perspective of Microsoft Word, the file contains both data (the text) as well as metadata (the formatting, revision history, etc.)

The design of a data structure not only includes the order of the fields, but also the higher level design goals for the programs which access and manipulate the data structures. For instance, efficiency has long been a desirable aspect of many computer programs. With society’s increased dependence on computers, other higher level design goals such as security, multiple access, etc. have also become desirable. As a result, many data structures contain fields to accommodate these goals.

Another important aspect in computing is how to access and manipulate the data structures and their related fields. Knuth defines this under the term “data organization”:

Data Organization: A way to represent information in a data structure, together with algorithms that access and/or modify the structure.

An example of this would be a field that contains the bytes 0x68, 0x65, 0x6C, 0x6C, and 0x6F. One way to interpret these bytes is as the ASCII string “hello”. In another interpretation, these bytes can be the integer number 448378203247 (decimal). Which one is it? Well there are scenarios where either could be correct. To answer the question of correct interpretation requires information beyond just the data structure and field layout, hence the term data organization. Even with self-describing data structures, information about how to access and manipulate the “self-describing” parts (e.g. type “1” means this is a string) is still needed.

So where does all this information for data organization (and data structures) come from? There are a few common sources. Perhaps the first would be a document from the organization that designed the data structures and the software that accesses and manipulates them. This could be either a formal specification, or one or more informal documents (e.g. entries in a knowledge base.) Another source would be reverse engineering the code that accesses and manipulates the data structures.

If you’ve read through all of this, you’re might be asking “So how does this relate to digital forensics?” The idea is that data structures are a type of digital evidence. Realize that the term “digital evidence” is somewhat overloaded. In one context, a disk image is digital evidence (i.e. what was collected during evidence acquisition), and in another context, an email extracted from a disk image is digital evidence. This series focuses on the latter, digital evidence extracted from a stream of bytes. Typically this would occur during the analysis phase, although (especially with activities such as verification) this can occur prior to the evidence acquisition phase.

The 5 Phases

Now that we’ve talked about what data structures are and how they relate to digital forensics, lets see how to put this to use with our forensic tools. What we’re about to do is describe five abstract phases, meaning all tools may not implement them directly, and some tools don’t focus on all five phases. These phases can also serve as a methodology for recovering data structures, should you happen to be in the business of writing digital forensic tools.

  1. Location
  2. Extraction
  3. Decoding
  4. Interpretation
  5. Reconstruction

The results of each phase are used as input for the next phase, in a linear fashion.

An example will help clarify each phase. Consider the recovery of a FAT directory entry from a disk image. The first task would be to locate the desired directory entry, which could be accomplished through different mechanisms such as calculation or iteration. The next task is to extract out the various fields of the data structure, such as the name, the date and time stamps, the attributes, etc. After the fields have been extracted, fields where individual bits represent sub fields can be decoded. In the example of the directory entry, this would be the attributes field, which denotes if a file is considered hidden, to be archived, a directory, etc. Once all of the fields have been extracted and decoded, they can be interpreted. For instance, the seconds field of a FAT time stamp is really the seconds divided by two, so the value must be multiplied by two. Finally, the data structure can be reconstructed using the facilities of the language of your choice, such as the time class in Python.

There are a few interesting points to note with recovery of data structures using the above methodology. First, not all tools go through all phases, at least not directly. For instance, file carving doesn’t directly care about data structures. Depending on how you look at it, file carving really does go through all five phases, it just uses an identify function. In addition, file carving does care about (parts of) data structures, it cares about the fields of the data structures that contain “user information”, not about the rest of the fields. In fact, much file carving is done with a built-in assumption about the data structure: that the fields that contain “user information” are stored in contiguous locations.

Another interesting point is the distinction between extraction, decoding, and interpretation. Briefly, extraction and decoding focus on extracting information (from stream of bytes and already extracted bytes respectively), whereas interpretation focuses on computation using extracted and decoded information. The next post will go into these distinctions in more depth.

A third and subtler point comes from the transition of data structures between different types of memory, notably from RAM to a secondary storage device such as a hard disk or USB thumb drive. Not all structural information may make the transition from RAM, and as a result is lost. For instance, a linked list data structure, which typically contains a pointer field to the next element in the list, may not record the pointer field when being written to disk. More often that not, such information isn’t necessary to read consistent data structures from disk, otherwise the data organization mechanism wouldn’t really be consistent and reliable. However, if an analysis scenario does require such information (it’s theoretically possible), the data structures would have to come directly from RAM, as opposed to after they’ve been written to disk. This problem doesn’t stem from the five phases, but instead stems from a loss of information during the transition from RAM to disk.

In the next post, we’ll cover each phase in more depth, and examine some of the different activities that can occur at each phase.

How digital forensics relates to computing

A lot of people are aware that there is some inherent connection between digital forensics and computing. I’m going to attempt to explain my understanding of how the two relate. However before we dive into digital forensics, we should clear up some misconceptions about what computing is (and perhaps what it is not).

Ask yourself the question “What is computing?” When I ask this question (or some related variant) to folks, I get different responses ranging from programming, to abstract definitions about Turing machines. Well, it turns out that in 1989 some folks over at the ACM came up with a definition that works out quite well. The final report titled “Computing as a Discipline” provides a short definition:

The discipline of computing is the systematic study of algorithmic processes that describe and transform information … The fundamental question underlying all of computing is, “What can be (efficiently) automated?”

This definition can be a bit abstract for those who aren’t already familiar with computing. In essence, the term “algorithmic processes” basically implies algorithms. That’s right, algorithms are at the heart of computing. A (well defined) algorithm is essentially a finite set of clearly articulated steps to accomplish some task. The task the algorithm is trying to accomplish, can vary. So computing is about algorithms whose tasks are to describe and transform information.

When we implement an algorithm on a computing device, we have a computer program. So a computer program is really just the implementation of some algorithm for a specific computing device. The computing device could be a physical machine (e.g. an Intel architecture) or an abstract model (e.g. Turing machine). When we implement an algorithm for a specific computing device, we’re really just translating the algorithm into a form the computing device can understand.  To help make this more concrete, take for example Microsoft Word. Microsoft Word is a computer program, it’s a slew of computer instructions encoded in a specific format. The computer instructions tell a computing device (e.g. the processor) what to do with information (the contents of the Word document). The computer instructions are a finite set of clearly articulated steps (described in a format the processor understands) to accomplish some task (editing the Word document).

There is one other concept to deal with before focusing on digital forensics, and that is how algorithms work with information. In order for an algorithm to transform and describe information, the information has to be encoded in some manner. For example, the letter “A” can be encoded (in ASCII) as the number 0x41 (65). The number 0x41 can then be represented in binary as 01000001. This binary number can then be encoded as the different positions of magnets and stored on a hard disk. Implicit in the structure of the algorithm, is how the algorithm decodes the representation of information. This means that given just raw the encoding of information (e.g. a stream of bits) we don’t know what information is represented, we still need to understand (to some degree) how the information is used by the algorithm. I blogged about this a bit in a previous post “Information Context (a.k.a. Code/Data Duality)“.

So how does this relate to digital forensics? Simple, digital forensics is the application of knowledge of various aspects of computing to answer legal questions. It’s been common (in practice) to extend the definition of digital forensics to answer certain types of non-legal questions (e.g. policy violations in a corporate setting).

Think for a moment about what we do in digital forensics:

  • Collection of digital evidence: We collect the representation of information from computing devices.
  • Preservation of digital evidence: We take steps to minimize the alteration of the information we collect from computing devices.
  • Analysis of digital evidence: We apply our understanding of computer programs (algorithmic processes) to interpret and understand the information we collected. We then use our interpretation and understanding of the information to arrive at conclusions, using deductive and inductive logic.
  • Reporting: We relate our findings of the analysis of information collected from computing devices, to others.

Metadata can also be explained in terms of computing. Looking back at the definition for the discipline of computing, realize there are two general categories of information:

  • information that gets described and transformed by the algorithm
  • auxiliary information used by the algorithm when the steps of the algorithm are carried out

The former (information that gets described and transformed) can be called “content”, while the latter (auxiliary information used by the algorithm when executed) can be called “metadata”. Viewed in this perspective, metadata is particular to a specific algorithm (computer program) and what is content to one algorithm could be metadata to another.Again, an example can help make this a bit clearer. Let’s go back to our Microsoft Word document. From the perspective of Microsoft Word, the content would be the text the user typed. The metadata would be the font information and attributes, revision history, etc. So, to Microsoft Word, the document contains both content and metadata. However, from the perspective of the file system, the Word document is the content, and things such as the location of the file, security attributes, etc. are all metadata. So what is considered by Microsoft Word to be metadata and content is just content to the file system.

Hopefully this helps explain what computing is, and how digital forensics relates.

The basics of how programs are compiled and executed

Well, the post “The basics of how digital forensics tools work” seemed to be fairly popular, even getting a place on Digg. This post is focused on the basics of how a program gets compiled and loaded into memory when the program is executed. It’s useful for code analysis (reverse engineering), and is aimed at those who aren’t already familiar with compilers. The basic reasoning is that if you understand (at least from a high level) how compilers work, it will help when analyzing compiled programs. So, if you’re familiar with code analysis, this post isn’t really aimed at you. If however, you’re new to the field of reverse engineering, (specifically code analysis) this post is aimed at you.

Compiling is program transformation

From an abstract perspective, compiling a program is really just transforming the program from one language into another. The “original” language that the program is written in is commonly called the source language (e.g. C, C++, Python, Java, Pascal, etc.) The program as it is written in the source language is called the source code. The “destination” language, the language that the program is written to, is commonly called the target language. So compiling a program is essentially translating it from the source language to the target language.

The “typical” notion of compiling a program is transforming the program from a higher level language (e.g. C, C++, Visual Basic, etc.) to an executable file (PE, ELF, etc.). In this case the source language is the higher level language, and the target language is machine code (the byte code representation of assembly). Realize however, that going from source code to executable file is more than just transforming source code into machine code. When you run an executable file, the operating system needs to set up an environment (process space) for the code (contained in the executable file) to run inside of. For instance, the operating system needs to know what external libraries will be used, what parts of memory should be marked executable (i.e. can contain directly executable machine code), as well as where in memory to start executing code. Two additional types of programs, linkers and loaders accomplish these tasks.

Compiler front and back ends.

Typically a compiler is composed of two primary components, the front end and the back end. The front end of a compiler typically takes in the source code, analyzes the structure (doing things such as checking for errors) and creates an intermediate representation of the source code (suitable for the back end). The back end takes the output of the front end (the intermediate representation), optionally performs optimization, and translates the intermediate representation of the source code, into the target language. In the case of compiling a program, it is common for a compiler’s back end to generate human readable assembly code (mnemonics) and then invoke an assembler to translate the assembly code into it’s byte code representation, which is suitable for a processor.

Realize, it’s not an “absolute” requirement that a compiler be divided into front and back ends. It is certainly possible to create a compiler that translates directly from source code to target language. There are however benefits to the front/back end division, such as reuse, ease of development, etc.

Linkers and Loaders

The compiler took our source code and translated it to executable machine instructions, but we don’t yet have an executable file, just one or more files that contain executable code and data. These files are typically called object code, and in many instances aren’t suitable to stand on their own. There are (at least) 3 high level tasks that still need to be performed:

  1. We still need to handle referencing dependencies, such as variables and functions (possibly in external code libraries.) This is called symbol resolution.
  2. We still need to arrange the object code into a single file, making sure separate pieces do not overlap, and adjusting code (as necessary) to reflect the new locations. This is called relocation.
  3. When we execute the program, we still need to set up an environment for it to run in, as well as load the code from disk into RAM. This is called program loading.

Conceptually, the line between linkers and loaders tends to blur near the middle. Linkers tend to focus on the first item (symbol resolution) while loaders tend to focus on the third item (program loading). The second item (relocation) can be handled by either a linker or loader, or even both. While linkers and loaders are often separate programs, there do exist single linker-loader programs which combine the functionality.

Linkers

The primary job of a linker is symbol resolution, that is to resolve references to entities in the code. For example, a linker might be responsible for replacing references to the variable X with a memory address. The output from a linker (at compile time) typically includes an executable file, a map of the different components in the executable file (which facilitates future linking and loading), and (optionally) debugging information. The different components that a linker generates don’t always have to be separate files, they could all be contained in different parts of the executable file.

Sometimes you’ll hear references to statically or dynamically linked programs. Both of these refer to how different pieces of object code are linked together at compile time (i.e. prior to the program loading.) Statically linked programs contain all of the object code they need in the executable file. Dynamically linked programs don’t contain all of the object code they need, instead they contain enough information so that at a later time, the needed object code (and symbols) can be found and made accessible. Since statically linked programs contain all the object code and information they need, they tend to be larger.

There is another interesting aspect of linkers, in that there are both linkers which work at compile time, and linkers which perform symbol resolution during program loading, or even run time. Linkers which perform their work during compile time are called compile time linkers, while linkers which perform their work at load and run time are called dynamic linkers. Don’t confuse dynamic linkers and dynamically linked executable files. The information in a dynamically linked executable is used by a dynamic linker. One is your code (dynamically linked executable), the other helps your code run properly (dynamic linker).

Loaders

There are two primary functions that are typically assigned to loaders, creating and configuring an environment for your program to execute in, and loading your program into that environment (which includes starting your program). During the loading process, the dynamic linker may come into play, to help resolve symbols in dynamically linked programs.

When creating and configuring the environment, the loader needs information that was generated by the compile time linker, such as a map of which sections of memory should be marked as executable (i.e. can contain directly executable code), as well as where various pieces of code and data should reside in memory. When loading a dynamically linked program into memory, the loader also needs to load the libraries into the process environment. This loading can happen when the environment is first created and configured, or at run time.

In the process of creating and configuring the environment, the loader will transfer the code (and initialized data) from the executable file on disk into the environment. After the code and data have been transferred to memory (and any necessary modifications for load time relocation have been made), the code is started. In essence, the way the program is started is to tell the processor to execute code at the first instruction of the code (in memory). The address of the first instruction of code (in memory) is called the entry point, and is typically set by the compile time linker.

Further reading

Well, that about wraps it up for this introduction. There are some good books out there on compilers, linkers, and loaders. The classic book on compilers is Compilers: Principles, Techniques, and Tools. I have the first edition, and it is a bit heavy on the computer science theory (although I just ordered the second edition, which was published in August 2006). The classic book on linkers and loaders is Linkers and Loaders. It too is a bit more abstract, but considerably lighter on the theory. If you’re examining code in a Windows environment, I’d also recommend reading (at least parts of) Microsoft Windows Internals, Fourth Edition.

Information Context (a.k.a Code/Data Duality)

One concept that pervades digital forensics, reverse engineering, exploit analysis, even computing theory is that in order to fully understand information, you need to know the context the information is used in.

For example, categorize the following four pieces of information as either code or data:

1) push 0x6F6C6C65
2) “hello” (without the quotes)
3) 448378203247
4) 110100001100101011011000110110001101111

Some common answers might be:
1) Code (x86 instruction)
2) Data (string)
3) Data (integer)
4) Code or Data (unknown)

Really, all four are encoded internally (at least on an intel architecture) the same way, so they’re all code and they’re all data. Inside the metal box, they all have the same binary representation. The key is to know the context in which they’re used. If you thought #1 was code, this is probably based on past experience (e.g. having seen x86 assembly before). If the four ascii bytes ‘h’ ‘e’ ‘l’ ‘l’ ‘o’ were in a code segment, then it is possible that they were instructions, however it is also that they were data. This is because code and data can be interspersed even in a code segment (as is often done by compilers.)

Note: Feel free to skip the following paragraph if you’re not really interested in the theory behind why this is possible. The application of information context is in the rest of the post.
The reason code and data can be interspersed is often attributed to the Von Neumann architecture, which stores both code and data in a common memory area. Other computing architectures, such as the Harvard architecture have seperate memory storage areas for code and data. However, this is an implementation issue, as information context (or lack thereof) can also be seen in Turing machines. In a Universal Turing Machine (UTM), you can’t distinguish between code and data if you encode both the instructions to the UTM and the data to the UTM with the same set (or subset) of tape symbols. Both the instructions for the UTM and the data for the UTM sit on the tape, and if they’re encoded with the same set (or subset) of tape symbols, then just by looking at any given set of symbols, there isn’t enough information about how the symbols are used to determine if they’re instructions (e.g. move the tape head) or data. This type of UTM would be more along the lines of a Von Neumann machine. A UTM which was hard-coded (via the transition function) to use different symbols for instructions and data would be more along the lines of a Harvard machine.

One area where information context greatly determines understanding of code and data is in reverse engineering, specifically executable packers/compressors/encryption/etc. (Note: executable packers/compressors/encryption/etc. are really examples of self-modifying code, so this concept extends to self-modifying code as well). Here’s an abstract view of how common block-based executable encryption works (e.g. ASPack, etc.):

| Original executable | — Packer —> | Encrypted data |

The encrypted data is then combined with a small decryption stub resulting in an encrypted executable that looks something like:

| Decryption stub | Encrypted data |

When the encrypted executable is run, the decryption stub decrypts the encrypted data, yielding something that can look like:

| Decryption stub | Encrypted data | Original executable |

(Note: this is an abstract view, there are of course implementation details such as import tables, etc. It’s also possible to have the encrypted data decrypted in place).

The encrypted data is really code, just in a different representation. The same concept applies when you zip / gzip / bzip / etc. an executable from disk.

Another example where the duality between code and data comes into play is with exploits. Take for example SEH overwrites. An SEH overwrite is a mechanism for gaining control of execution (typically during a stack based buffer overflow) without using the return pointer. SEH stands for Structured Exception Handling, an error/exception handling mechanism used on the Windows operating system. One of the structures used to implement SEH sits on the stack, and can be overwritten during a stack-based buffer overflow. The structure consists of two pointers, and looks something like this:

| Pointer to previous SEH structure | Pointer to handler code |

Now in theory, a pointer, regardless of what it points to, is considered data. The value of a pointer is the address of something in memory, hence data. During an exploit that gains control by overwriting the SEH structure, the pointer to the previous SEH structure is typically overwritten with a jump instruction (often a jump to shellcode.) The pointer to handler code is typically overwritten with the address of a set of pop+pop+ret instructions. The SEH structure now looks like:

| Jump (to shellcode) | Pointer to pop+pop+ret |

The next thing that happens is to trigger an exception so that the pop+pop+ret executes. Due to the way the stack is set up when the exection handling code (now pop+pop+ret) is called, the ret transfers control to the jump (to shellcode). The code/data duality comes into play since this was originally a pointer (to the previous SEH structure) but is now code (a jump). In the context of the operating system, it treats this as data, but in the context of a pop+pop+ret it is treated as code.

Information context goes beyond just code/data duality. Take for example string searches (a.k.a. keyword searches) which are commonly used in digital forensic examinations to generate new leads to investigate. Assume a suspect is accused of stealing classified documents. Searching a suspect’s media (hard disk, usb drive, digital camera, etc.) for the string “Classified” (amongst other things) could seem prudent. However a hit on the word “Classified” does not necessarily suggest guilt, but it does warrant further investigation. For instance the hit for “Classified” could come from a classified document, or the hit could come from a cached HTML page for “Classified Ads”. Again the key is to understand the context in which the hit comes from.

To summarize this post in a single sentence: Information context can be vital in understanding/analyzing what you’re looking at.

Self replicating software – Part 3 – Other methods

Up until now, this thread of posts has been rather theoretical, talking about Turing machines, etc. the only time there was some source code was for showing a program that can print out a description of itself (its source code).

Well, one problem with the self-replication method for getting a copy of a program’s description is that the program needs to contain a partial copy (and it computes the rest). This can cause some nasty code bloat. The method many viruses use is to copy themselves from another spot in memory.

For instance, if you have an executable file being run in memory (the host), and a portion of the executable code is a “virus” (a.k.a viral code), then by virtue of the fact that the host program is running in memory, a copy of the virus code is also in memory. All a virus needs to do is locate the start of itself in memory and copy it out. No need for the recursion theorem, no need to carry a partial copy of itself.

Some simple methods for finding where the virus is in memory are:

  1. Always be located at the same memory address
  2. When the virus is infecting another host file, embed in the host file the location of the virus
  3. Once the virus starts, locate the address of the first instruction
  4. Scan the host file’s memory for a signature of the virus

To accomplish the first method can require some careful forethought. In essence, the virus has to find a location in memory that is normally not taken by some portion of code or data. When the “virus” part of the executable code runs, it can copy itself out of memory since it knows where it starts (and assuming it also knows the size)

The second method requires that the virus (rather the virus’s author) understand a bit about the loading process that the executable file undergoes when it is loaded (assuming the virus is infecting a file on disk rather than a process running in memory). The virus can then “hard-code” the address of the viral code in the newly infected host. When the newly infected host file gets executed (e.g. by the user) and control gets transferred to the virus portion of the host, the virus knows where it is in memory since it was recorded by the previous incarnation.

The third method normally requires a bit of hackery. When the virus portion of the host file gets control of execution it needs to perform some sort of set of instructions to figure out where it is. The classic method of doing this in Intel x86 assembly is:

0x8046453: CALL 0x8046458
0x8046458: POP EAX

The first line calls a function which starts at the second line. The call instruction first pushes the address of the next instruction (which would be 0x8046458) onto the stack, and then transfers control to the operand of the call (which is also 0x8045458). The second line pops the last value that was pushed onto the stack (the address in memory) and saves a copy in the EAX register.

The fourth method simply requires that the virus have a static signature, and that once control of execution gets transferred to the virus portion of the host file, the virus scan memory until it finds the signature. Of course there is the possibility for a false positive, but by carefully choosing the signature, this could be reduced.

There is another method that I haven’t described, and that is that the virus can obtain a copy of itself from a file on disk. This generally works for script based viruses, but also has the potential to work for other types of executable files. This was the method used by the biennale.py virus.

This ends this thread of posts (at least for a little while). The original reason for creating this thread was to explain a little bit about how viruses work. This is useful for computer forensic analysts/examiners examining viruses, since understanding the basics goes a long way when examining actual specimens. Of course there is much more than the simplification given in this series of posts, but hopefully this can provide some guidance.

Self replicating software – Part 2 – Recursion theorem proof

In this post I’ll cover the proof of the Recusion theorem (see Self Replicating Software – Part 1 – The Recursion Theorem).

The proof for the Recursion theorem is a constructive proof, meaning that a Turing Machine (TM) that can reference its own description is constructed. This proof was taken from Michael Sipser’s “An Introduction to the Theory of Computation”.

Before we get into the proof of the Recursion Theorem, we’ll need a special TM. There is a computable function q which can print out the description of a TM that prints the input that was given to q.

So, if we have a machine P_w, that prints out w (where w is input to the machine), q(w) prints out the description of TM_w. The construction of such a machine is relatively simple. Lets call the machine that computes q(w) Q_w. Here is an informal description:

Q_w = "
Step 1: Create a machine P_w ="
Step 1.1: print w"

Step 2: Print <P_w>

Recall that <P_w> is notation for “The description of the Turing Machine P_w”.

Last time, we had a TM called TM_T that took two inputs, one being the description of a TM, the other input called w, and performed some computation with it. We also had another TM called TM_R which took one input, namely the same w passed to TM_T and performed the same computation as TM_T (on the description of a TM) but with it’s own description (<TM_R>) filled in. Since TM_R only takes one input, w, the description of TM_R is internal to TM_R.

The TM_R is constructed in a special manner, namely it has 3 components. Let’s call these TM_A, TM_B, and TM_T where TM_T is the same TM_T referenced before. TM_A is a special TM that generates a description of TM_B and TM_T. TM_B computes the description of TM_A. The tricky part is to avoid creating a cyclic reference. In essence, you can’t have TM_A print out <TM_B><TM_T> and then have TM_B print out <TM_A> since they reference each other directly. The trick is to have TM_B calculate <TM_A> by using the output from TM_A.

Think about this abstractly for a moment, if TM_A were to print some string w, then <TM_A> is simply q(w) (a.k.a. the description of a machine that prints w). In case of the recursion theorem, TM_A is the machine that prints out <TM_B><TM_T> and TM_B prints the description of a machine that prints out <TM_B><TM_T>

So, the proof goes as follows:

TM_R(w) = "
TM_A which is P_<TM_B><TM_T>
TM_B which is Q_w
TM_T which is specified ahead of time

Step 1: Run TM_A
Step 2: Run TM_B on the output of TM_A
Step 3: Combine the output of TM_A with the output of TM_B and w
Step 4: Pass control to TM_T
"

Think about this, TM_A is a machine that prints out <TM_B><TM_T>. <TM_B> computes the description of a machine that prints out <TM_B><TM_T>. Note the following:

TM_A = P_<TM_B><TM_T>
<TM_A> = <P_<TM_B><TM_T>>
<P_<TM_B><TM_T>> = Q(<TM_B><TM_T>)

TM_B = Q(w) where w is <TM_B><TM_T>

If TM_T prints the description, then TM_T is called a quine. For TM_T to be a virus, it would “print” the description to a file (somewhere on the tape).

We’ve covered a fair amount of theory between this post and the last post so let’s put it into some concrete code. In this example, TM_T will just print the description of the machine. This type of program is known as a quine or selfrep.

[lost@pagan code]$ grep -n ".*" quine.py
1:#!/usr/bin/python
2:
3:def TM_T(M):
4: print M;
5:
6:def TM_B(w):
7: computedA = '# Below is TM_A' + chr(0xA) + 's = ' + '"' + '""' + w + '"' + '"";' + chr(0xA) + 'TM_B(s);';
8: TM_T(w + computedA);
9:
10:# Below is TM_A
11:s = """#!/usr/bin/python
12:
13:def TM_T(M):
14: print M;
15:
16:def TM_B(w):
17: computedA = '# Below is TM_A' + chr(0xA) + 's = ' + '"' + '""' + w + '"' + '"";' + chr(0xA) + 'TM_B(s);';
18: TM_T(w + computedA);
19:
20:""";
21:TM_B(s);
[lost@pagan code]$ clear
[lost@pagan code]$ grep -n ".*" quine.py
1:#!/usr/bin/python
2:
3:def TM_T(M):
4: print M;
5:
6:def TM_B(w):
7: computedA = '# Below is TM_A' + chr(0xA) + 's = ' + '"' + '""' + w + '"' + '"";' + chr(0xA) + 'TM_B(s);';
8: TM_T(w + computedA);
9:
10:# Below is TM_A
11:s = """#!/usr/bin/python
12:
13:def TM_T(M):
14: print M;
15:
16:def TM_B(w):
17: computedA = '# Below is TM_A' + chr(0xA) + 's = ' + '"' + '""' + w + '"' + '"";' + chr(0xA) + 'TM_B(s);';
18: TM_T(w + computedA);
19:
20:""";
21:TM_B(s);

[lost@pagan code]$ ./quine.py | md5sum; md5sum quine.py
dd9958745db4735ceab8101cae0a15a8 –
dd9958745db4735ceab8101cae0a15a8 quine.py
(Note the md5sum of the output and the md5sum of the input are the same)

Let’s analyze this piece by piece. In this program, TM_A is represented by lines 11 – 21. TM_A is the description of TM_B and TM_T. Notice that TM_A is a copy of (the descriptions of) TM_B and TM_T and stuck in triple quotes (one method in Python for representing strings).

TM_B is on lines 6-10 (including the comment “# Below is TM_A”). Recall that TM_B computes the description of TM_A using the output from TM_A. The input to TM_B is the output of TM_A, which was the description of TM_B and TM_T. Finally, the output from TM_B is transferred to TM_T which prints the contents.

Here is the logic flow:

Lines 11-20 calculate P_<TM_B><TM_T>
Line 21 takes the output (s) and passes it to TM_B
Line 7 computes the description of TM_A (computedA) from the output of TM_A (which was passed to TM_B)
Line 8 combines the computed description of TM_A with the output of TM_A and passes the whole string to TM_T.
Line 4 prints the description of the machine

In the case of a virus, instead of printing the description to the screen, a virus would print the description to another python file.

What we’ve done is prove (from a theoretical point of view) why viruses can exist. The fact viruses exist already tells us this, but we’ve now proven from an abstract view that they do exist. We’ve also given a basic algorithm for creating viruses, and shown how to implement something simple in Python.

In the next post we’ll cover another method (perhaps more commonly used) a virus can use to get it’s own description (which doesn’t depend on the recursion theorem).

Self replicating software – Part 1 – The Recursion Theorem

This is the first in a multi part post about computing theory and self replicating software. This post assumes you have knowledge and understanding of a Turing Machine (abbreviated TM). If you aren’t familiar with Turing Machines (TMs) then you may want to take a look at the Wikipedia entry on the topic at http://en.wikipedia.org/wiki/Turing_machine before continuing. Much of the material for this post comes from Michael Sipser’s “An Introduction to the Theory of Computation”.

Let’s assume there is a computable function t which takes two inputs and yields one output. This function can be stated as:

t: E* x E* -> E*

Call the TM that implements this function TM_T. Now if one of the inputs that TM_T takes is the description of a TM (perhaps a description of TM_T or the description of another TM) we can call TM_T a “Meta TM”, meaning it operates on TMs (more accurately, descriptions of TMs). TM_T could do a number of things with the description, such as emulate it (making TM_T a universal TM), or print the description to the TM’s tape (making TM_T a “quine”). For the purpose of the recursion theorem, it’s irrelevant what TM_T does with the description. So TM_T can be written as taking two inputs:
TM_T(<TM>, w)

Where <TM> is a description of a TM. The other argument (w) could be input to the TM described by the <TM> argument.

Now, assume there is another computable function r which takes one parameter, w, the same w passed as the second argument to function t. Function r can be stated as:

r: E* -> E*

Call the TM that implements this function TM_R. So TM_R can be written as taking one input:
TM_R(w)

The TM_R is described by <TM_R>.

The recursion theorem states (in terms of functions r and t):

r(w) = t(<TM_R>, w)

alternatively

TM_R(w) = TM_T(<TM_R>, w)

This means that the ouput from TM_T given a description of TM_R and input w yields the same output as TM_R with input w. Think about the implications of this for a moment. If TM_T prints the description of the TM passed to it (which in this case is the description of TM_R), then TM_R can do the same thing without having the description passed in explicitly, instead TM_R can calculate <TM_R>. In essence TM_R could print out a description of itself, without having the description given to it.

I’ll post the proof and examples in Python in later entries. The recursion theorem can allow a virus to obtain a description of itself and “print” the description into a file. Although, there are other methods that virii can use to obtain descriptions of themselves, which will also be covered in later posts.

Base+Offset notation (or why we start counting with zero)

Every now and again, I get the question about why we starting counting things such as arrays, offsets, etc. with zero (0) and not one (1). The answer is simple, when specifying a data structure, we normally specify the byte (or whatever unit) offset for the start of a field for a specific data structure. Think how a computer would get to a specific element of a list. Let’s say we want the third element in the list (the offset is 2). The start of the list is the first element. We move to the next element, which is at offset 1, and then we move to the next element (the third item) which is offset 2. We now read the contents of that element. When we start counting with offsets, the first byte is zero bytes from the start (since it’s the first byte). Here is what it looks like graphically:

0 1 2 3 4 5 6
A B C D E F G
^     (offset from A = 0)

Let’s say we want to get to element C (the third element, offset 2). If “A” is the start, then it is zero bytes from the start. That means, that A’s offset is 0. There is a caret (^) beneath “A” to show the current position. Next we move forward one element (offset = 1) to the second element, which looks like:

0 1 2 3 4 5 6
A B C D E F G
^   (offset from A = 1)

Finally, we move forward one more element (offset = 2) to the third element, which looks like:

0 1 2 3 4 5 6
A B C D E F G
^ (offset from A = 2)

For example, the FAT boot sector starts out like this: (Note: this was taken from http://support.microsoft.com/kb/q140418/)

Field                 Offset     Length
-----                 ------     ------
Bytes Per Sector      11         2
Sectors Per Cluster   13         1
Reserved Sectors      14         2
FATs                  16         1
Root Entries          17         2
Small Sectors         19         2
Media Descriptor      21         1
Sectors Per FAT       22         2
Sectors Per Track     24         2
Heads                 26         2
Hidden Sectors        28         4
Large Sectors         32         4

Notice that the “Bytes per Sector” field (which tells us the size of a sector, usually 512) is at offset 11. This means that this field starts at offset 11 (a.k.a. 11 bytes from the start, the 12th byte, etc.) where the offset of the first byte is 0 (the start).