The basics of how programs are compiled and executed

Well, the post “The basics of how digital forensics tools work” seemed to be fairly popular, even getting a place on Digg. This post is focused on the basics of how a program gets compiled and loaded into memory when the program is executed. It’s useful for code analysis (reverse engineering), and is aimed at those who aren’t already familiar with compilers. The basic reasoning is that if you understand (at least from a high level) how compilers work, it will help when analyzing compiled programs. So, if you’re familiar with code analysis, this post isn’t really aimed at you. If however, you’re new to the field of reverse engineering, (specifically code analysis) this post is aimed at you.

Compiling is program transformation

From an abstract perspective, compiling a program is really just transforming the program from one language into another. The “original” language that the program is written in is commonly called the source language (e.g. C, C++, Python, Java, Pascal, etc.) The program as it is written in the source language is called the source code. The “destination” language, the language that the program is written to, is commonly called the target language. So compiling a program is essentially translating it from the source language to the target language.

The “typical” notion of compiling a program is transforming the program from a higher level language (e.g. C, C++, Visual Basic, etc.) to an executable file (PE, ELF, etc.). In this case the source language is the higher level language, and the target language is machine code (the byte code representation of assembly). Realize however, that going from source code to executable file is more than just transforming source code into machine code. When you run an executable file, the operating system needs to set up an environment (process space) for the code (contained in the executable file) to run inside of. For instance, the operating system needs to know what external libraries will be used, what parts of memory should be marked executable (i.e. can contain directly executable machine code), as well as where in memory to start executing code. Two additional types of programs, linkers and loaders accomplish these tasks.

Compiler front and back ends.

Typically a compiler is composed of two primary components, the front end and the back end. The front end of a compiler typically takes in the source code, analyzes the structure (doing things such as checking for errors) and creates an intermediate representation of the source code (suitable for the back end). The back end takes the output of the front end (the intermediate representation), optionally performs optimization, and translates the intermediate representation of the source code, into the target language. In the case of compiling a program, it is common for a compiler’s back end to generate human readable assembly code (mnemonics) and then invoke an assembler to translate the assembly code into it’s byte code representation, which is suitable for a processor.

Realize, it’s not an “absolute” requirement that a compiler be divided into front and back ends. It is certainly possible to create a compiler that translates directly from source code to target language. There are however benefits to the front/back end division, such as reuse, ease of development, etc.

Linkers and Loaders

The compiler took our source code and translated it to executable machine instructions, but we don’t yet have an executable file, just one or more files that contain executable code and data. These files are typically called object code, and in many instances aren’t suitable to stand on their own. There are (at least) 3 high level tasks that still need to be performed:

  1. We still need to handle referencing dependencies, such as variables and functions (possibly in external code libraries.) This is called symbol resolution.
  2. We still need to arrange the object code into a single file, making sure separate pieces do not overlap, and adjusting code (as necessary) to reflect the new locations. This is called relocation.
  3. When we execute the program, we still need to set up an environment for it to run in, as well as load the code from disk into RAM. This is called program loading.

Conceptually, the line between linkers and loaders tends to blur near the middle. Linkers tend to focus on the first item (symbol resolution) while loaders tend to focus on the third item (program loading). The second item (relocation) can be handled by either a linker or loader, or even both. While linkers and loaders are often separate programs, there do exist single linker-loader programs which combine the functionality.

Linkers

The primary job of a linker is symbol resolution, that is to resolve references to entities in the code. For example, a linker might be responsible for replacing references to the variable X with a memory address. The output from a linker (at compile time) typically includes an executable file, a map of the different components in the executable file (which facilitates future linking and loading), and (optionally) debugging information. The different components that a linker generates don’t always have to be separate files, they could all be contained in different parts of the executable file.

Sometimes you’ll hear references to statically or dynamically linked programs. Both of these refer to how different pieces of object code are linked together at compile time (i.e. prior to the program loading.) Statically linked programs contain all of the object code they need in the executable file. Dynamically linked programs don’t contain all of the object code they need, instead they contain enough information so that at a later time, the needed object code (and symbols) can be found and made accessible. Since statically linked programs contain all the object code and information they need, they tend to be larger.

There is another interesting aspect of linkers, in that there are both linkers which work at compile time, and linkers which perform symbol resolution during program loading, or even run time. Linkers which perform their work during compile time are called compile time linkers, while linkers which perform their work at load and run time are called dynamic linkers. Don’t confuse dynamic linkers and dynamically linked executable files. The information in a dynamically linked executable is used by a dynamic linker. One is your code (dynamically linked executable), the other helps your code run properly (dynamic linker).

Loaders

There are two primary functions that are typically assigned to loaders, creating and configuring an environment for your program to execute in, and loading your program into that environment (which includes starting your program). During the loading process, the dynamic linker may come into play, to help resolve symbols in dynamically linked programs.

When creating and configuring the environment, the loader needs information that was generated by the compile time linker, such as a map of which sections of memory should be marked as executable (i.e. can contain directly executable code), as well as where various pieces of code and data should reside in memory. When loading a dynamically linked program into memory, the loader also needs to load the libraries into the process environment. This loading can happen when the environment is first created and configured, or at run time.

In the process of creating and configuring the environment, the loader will transfer the code (and initialized data) from the executable file on disk into the environment. After the code and data have been transferred to memory (and any necessary modifications for load time relocation have been made), the code is started. In essence, the way the program is started is to tell the processor to execute code at the first instruction of the code (in memory). The address of the first instruction of code (in memory) is called the entry point, and is typically set by the compile time linker.

Further reading

Well, that about wraps it up for this introduction. There are some good books out there on compilers, linkers, and loaders. The classic book on compilers is Compilers: Principles, Techniques, and Tools. I have the first edition, and it is a bit heavy on the computer science theory (although I just ordered the second edition, which was published in August 2006). The classic book on linkers and loaders is Linkers and Loaders. It too is a bit more abstract, but considerably lighter on the theory. If you’re examining code in a Windows environment, I’d also recommend reading (at least parts of) Microsoft Windows Internals, Fourth Edition.

Digital forensics in a comic

I saw this the other day. Hmmm… sifting through lots of data to find specific pieces of information, I think I see an interesting application for this… 🙂

Regular expression comic
http://xkcd.com/c208.html

Self replicating software – Part 4 – The difference between worms and viruses

This is the fourth part of the installment on self replicating software. This post deals with worms (a subset of computer viruses).

Briefly, a computer virus is a program that infects other programs with an optionally mutated copy of itself. This is the basic definition that Fred Cohen (the “father” of computer viruses) used in “Computer Viruses – Theory and Experiments.” If you look at the previous posts in this category, we examined how/why viruses can exist by ways of the recursion theorem, as well as a few other methods.

A computer worm is a virus that spreads, where each new infection continues spreading without the need for human intervention (e.g. load and execute the newly infected file). In essence, a computer worm is a virus that after infection, starts the newly created (optionally mutated) copy of itself. Cohen states this (and a more formal defintion) in “A Formal Definition of Computer Worms and Some Related Results.”

Since computer worms are a subset of viruses, many of the same theories apply, including applications of the recursion theorem. What is interesting about computer worms is the potential to spread very quickly due to their inherent automation.

Realize that this definition of a computer worm focuses on the spreading behavior of the malicious code, not the method that is used for spreading. This leads us to some interesting anomalies with different definitions of computer worms. I’ve found the following definitions of computer worms used at various places:

  1. Worms are programs that replicate themselves from system to system without the use of a host file. This is in contrast to viruses, which requires the spreading of an infected host file. (Symantec, Wikipedia, Cisco, et al.)
  2. A worm is a small piece of software that uses computer networks and security holes to replicate itself. (HowStuffWorks)
  3. A worm self-propagates (Common response I’ve heard from various security folks)

The first definition is used by a number of folks, including some big vendors. If you look at the Cohen definition of viruses, there is no requirement that the victim program (the one that gets infected) exist. If the victim program doesn’t exist, then the mutated version of the virus is a standalone. If the victim program is started by a human then it’s a virus, if it is started automatically then it’s a worm. Think of a virus that comes in the form of an infected executable (i.e. it has a host file) and then “drops” a copy of itself as a standalone. Another possible scenario would be a standalone program that infects another file. By the first definition, would these be classified as viruses or a worms? (Hint: The definition doesn’t cover this type of scenario.)
The second definition basically requires that a worm spread between more than one computer system. Again, per Cohen, this isn’t a requirement. A worm can spread between processes on the same machine. The worm hasn’t crossed into another machine, however the code is still propagating without the need for human intervention.

The last definition is a bit ambiguous, which is why I tend to avoid it. The ambiguity comes from the fact that “self-propagating” doesn’t necessarily imply human intervention. Under one interpretation a virus is self propagating in that the viral code is what copies itself (i.e. a user doesn’t need to copy a file around.) Under another interpretation a worm is self-propagating since it does propagate, however it continues to propagate itself.

I recommend reading the Cohen papers mentioned earlier. They’re a bit heavy on the math if you aren’t a computer science / math type, although they do explain what is going on.

Two tools to help debug shellcode

Here are two small tools to help debug/analyze shellcode. The goal of both tools is to provide an executable environment for the shellcode. Shellcode is usually intended to run in the context of a running process, and by itself doesn’t provide the environment typically provided by an executable.

The first tool, make_loader.py is a Python script which takes the name of a file containing shellcode and outputs a compilable C file with the embedded shellcode. If you compile the output, the resulting executable run the shellcode.
The second tool, run_shellcode is a C program (you have to compile it) which, at runtime, loads shellcode from disk into memory (and then transfers execution to the shellcode.) A neat feature of this tool is that it can be completely controlled by a configuration file, meaning you only need to load the file once into a debugger. You can examine new shellcode by changing the configuration file.
Both tools allow you to specify if you want to automatically trap to the debugger (typically by an int 3), and to skip over a number of bytes in the file that contains the shellcode. The
automatic debugger tripping is nice so you don’t always have to explicitly set a breakpoint.
The skip is nice if the shellcode doesn’t sit at the start of the and you don’t want to bother stripping out the unnecessary bytes. Think Wireshark “Follow TCP Stream” with a “Save”.

An alternative to these tools is shellcode2exe, although I didn’t feel like installing PHP (and a webserver)

Here are the files….
run_shellcode.c 1.0 make_loader.py 1.0

Site move

Welcome to the new Forensic Computing blog (forensicblog.org). The old site (forensiccomputing.blogspot.com) is no longer active, although I will keep it up for archival purposes. I’m no longer on blogger, instead this is a self-hosted WordPress installation.

The basics of how digital forensics tools work

I’ve noticed there is a fair amount of confusion about how forensics tools work behind the scenes. If you’ve taken a course in digital forensics this will probably be “old hat” for you. If on the other hand, you’re starting off in the digital forensics field, this post is meant for you.

There are two primary categories of digital forensics tools, those that acquire evidence (data), and those that analyze the evidence. Typically, “presentation” functionality is rolled into analysis tools.

Acquisition tools, well… acquire data. This is actually the easier of the two tools to write, and there are a number of acquisition tools in existence. There are two ways of storing the acquired data, on a physical disk (disk to disk imaging) and in a file (disk to file imaging). The file that the data is stored in is also referred to as a logical container (or logical evidence container, etc.) There are a variety of logical container formats, with the most popular formats being: DD (a.k.a. raw, as well as split DD) and EWF (Expert Witness, a variant used with EnCase). There are other formats, including sgzip (seekable gzip, used by PyFlag) and AFF (Advanced Forensics Format). Many logical containers allow an examiner to include metadata about the evidence, including cryptographic hash sums, and information about how and where the evidence was collected (e.g. the technicians name, comments, etc.)

Analysis tools work in two major phases. In the first phase, the tools read in the evidence (data) collected by the acquisition tools as a series of bytes, and translate the bytes into a usable structure. An example of this, would be code that reads in data from a DD file and “breaks out” the different components of a boot sector (or superblock on EXT2/3 file systems). The second phase is where the analysis tool examines the structure(s) that were extracted in the first phase and performs some actual analysis. This could be displaying the data to the screen in a more-human-friendly-format, walking directory structures, extracting unallocated files, etc. An examiner will typically interact with the analysis tool, directing it to analyze and/or produce information about the structures it extracts.

Presentation of digital evidence (and conclusions) is an important part of digital forensics, and is ultimately the role of the examiner, not a tool. Tools however can support presentation. EnCase allows an examiner to bookmark items, and ProDiscover allows an examiner to tag “Evidence of Interest”. The items can then be exported as files, to word documents, etc. Some analysis tools have built in functionality to help with creating a report.

Of course, there is a lot more to the implementation of the tools than the simplification presented here, but this is the basics of how digital forensics tools work.

Digital Forensics Documentation

One aspect of digital forensics that is often overlooked by a number of folks is documentation. If you’ve ever taken an class in incident response or digital forensics, undoubtedly you’ve heard about the need to properly document your work. Really, the thousand-foot goal with documentation is to provide an audit trail of what actions you took, (and sometimes) why you performed those actions. Ultimately, you may be required to defend the actions you took, and the documentation of your actions is likely to play a role.

A few of the different reasons why documentation of your actions is important are:

  • Your work will be validated by another party
  • Your work will be examined for actions that affirm/deny a statement or fact
  • Your work will be used as the basis for a decision

These aren’t the only reasons, just some of the more common ones. The first bullet point deals with the concept that both sides should get a fair chance to examine the evidence. For instance, if your report supports a theory that a suspect executed a malicious executable on a specific system, the areas that you analyzed (e.g. prefetch directories, access times, etc.) are likely to be of interest. If you document the different aspects of the evidence that you examined, then another examiner can recreate the substantive content of your actions, and examine the same areas.

This leads to another question that I frequently get asked, and it is how much documentation is adequate? The answer is that it depends on who your audience is. There is the adage that says “write your report so that it can be used in court” (or some other similar wording). The problem is that this adage isn’t really helpful. One good idea that I’ve come across is to document your work so that someone else, with a similar knowledge level, can recreate your steps. There are two implicit notions in this statement that are worth clarifying: the similar knowledge level, and the lack of mentioning tools.

A person with a similar knowledge level, would ideally be able to arrive at similar statements as you. The catch is, what is reasonable for a similar knowledge level? If you take a look at the syllabuses for a number of digital forensics courses, you’ll notice that they cover a lot of the same ground, file systems, operating system specifics, etc. It would be fair to assume that someone who is at a similar knowledge level would have learned the “basics”. On the other hand, if you have come up with some new technique, it might be prudent to explain the technique in a bit more depth (or at least reference it if the technique is published elsewhere). Typically, the more in-depth explanations can go in an appendix, although it’s not a strict requirement, more a matter of style.

The lack of mentioning a specific tool is also noteworthy. It might be a good idea to document what tool you used, but someone else isn’t required to use the same tool to examine the same evidence. The idea behind this is that, assuming a tool works in a correct manner, and the output from the tool was interpreted in a manner that was intended, another person should be able to arrive at similar conclusions using their own methods. This is why it’s important to understand how your tools work behind the scenes. You don’t necessarily need to know that your specific tool stores the various MFT entries in a linked list (although it’s possible you might). Understanding what the tool does, and it’s limitations however are likely more important.

Another common issue is what to put in your final report. The answer is that it really depends on who your audience is, and ask them. For instance, if your work supports corporate security, find out from them what it is that they want. They may say they only want a few specific things and that’s it. If you’re working for an attorney, they will likely have a specific strategy, and be able to provide you with enough guidance with regards to what it is they want. It might be they only want a list of the names of files created within a specific time. In this case, a spreadsheet of the names of the files would likely be sufficient. For matters of style, you might want to format the report look a little more professional than just a spreadsheet (e.g. letterhead, readable font, etc.) Again the choice is up to you.

What CSI does right

I was at a training class last year and the instructor made a good point about the TV show CSI (Crime Scene Investigation). While the actual techniques/methods/etc. the show uses may not always be accurate with respect to real life (some are, some aren’t), the characters do perform a lot of experiments. If you don’t know the answer to a question, you can do some research and hope there is an answer somewhere, or you can perform some experimentation yourself.

Deductive and Inductive reasoning

One thing that I see on a fairly regular basis is confusion between deductive and inductive reasoning. Both types of reasoning play different roles in investigations/forensics/science/etc. The difference between the two is sometimes hard to define. Here are two common defintions:

1. With deductive reasoning, the conclusions are contained, whether explicit or implicit, in the premises. With inductive reasoning, the conclusions go beyond what is contained in the premises.

2. The conclusions arrived at using (correct) deductive logic are necessarily true, meaning they must be true. The conclusions arrived at using inductive logic, are not necessarily true, although they may be.

An example might clarify things (taken from a philosophy class I took years ago):

  1. If I study, I will get an A on the exam (premise)
  2. I studied (premise)
  3. Therefore I got an A on the exam (conclusion)

In this case, since I studied, I got an A on the exam. The conclusion (I got an A on the exam) is contained implicitly in 1 and 2. For the geeks in us, here is a proof:

  1. If I study, then I will get an A on the exam [ IF A then B ]
  2. I studied [ A ]
  3. Therefore I got an A on the exam [ B ] (modus ponens on 1 and 2)

With inductive reasoning however:

  1. If I study, then I will get an A on the exam (premise)
  2. I got an A on the exam (premise)
  3. Therefore I studied (conclusion)

Just because I got an A on the exam doesn’t imply I studied, I could have cheated. For the geeks in us, here is an (incorrect) proof:

  1. If I study, then I will get an A on the exam [ IF B then C ]
  2. I got an A on the exam [ C ]
  3. Therefore I studied (no logical argument, no B)

The key in these examples in is parts 1 and 2. With deductive reasoning we had B and followed the If chain [(IF B then C) ^ B yields C]. With the inductive reasoning we have no B. In terms of logic this is confusing an “if” statement with an “if and only if”, where the former requires one direction of truth and the latter requires two directions of truth.

So how does this play into investigations/forensics/etc.? The idea is to be careful the the conclusions drawn. For instance, (relating back to the blog post about context) if an examiner finds the string “hacker” on a hard disk, the hit doesn’t necessarily mean that a “hacker” was on the system, nor does it necessarily mean that “hacker” tools were used. The data around the string would (hopefully) provide more context. Although even the presence of “hacker” tools doesn’t mean that the suspect actually used them, nor does it necessarily mean that the suspect even introduced them to the system. These types of questions are often raised with “The Trojan Defense”.

One (common) misunderstanding of deductive and inductive reasoning is with our legal system. Our legal system depends heavily on inductive reasoning (inferences). For instance take the case with Keith Jones testifying at the UBS trial. Keith Jones testified about what was found on UBS systems, various different sources of logs (e.g. WTMP logs, provider logs, etc.) and his analysis of the information. Does this prove with 100% certainty that the suspect (Duronio) actually committed the crime? No it doesn’t. However with a substantial amount of evidence, a jury could reach the conclusion that the standard of “beyond a reasonable doubt” has been met.

Another example of deductive vs. inductive reasoning is with “courts approving digital forensic tools”. First courts aren’t in the business of approving digital forensics tools. They may allow a person to testify about the use and conclusions drawn using the tools. This is fundamentally different from saying “tool XYZ is approved by this court”. The reasoning for allowing an examiner to testify using the results obtained from a tool typically involves a trusted third party. Essentially one (or more) third parties comes to a conclusion about the correctness of a tool. So the decision about allowing the original examiner to testify about the results found using the tool depends on what a third party thinks. This leads to the question: Just because a third party thinks so, does it mean it’s guaranteed to be true? Perhaps yes, perhaps no. [Note: I’m not commenting about any specific digital forensics tool, this could apply to any situation involving any type of tool or even process. This is one type of review used when considering whether or not to allow a scientific tool/process/technique into court.]

Information Context (a.k.a Code/Data Duality)

One concept that pervades digital forensics, reverse engineering, exploit analysis, even computing theory is that in order to fully understand information, you need to know the context the information is used in.

For example, categorize the following four pieces of information as either code or data:

1) push 0x6F6C6C65
2) “hello” (without the quotes)
3) 448378203247
4) 110100001100101011011000110110001101111

Some common answers might be:
1) Code (x86 instruction)
2) Data (string)
3) Data (integer)
4) Code or Data (unknown)

Really, all four are encoded internally (at least on an intel architecture) the same way, so they’re all code and they’re all data. Inside the metal box, they all have the same binary representation. The key is to know the context in which they’re used. If you thought #1 was code, this is probably based on past experience (e.g. having seen x86 assembly before). If the four ascii bytes ‘h’ ‘e’ ‘l’ ‘l’ ‘o’ were in a code segment, then it is possible that they were instructions, however it is also that they were data. This is because code and data can be interspersed even in a code segment (as is often done by compilers.)

Note: Feel free to skip the following paragraph if you’re not really interested in the theory behind why this is possible. The application of information context is in the rest of the post.
The reason code and data can be interspersed is often attributed to the Von Neumann architecture, which stores both code and data in a common memory area. Other computing architectures, such as the Harvard architecture have seperate memory storage areas for code and data. However, this is an implementation issue, as information context (or lack thereof) can also be seen in Turing machines. In a Universal Turing Machine (UTM), you can’t distinguish between code and data if you encode both the instructions to the UTM and the data to the UTM with the same set (or subset) of tape symbols. Both the instructions for the UTM and the data for the UTM sit on the tape, and if they’re encoded with the same set (or subset) of tape symbols, then just by looking at any given set of symbols, there isn’t enough information about how the symbols are used to determine if they’re instructions (e.g. move the tape head) or data. This type of UTM would be more along the lines of a Von Neumann machine. A UTM which was hard-coded (via the transition function) to use different symbols for instructions and data would be more along the lines of a Harvard machine.

One area where information context greatly determines understanding of code and data is in reverse engineering, specifically executable packers/compressors/encryption/etc. (Note: executable packers/compressors/encryption/etc. are really examples of self-modifying code, so this concept extends to self-modifying code as well). Here’s an abstract view of how common block-based executable encryption works (e.g. ASPack, etc.):

| Original executable | — Packer —> | Encrypted data |

The encrypted data is then combined with a small decryption stub resulting in an encrypted executable that looks something like:

| Decryption stub | Encrypted data |

When the encrypted executable is run, the decryption stub decrypts the encrypted data, yielding something that can look like:

| Decryption stub | Encrypted data | Original executable |

(Note: this is an abstract view, there are of course implementation details such as import tables, etc. It’s also possible to have the encrypted data decrypted in place).

The encrypted data is really code, just in a different representation. The same concept applies when you zip / gzip / bzip / etc. an executable from disk.

Another example where the duality between code and data comes into play is with exploits. Take for example SEH overwrites. An SEH overwrite is a mechanism for gaining control of execution (typically during a stack based buffer overflow) without using the return pointer. SEH stands for Structured Exception Handling, an error/exception handling mechanism used on the Windows operating system. One of the structures used to implement SEH sits on the stack, and can be overwritten during a stack-based buffer overflow. The structure consists of two pointers, and looks something like this:

| Pointer to previous SEH structure | Pointer to handler code |

Now in theory, a pointer, regardless of what it points to, is considered data. The value of a pointer is the address of something in memory, hence data. During an exploit that gains control by overwriting the SEH structure, the pointer to the previous SEH structure is typically overwritten with a jump instruction (often a jump to shellcode.) The pointer to handler code is typically overwritten with the address of a set of pop+pop+ret instructions. The SEH structure now looks like:

| Jump (to shellcode) | Pointer to pop+pop+ret |

The next thing that happens is to trigger an exception so that the pop+pop+ret executes. Due to the way the stack is set up when the exection handling code (now pop+pop+ret) is called, the ret transfers control to the jump (to shellcode). The code/data duality comes into play since this was originally a pointer (to the previous SEH structure) but is now code (a jump). In the context of the operating system, it treats this as data, but in the context of a pop+pop+ret it is treated as code.

Information context goes beyond just code/data duality. Take for example string searches (a.k.a. keyword searches) which are commonly used in digital forensic examinations to generate new leads to investigate. Assume a suspect is accused of stealing classified documents. Searching a suspect’s media (hard disk, usb drive, digital camera, etc.) for the string “Classified” (amongst other things) could seem prudent. However a hit on the word “Classified” does not necessarily suggest guilt, but it does warrant further investigation. For instance the hit for “Classified” could come from a classified document, or the hit could come from a cached HTML page for “Classified Ads”. Again the key is to understand the context in which the hit comes from.

To summarize this post in a single sentence: Information context can be vital in understanding/analyzing what you’re looking at.