Well, the post “The basics of how digital forensics tools work” seemed to be fairly popular, even getting a place on Digg. This post is focused on the basics of how a program gets compiled and loaded into memory when the program is executed. It’s useful for code analysis (reverse engineering), and is aimed at those who aren’t already familiar with compilers. The basic reasoning is that if you understand (at least from a high level) how compilers work, it will help when analyzing compiled programs. So, if you’re familiar with code analysis, this post isn’t really aimed at you. If however, you’re new to the field of reverse engineering, (specifically code analysis) this post is aimed at you.
Compiling is program transformation
From an abstract perspective, compiling a program is really just transforming the program from one language into another. The “original” language that the program is written in is commonly called the source language (e.g. C, C++, Python, Java, Pascal, etc.) The program as it is written in the source language is called the source code. The “destination” language, the language that the program is written to, is commonly called the target language. So compiling a program is essentially translating it from the source language to the target language.
The “typical” notion of compiling a program is transforming the program from a higher level language (e.g. C, C++, Visual Basic, etc.) to an executable file (PE, ELF, etc.). In this case the source language is the higher level language, and the target language is machine code (the byte code representation of assembly). Realize however, that going from source code to executable file is more than just transforming source code into machine code. When you run an executable file, the operating system needs to set up an environment (process space) for the code (contained in the executable file) to run inside of. For instance, the operating system needs to know what external libraries will be used, what parts of memory should be marked executable (i.e. can contain directly executable machine code), as well as where in memory to start executing code. Two additional types of programs, linkers and loaders accomplish these tasks.
Compiler front and back ends.
Typically a compiler is composed of two primary components, the front end and the back end. The front end of a compiler typically takes in the source code, analyzes the structure (doing things such as checking for errors) and creates an intermediate representation of the source code (suitable for the back end). The back end takes the output of the front end (the intermediate representation), optionally performs optimization, and translates the intermediate representation of the source code, into the target language. In the case of compiling a program, it is common for a compiler’s back end to generate human readable assembly code (mnemonics) and then invoke an assembler to translate the assembly code into it’s byte code representation, which is suitable for a processor.
Realize, it’s not an “absolute” requirement that a compiler be divided into front and back ends. It is certainly possible to create a compiler that translates directly from source code to target language. There are however benefits to the front/back end division, such as reuse, ease of development, etc.
Linkers and Loaders
The compiler took our source code and translated it to executable machine instructions, but we don’t yet have an executable file, just one or more files that contain executable code and data. These files are typically called object code, and in many instances aren’t suitable to stand on their own. There are (at least) 3 high level tasks that still need to be performed:
- We still need to handle referencing dependencies, such as variables and functions (possibly in external code libraries.) This is called symbol resolution.
- We still need to arrange the object code into a single file, making sure separate pieces do not overlap, and adjusting code (as necessary) to reflect the new locations. This is called relocation.
- When we execute the program, we still need to set up an environment for it to run in, as well as load the code from disk into RAM. This is called program loading.
Conceptually, the line between linkers and loaders tends to blur near the middle. Linkers tend to focus on the first item (symbol resolution) while loaders tend to focus on the third item (program loading). The second item (relocation) can be handled by either a linker or loader, or even both. While linkers and loaders are often separate programs, there do exist single linker-loader programs which combine the functionality.
Linkers
The primary job of a linker is symbol resolution, that is to resolve references to entities in the code. For example, a linker might be responsible for replacing references to the variable X with a memory address. The output from a linker (at compile time) typically includes an executable file, a map of the different components in the executable file (which facilitates future linking and loading), and (optionally) debugging information. The different components that a linker generates don’t always have to be separate files, they could all be contained in different parts of the executable file.
Sometimes you’ll hear references to statically or dynamically linked programs. Both of these refer to how different pieces of object code are linked together at compile time (i.e. prior to the program loading.) Statically linked programs contain all of the object code they need in the executable file. Dynamically linked programs don’t contain all of the object code they need, instead they contain enough information so that at a later time, the needed object code (and symbols) can be found and made accessible. Since statically linked programs contain all the object code and information they need, they tend to be larger.
There is another interesting aspect of linkers, in that there are both linkers which work at compile time, and linkers which perform symbol resolution during program loading, or even run time. Linkers which perform their work during compile time are called compile time linkers, while linkers which perform their work at load and run time are called dynamic linkers. Don’t confuse dynamic linkers and dynamically linked executable files. The information in a dynamically linked executable is used by a dynamic linker. One is your code (dynamically linked executable), the other helps your code run properly (dynamic linker).
Loaders
There are two primary functions that are typically assigned to loaders, creating and configuring an environment for your program to execute in, and loading your program into that environment (which includes starting your program). During the loading process, the dynamic linker may come into play, to help resolve symbols in dynamically linked programs.
When creating and configuring the environment, the loader needs information that was generated by the compile time linker, such as a map of which sections of memory should be marked as executable (i.e. can contain directly executable code), as well as where various pieces of code and data should reside in memory. When loading a dynamically linked program into memory, the loader also needs to load the libraries into the process environment. This loading can happen when the environment is first created and configured, or at run time.
In the process of creating and configuring the environment, the loader will transfer the code (and initialized data) from the executable file on disk into the environment. After the code and data have been transferred to memory (and any necessary modifications for load time relocation have been made), the code is started. In essence, the way the program is started is to tell the processor to execute code at the first instruction of the code (in memory). The address of the first instruction of code (in memory) is called the entry point, and is typically set by the compile time linker.
Further reading
Well, that about wraps it up for this introduction. There are some good books out there on compilers, linkers, and loaders. The classic book on compilers is Compilers: Principles, Techniques, and Tools. I have the first edition, and it is a bit heavy on the computer science theory (although I just ordered the second edition, which was published in August 2006). The classic book on linkers and loaders is Linkers and Loaders. It too is a bit more abstract, but considerably lighter on the theory. If you’re examining code in a Windows environment, I’d also recommend reading (at least parts of) Microsoft Windows Internals, Fourth Edition.