Compiling, disassembling and re-assembling .NET binaries
Volume Number: 19 (2003)
Issue Number: 12
Column Tag: Programming
Casting your .NET
Compiling, disassembling and re-assembling .NET binaries
by Andrew Troelsen
Exploring .NET development on Mac OS X
Diving Deeper into the .NET Assembly Format
The previous issue got you up and running with the SSCLI, and introduced you to the basics of the C# compiler (csc) and CLI execution launcher (clix) utilities. In this installment, you will not only come to learn some additional options of csc, but also examine two key development tools (ildasm and ilasm) which any .NET enthusiast must be aware of. Along the way, you will gain a much deeper understanding how a .NET assembly is composed under the hood.
A Review of Assembly Basics
During the course of these first few installments, you have come to learn that the .NET platform supports a high degree of binary reuse, which is made possible using a unit of deployment termed an assembly. Recall that .NET assemblies (which may take an *.exe, *.dll or *.netmodule file extension) are composed of three key elements: CIL code, type metadata and the assembly manifest:
- Common Intermediate Language (CIL): The true language of the .NET platform. All .NET-aware compilers transform their respective tokens into the syntax of CIL. As noted earlier, CIL code is conceptually similar to Java bytecode in that CIL is platform agnostic. Unlike Java bytecode however, CIL is compiled (not interpreted) on demand by the .NET runtime.
- Type Metadata: Used to fully describe each and every aspect of a .NET type (class, interface, structure, enum or delegate) referenced by, or defined within, a given assembly. As you get to know the .NET platform, you will quickly discover that type metadata is the glue for every major technology.
- Manifest: Metadata that describes the assembly itself (version, required external assemblies, copyright information and so forth). As described at a later time, the assembly manifest is a key aspect of the .NET versioning policy.
The good news is that you will never have to author a lick of CIL code, type metadata or manifest information by hand, as it is the job of a .NET compiler to translate your source code into the correct binary format. Given this, you will never need to describe your types manually (as is the case in CORBA IDL) or build C-based wrapper classes to communicate with an underlying protocol. While the .NET platform supports multiple languages and corresponding compilers, few would argue that C# is the language of choice used to build .NET assemblies.
In the last issue, you created a .NET code library (MyLib.dll) and a corresponding .NET client executable (MyClient.exe). While you are free reuse that source code over the course of this issue, Listings 1 and 2 offer a few simple types to manipulate over the pages that follow.
Listing 1: simpleCodeLibrary.cs
// A simple code library containing one class type.
using System;
namespace ShamerLib
{
public class Shamer
{
public static void ShameChild(string kidName,
int intensity)
{
for(int i = 0; i < intensity; i++)
Console.WriteLine("Be quiet {0}!!", kidName);
}
}
}
Listing 2: simpleClientApplication.cs
// Our client application makes use
// of Shamer.dll (seen in Listing 1).
using System;
using ShamerLib;
namespace ShamerClient
{
public class TheApp
{
public static void Main(string[] args)
{
string kidName;
int frustrationLevel;
// All .NET arrays derive from System.Array.
// This class type has a member named Length.
if(args.Length != 0)
{
kidName = args[0];
frustrationLevel = int.Parse(args[1]);
}
else // No command line args, prompt user.
{
Console.Write("Please enter child's name: ");
kidName = Console.ReadLine();
Console.Write("Please enter annoyance level: ");
frustrationLevel = int.Parse(Console.ReadLine());
}
// Pass user input to Shamer type.
Shamer.ShameChild(kidName, frustrationLevel);
Console.WriteLine("Thanks for playing...");
}
}
}
The only real point of interest in Listing 2 is the fact that we have provided the ability to process command line arguments via the incoming array of strings. If the caller does supply such arguments, we will use them to assign our local variables (kidName and frustrationLevel). On the other hand, if the user does not supply command line arguments, s/he will be prompted at the Terminal. In either case, once the local variables have been assigned, they are passed into the static ShamerLib.Shamer.ShameChild() method located in the shamer.dll assembly.
Build Details of the C# Compiler
Now that we have some C# source code at our disposal, let's examine select details of the SSCLI C# compiler, csc. The C# compiler (csc) is a command line tool which supports numerous arguments, the full set of which can be viewed by typing the command shown in Listing 3 from an SSCLI-enabled Terminal (see the previous issue for details regarding sourcing the SSCLI environment variables):
Listing 3. Viewing each option of csc.
csc -help
Now, assume you wish to compile simpleCodeLibrary.cs into a single file assembly named shamer.dll. To do so, you need to be aware of the following core commands of csc.exe:
- /target: Allows you to specify the format of the output file (code library, Terminal application and so on). May be shortened to /t:.
- /out: Allows you to specify the name of the output file. If omitted, the name of the output file is based on the name of the class which contains the Main() method, or in the case of a class library, the name of the initial input file.
- /reference: Allows you to specify any external assemblies to reference during the current compilation (may be shortened to /r:).
At it's core, csc processes C# source code files (which by convention take a *.cs file extension) to produce a binary file termed an assembly. Under the SSCLI, csc provides three valid flags that control the target output (Table 1).
Option of /target flag Short Form Meaning in Life
/target:exe /t:exe Generates an executable Terminal assembly with the
*.exe file extension.
/target:dll /t:dll Generates a binary code library that may not be
directly executed by the .NET runtime. Rather,
*.dll binaries are loaded by an executing *.exe.
/target:module /t:module This option allows you to build a 'multi-file'
assembly. In a nutshell, multi-file assemblies are
a collection of code library files that are intended
to be versioned as a single logical unit. This
particular aspect of .NET will be examined at a
later time.
Table 1. Variations of the /target flag
In addition to the /t:exe, /t:dll and /t:module files, the csc compiler of the SSCLI also supplies the /target:winexe flag, which effectively does nothing under the Macintosh operation system (under GUI-aware implementations of the C# compiler, /t:winexe is used to build a GUI-based Windows Forms application).
Of course you will also need to specify the set of input files to be processed. These can be listed individually as discrete arguments, or via the wildcard syntax (e.g., csc *.cs) to instruct csc.exe to compile all C# files in the current directory. However, to build shamer.dll, the following command set will do nicely (Listing 4):
Listing 4. Compiling the Shamer.dll code library.
csc /t:library /out:shamer.dll simpleCodeLibrary.cs
To compile simpleClientApplication.cs into a single file executable assembly that makes use of the types contained within shamer.dll, enter the command shown in Listing 5 (be sure that shamer.dll is in the same location of the input *.cs file):
Listing 5. Compiling the client.exe executable.
csc /t:exe /out:client.exe /r:shamer.dll simpleClientApplication.cs
At this point you are able to run the resulting client.exe application (either with or without command line arguments) using the clix utility (Listing 6).
Listing 6. Executing your .NET example application.
clix client.exe Sally 9
Here we are passing two arguments to the client.exe application. Recall that we have programmed Main() to parse these values out of the incoming array of strings, and therefore the phrase "Be quite Sally!!" will print out nine times. If you run client.exe without specifying any command line arguments, you will be prompted at the Terminal for further input. Do note that the version of Main() is not smart enough to receive these arguments out of order (just to keep focused on the tasks at hand). Given this, if you specify the child's name after the intensity level, you will receive a runtime exception.
Working with C# Response Files
When you work with the C# command line compiler, you are likely to issue the same options during each compilation cycle. For example, assume you have a directory which contains ten *.cs files, of which you wish to compile three of them into a single file .NET class library named MyLib.dll. Furthermore, assume you need to reference various external assemblies for the build (Listing 7):
Listing 7. A non-trivial csc command set.
csc /out:MyLib.dll /t:library /r:System.Xml.dll
/r:MyCoolLib.dll /r:MyOtherLib.dll firstClass.cs
otherClass.cs thirdClass.cs
Clearly, this can become a pain. To help make your Terminal compilations more enjoyable, the C# compiler supports the use of 'response files'. These text files (which by convention take an *.rsp file extension) contain all of the options you wish to pass into csc. Assume you have created a response file (Listing 8) named compilerArgs.rsp (note that comments are denoted with the '#' symbol):
Listing 8. CompilerArgs.rsp.
# compilerArgs.rsp.
# Name of output file.
/out:MyLib.dll
# Type of output file.
/t:library
# External assemblies to reference.
/r:System.Xml.dll
/r:MyCoolLib.dll
/r:MyOtherLib.dll
# Input files.
firstClass.cs otherClass.cs thirdClass.cs
Given this, you can now simply pass in the compilerArgs.rsp file as the sole argument to csc.exe via the '@' flag (Listing 9):
Listing 9. Specifying a response file at the command line.
csc @compilerArgs.rsp
If you wish, you may feed in multiple response files (Listing 10):
Listing 10. csc may be fed in multiple response files.
csc @compilerArgs.rsp @moreArgs.rsp @evenMoreArgs.rsp
Keep in mind however that response files are processed in the order they are encountered; therefore settings in a previous file can be overridden by settings in a later file. Finally, be aware that there is a 'default' response file, csc.rsp, which is automatically processed by csc.exe during each compilation. If you examine the contents of this file (which is located by default under /sscli/build/v1.ppcfstchk.rotor) you will find little more than a set of common assembly references (such as System.dll, System.Xml.dll and so on). In the rare occasion that you wish to disable the inclusion of csc.rsp you may specify the /noconfig flag.
Of course, the C# compiler defines many other options beyond the set we have just examined. Over the life of the series, you'll see additional options at work, however check out /sscli/docs/compilers/csharp.html for all the gory details of the csc utility which ships with the SSCLI.
The Role of the CIL Dissassembler (ildasm)
At this point you should be confident in your ability to build and debug .NET assemblies using csc (of course, C# itself may still be a bit of a mystery, but we'll deal with that soon enough). The next topic to address is the process of getting under the hood of a binary .NET blob and checking out the key ingredients (CIL, type metadata and manifest information). The SSCLI ships with numerous programmer tools to do this very thing, the first of which is the Intermediate Language Disassembler (ildasm). Table 2 lists some of the key options of this (very insightful) utility (see /sscli/docs/tools/ildasm.html for full details).
Option of ildasm Meaning in Life
/classlist Displays a list of all types within a given .NET assembly.
/item This option allows you to view the disassembly of a specific
type or type member, rather than the entire set of type members.
/metadata Use this flag to view the type metadata within the .NET assembly.
/output:<filename> Instructs ildasm to dump the contents to a specified text file,
rather than to the Terminal.
/visibility This flag instructs ildasm to display types of a certain 'visibility'
level. In a nutshell, C# supports the creation of public types (which
may be used by other assemblies) and internal types (which may only
be used by the defining assembly). The CIL programming language
defines a number of additional visibility modifiers.
Table 2. Some (but not all) options of ildasm.
To take this tool out for a test drive, let's view the internal structure of the entire client.exe assembly (Listing 11).
Listing 11. Disassembling client.exe via ildasm
ildasm client.exe
Viewing the Assembly Manifest
The first point of interest is the manifest data, which as you recall is metadata that defines the assembly itself (Listing 12).
Listing 12. The manifest data of client.exe
.assembly extern mscorlib
{
.publickeytoken = (B7 7A 5C 56 19 34 E0 89 )
.ver 1:0:3300:0
}
.assembly extern shamer
{
.ver 0:0:0:0
}
.assembly client
{
...
.hash algorithm 0x00008004
.ver 0:0:0:0
}
.module client.exe
.imagebase 0x00400000
.subsystem 0x00000003
.file alignment 512
.corflags 0x00000001
The critical point here is the use of the .assembly extern directives. These CIL token are used to document the set of external assemblies the current assembly must have to function correctly. Notice that client.exe makes use of the standard mscorlib.dll assembly as well as (surprise, surprise) shamer.dll. Next, notice that the .assembly directive (without the extern attribute) is used to describe basic characteristics of the current assembly (client.exe in this case) such as it's version (.ver) and module name (.module). As mentioned, the .NET runtime reads this information during the process of locating and loading external assemblies for use (exactly how is the topic of a later article).
Viewing the CIL behind Main()
Next, let's examine the CIL code produced by csc for the client's static Main() method (Listing 13). Be aware that each of the IL_XXXX: tokens are line-labels inserted by ildasm.
Listing 13. The CIL of Main().
.method public hidebysig static void
Main(string[] args) cil managed
{
.entrypoint
.maxstack 2
.locals init (string V_0, int32 V_1)
IL_0000: ldarg.0
IL_0001: ldlen
IL_0002: conv.i4
IL_0003: brfalse.s IL_0014
IL_0005: ldarg.0
IL_0006: ldc.i4.0
IL_0007: ldelem.ref
IL_0008: stloc.0
IL_0009: ldarg.0
IL_000a: ldc.i4.1
IL_000b: ldelem.ref
IL_000c: call int32 [mscorlib]
System.Int32::Parse(string)
IL_0011: stloc.1
IL_0012: br.s IL_0039
IL_0014: ldstr "Please enter child's name: "
IL_0019: call void [mscorlib]
System.Console::Write(string)
IL_001e: call string [mscorlib]
System.Console::ReadLine()
IL_0023: stloc.0
IL_0024: ldstr "Please enter annoyance level: "
IL_0029: call void [mscorlib]
System.Console::Write(string)
IL_002e: call string [mscorlib]
System.Console::ReadLine()
IL_0033: call int32 [mscorlib]
System.Int32::Parse(string)
IL_0038: stloc.1
IL_0039: ldloc.0
IL_003a: ldloc.1
IL_003b: call void [shamer]
ShamerLib.Shamer::ShameChild(string, int32)
IL_0040: ldstr "Thanks for playing..."
IL_0045: call void [mscorlib]
System.Console::WriteLine(string)
IL_004a: ret
} // end of method TheApp::Main
Now, before your eyes pop out of your head, let me reiterate that this is not the place to dive into all the details of the syntax of CIL. However, here is a brief explanation of the highlights.
First of all, CIL is an entirely stack based language: values are pushed onto the stack using various CIL operational codes (opcodes) and popped over the stack using others. Notice that the Main() method is adorned with the .entrypoint directive, which as you may guess is how the .NET runtime is able to identify entry point to the executable.
Next, note that the .maxstack directive is used to mark the upper limit of values which may be pushed onto the stack during the duration of this method. Given that our Main() method defines only two local variables and never calls methods with more than two arguments, it should be no surprise the value assigned to .maxstack is 2.
Now, notice the syntax used to invoke a type method (Listing 14).
Listing 14. Invoking Console.Write() and Shamer.ShameChild().
IL_0019: call void [mscorlib]
System.Console::Write(string)
...
IL_003b: call void [shamer]
ShamerLib.Shamer::ShameChild(string, int32)
As you can see, the call opcode marks the act of invocating a method. However, notice that the friendly name of the assembly is infixed between the return value and method name. Given this, we can boil down a CIL method invocation to the following simple template shown in Listing 15:
Listing 15. The CIL of method invocations.
ReturnType[NameOfAssembly]
Namespace.Type::MethodName(anyArguments)
Again, don't sweat the details of CIL at this point in the game (in fact, you can live the life of a happy and healthy C# developer without thinking about CIL whatsoever). Simply understand that the ildasm tool allows you to view the raw CIL syntax emitted by a given .NET compiler.
Viewing the Type Metadata
Recall that an assembly is composed of three key elements (CIL, type metadata and manifest information). If you wish to view the type metadata within a given assembly, you must specify the /metadata option (Listing 16).
Listing 16. Listing type metadata with ildasm
ildasm /metadata client.exe
Once you execute this command, you will find that the initial output of ildasm is indeed a listing of type metadata. Simply put, .NET metadata can be lumped into two categories: TypeDefs (types you defined yourself) and TypeRefs (types within an external assembly you referenced). Both of these categories will list in vivid detail the composition of each item (e.g. base class, number of methods, method parameters, etc). For example, Listing 17 shows the metadata for the ShamerClient.TheApp class type defined within client.exe (again, don't sweat the details).
Listing 17. Type metadata for TheApp
// TypeDef #1
// ----------------------------------------------
// TypDefName: ShamerClient.TheApp (02000002)
// Flags : [Public] [AutoLayout]
// [Class] [AnsiClass] (00100001)
// Extends : 01000001 [TypeRef] System.Object
// Method #1 [ENTRYPOINT]
// ----------------------------------------------
// MethodName: Main (06000001)
// Flags : [Public] [Static]
// [HideBySig] [ReuseSlot] (00000096)
// RVA : 0x00002050
// ImplFlags : [IL] [Managed] (00000000)
// CallCnvntn: [DEFAULT]
// ReturnType: Void
// 1 Arguments
// Argument #1: SZArray String
// 1 Parameters
// (1) ParamToken : (08000001) Name :
// args flags: [none] (00000000)
//
// Method #2
// ----------------------------------------------
// MethodName: .ctor (06000002)
// Flags : [Public] [HideBySig] [ReuseSlot]
// [SpecialName] [RTSpecialName] [.ctor] (00001886)
// RVA : 0x000020a8
// ImplFlags : [IL] [Managed] (00000000)
// CallCnvntn: [DEFAULT]
// hasThis
// ReturnType: Void
// No arguments.
Now, you may be wondering what use type metadata serves in the .NET platform. To be honest, just about everything in .NET revolves around metadata in one form or another. Object serialization, XML Web services, .NET remoting, late binding, dynamic type creation and heap allocations all demand full type descriptions. Later in this series, you will learn how to leverage this information programmatically using the friendly object model provided by the System.Reflection namespace (can anyone say custom object browser?)
The MetaInfo Utility
On a metadata-related note, the SSCLI supplies an additional tool (metainfo) which is used exclusively to view type metadata (see /sscli/docs/tools/metainfo.html for full details). For example, if you wish to see the TypeDefs and TypeRefs within shamer.dll (but don't care to see the CIL code or manifest data), you could enter the following command (Listing 18):
Listing 18. Working with the metainfo utility.
metainfo shamer.dll
The Role of the CIL Assembler (ilasm)
Now that you can disassemble a .NET assembly using ildasm and metainfo, it is worth pointing out that the SSCLI ships with a CIL assembler utility named (not surprisingly) ilasm. Although it is not terribly likely, it is entirely possible to build an complete .NET application using raw CIL code and bypass higher level languages such as C#, VB.NET and so forth (remember, as far as the .NET runtime is concerned, it's all CIL). As suggested by Table 3, working with ilasm is quite straightforward.
Options of ilasm Meaning in Life
/clock Tells ilasm to display compilation diagnostics for the current build.
/dll or /exe Builds a code library or executable assembly (respectively).
/output Specifies the name of the output file.
Table 3. Some (but not all) options of ilasm.
As a fellow programmer, I'm sure you'd love to build a .NET assembly using CIL and the CIL assembler at least once (just to say you did it). Again, building anything but a trivial CIL source code file would require a solid understanding of the syntax and semantics of the Common Intermediate Language, however if you are up for the task, create a brand new source code file named simpleCILCode.il (by convention, CIL code files take an *.il extension). Within your new file, define the following .NET type (Listing 19):
Listing 19. An example using raw CIL
// mscorlib.dll is automatically
// listed in the manifest by ilasm,
// so we don't need to bother specifying
// this external assembly.
// Now defined our assembly.
// If unspecified, the version
// of an assembly is automatically
// 0.0.0.0.
.assembly SimpleCILCode{}
.module SimpleCILCode.dll
// Our only class type: MyCILExample.MyCILApp
.namespace MyCILExample
{
.class public auto ansi beforefieldinit MyCILApp
extends [mscorlib]System.Object
{
// The single method, Speak().
.method public hidebysig static void
Speak() cil managed
{
.maxstack 1
ldstr "Yo!!"
call void [mscorlib]
System.Console::WriteLine(string)
ret
}
}
}
Basically, this CIL code file defines a single namespace (MyCILExample) which contains a single class (MyCILApp) which supports a single method named Speak(). The implementation of Speak() loads a string literal onto the stack (via the ldstr opcode) which is used for the invocation of System.Console.WriteLine(). The ret opcode, obviously, returns from the method. Now, to compile this CIL source code file into a binary *.dll, supply the following command set to ilasm (Listing 20):
Listing 20. Compiling *il files using ilasm
ilasm /output: Simple.dll /dll SimpleCILCode.il
At this point, you can make use of Simple.dll just as you would any other .NET code library. To prove the point, let's update our existing simpleClientApplication.cs file to invoke the Speak() method (Listing 21).
Listing 21. Our updated client application
using System;
using ShamerLib;
// Need this!
using MyCILExample;
namespace ShamerClient
{
public class TheApp
{
public static void Main(string[] args)
{
...
MyCILApp.Speak();
}
}
}
Now recompile client.exe while referencing simple.dll (Listing 22).
Listing 22. Recompiling client.exe
csc /r:simple.dll /r:shamer.dll /t:exe /out:client.exe simpleClientApplication.cs
Sure enough, if you run simpleClientApplicaion.exe through ildasm, you will find Simple.dll listed in the assembly manifest. Likewise, if you run the updated application, you will find that the message ("Yo!!") is emitted to the Terminal.
Round Trippin' (Assembly to CIL, CIL to Assembly)
Given the functionality of ildasm and ilasm, the SSCLI (as well as other .NET platform distributions) intrinsically supports the notion of a 'round trip'. Simply put, this software idiom is used to describe the process of compiling -> decompiling -> editing -> recompiling a software blob into a new modified unit. As you would guess, this can prove extremely helpful when you need to modify the contents of a .NET assembly to which you do not have the original source code files. To solidify the information presented in this issue, try the following round-trip exercise:
- Disassemble your client application using ildasm, outputting the CIL to a new file named simpleClientApplication.il (don't forget the /output flag of ildasm).
- Using your text editor of choice, modify each string literal (e.g. "Please enter child's name") to a new string ("Yo dude! Enter the name of the kid!").
- Save your changes and recompile the *.il file into a new executable assembly named ModifiedApp.exe using ilasm.
- Run your ModifiedApp.exe assembly using clix.
A Brief Note Regarding Obfuscation
Now obviously, if a .NET assembly can be so easily disassembled, modified and recompiled, you are no likely already imagining numerous doomsday scenarios (your proprietary 'bubble sort' algorithm has been modified to wipe a user's hard-drive of all data) and copyright infringements ("But that is our CIL code, you can't change it!") Yes it is true, given that a .NET binary can always be viewed in terms of its CIL code using tools such as ildasm, it is possible that prying eyes could take your intellectual property as a basic for their own and build a software monster. This problem is not unique to the .NET platform however, as numerous Java, C(++) and BASIC decompilers have existed for years.
However, if you wish to lessen the chances of bad people using your resulting CIL code for evil purposes, rest assured that numerous .NET obfuscators exist. As you may know, the basic role of an obfuscator is to make use of a set of algorithms translate valid syntax (in this case CIL code) into and out-of total nonsense. As well, the .NET platform supports the notion of a 'strong name' which can be used to digitally sign (in effect) your assemblies using public / private key cryptography. By doing so, it is next to impossible for an evil individual to modify a binary assembly and pretend to 'be you'. More on strong names at a later time.
Wrap Up
Sweet! So here ends another installment of Casting your .NET where you spent your time digging deeper into the format of a .NET assembly. We began by revisiting the key elements of a .NET binary (CIL, metadata and manifest information) and learned about some additional functionality provided by csc. Next up, you came to understand the role of the ildasm utility and learned how this tool allows .NET developers to peek inside the assembly itself to view the underlying goo. Finally, you took ilasm out for a test drive and preformed a simple 'round-trip'.
In the next issue, you will finalize your initial look at the SSCLI and examine a number of interesting sample applications, alternative .NET programming languages and a few additional development utilities. After this point, the next several installments will dive headlong into the details of the C# programming language. Until then, as always, happy hacking.
Andrew Troelsen is a seasoned .NET developer who has authored numerous books on the topic, including the award winning C# and the .NET Platform. He is employed as a full-time .NET trainer and consultant for Intertech Training (www.intertechtraining.com), and is a well-known Timberwolves, Wild and Vikings rube (not necessarily in that order). You can contact Andrew at atroelsen@mac.com.