CSI: Crash Scene Investigation

Volume Number: 25
Issue Number: 12
Column Tag: Debugging

CSI: Crash Scene Investigation

Examining crashes to catch the culprit

by David Garcea

The 911 Call

You've written an application. Great! However, every application will crash eventually. Every crash is a mystery waiting to be solved, and the deduction and experimentation required to solve it can make you feel like Gil Grissom, but first you must know it happened. When an application crashes, the Crash Reporter will gather information about it and send it to Apple. This works great for Apple's products, but it doesn't help third party developers. You will have to do some coding to redirect this information to you. There are several packages that you can use to accomplish this, or you can do it all yourself. Here are a few of the free, ready-made options:

Smart Crash Reports by Unsanity is an InputManager-based enhancement for Apple's Crash Reporter application, which causes the crash report to be posted to a CGI on either your web server, or on Unsanity's server, as well as sending it to Apple. http://www.smartcrashreports.com
HDCrashReporter resides solely inside your application, but requires the user to relaunch your application after a crash, in order to email the report to you. The source code is available under the GNU Lesser General Public License, so you can customize it. http://www.profcast.com/developers/HDCrashReporter.php
ILCrashReporter is a framework that contains a custom CrashReporter application. When you start your application, you launch the CrashReporter and it watches your process for unexpected termination. When a crash occurs, the crash report and console log are emailed to you. The source code for both the framework and application are available. http://www.infinite-loop.dk/developer

If you choose one of the above options, you may still want to expand on the information they gather. To avoid delays caused by going back to ask for more information, get everything you might need all at once. In addition to the crash report, you should acquire a system profile to determine which environments are susceptible to the crash, and therefore how many users are likely to be affected by it. You should also get the console log, which may hold only a single line that pertains to your program, but that line often pinpoints the problem. These files can be obtained automatically, so that all the user must do is permit the information to be sent to you. The less they have to do, the more likely they are to report the crash. Consider providing a way for the user to describe the incident as well. They witnessed it, so they may know something crucial to reproducing it. You will also want to note the name and contact information of the reporter, so that you can ask for more information if necessary, and get confirmation once you have fixed it.

The Witness Interview

Eyewitness statements are even more unreliable in the technology industry than they are in criminal investigations. While the witness statement could provide the clues that you need to solve the problem, they could also contain misused terminology, specious assertions, and misleading statements. Always examine what the witness said, and stay open to it as a possibility, but do not to assume any of it is accurate.

The problem with witness statements is the difficulty inherent in describing in words what is seen on the screen. To circumvent this, ask the reporter to take a screenshot of your application the moment before they reproduce the crash. You will notice the details that the reporter did not think to mention. At Telestream, we take this further and ask our Quality Assurance department to use ScreenFlow to record a video of the steps leading up to a crash or bug, which is better than a single picture, as it shows you the state of the program at each step. You could use the free demo of ScreenFlow (http://www.telestream.net/screen-flow), or Jing (http://www.jingproject.com) to do the same.

The Victim's Wallet

Once you have all of the documentation, examine it, starting with the crash report, which contains sections describing the process, the report, the crash, the threads, the registers, and the binaries.

The Process section contains identifying information about the process that crashed, including the name, process identification number, how the process was launched, and the executable path for the process.

Ensure that the identifier matches one of yours. If it does not, then the crash is out of your jurisdiction and you cannot fix it. Inform the reporter to send it to the appropriate party.

After the process name is a number in brackets. This is the process identification (PID) number that was assigned to the process when it started. Every process is given a number, starting with zero, and incrementing for each new process that is launched. If this number is high, you know that a lot of processes have been run since the last time the computer booted up, which suggests that the computer has been running for a while without a restart.

Next is the path to the executable for this process. If this is a location that you did not expect, investigate how your program behaves when run from this location. The user may have been running from a locked disk image, or in a directory where they did not have proper permissions, both of which could cause problems if your code is not designed to handle these situations.

The version number of your product is next, and it is essential in correlating the crash to a specific version of your code. The standard Mac version number scheme contains a major version, a minor version, and a bug fix number. This scheme lacks one essential feature. It does not provide a unique identifier for each and every build. Consider using a scheme that consists of the major version, minor version, bug fix number, and build number, thereby assigning a single unique identifier to each and every build. You can then correlate this number to the date and time that a build was made, and then retrieve the exact version of every source file used to make that build from your source control management system. This will save time that might otherwise be lost by trying to reproduce a crash with the wrong source code.

The code type specifies whether the PowerPC or Intel code inside your universal binary was the one that crashed. If you have code written for a specific architecture, such as AltiVec or SSE, this will tell you which was executed.

The parent process tells you how your application or plug-in was launched. For applications, this is typically launchd. If your product was launched by another process, it may have been in an environment or workflow that you had not anticipated.

The Crime Scene

The report section includes the date and time that the crash occurred, which can be used to correlate the crash with the console log, as most of its entries are time-stamped. You can then focus on the log entries immediately prior to the time of the crash.

The version of the operating system is important for reproducing issues, as they may be specific to a certain version of Mac OS X. If it is a new version of Mac OS X that was released after this version of your software, or if it is a very old version of MacOS X, this could signal an incompatibility.

Lastly, the report version describes the format of the crash report, for use by automated analysis programs.

The Cause of Death

Crashes are caused by exceptions. The crash section describes the exception that caused the crash using two identifiers: the exception type, which is the category for the exception; and the exception code, which is the specific identifier. The most common exception types are EXC_ARITHMETIC, EXC_BAD_INSTRUCTION, and EXC_BAD_ACCESS. The line for the exception code may also include the offending address or value that caused the exception. The last item is the number of the thread that was executing when the exception was encountered.

The EXC_ARITHMETIC exception type covers any arithmetic that is considered illegal, such as dividing by zero (EXC_I386_DIV). Mathematically, the result of a division by zero is undefined. Intel processors are strict when it comes to dividing by zero, and they will not allow it. PowerPC processors were more forgiving, albeit mathematically incorrect. Instead of causing a crash, they returned zero as the result.

Listing 1: ExceptionController.m
Divide By Zero

The following demonstrates causing an EXC_ARITHMETIC/EXC_I386_DIV (divide by zero) exception. Note that the compiler will warn you if it sees that you are trying to do a divide operation with a literal constant of zero as the divisor. However, it will not catch situations where a variable with a value of zero is used as the divisor.

int divisor = 0;

   
// This line will cause the exception on an Intel processor.
int result = 128 / divisor;
// Modulus operations use division, so they can
// also cause this exception.
result = 128 % divisor;

The EXC_BAD_INSTRUCTION exception type means that the processor was given an instruction that it does not understand. This means that your code has corrupted the instruction pointer, which is a register that points to the memory location that holds the next instruction to execute. When that pointer is corrupted it points to some other part of memory and the processor tries to interpret that memory as an instruction, when it was intended to be something else.

In order to prevent a problem in one program from crashing other programs, or even the entire system, Mac OS X uses protected memory. Every process is given a virtual address space, which is divided into segments. Each segment has permissions that specify whether you can read from it, write to it, or execute it. When you allocate memory, it is mapped from the physical address that it resides on to the virtual address that is given to your program. The EXC_BAD_ACCESS exception type means that your program attempted to access memory that either was not mapped (KERN_INVALID_ADDRESS), or was not allowed to access (KERN_PROTECTION_FAILURE) because of the permissions on that segment. To examine the virtual memory maps for your application, pass the PID of your application to the vmmap command line tool.

Listing 2: ExceptionController.m
Kernel Invalid Address

The following demonstrates causing an EXC_BAD_ACCESS/ KERN_INVALID_ADDRESS exception.

   
// On 32-Bit systems, each process can have up to 4GB of 
// memory. Here, we try to write to the very last byte,
// which is neither likely to be mapped already, nor 
// mapped by us via allocation. While you aren't likely
// to explicitly do this in application, if you try to 
// write to a pointer that has been corrupted, you may 
// end up doing just this.
memcpy( (void *)0xFFFFFFFF, "d", 1);

Listing 3: ExceptionController.m
Kernel Protection Failure

The following demonstrates causing an EXC_BAD_ACCESS/ KERN_PROTECTION_FAILURE exception.

// Trying to write to a NULL pointer will cause this 
// exception, as memory address zero resides in a 
// virtual memory segment called "__PAGEZERO", which
// does not allow write access.
long *badPointer = NULL;
// This line will cause the crash.
*badPointer = 0xFEEDFACE;

Now that you know how to cause these bugs, you will be better prepared to find them and fix them.

The Corpse

The body of the crash report is the threads section. This shows the call stack for every thread in your program when your process crashed. Each entry in the call stack contains a number defining its position in the stack, a universal type identifier, the address of the function, the function name, and the offset to the instruction that caused the crash. The first line is the function that the thread was in at the time of the crash. The identifier tells you what binary contains that function. If the identifier in the first line of the call stack for the crashed thread is not the identifier for your application, you can check the Binary Images Description portion of the crash log for more information on that binary. We will cover more on that section later.

Threads can either be actively executing, or blocked, waiting to execute. Determining which threads were active at the time of the crash and which were blocked, will allow you ferret out the potential culprits. Consider any thread that was blocked to have an alibi. Any thread, whose current function name contains one of the following words, was most likely blocked: wait, delay, semaphore, mutex, and sleep.

If you notice a thread with one or more function names repeated, particularly if the call stack is very deep, the exception might be caused by runaway recursion. Recursion is when a function calls itself, either directly or indirectly. This can be quite useful technique, particularly when dealing with hierarchical data, but if left unchecked, the recursion could keep going until it uses up all of the available memory, which will cause a crash. Recursion can also happen unintentionally, for instance, if you call were to call [self display] in the drawing routine of a custom view.

If you see question marks in place of the binary identifiers, you could have a stack corruption. These can be difficult to solve because the application will continue to run after the memory has been corrupted, crashing instead in code that is executed much later. If you suspect you are dealing with a stack corruption, try turning on stack canaries in Xcode by adding the –fstack-protector (or –fstack-protector-all) flag to the "Other C Flags" setting for your project. Stack canaries work like a canary in a coal mine, as an early warning system. When stack canaries are on, the integrity of the stack is checked when you return from a function. If the stack has been corrupted, an error message is printed to the console to help you find the problem.

Multithreading problems are also difficult to track down because the crashes may only happen a small percentage of the time, and the offending code might not be the in the thread that crashed. Your best resource for these types of issues is collecting multiple crash logs and comparing them together. If you find the same two threads are always in similar locations when the crash occurs, try checking for unsynchronized access to shared resources, which is usually the culprit. Check your semaphores, and mutexes, to see if there is a case you might have left vulnerable to simultaneous access.

The Brain

After the threads section, you will find a table listing the registers and their values at the time of the crash. The x86 processor architecture designates eight registers for general purposes, six segment registers for memory management, a flags register to describe or control the results of operations, and the instruction pointer, which holds the address of the next instruction to execute. Not all x86 registers are listed in the crash report, but the ones that are can provide clues as to the cause of the crash.

x86 General Purpose Registers Listed In Crash Reports

Register 
(32 Bit / 64 Bit)      Purpose   
EAX   /RAX      Accumulator    
EBX   /RBX      Base   
ECX   /RCX      Counter   
EDX   /RDX      Data   
EDI   /RDI      Destination Index   
ESI   /RSI      Source Index   
EBP   /RBP      Base Pointer   
ESP   /RSP      Stack Pointer

While the general-purpose registers can be used for anything, most have certain tasks that they are optimized for. The accumulator is where most arithmetic calculations are performed. The base register has no specialized purpose. The counter register is designed for use as the index in loops. The data register is for storing data used in the calculations occurring in the accumulator. The destination index is for use as a pointer to the current location in a write operation. Similarly, the source index is for use as a pointer in a read operation. The base pointer points to the bottom of the stack, and the stack pointer points to the top of the stack.

x86 Other Registers Listed In Crash Reports

Register       Purpose   
SS      Stack Segment   
EFL/RFL      Flags   
EIP/RIP      Instruction Pointer   
CS      Code Segment   
DS      Data Segment   
ES      Extra Segment   
FS      F (Extra) Segment   
GS      G (Extra) Segment   
CR2      Control Register 2

The remaining registers have dedicated purposes. The segment registers are for supporting memory protection via segmentation. However, paging is now the preferred method of memory protection, so most of these registers are set to the same value. The F and G segments may store data specific to a thread. The flags register is used to control the results of operations, and to store information about those results, such as if the result overflowed the register. CR2 contains the offending address when a page fault occurs.

The Known Associates

The Binary Images Description section of the crash report has a list of all of the binaries involved in running your application, including the frameworks, plug-ins, and dynamically-linked libraries. There is one line per binary, and each entry contains the memory address span, the identifier, the version, and the file path it was loaded from. This list is usually long, even for the most trivial applications. If your application uses plug-ins, look for them here to see what versions were present.

If you are having trouble finding the cause of your crash, it is worth taking a few minutes to review this list. Look for anything that is unusual, meaning any entry whose identifier is neither yours nor Apple's (i.e. com.apple.whatever). When you find one that you do not recognize, look it up online. If it seems like it could interfere, try uninstalling it and see if the problem disappears.

The Modus Operandi

Some bugs cause immediate crashes, such as the EXC_I386_DIV exception, others start causing problems that will lead to a crash. If the crash happens at the same line of code every time it is executed, it is probably going to be easy to fix. If, however, the crash only happens occasionally, or at different lines of code, then it is a delayed crash, and will be tougher. To fix these issues, you will have to backtrack and find the initial problem.

To tackle delayed crashes, there are several techniques you can use to narrow down the problem. Law enforcement has a better chance of catching a serial killer each time he commits a new murder because each incident provides the investigators with more information. Similarly, the more instances of the crash you have to examine, the easier it will be to solve. Collect documentation for multiple instances of the crash and compare them, using your favorite diff tool. The information that is the same may be the conditions that are required to cause the crash, which are hints to the cause. You can also use this technique to exclude unrelated crash reports, if they are dramatically different than all of the others that you have collected on the issue.

The Reenactment

Your goal now is to be able to reliably reproduce the crash. If you cannot, you will never be able to verify that it was fixed, so you might as well drop it in the cold case drawer.

The next step is reducing the time it takes to reproduce the crash to under a few minutes. If the crash takes an hour to occur, it will take an eternity to investigate it.

If appropriate, try stressing the program by reducing the resources it has available, such as RAM, virtual memory, disk space, network bandwidth, etc. The crash might require one of those resources reaching a critically low point. By reducing those resources from the beginning, such as by launching a lot of applications, filling up disk space with large files, or starting extremely large file transfers, you can induce the required conditions without the usual wait.

Try examining what you think it is doing around the time of the crash. If it is working on a certain part of a large file, then try making that part of the file into the beginning of the file, either by moving it, or by cutting out everything preceding it. If it is in last stage of a multistage process, then try disabling the prior stages.

Once your crash is easily reproducible, you will need to narrow down the problem. Try taking out easily removable items such as plug-ins and frameworks. Next try commenting out half the suspected code, in a way that leaves the other half still compiling and usable. If the problem persists, then you know the bug is in the uncommented half. Then try commenting out half of that code, and continue with this technique until the bug becomes evident.

Another useful technique is regression, which requires that you use a source control management (SCM) product, such as CVS, Subversion, or Perforce. It will also be useful to have unique version numbers for every build of your product like we discussed earlier. Try going back through previous builds of your product until you find one where the problem did not occur. Then, using your SCM tool, find the changes that were made between the unaffected build and the affected build. Those changes are likely to contain the bug.

The Suspect Lineup

You should now be able to turn the problem on or off at will, making the crash occur or not occur.

Now that you have found the fix, you might be thinking that you are done, but there are often many possible fixes for any problem. Do you really want to use whichever one happened to be the first that you found? Take the time to think of a few other possible ways to fix the problem. Then consider the benefits and drawbacks of each. Consider the time it takes to implement, the maintainability of the code, the scope of the changes, and the likelihood that the fix will cause more problems. Now you can pick the best fix and implement it.

The Conviction

You are almost done. Document the bug and the fix in your code, so that neither you, nor the other members of your team inadvertently reintroduce the problem. Document it in your source code management system as well, so that you know when you fixed it, both in regards to time and to versioning. And finally, document it in your release notes so that your users know that this is the update that will fix the problem they are encountering. Do not be so ashamed of the crash that you omit it from the release notes. All applications crash, even Apple's. The fact that you found it, fixed it fast, and made the fix available to your users quickly and honestly, is something to be proud of. Keeping excellent records like this will help prevent the problem from reappearing, and will also provide you with valuable resources for tracking down your next crash.

Software Updates via MacUpdate

Latest Forum Discussions

Combo Quest (Games)

Combo Quest 1.0 Device: iOS Universal Category: Games Price: $.99, Version: 1.0 (iTunes) Description: Combo Quest is an epic, time tap role-playing adventure. In this unique masterpiece, you are a knight on a heroic quest to retrieve... | Read more »

Hero Emblems (Games)

Hero Emblems 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: ** 25% OFF for a limited time to celebrate the release ** ** Note for iPhone 6 user: If it doesn't run fullscreen on your device... | Read more »

Puzzle Blitz (Games)

Puzzle Blitz 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: Puzzle Blitz is a frantic puzzle solving race against the clock! Solve as many puzzles as you can, before time runs out! You have... | Read more »

Sky Patrol (Games)

Sky Patrol 1.0.1 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0.1 (iTunes) Description: 'Strategic Twist On The Classic Shooter Genre' - Indie Game Mag... | Read more »

The Princess Bride - The Official Game...

The Princess Bride - The Official Game 1.1 Device: iOS Universal Category: Games Price: $3.99, Version: 1.1 (iTunes) Description: An epic game based on the beloved classic movie? Inconceivable! Play the world of The Princess Bride... | Read more »

Frozen Synapse (Games)

Frozen Synapse 1.0 Device: iOS iPhone Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: Frozen Synapse is a multi-award-winning tactical game. (Full cross-play with desktop and tablet versions) 9/10 Edge 9/10 Eurogamer... | Read more »

Space Marshals (Games)

Space Marshals 1.0.1 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0.1 (iTunes) Description: ### IMPORTANT ### Please note that iPhone 4 is not supported. Space Marshals is a Sci-fi Wild West adventure taking place... | Read more »

Battle Slimes (Games)

Battle Slimes 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: BATTLE SLIMES is a fun local multiplayer game. Control speedy & bouncy slime blobs as you compete with friends and family.... | Read more »

Spectrum - 3D Avenue (Games)

Spectrum - 3D Avenue 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: "Spectrum is a pretty cool take on twitchy/reaction-based gameplay with enough complexity and style to stand out from the... | Read more »

Drop Wizard (Games)

Drop Wizard 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: Bring back the joy of arcade games! Drop Wizard is an action arcade game where you play as Teo, a wizard on a quest to save his... | Read more »

Price Scanner via MacPrices.net

Our MacBook Price Trackers will show you the...

Our Apple award-winning MacBook Price Trackers are continually updated with the latest information on prices, bundles, and availability for 16″ and 14″ MacBook Pros along with 13″ and 15″ MacBook... Read more

Amazon is offering a 10% discount on Apple’s...

Don’t pay full price! Amazon has 16-inch M4 Pro MacBook Pros (Silver and Black colors) on sale today for 10% off Apple’s MSRP. Shipping is free. These are the lowest prices currently available for 16... Read more

13-inch M4 MacBook Airs on sale for $150 off...

Amazon has new 13″ M4 MacBook Airs on sale for $150 off MSRP right now, starting at $849. Sale prices apply to most colors and configurations. Be sure to select Amazon as the seller, rather than a... Read more

15-inch M4 MacBook Airs on sale for $150 off...

Amazon has new 15″ M4 MacBook Airs on sale for $150 off Apple’s MSRP, starting at $1049. Be sure to select Amazon as the seller, rather than a third-party: – 15″ M4 MacBook Air (16GB/256GB): $1049, $... Read more

Amazon is offering a $50 discount on Apple’s...

Amazon has Apple’s 11th-generation A16 iPads in stock on sale for $50 (or a little more) off MSRP this week. Shipping is free: – 11″ 11th-generation 128GB WiFi iPads: $299 $50 off MSRP – 11″ 11th-... Read more

Clearance 13-inch M1 MacBook Airs available f...

Walmart has clearance, but new, Apple 13″ M1 MacBook Airs (8GB RAM, 256GB SSD) available online for $649, $360 off original MSRP, in Space Gray, Silver, and Gold colors. These are new MacBooks for... Read more

iPad minis on sale for $100 off Apple’s MSRP...

Amazon is offering $100 discounts (up to 20% off) on Apple’s newest 2024 WiFi iPad minis, each with free shipping. These are the lowest prices available for new minis among the Apple retailers we... Read more

AirPods Max headphones on sale for $479, $70...

Amazon has AirPods Max with USB-C on sale for $479.99 in all colors. Shipping is free. Their price is $70 off Apple’s MSRP, and it’s the lowest price available today for AirPods Max. Keep an eye on... Read more

14-inch M4 Pro/M4 Max MacBook Pros on sale th...

Don’t pay full price! Get a new 14″ MacBook Pro with an M4 Pro or M4 Max CPU for up to $320 off Apple’s MSRP this weekend at these retailers…they are the lowest prices available for these MacBook... Read more

Get a 15-inch M4 MacBook Air for $150 off App...

A couple of Apple retailers are offering $150 discounts on new 15″ M4 MacBook Airs this weekend. Prices at these retailers start at $1049: (1): Amazon has new 15″ M4 MacBook Airs on sale for $150 off... Read more

Jobs Board

SPREAD THE WORD:
Slashdot
Digg
Del.icio.us
Reddit
Newsvine

MacTech

CSI: Crash Scene Investigation

Examining crashes to catch the culprit

The 911 Call

The Witness Interview

The Victim's Wallet

The Crime Scene

The Cause of Death

The Corpse

The Brain

The Known Associates

The Modus Operandi

The Reenactment

The Suspect Lineup

The Conviction

Suggested Reading

Software Updates via MacUpdate

Latest Forum Discussions

Price Scanner via MacPrices.net

Jobs Board