TweetFollow Us on Twitter

CSI: Crash Scene Investigation

Volume Number: 25
Issue Number: 12
Column Tag: Debugging

CSI: Crash Scene Investigation

Examining crashes to catch the culprit

by David Garcea

The 911 Call

You've written an application. Great! However, every application will crash eventually. Every crash is a mystery waiting to be solved, and the deduction and experimentation required to solve it can make you feel like Gil Grissom, but first you must know it happened. When an application crashes, the Crash Reporter will gather information about it and send it to Apple. This works great for Apple's products, but it doesn't help third party developers. You will have to do some coding to redirect this information to you. There are several packages that you can use to accomplish this, or you can do it all yourself. Here are a few of the free, ready-made options:

  • Smart Crash Reports by Unsanity is an InputManager-based enhancement for Apple's Crash Reporter application, which causes the crash report to be posted to a CGI on either your web server, or on Unsanity's server, as well as sending it to Apple. http://www.smartcrashreports.com
  • HDCrashReporter resides solely inside your application, but requires the user to relaunch your application after a crash, in order to email the report to you. The source code is available under the GNU Lesser General Public License, so you can customize it. http://www.profcast.com/developers/HDCrashReporter.php
  • ILCrashReporter is a framework that contains a custom CrashReporter application. When you start your application, you launch the CrashReporter and it watches your process for unexpected termination. When a crash occurs, the crash report and console log are emailed to you. The source code for both the framework and application are available. http://www.infinite-loop.dk/developer

If you choose one of the above options, you may still want to expand on the information they gather. To avoid delays caused by going back to ask for more information, get everything you might need all at once. In addition to the crash report, you should acquire a system profile to determine which environments are susceptible to the crash, and therefore how many users are likely to be affected by it. You should also get the console log, which may hold only a single line that pertains to your program, but that line often pinpoints the problem. These files can be obtained automatically, so that all the user must do is permit the information to be sent to you. The less they have to do, the more likely they are to report the crash. Consider providing a way for the user to describe the incident as well. They witnessed it, so they may know something crucial to reproducing it. You will also want to note the name and contact information of the reporter, so that you can ask for more information if necessary, and get confirmation once you have fixed it.

The Witness Interview

Eyewitness statements are even more unreliable in the technology industry than they are in criminal investigations. While the witness statement could provide the clues that you need to solve the problem, they could also contain misused terminology, specious assertions, and misleading statements. Always examine what the witness said, and stay open to it as a possibility, but do not to assume any of it is accurate.

The problem with witness statements is the difficulty inherent in describing in words what is seen on the screen. To circumvent this, ask the reporter to take a screenshot of your application the moment before they reproduce the crash. You will notice the details that the reporter did not think to mention. At Telestream, we take this further and ask our Quality Assurance department to use ScreenFlow to record a video of the steps leading up to a crash or bug, which is better than a single picture, as it shows you the state of the program at each step. You could use the free demo of ScreenFlow (http://www.telestream.net/screen-flow), or Jing (http://www.jingproject.com) to do the same.

The Victim's Wallet

Once you have all of the documentation, examine it, starting with the crash report, which contains sections describing the process, the report, the crash, the threads, the registers, and the binaries.

The Process section contains identifying information about the process that crashed, including the name, process identification number, how the process was launched, and the executable path for the process.

Ensure that the identifier matches one of yours. If it does not, then the crash is out of your jurisdiction and you cannot fix it. Inform the reporter to send it to the appropriate party.

After the process name is a number in brackets. This is the process identification (PID) number that was assigned to the process when it started. Every process is given a number, starting with zero, and incrementing for each new process that is launched. If this number is high, you know that a lot of processes have been run since the last time the computer booted up, which suggests that the computer has been running for a while without a restart.

Next is the path to the executable for this process. If this is a location that you did not expect, investigate how your program behaves when run from this location. The user may have been running from a locked disk image, or in a directory where they did not have proper permissions, both of which could cause problems if your code is not designed to handle these situations.

The version number of your product is next, and it is essential in correlating the crash to a specific version of your code. The standard Mac version number scheme contains a major version, a minor version, and a bug fix number. This scheme lacks one essential feature. It does not provide a unique identifier for each and every build. Consider using a scheme that consists of the major version, minor version, bug fix number, and build number, thereby assigning a single unique identifier to each and every build. You can then correlate this number to the date and time that a build was made, and then retrieve the exact version of every source file used to make that build from your source control management system. This will save time that might otherwise be lost by trying to reproduce a crash with the wrong source code.

The code type specifies whether the PowerPC or Intel code inside your universal binary was the one that crashed. If you have code written for a specific architecture, such as AltiVec or SSE, this will tell you which was executed.

The parent process tells you how your application or plug-in was launched. For applications, this is typically launchd. If your product was launched by another process, it may have been in an environment or workflow that you had not anticipated.

The Crime Scene

The report section includes the date and time that the crash occurred, which can be used to correlate the crash with the console log, as most of its entries are time-stamped. You can then focus on the log entries immediately prior to the time of the crash.

The version of the operating system is important for reproducing issues, as they may be specific to a certain version of Mac OS X. If it is a new version of Mac OS X that was released after this version of your software, or if it is a very old version of MacOS X, this could signal an incompatibility.

Lastly, the report version describes the format of the crash report, for use by automated analysis programs.

The Cause of Death

Crashes are caused by exceptions. The crash section describes the exception that caused the crash using two identifiers: the exception type, which is the category for the exception; and the exception code, which is the specific identifier. The most common exception types are EXC_ARITHMETIC, EXC_BAD_INSTRUCTION, and EXC_BAD_ACCESS. The line for the exception code may also include the offending address or value that caused the exception. The last item is the number of the thread that was executing when the exception was encountered.

The EXC_ARITHMETIC exception type covers any arithmetic that is considered illegal, such as dividing by zero (EXC_I386_DIV). Mathematically, the result of a division by zero is undefined. Intel processors are strict when it comes to dividing by zero, and they will not allow it. PowerPC processors were more forgiving, albeit mathematically incorrect. Instead of causing a crash, they returned zero as the result.

Listing 1: ExceptionController.m

Divide By Zero

The following demonstrates causing an EXC_ARITHMETIC/EXC_I386_DIV (divide by zero) exception. Note that the compiler will warn you if it sees that you are trying to do a divide operation with a literal constant of zero as the divisor. However, it will not catch situations where a variable with a value of zero is used as the divisor.

int divisor = 0;
   
// This line will cause the exception on an Intel processor.
int result = 128 / divisor;
// Modulus operations use division, so they can
// also cause this exception.
result = 128 % divisor;

The EXC_BAD_INSTRUCTION exception type means that the processor was given an instruction that it does not understand. This means that your code has corrupted the instruction pointer, which is a register that points to the memory location that holds the next instruction to execute. When that pointer is corrupted it points to some other part of memory and the processor tries to interpret that memory as an instruction, when it was intended to be something else.

In order to prevent a problem in one program from crashing other programs, or even the entire system, Mac OS X uses protected memory. Every process is given a virtual address space, which is divided into segments. Each segment has permissions that specify whether you can read from it, write to it, or execute it. When you allocate memory, it is mapped from the physical address that it resides on to the virtual address that is given to your program. The EXC_BAD_ACCESS exception type means that your program attempted to access memory that either was not mapped (KERN_INVALID_ADDRESS), or was not allowed to access (KERN_PROTECTION_FAILURE) because of the permissions on that segment. To examine the virtual memory maps for your application, pass the PID of your application to the vmmap command line tool.

Listing 2: ExceptionController.m

Kernel Invalid Address

The following demonstrates causing an EXC_BAD_ACCESS/ KERN_INVALID_ADDRESS exception.

   
// On 32-Bit systems, each process can have up to 4GB of 
// memory. Here, we try to write to the very last byte,
// which is neither likely to be mapped already, nor 
// mapped by us via allocation. While you aren't likely
// to explicitly do this in application, if you try to 
// write to a pointer that has been corrupted, you may 
// end up doing just this.
memcpy( (void *)0xFFFFFFFF, "d", 1);

Listing 3: ExceptionController.m

Kernel Protection Failure

The following demonstrates causing an EXC_BAD_ACCESS/ KERN_PROTECTION_FAILURE exception.

// Trying to write to a NULL pointer will cause this 
// exception, as memory address zero resides in a 
// virtual memory segment called "__PAGEZERO", which
// does not allow write access.
long *badPointer = NULL;
// This line will cause the crash.
*badPointer = 0xFEEDFACE;

Now that you know how to cause these bugs, you will be better prepared to find them and fix them.

The Corpse

The body of the crash report is the threads section. This shows the call stack for every thread in your program when your process crashed. Each entry in the call stack contains a number defining its position in the stack, a universal type identifier, the address of the function, the function name, and the offset to the instruction that caused the crash. The first line is the function that the thread was in at the time of the crash. The identifier tells you what binary contains that function. If the identifier in the first line of the call stack for the crashed thread is not the identifier for your application, you can check the Binary Images Description portion of the crash log for more information on that binary. We will cover more on that section later.

Threads can either be actively executing, or blocked, waiting to execute. Determining which threads were active at the time of the crash and which were blocked, will allow you ferret out the potential culprits. Consider any thread that was blocked to have an alibi. Any thread, whose current function name contains one of the following words, was most likely blocked: wait, delay, semaphore, mutex, and sleep.

If you notice a thread with one or more function names repeated, particularly if the call stack is very deep, the exception might be caused by runaway recursion. Recursion is when a function calls itself, either directly or indirectly. This can be quite useful technique, particularly when dealing with hierarchical data, but if left unchecked, the recursion could keep going until it uses up all of the available memory, which will cause a crash. Recursion can also happen unintentionally, for instance, if you call were to call [self display] in the drawing routine of a custom view.

If you see question marks in place of the binary identifiers, you could have a stack corruption. These can be difficult to solve because the application will continue to run after the memory has been corrupted, crashing instead in code that is executed much later. If you suspect you are dealing with a stack corruption, try turning on stack canaries in Xcode by adding the –fstack-protector (or –fstack-protector-all) flag to the "Other C Flags" setting for your project. Stack canaries work like a canary in a coal mine, as an early warning system. When stack canaries are on, the integrity of the stack is checked when you return from a function. If the stack has been corrupted, an error message is printed to the console to help you find the problem.

Multithreading problems are also difficult to track down because the crashes may only happen a small percentage of the time, and the offending code might not be the in the thread that crashed. Your best resource for these types of issues is collecting multiple crash logs and comparing them together. If you find the same two threads are always in similar locations when the crash occurs, try checking for unsynchronized access to shared resources, which is usually the culprit. Check your semaphores, and mutexes, to see if there is a case you might have left vulnerable to simultaneous access.

The Brain

After the threads section, you will find a table listing the registers and their values at the time of the crash. The x86 processor architecture designates eight registers for general purposes, six segment registers for memory management, a flags register to describe or control the results of operations, and the instruction pointer, which holds the address of the next instruction to execute. Not all x86 registers are listed in the crash report, but the ones that are can provide clues as to the cause of the crash.

x86 General Purpose Registers Listed In Crash Reports

Register 
(32 Bit / 64 Bit)      Purpose   
EAX   /RAX      Accumulator    
EBX   /RBX      Base   
ECX   /RCX      Counter   
EDX   /RDX      Data   
EDI   /RDI      Destination Index   
ESI   /RSI      Source Index   
EBP   /RBP      Base Pointer   
ESP   /RSP      Stack Pointer   

While the general-purpose registers can be used for anything, most have certain tasks that they are optimized for. The accumulator is where most arithmetic calculations are performed. The base register has no specialized purpose. The counter register is designed for use as the index in loops. The data register is for storing data used in the calculations occurring in the accumulator. The destination index is for use as a pointer to the current location in a write operation. Similarly, the source index is for use as a pointer in a read operation. The base pointer points to the bottom of the stack, and the stack pointer points to the top of the stack.

x86 Other Registers Listed In Crash Reports

Register       Purpose   
SS      Stack Segment   
EFL/RFL      Flags   
EIP/RIP      Instruction Pointer   
CS      Code Segment   
DS      Data Segment   
ES      Extra Segment   
FS      F (Extra) Segment   
GS      G (Extra) Segment   
CR2      Control Register 2   

The remaining registers have dedicated purposes. The segment registers are for supporting memory protection via segmentation. However, paging is now the preferred method of memory protection, so most of these registers are set to the same value. The F and G segments may store data specific to a thread. The flags register is used to control the results of operations, and to store information about those results, such as if the result overflowed the register. CR2 contains the offending address when a page fault occurs.

The Known Associates

The Binary Images Description section of the crash report has a list of all of the binaries involved in running your application, including the frameworks, plug-ins, and dynamically-linked libraries. There is one line per binary, and each entry contains the memory address span, the identifier, the version, and the file path it was loaded from. This list is usually long, even for the most trivial applications. If your application uses plug-ins, look for them here to see what versions were present.

If you are having trouble finding the cause of your crash, it is worth taking a few minutes to review this list. Look for anything that is unusual, meaning any entry whose identifier is neither yours nor Apple's (i.e. com.apple.whatever). When you find one that you do not recognize, look it up online. If it seems like it could interfere, try uninstalling it and see if the problem disappears.

The Modus Operandi

Some bugs cause immediate crashes, such as the EXC_I386_DIV exception, others start causing problems that will lead to a crash. If the crash happens at the same line of code every time it is executed, it is probably going to be easy to fix. If, however, the crash only happens occasionally, or at different lines of code, then it is a delayed crash, and will be tougher. To fix these issues, you will have to backtrack and find the initial problem.

To tackle delayed crashes, there are several techniques you can use to narrow down the problem. Law enforcement has a better chance of catching a serial killer each time he commits a new murder because each incident provides the investigators with more information. Similarly, the more instances of the crash you have to examine, the easier it will be to solve. Collect documentation for multiple instances of the crash and compare them, using your favorite diff tool. The information that is the same may be the conditions that are required to cause the crash, which are hints to the cause. You can also use this technique to exclude unrelated crash reports, if they are dramatically different than all of the others that you have collected on the issue.

The Reenactment

Your goal now is to be able to reliably reproduce the crash. If you cannot, you will never be able to verify that it was fixed, so you might as well drop it in the cold case drawer.

The next step is reducing the time it takes to reproduce the crash to under a few minutes. If the crash takes an hour to occur, it will take an eternity to investigate it.

If appropriate, try stressing the program by reducing the resources it has available, such as RAM, virtual memory, disk space, network bandwidth, etc. The crash might require one of those resources reaching a critically low point. By reducing those resources from the beginning, such as by launching a lot of applications, filling up disk space with large files, or starting extremely large file transfers, you can induce the required conditions without the usual wait.

Try examining what you think it is doing around the time of the crash. If it is working on a certain part of a large file, then try making that part of the file into the beginning of the file, either by moving it, or by cutting out everything preceding it. If it is in last stage of a multistage process, then try disabling the prior stages.

Once your crash is easily reproducible, you will need to narrow down the problem. Try taking out easily removable items such as plug-ins and frameworks. Next try commenting out half the suspected code, in a way that leaves the other half still compiling and usable. If the problem persists, then you know the bug is in the uncommented half. Then try commenting out half of that code, and continue with this technique until the bug becomes evident.

Another useful technique is regression, which requires that you use a source control management (SCM) product, such as CVS, Subversion, or Perforce. It will also be useful to have unique version numbers for every build of your product like we discussed earlier. Try going back through previous builds of your product until you find one where the problem did not occur. Then, using your SCM tool, find the changes that were made between the unaffected build and the affected build. Those changes are likely to contain the bug.

The Suspect Lineup

You should now be able to turn the problem on or off at will, making the crash occur or not occur.

Now that you have found the fix, you might be thinking that you are done, but there are often many possible fixes for any problem. Do you really want to use whichever one happened to be the first that you found? Take the time to think of a few other possible ways to fix the problem. Then consider the benefits and drawbacks of each. Consider the time it takes to implement, the maintainability of the code, the scope of the changes, and the likelihood that the fix will cause more problems. Now you can pick the best fix and implement it.

The Conviction

You are almost done. Document the bug and the fix in your code, so that neither you, nor the other members of your team inadvertently reintroduce the problem. Document it in your source code management system as well, so that you know when you fixed it, both in regards to time and to versioning. And finally, document it in your release notes so that your users know that this is the update that will fix the problem they are encountering. Do not be so ashamed of the crash that you omit it from the release notes. All applications crash, even Apple's. The fact that you found it, fixed it fast, and made the fix available to your users quickly and honestly, is something to be proud of. Keeping excellent records like this will help prevent the problem from reappearing, and will also provide you with valuable resources for tracking down your next crash.

Suggested Reading

Apple. "Technical Note TN2123: CrashReporter". http://developer.apple.com/technotes/tn2004/tn2123.html

Wikibooks. "X86 Assembly/X86 Architecture". http://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture

William Swanson. "The Art of Picking Intel Registers". http://www.swansontec.com/sregisters.html


David Garcea is the Engineering Program Manager for Macintosh Desktop Products at the U.S. headquarters of Telestream, Inc., makers of Flip4Mac WMV, Episode, Drive-in, Pipeline, and ScreenFlow. Drawing on over twelve years of experience making Mac applications and plug-ins, he leads the team of engineers that makes it possible for you to make and watch WMV content on your Mac, load all of your DVDs onto your laptop, or capture, transcode, and edit a multi-camera concert in Times Square. He has a bachelor's degree in computer science from the State University of New York Empire State College, and lives in Northern California. You can reach him at dgarcea@gmail.com.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Latest Forum Discussions

See All

Tokkun Studio unveils alpha trailer for...
We are back on the MMORPG news train, and this time it comes from the sort of international developers Tokkun Studio. They are based in France and Japan, so it counts. Anyway, semantics aside, they have released an alpha trailer for the upcoming... | Read more »
Win a host of exclusive in-game Honor of...
To celebrate its latest Jujutsu Kaisen crossover event, Honor of Kings is offering a bounty of login and achievement rewards kicking off the holiday season early. [Read more] | Read more »
Miraibo GO comes out swinging hard as it...
Having just launched what feels like yesterday, Dreamcube Studio is wasting no time adding events to their open-world survival Miraibo GO. Abyssal Souls arrives relatively in time for the spooky season and brings with it horrifying new partners to... | Read more »
Ditch the heavy binders and high price t...
As fun as the real-world equivalent and the very old Game Boy version are, the Pokemon Trading Card games have historically been received poorly on mobile. It is a very strange and confusing trend, but one that The Pokemon Company is determined to... | Read more »
Peace amongst mobile gamers is now shatt...
Some of the crazy folk tales from gaming have undoubtedly come from the EVE universe. Stories of spying, betrayal, and epic battles have entered history, and now the franchise expands as CCP Games launches EVE Galaxy Conquest, a free-to-play 4x... | Read more »
Lord of Nazarick, the turn-based RPG bas...
Crunchyroll and A PLUS JAPAN have just confirmed that Lord of Nazarick, their turn-based RPG based on the popular OVERLORD anime, is now available for iOS and Android. Starting today at 2PM CET, fans can download the game from Google Play and the... | Read more »
Digital Extremes' recent Devstream...
If you are anything like me you are impatiently waiting for Warframe: 1999 whilst simultaneously cursing the fact Excalibur Prime is permanently Vault locked. To keep us fed during our wait, Digital Extremes hosted a Double Devstream to dish out a... | Read more »
The Frozen Canvas adds a splash of colou...
It is time to grab your gloves and layer up, as Torchlight: Infinite is diving into the frozen tundra in its sixth season. The Frozen Canvas is a colourful new update that brings a stylish flair to the Netherrealm and puts creativity in the... | Read more »
Back When AOL WAS the Internet – The Tou...
In Episode 606 of The TouchArcade Show we kick things off talking about my plans for this weekend, which has resulted in this week’s show being a bit shorter than normal. We also go over some more updates on our Patreon situation, which has been... | Read more »
Creative Assembly's latest mobile p...
The Total War series has been slowly trickling onto mobile, which is a fantastic thing because most, if not all, of them are incredibly great fun. Creative Assembly's latest to get the Feral Interactive treatment into portable form is Total War:... | Read more »

Price Scanner via MacPrices.net

Early Black Friday Deal: Apple’s newly upgrad...
Amazon has Apple 13″ MacBook Airs with M2 CPUs and 16GB of RAM on early Black Friday sale for $200 off MSRP, only $799. Their prices are the lowest currently available for these newly upgraded 13″ M2... Read more
13-inch 8GB M2 MacBook Airs for $749, $250 of...
Best Buy has Apple 13″ MacBook Airs with M2 CPUs and 8GB of RAM in stock and on sale on their online store for $250 off MSRP. Prices start at $749. Their prices are the lowest currently available for... Read more
Amazon is offering an early Black Friday $100...
Amazon is offering early Black Friday discounts on Apple’s new 2024 WiFi iPad minis ranging up to $100 off MSRP, each with free shipping. These are the lowest prices available for new minis anywhere... Read more
Price Drop! Clearance 14-inch M3 MacBook Pros...
Best Buy is offering a $500 discount on clearance 14″ M3 MacBook Pros on their online store this week with prices available starting at only $1099. Prices valid for online orders only, in-store... Read more
Apple AirPods Pro with USB-C on early Black F...
A couple of Apple retailers are offering $70 (28%) discounts on Apple’s AirPods Pro with USB-C (and hearing aid capabilities) this weekend. These are early AirPods Black Friday discounts if you’re... Read more
Price drop! 13-inch M3 MacBook Airs now avail...
With yesterday’s across-the-board MacBook Air upgrade to 16GB of RAM standard, Apple has dropped prices on clearance 13″ 8GB M3 MacBook Airs, Certified Refurbished, to a new low starting at only $829... Read more
Price drop! Apple 15-inch M3 MacBook Airs now...
With yesterday’s release of 15-inch M3 MacBook Airs with 16GB of RAM standard, Apple has dropped prices on clearance Certified Refurbished 15″ 8GB M3 MacBook Airs to a new low starting at only $999.... Read more
Apple has clearance 15-inch M2 MacBook Airs a...
Apple has clearance, Certified Refurbished, 15″ M2 MacBook Airs now available starting at $929 and ranging up to $410 off original MSRP. These are the cheapest 15″ MacBook Airs for sale today at... Read more
Apple drops prices on 13-inch M2 MacBook Airs...
Apple has dropped prices on 13″ M2 MacBook Airs to a new low of only $749 in their Certified Refurbished store. These are the cheapest M2-powered MacBooks for sale at Apple. Apple’s one-year warranty... Read more
Clearance 13-inch M1 MacBook Airs available a...
Apple has clearance 13″ M1 MacBook Airs, Certified Refurbished, now available for $679 for 8-Core CPU/7-Core GPU/256GB models. Apple’s one-year warranty is included, shipping is free, and each... Read more

Jobs Board

Seasonal Cashier - *Apple* Blossom Mall - J...
Seasonal Cashier - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Read more
Seasonal Fine Jewelry Commission Associate -...
…Fine Jewelry Commission Associate - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) Read more
Seasonal Operations Associate - *Apple* Blo...
Seasonal Operations Associate - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Read more
Hair Stylist - *Apple* Blossom Mall - JCPen...
Hair Stylist - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Blossom Read more
Cashier - *Apple* Blossom Mall - JCPenney (...
Cashier - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Blossom Mall Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.