June 96 - Balance of Power
Balance of Power:
Sleuthing Through Your Code
DAVE EVANS
The night was well advanced, but the bright glow of fluorescent lamps
misrepresented time. As I sat back in my comfortable chair, rubbing tired eyes,
I wondered what the venerable but fictional Mr. Sherlock Holmes would offer me
as advice. Perhaps because I was so weary from the long hours of debugging, I
easily imagined Mr. Holmes sitting near me in a tweed suit smoking his pipe.
Certainly he would address me as he once addressed his compatriot Dr. Watson,
with a slightly condescending tone, and he would tell me that in my debugging I
was missing the key iota of information.
At that moment, a solitary number seemed brighter on my monitor. Perhaps I have
an overactive imagination, but it seemed as if MacsBug were magically
illuminating that crucial, overlooked information. My computer was at interrupt
level 2, yet it was waiting for a driver request to complete. How could I have
missed the interrupt level earlier? It was no wonder that the computer froze.
My software had most likely called the driver synchronously at exactly the
wrong time. The voice of Mr. Holmes rang again in my ears. This time he quoted
from that unfortunate story "A Case of Identity" when he said, "It has long
been an axiom of mine that the little things are infinitely the most
important."
Sir Arthur's famous detective was unsurpassed as an observer of detail. He
believed that keen attention to all things -- even the mundane -- was the key
to good detective work. In debugging software, I've found this advice is also
true. Although many software bugs can be solved quite easily, the most
challenging problems demand more attention. This is especially true of crashes
or freezes in your software. To find the detail we need for those, we often
have to go below source-level tools and get comfortable with lower-level
aids.
In this column I'll take you through some low-level debugging techniques. I'll
start with basic strategy and then discuss particular methods and examples.
Although many details will be PowerPC-specific, much of the information here is
useful on all Macintosh computers.
The experienced engineer starts with a basic strategy when faced with a
troublesome software crash or freeze. The strategy is similar to Mr. Holmes's
approach to solving difficult crimes. Using the scientific method, he starts by
collecting key information and details. When he has finished researching, he
begins to analyze the information and eliminates hypothesis after hypothesis.
Once close to a solution, he seeks out more detail to narrow his suspects to a
single culprit. Similarly, your strategy for debugging software should start
with careful observation and research. Then you should hypothesize, test your
theories, and collect more detail. This narrowing approach will draw you closer
to the pernicious coding error in your software.
It's tempting when faced with a difficult crash to experiment instead of
researching it first. But beware! Don't just reimplement your code with new
approaches until it stops crashing. Though some may cynically suggest that
that's the Macintosh way to program, don't be lulled into this strategy. I've
found that it usually produces unstable code and ultimately takes longer than
researching the original problem.
In researching a crash or a freeze, the private bug detective should first ask
these few basic questions:
- What kind of crash or freeze is this?
- What code did the computer stop in?
- How did I get to that code?
For these, you'll need a low-level
debugger (such as MacsBug). Let's look at each one in turn.
The first step is to determine the kind of problem you've got. For crashes
there are a number of possible problems, including the all-too-familiar illegal
instruction and bus errors. Note that PowerPC exception handlers don't
currently distinguish between these or other types. In MacsBug the correct type
will be reported, but your debugger may instead describe all crashes as general
spurious interrupts or type 11 errors.
If your crash is from an illegal instruction error, it's possible that the
processor jumped to an invalid address or the intended code moved in memory. In
this case you'll notice (in a disassembly where execution stopped) that most
instructions are invalid or nonsense. This can also occur if the emulator tries
to emulate PowerPC code, or if the processor tries to execute 680x0 code as
PowerPC code. Try disassembling memory as both PowerPC code (using ipp pc) and
680x0 code (using ip pc).
If your crash is from a bus error, the most likely cause is an invalid address
in some register. Disassemble memory where execution stopped and examine the
instructions. If there are instructions that dereference registers, inspect
those registers for addresses that aren't in a valid range. If you're debugging
680x0 code on a Power Macintosh, you'll need to look at all the instructions
near the crash, because the 680x0 emulator won't tell you exactly which
instruction caused the error.
Researching a freeze requires a different approach. If the freeze prevents you
from using any debugging tools, you must isolate the offending code by watching
the computer execute up to the freeze. Setting breakpoints, tracing, and
stopping execution at known locations will bring you closer. This approach is
slow but will lead you to the code that caused the error or to the state that
prompted it. If the computer is frozen but you can still use debugging tools,
it's very possible that you're in an infinite loop.
Sherlock Holmes sometimes astonished readers by deducing crimes just from
hearing second-hand details. He was also known, however, to walk the back
alleys of London and gumshoe the scene of a crime when necessary. Learning the
layout of the crime scene was crucial for a number of his deductions. When
staring at your newly crashed software, do you recognize the code that your
debugger is displaying? Disassemble memory near the location of the crash and
snoop around for clues. Check for the following to determine how your computer
came to this final resting place:
- If you're using MacsBug, use the wh pc command to check where the code is.
- Display memory and disassemble from the beginning of the code's block
of memory.
- Does the code nearby reference strings or Gestalt selectors?
- Look for text symbols and strings in the code.
If you've crashed in
PowerPC code, most low-level debuggers will give great information about where
you are. This is because most PowerPC code is registered and linked using the
Code Fragment Manager, which these debuggers can access for hints. For example,
if you use the
wh pc command in MacsBug, after crashing in PowerPC code you'll
see something like this:
Address 000BAE34 is in the System heap
at 00002800 at NQDColor2Index+00018
The address is in a CFM fragment "NQD"
It is 0001AD28 bytes into this heap block:
Start Length Tag Mstr Ptr Lock
* 000A00F0 0003DB00+04 R 00002AC4 L
Here
we see that the computer crashed at a location 24 bytes from the beginning of
the NQDColor2Index routine. This routine is in the NQD (or Native QuickDraw)
code fragment. Since this address is close to the beginning of the routine, we
can disassemble from its start and examine the six instructions that executed
before the crash for more clues:
Disassembling PowerPC code from bae00
NQDColor2Index
+00000 000BAE00 li r5,0x0000
+00004 000BAE04 lwz r4,TheGDevice(r0)
+00008 000BAE08 sth r5,QDErr(r0)
+0000C 000BAE0C stw r31,-0x0004(SP)
+00010 000BAE10 lwz r5,0x0000(r4)
+00014 000BAE14 addi r31,r3,0x0000
+00018 000BAE18 *lwz r3,0x000C(r5)
A
bus error at NQDColor2Index+00018 would occur if register R5 contained an
invalid address. Look at the register display to validate that hypothesis.
Notice in the code that R5 is a dereference of R4, which comes from the
low-memory global TheGDevice. Here we crashed because TheGDevice had become
invalid, so now your investigation turns toward that global.
A freeze will typically occur because of a double page fault or exception or
because of an infinite loop. Synchronous driver calls will also freeze if
called when the interrupt level is above 0. A double fault or exception is
common only if you're writing driver software. Your computer can handle only
one page fault or exception at a time. A double fault or exception occurs when
software that services a fault subsequently causes a second fault. For example,
disk drivers are sometimes called by the Virtual Memory Manager to help service
page faults; therefore, if you develop a disk driver you must take care not to
cause page faults since you may be asked to service one as well.
A good way to detect infinite loops is to trace for a few instructions using
your debugger. If you notice the same set of instructions being repetitively
executed, you could be in an infinite loop. Look at branch instructions for
clues to why the loop isn't completing. A special case of these loops is the
vSyncWait routine. It looks like this:
MOVE.W $0010(A0),D0
BGT.S *-6
This
tight loop is waiting for the two-byte value located 16 bytes from register A0
to become 0 or negative. This is a standard sequence to wait for a driver
request to complete. The driver request is described in an IOParam record
pointed to by register A0. When the driver is done servicing the request, it
will interrupt the loop and modify the ioResult field 16 bytes into that
record. It will then return from the interrupt, and the loop will complete
normally. A freeze in this loop means the driver isn't servicing the request.
If you typed
dm a0 iopb in MacsBug, you might see something like this:
Displaying IOParamBlockRec at 000003A4
000003A4 qLink NIL
000003A8 qType 0002
000003AA ioTrap A003
000003AC ioCmdAddr NIL
000003B0 ioCompletion NIL
000003B4 ioResult 0001
000003B6 ioNamePtr NIL
000003BA ioVRefNum 0008
000003BC ioRefNum FFDF
000003BE ioVersNum #0
000003BF ioPermssn #23
000003C0 ioMisc NIL
000003C4 ioBuffer 01C7E2B0
000003C8 ioReqCount 00010000
000003CC ioActCount 00010000
000003D0 ioPosMode 0001
000003D2 ioPosOffset 1B84AA00
Take
note of the ioTrap and ioRefNum fields. In this case, ioTrap is $A003, which is
the synchronous Read trap. Using the
drvr dcmd in MacsBug, you'll find that the
driver with refNum $FFDF is .ASYC00, which is the SCSI driver. This hang, then,
occurs during a synchronous Read call to the SCSI driver. Perhaps I should next
check the current interrupt level.
After a long, ponderous silence, while sharply focused on the current enigma,
Holmes might startle you by saying, "Let us reconstruct, Watson." Then he would
describe the probable series of events that preceded that particular criminal
act. If the reconstruction wasn't adequate to identify a perpetrator, at least
it would review the crucial discoveries so far. It would show Holmes's
appreciable progress toward a solution. Similarly, while in the midst of a
difficult debugging task, you should reconstruct the turn of events to gain
extremely helpful information.
Figuring out what happened, once the computer is stopped cold in a crash or a
freeze, isn't easy. In effect, you're looking for footsteps in the sand that
are often obscured or covered with other false marks. For this task, the
technique we most often use is the stack crawl.
Procedural programming on the Macintosh uses a stack. For each procedure call,
the stack is added to, and vital clues such as return addresses and stack frame
pointers are left for us to find. In PowerPC code, the link register adds to
our clues and is guaranteed to point back to the penultimate procedure of
interest. Your low-level debugger will certainly have a stack crawl tool to use
as well.
In MacsBug, the sc and sc7 commands are your basic stack-crawling aids. Start
your search with the sc command, which looks for stack frames. Frames are
structures found on the stack containing both the return address and a pointer
to the previous frame. In PowerPC code the frames also contain a standard area
to preserve basic registers. Fortunately, frames are required in PowerPC code
and follow a standard format. Most 680x0 compilers will generate stack frames
as well, although much of the 680x0 system software was written in assembly
language without frames. If during your crash you have a valid stack frame
address in register A6 or R1, the sc command will show you a history of which
code execution preceded your software's demise. Listing 1 shows a basic sc
command's result.
Listing 1. Display from the sc command
Calling chain using A6/R1 links
Back chain ISA Caller
01C8A0AC 68K 01C139CA 'CODE 0001 0F6E Main'+03A1A
01C8A0A0 68K 01C132EA 'CODE 0001 0F6E Main'+0333A
01C89F4A 68K 00058748 'scod BFB1 011C'+01A38
01C89E6A 68K 00064090 'scod BFB1 011C'+0D380
01C89E40 68K 408787FC CHECKUPDATESEARCH+0003E
01C89E16 68K 40878426 __GETSUBWINDOWS+000D6
In
this example the first two links are in a CODE resource from file number $0F6E.
Use the MacsBug file command to determine which file they were loaded from.
It's likely that they're from the current application, and the return addresses
displayed in the Caller column (01C139CA and 01C132EA) are most likely in the
application's binary. The return addresses listed are crucial to your
sleuthing. They not only point out where execution would have returned to but,
more important, they show which instructions were recently executed: the ones
just before the return address. Those addresses are your footprints in the
sand. They are clues in your reconstruction, and they hint to the turn of
events that led to the crash or freeze.
Note the third and fourth lines in Listing 1, which show return addresses in an
'scod' resource. Those 'scod' resources implement the Process Manager. It's
possible that the application binary, probably at the instruction just before
address 1C132EA, made a call to the Process Manager.
The fifth and sixth lines of the display show return addresses in the Macintosh
ROM. The symbols are shown because I've installed a ROM map file in my MacsBug
Preferences folder. You should use the provided ROM map file for your computer,
because it will often give you better stack crawl information. You can also
deduce that these return addresses are in the ROM from the addresses
themselves. Most Macintosh ROMs begin at memory address $40800000. PCI-based
Macintosh ROMs currently begin at $FFC00000, and PowerPC processor-based
PowerBook ROMs at $40000000. You can determine the beginning address of your
ROM by looking at the ROMBase low-memory global. In MacsBug, for example, type
dl ROMBase to display the beginning ROM address.
The sc7 command in MacsBug gives you less precise information. In cases when
you don't have stack frames, you can ask your debugger to display all possible
return addresses on the stack. Your debugger will intelligently guess which
values on the stack are possible return addresses, but most of the information
displayed will be extraneous. You must pick through this information for clues
-- an arduous task. The stack frame-based crawl is neat and tidy, whereas the
same situation would produce the sc7 display shown in Listing 2. I've added an
asterisk (*) on each line that's also in the sc command's display.
Listing 2. Display from the sc7 command
Return addresses on the stack
Stack Addr Frame Addr ISA Caller
01C8A0B0 68K 01C16D62 'CODE 0001 0F6E Main'+06DB2c
01C8A0A4 01C8A0A0 68K 01C139CA 'CODE 0001 0F6E Main'+03A1A *
01C8A094 68K 40849116 UNLOADSEG+00046
01C8A06A 01C8A066 68K 409CFFFC DISPTABLE+8D0BC
01C8A018 68K 4087EAF0 GETRESOURCE+000B2
01C8A00E 68K 408806F6
01C8A008 PPC 00094BE8 EmToNatEndMoveParams+00014
01C89FF8 68K 0011ACDA
01C89FE0 68K 4087ECFE VRMGRSTDENTRY+000B0
01C89FDC 68K 4087ECFE VRMGRSTDENTRY+000B0
01C89FD8 68K 0011A5B4
01C89F4E 01C89F4A 68K 01C132EA 'CODE 0001 0F6E Main'+0333A *
01C89F4A 68K 01C8A09E
01C89F22 01C89F1E 68K 00058748 'scod BFB1 011C'+01A38 *
01C89F1E 68K 01C89F48
01C89EDE 01C89EDA 68K 00163E30
01C89EDA 68K 01C89F1C
01C89E62 68K 01C8AFBE
01C89E44 01C89E40 68K 00064090 'scod BFB1 011C'+0D380 *
01C89E1A 01C89E16 68K 408787FC CHECKUPDATESEARCH+0003E *
01C89DF4 01C89DF0 68K 40878426 __GETSUBWINDOWS+000D6 *
01C89DE2 68K 4087876E CALCANCESTORRGNS+0002A
01C89DDE 68K 001191E6
In
this example, there were a number of values on the stack that might have been
valid return addresses. The six we saw in the
sc command's display are there.
Many of the other lines will not be relevant return addresses, because many
procedures reserve space on the stack but don't always use it or initialize it.
There will often be old return addresses in that unused part of the stack.
These old return addresses are like very faint footprints in the sand -- from
some previous execution -- and they may tell you what occurred much earlier in
time. More often, though, they'll just be distracting and irrelevant to your
search.
Be very wary of an sc7 command when tracing through PowerPC code. PowerPC code
typically has large stack frames, at least 56 bytes for each procedure, and the
code often doesn't use all those bytes. This will cause many old return
addresses to stay in the unused parts of the stack frame, and those old
addresses will appear in your sc7 command's display.
Sometimes you'll notice that the sc and sc7 commands fail to work. In MacsBug,
you may see the error
Bad stack: stack pointer must be even and
<= stack base
There's
more than one stack that the system uses, but the stack base that MacsBug
refers to in this error is the application stack's base or top address. The sc
and sc7 commands first check to see if the A6, A7, and R1 registers point to
locations below the application stack's base. If they don't, MacsBug returns
this error. The executing code may be using a different stack, however. Many
parts of the Mac OS system software use separate stacks. To force MacsBug to
execute a stack crawl anyway, specify the register to use and the amount of
memory to search through. For example, the MacsBug commands
sc7 a7 4000 and
sc
a6 4000 will execute a stack crawl even if the A6 and A7 registers point above
the application stack's base.
System stacks vary in size from about 8000 bytes up to 48000 bytes. There's no
easy way to determine the base of a system stack that's in use. If you don't
get interesting clues from 16384 bytes ($4000 in hex), vary the number of bytes
you specify and compare your results.
Don't be pacified by source-level debuggers. Lower-level tools give you a much
better understanding of the Mac OS and your code. These tools also give you the
ability to research the most complicated problems. Strive to be a software
sleuth, and you'll gain some truly useful expertise.
DAVE EVANS still works at Apple in the Mac OS System Software group. He always
enjoyed Sherlock Holmes stories while he was growing up, and he was excited to
learn that most of the stories are no longer protected under copyright and are
easily accessible on the Internet (see the 221B Baker Street Web page at
http://www.contrib.andrew.cmu.edu/u/mset/holmes.html).*
Thanks to Geoff Chatterton, Doug Clarke, Michael Dautermann, and Tim Maroney
for reviewing this column.*