QuickTrap
Volume Number: | | 4
|
Issue Number: | | 3
|
Column Tag: | | The Mac Hacker
|
QuickTrap Routines Bypass Trap Dispatcher
By Mike Morton, University of Hawaii
Bypassing the ROM trap dispatcher
In an article a while back, I covered the basics of bypassing the Macintosh trap dispatcher to call ROM routines directly, to speed up calls to the Toolbox and OS. In this article, Ill present a set of subroutines which implement this technique in a practical way.
The package is written in MPW assembler, and should be easily callable from any of the MPW languages. Its short and should be portable to other development systems. It also includes a fail-soft feature, in case it turns out not to work on some future Macintosh.
A quick review
Programs call the Macintosh Toolbox and Operating System routines by executing illegal instructions, which are handed to the trap dispatching code in the ROM. In addition to the time it takes for the 680x0 processor to recover from the emotional trauma of this illegal instruction, the dispatcher must fetch the offending instruction, decode it, and call the routine it specifies. This is very general, since it hides ROM locations from the application, but its also slow.
With the GetTrapAddress routine, you can calculate the address of a ROM routine just once each time your application runs. Calling that address directly can save you a lot of time, with very little cost in generality.
What does the dispatcher do?
Heres the code for the dispatcher in my MacPlus ROM. Your Mac may have something a little different, but all existing Macs seem to be similar in principle. The dispatcher, at address $401F52 in my ROM, disassembles to:
disp:
SUBQ.L #2, SP ; add 2 bytes above CCR
MOVEM.LD1-D2/A2, -(SP) ; save 12 bytes of regs
MOVE.L 12+4(SP), A2 ; get PC of trap word
MOVE.W (A2)+, D2; get A-trap word
MOVE.L A2, 12+4(SP) ; restore updated PC
MOVE.W D2, D1 ; copy trap word to D1
ANDI.W #$01FF, D2 ; get just trap number
CMPI.W #$A800, D1 ; trap or OS?
BLO.S doOS; jump if OS
LEA $0C00, A2; point->Toolbox dispatch
LSL.W #2, D2 ; scale number->longwords
MOVE.L (A2,D2.W), 12(SP) ; copy address to stack
CMPI.W #$AC00, D1 ; auto-pop bit set?
MOVEM.L(SP)+, D1-D2/A2 ; restore regs; leave CCR
BLO.S callTB ; skip if auto-pop off
MOVE.L (SP)+, (SP); RTS to caller, not glue
tBox: RTS ; call Toolbox routine
doOS:
LEA $0400, A2; point to OS dispatch
BCLR #8, D2 ; clear&test keep A0 bit
BNE.S OSa0; skip to allow A0 returned
LSL.W #2, D2 ; scale number->longwords
MOVE.L (A2,D2.W), A2; fetch OS routine address
MOVEM.LA0-A1, -(SP) ; save regs (incl A0)
JSR (A2); call OS routine
MOVEM.L(SP)+, A0-A1 ; and restore OS regs
OSrt:
MOVEM.L(SP)+, D1-D2/A2; restore OUR regs
ADDQ.W #4, SP ; ignore stacked CCR
TST.W D0; preset CCR on result
RTS ; and return
OSa0:
LSL.W #2, D2 ; scale number->longwords
MOVE.L (A2,D2.W), A2; fetch OS routine address
MOVE.L A1, -(SP); preserve A1, *not* A0
JSR (A2); call OS routine
MOVE.L (SP)+, A1; and restore A1
BRA.S OSrt; clean up with common code
[An aside: This is the first piece of ROM code I ever read, and I still think its a great example of tight 68000 coding. Its tighter on the Mac II, with indirect addressing available. I cant see any way to make it faster; can anyone spot a way to save a few bytes, though?]
Besides figuring out which routine to call (using the Toolbox dispatch table at $0C00 or OS table at $0400), the dispatcher also does some other important things. For Toolbox traps, it discards the return address if the auto-pop bit is set -- this is useful for glue. And for OS traps, it preserves D1, D2, A1 and A2, and sometimes A0. For OS traps, it also passes the low nine bits of the trap number to the routine, in D1,
Our task is to make a trap dispatcher which does all this, but is much faster. Note, for instance, that the new code must still pass the trap number in D1.w -- I believe this is how some routines test for flag bits set in the word. (For instance, CmpString has a bit to specify if the comparison is case-sensitive.)
Hey, wait a minute! Isnt it a bad idea to know how one ROM routine (the dispatcher) communicates with all the others? Isnt code which depends on this interface likely to fall apart when the Mac III hits the streets? Well, first of all, itd be awfully hard for Apple to change hundreds of routines. But more importantly, theres a way to back out gracefully. Trust me; well get to it
An applications view of the QuickTrap routines
The fundamental speedup is to get rid of the dispatcher, and have one quick trap routine for every real routine youd like fast access to. For instance, if your program does a lot of SetPort calls, you can easily create qtSetPort, which has exactly the same interface and does the same thing, only faster. As you might guess, each qtxxx routine caches the address for its routine.
Once, at the beginning of your application, you must call qtEval, which evaluates each address and stores it. If you dont call it, everything will still work -- this is related to the fail-soft scheme.
Other than this, everything works the same as old-style trap routines.
Caching problems
Imagine that you spend a lot of time doing FrameOval calls to draw circles on the screen, and would like to speed this up. (Actually, Im sure the trap time is insignificant compared to the drawing time; this is just an example.) You install qtFrameOval and call it instead everything works great.
Now your friend gives you this neat, public-domain desk accessory which causes all ovals to be drawn on your screen with smile-faces in them. [Any takers to write this, by the way? You could call it The Smiling Moose ] It does this by altering the FrameOval trap to call it. But since your application never executes that trap, its ovals are drawn unmolested. How can you make sure your ovals are happy?
The answer is to call qtEval at the right times -- not just at initialization but whenever you suspect someone has installed a replacement trap routine. Since the qtxxx routines are supposed to cache the real addresses, they must track new address when theyre installed, or the cache becomes stale.
One way to do this is to call qtEval every time you regain control from a desk accessory, each time you regain control from Switcher or Multifinder, and each time you invoke an FKEY. Perhaps youd also have to call it for every SystemTask call. And of course you must call it if your application does any SetTrapAddress calls for the relevant traps. In short, whenever anyone could have changed trap addresses, refresh the cache.
A simpler approach is to change the SetTrapAddress trap by installing a prefix routine which sets a flag in your globals that re-evaluation is needed. If DAs, FKEYs, etc., play by the rules and use SetTrapAddress calls, nobody can make the trap tables get out of sync with your cached addresses.
Its tempting to call qtEval in your idle-loop as a heavy-handed way to make sure its done often enough. I suspect this is a bad idea -- it can cause seemingly random bugs.
One other way: if you use, for instance, qtFrameOval only in some code which doesnt relinquish control, call qtEval once before each time you enter that code. Remember that qtEval isnt all that speedy -- it must call GetTrapAddress for every qtxxx routine.
Reasons not to use these routines
Because the routines are JSRd to, they take up four bytes instead of two. This is no big problem for most applications, but dont change all your calls.
When youre debugging, commands to break on traps dont work, since your application is not executing trap instructions. You can force these traps to occur by disabling the caching; see below for details.
The routines use impure code. You must make sure you put them in a segment which is locked in memory.
Which traps should you replace?
Remember that many traps take so much time that the dispatch isnt worth improving. Others do next to nothing, and speed up a lot. In early use of these routines at Lotus, we estimated about thirty routines were worth replacing. In the OS world, things like BlockMove and UprString were included. Routines which just twiddled handles are also important, like HLock, HUnlock, HPurge, HNoPurge, and GetHandleSize. Among the Toolbox routines, things like MoveTo and SetPort seemed to help.
Even if a routine is slow, it may be worth tweaking if its called a lot. We got measureable improvements substituting for CharWidth, DrawString, StringWidth, and SystemTask.
You can also replace package calls, which is kind of a pain. If you want to change all the FP68K traps to qtFP68K, you have to change Apples include files, since each of the SANE macros invokes the trap. Another solution is to just redefine FP68K to be a macro to JSR to the qtxxx routine. But then you have to define a trap like myFP68K which still expands to the A-line trap -- this is because the qtxxx routine must have a copy of the trap word.
How much does it help?
As the TV diet ads say, results vary directly with how closely you stick to the plan. Average performance in a large Macintosh product at Lotus was improved by about 5%. A couple of heavily CPU-bound loops were improved by 15%. These arent huge gains, but considering that they took only a day or so of work to install in a very large program, theyre pretty good.
When does the warranty run out?
OK, its time to face the music. If these routines dive directly into the ROM, they may someday dive into ROM routines in a new machine which expect different parameters. (For more on this topic, see Macintosh Technical Note #110.) Or even if the ROM doesnt change, some caching problem may come up if your applications users use some odd way of altering trap addresses and making your cache stale.
The initialization routine qtEval can be easily disabled by modifying resources. For instance, when a user calls to complain that some FKEY or DA doesnt work with your application, you can quickly change a copy of the application to disable address caching and test if thats the problem. If it is the problem, you can either distribute the altered application or tell power users how to edit the resources to alter the copy they already have.
The resource used to control caching is QTRP 257. The format is simple: if the resource is present and the first word is zero, caching is enabled. To turn off caching, just remove this resource under Resedit (or renumber it, to easily restore caching). Remember that programmers may want to disable caching for certain types of debugging when they want to see traps under a debugger.
In a future format, a non-zero first word could signal that the resource contains a list of specific traps to be enabled/disabled.
In short, its easy to experimentally turn off this hackery to check if its causing problems, and easy to turn it off permanently if it is. In tests at Lotus, an application with caching disabled ran less than 1% slower than one which executed traps in the first place. This is the cost of calling a qtxxx routine, which in turn must do the xxx trap anyway because caching is off.
Notes on the code
The routines are intended to be pretty simple; Ill walk through them and point out a few things.
Code caching: While this stuff works fine on a Mac II, I believe it ought to flush the 68020s code cache after patching itself. Any recommendations from 68020 gurus out there?
Layout: The Toolbox and OS routines are laid out with symbols defining offsets in them. This is so they can be patched; the symbols must stay in sync with the layout.
Toolbox routines: These start out looking a lot like glue routines for a higher-level language -- they use the auto-pop bit, so their return is ignored and the trap returns straight to the application routine which called qtxxx. This trap word, plus four bytes of slop, is patched to be JMP <trap address>. Also, before the entry point theres another copy of the trap word, in case qtEval is called more than once.
OS routines: These are more complicated. In their simple form, they execute the trap and return, because there is no auto-pop bit for OS calls. After being patched, they do just what the dispatcher does: save registers, set up D1 and D2 with the trap number, JSR to the routine, restore registers, test D0.w, and return. Note that the registers saved and restored depend on the trap word -- if bit 8 is set, then A0 is included in the registers saved.
Bit-coding: The OS routines hard-wire the trap number passed in D1 and D2. If you want to call, for example, NewHandle with the clear bit set, you must define two routines: qtNewHandle and qtNewHandleClear (or whatever you want to call them). This is necessary because your JSR qtNewHandle cant communicate whether it wants the clear bit set -- thats something normally encoded in the trap word.
Adding routines: The qtTool and qtOS macros do all the work for you. For each one, supply the name of the qtxxx routine you want to define and the _xxx name of the trap its going to handle.
Using the routines
Pick some segment which wont leave memory and change the SEG directive to specify it.
Assemble the routines and link them into your program as normal. If youre using a higher-level language, declare the qtxxx routines to have exactly the same calling interface as the _xxx trap, except that theyre defined externally instead of invoking in-line trap words.
Remember to call qtEval once at startup. And if you want to avoid the cache getting stale, use one of the strategies described above to decide when to call qtEval again.
If youre using a language which uses glue, you may not be able to easily do this. Write to your language developer and pester them to do it for you
Comments? Improvements? Letter bombs?
Id be interested to know how these routines work, how easy they are to install in various development environments, and what kind of performance improvements you see from using them. Drop me a line at P.O. Box 11378, Honolulu, HI 96828.
Since this stuff is stretching the ROM in ways it wasnt meant to be stretched, Id also appreciate hearing about the technique in general. Do you think its safe? Can you suggest a better way? And if you have improvements, send them in to MacTutor
; Macintosh Toolbox and OS-trap bypass routines.
; Copyright © 1987 Michael S. Morton
;
; History:10-May-87 - MM - Initial version.
;22-Oct-87 - MM - Neatened for publication.
BLANKS ON
STRING ASIS
PRINT OFF
LOAD tlAsmSyms.sym
PRINT ON
; Impure code! Should be in a segment which is locked in memory.
SEG LOCKDSEG ; *** change to a locked segment ***
;------------------------------------------------------------------
; Each Toolbox routine starts out life as:
;2 <pure copy of trap word>
;entry: 2 <trap word with auto-pop bit>
;4 <four bytes unused>
;
; The evaluator changes this to:
;2 <pure copy of trap word>
;entry: 6 JMP <actual address>
;
; Either form is callable with a JSR because the former
; includes the auto-pop bit, so the Toolbox routine returns
; to its callers caller. Offsets are:
tTrap: EQU 0 ; pure copy of trap
tJump: EQU 2 ; JMP xxx.L instruction
tAddress: EQU 4 ; address to jump to
tLength:EQU 8 ; length of one block
; The qtTool macro generates code for one qt Toolbox routine.
;
; args: routine -Name for routine.
;Typically qt plus trap name.
; trap -Name of the trap, eg, _MoveTo.
MACRO
qtTool
EXPORT &Syslst[1] ; define the qtXXX routine globally
&Syslst[2] ; first, a pure copy of the trap word
&Syslst[1] &Syslst[2] autoPop
; entry: if not overwritten, just trap
ds.b 4 ; reserve room for overwriting
ENDM
;----------------------------------------------------------------
; Each OS trap bypass routine is 28 bytes long.
; The unevaluated routine is:
;2 <pure copy of trap word>
;entry: 2 <trap word>; only this part
;2 RTS ; gets executed before eval
;4 MOVE.W #<xxx>, D1; trap word patched here
;4 MOVE.W #<xxx>, D2; trap number patched here
;6 JSR xxx.L ; routine address patched here
;4 MOVEM.L (SP)+, D1-D2/A1-A2; register list patched here
;2 TST.W D0 ; set condition codes
;2 RTS ; and return
; After the evaluator is done, the routine becomes:
;2 <pure copy of trap word>
;entry: 4 MOVEM.L D1-D2/A1-A2, -(SP) ; (saves A0 too,
; if bit 8 set)
;4 MOVE.W #<trapword>, D1 ; get trap word in D1
;4 MOVE.W #<trapword & $01FF>, D2 ; and trap number
;6 JSR xxx.L ; call the routine
;4 MOVEM.L (SP)+, D1-D2/A1-A2 ; (gets A0 too, if bit 8 set)
;2 TST.W D0 ; set condition codes
;2 RTS ; and return
oTrap: EQU 0 ; original trap word
oSave: EQU 2 ; for MOVEM.L xxx, -(SP) to go
oTrapWord:EQU 8 ; for trap word in MOVE.W #xxx, D1
oTrapNum: EQU 12; for trap number in MOVE.W #xxx, D2
oAddress: EQU 16; address to jump to
oRestore: EQU 22; for second copy of MOVEM regs list
oLength:EQU 28 ; length of one block
; The qtOS macro generates code for one qt OS routine.
;
; args: routine -Name for trap routine.
; Typically qt plus trap name.
; trap - Name of the trap, eg, _GetHandleSize.
MACRO
qtOS
EXPORT &Syslst[1] ; define the qtXXX routine globally
&Syslst[2] ; pure copy of trap word
&Syslst[1] &Syslst[2]
; entry: if not overwritten, just trap
RTS ; and return
MOVE.W #$5555, D1 ; get trap word in D1, for OS routine
MOVE.W #$5555, D2 ; and trap number
JSR $55555555 ; leave space for a longword address
MOVEM.L (SP)+, D1-D2/A1-A2
; assume A0 not in the register list
TST.W D0; set condition codes
RTS ; and return
ENDM
;----------------------------------------------------------------
; Resource type and ID for the flag used to disable the trickery.
qtrpType: EQU QTRP; resource type for flag
qtrpId: EQU 257 ; resource ID for flag
; qtEval
;
; description:
;Update the routines so they jump directly into
; the ROM, or wherever. This routine should be called
; at startup, and each time the application
;thinks anyone has (or might have) called SetTrapAddress.
;
; uses: (no registers)
qtEval: PROCEXPORT
; Stuff used in patching together routines:
jmpInst:EQU $4EF9; opcode word of JMP xxx.L
OSregs: REG D1-D2/A1-A2 ; registers saved by OS dispatcher
OSregs2:REG D1-D2/A0-A2 ; registers saved when bit 8 is zero
MOVEM.L D0-D2/A0-A2, -(SP) ; save callers registers
; First, see if weve already been told not to do our thing:
LEA qtEnabled, A2; point to the flag
TST.B (A2); have we snuffed it already?
BEQ evalEnd ; yes: nothing to do
; Second, decide if the resource flag allows us to map/cache.
SUBQ #4, SP; make room for function result
MOVE.L #qtrpType, -(SP) ; pass the type
MOVE.W #qtrpId, -(SP) ; and ID
_GetResource ; try to find our flag
MOVE.L (SP)+, A0; pop result
MOVE.L A0, D0 ; and test for NIL
BEQ.S evalTurnOff; no such thing? go flag this and exit
MOVE.L (A0), A1 ; deref. handle; point to rsrc with A0
MOVE.W (A1), D2 ; pick up first word, to check later
MOVE.L A0, -(SP); pass handle
_ReleaseResource; and get rid of it
TST.W D2; now check -- did rsrc start with zero?
BNE.S evalTurnOff; no: we dont yet do selective disable
; Nothing forbids hackery. Evaluate all the
; toolbox bypass routines.
LEA qtToolStart, A1 ; point to first routine
LEA qtToolEnd, A2; point to just after last routine
MOVE.W #jmpInst, D1 ; get a JMP xxx.L instruction
BRA.S toolEnd ; check for no routines
toolLp: MOVE.W tTrap(A1), D0; pick up the trap number in D0.w
_GetTrapAddress newTool ; ask where this routine lives
MOVE.L A0, tAddress(A1) ; store address first, THEN
MOVE.W D1, tJump(A1); the JMP, so routines always OK
ADDQ #tLength, A1 ; advance to the next routine
toolEnd:CMP.L A2, A1; at (or past) end of toolbox routines?
BLO.S toolLp ; nope: go evaluate another one
; Evaluate all the OS bypass routines.
LEA qtOSStart, A1; point to first
LEA qtOSEnd, A2; and to just after last
BRA.S osEnd ; handle degenerate case
osLoop: MOVE.W oTrap(A1), D0; pick up trap number
MOVE.W D0, D2 ; copy it for later use (BTST, etc.)
_GetTrapAddress newOS ; find where the routine lives
MOVE.L A0, oAddress(A1) ; save routine address in JMP xxx.L
MOVE.W D2, oTrapWord(A1); fill in MOVE.W #trapword, D1
AND.W #$01FF, D2 ; get just the trap number
MOVE.W D2, oTrapNum(A1) ; and store in MOVE.W #trapnum, D2
; Decide whether the saved registers include A0.
MOVE.L OSent, D0; assume we want usual registers saved
MOVE.L OSexit, D1 ; and restored
BTST #8, D2; but should we save A0, too?
BNE.S osLp1 ; nope: OSents registers are fine
MOVE.L OSent2, D0 ; yep: use reg list which includes A0
MOVE.L OSexit2, D1; and ditto for one which saves A0
osLp1: MOVE.W D1, oRestore(A1) ; store register list for restore
MOVE.L D0, oSave(A1); lastly, get rid of 1st trap
ADD #oLength, A1 ; advance to the next routine
osEnd: CMP.L A2, A1; at the end?
BLO.S osLoop ; nope: go do another
evalEnd:MOVEM.L (SP)+, D0-D2/A0-A2; restore callers registers
RTS
; Here when the resource forbids caching.
; A2 points to the flag.
evalTurnOff:
SF(A2) ; disable it for faster call next time
BRA.S evalEnd ; clean up and exit
; *** Impure *** flag: 0 means mapping disabled; non-zero means enabled.
qtEnabled:
DC.B $FF,00; initially enabled; 2nd byte to align
; Instructions and register lists to stick into OS routines.
; Each is 2 words.
OSent: MOVEM.L OSregs, -(SP)
OSent2: MOVEM.L OSregs2, -(SP)
OSexit: MOVEM.L (SP)+, OSregs
OSexit2:MOVEM.L (SP)+, OSregs2
; Toolbox trap replacement routines. To be re-evaluated,
; these must be between qtToolStart and qtToolEnd.
; Nothing else must be in here -- the evaluator
; walks through this as an array.
qtToolStart: ; Beginning of Toolbox trap replacement routines.
qtTool qtMoveTo,_MoveTo
qtTool qtSetPort,_SetPort
; add your own here
qtToolEnd: ; End of Toolbox trap replacement routines.
; OS trap replacement routines. As with toolbox,
; keep only these in here.
qtOSStart: ; Beginning of OS trap replacement routines.
qtOS qtHLock,_HLock
qtOS qtHUnlock,_HUnlock
; add your own here
qtOSEnd: ; End of OS trap replacement routines.
END