Efficient 68030
Volume Number: | | 8
|
Issue Number: | | 5
|
Column Tag: | | Efficient coding
|
Efficient 68030 Programming
Optimizing your code to run faster on the 68030
By Mike Scanlin, MacTutor Regular Contributing Author
So you want to optimize your code to run faster on the 68030? Well, good for you. This article will help. Most of these optimizations apply to running on the 68020/040, too, and in no case will any of these optimizations make your code run slower on a non-030 machine (like the 68000). These optimizations are not language specific, although most of the examples given here are in Think C and assembly.
The two most important things you need to know in order to make your program run fast on a 68030 are: (1) How to intelligently align your code and data and (2) How to use the data and instruction caches effectively. Once you understand a few basic attributes of the 030 you can do a few things to your code without much work to ensure that you have optimal cache use and alignment.
How much difference will it make, you ask? Using these techniques you can save about 15% on your subroutine calling overhead, 25% to 50% on some of your 32-bit memory accesses (reading or writing longs, Ptrs, Handles, etc), 60% on your NewHandleClear calls and more. Taken in sum these things are likely to make a noticeable difference in your codes performance when running on an 030.
The single biggest gain you can realize in your programs when running on an 030 is to make sure your 4-byte variables (longs, Ptrs, Handles, etc) are on 4-byte memory boundaries (i.e. their address is evenly divisible by 4). To understand why this makes a difference you need to know a little bit about how the data cache works.
THE DATA CACHE
The 68030 has a 256 byte data cache where it stores copies of the most recently accessed memory locations and values. If your code reads the same memory location twice within a short period of time, it is very likely that the 68030 can get the value at that location from its cache (which is in the processor itself) rather than from actual memory. This reduces the time it takes to execute the instruction accessing the memory location because it eliminates the external bus cycle that the processor would normally use to go out into memory and fetch the value that the instruction needs.
Of course, its not quite as simple as that. There are a couple of other details to know. The 256 byte data cache is actually broken up into 16 separate caches of 16 bytes each. When the processor is trying to fetch the value at a given memory location it first checks if it is in one of its 16 data caches. If it is, then it uses the value and everything is cool. If its not, then it goes out into memory and fetches the value along with the 16 or 32 bytes surrounding the value. If x is the memory location requested, then the processor fetches all the bytes from location ((x >> 4) << 4) to ( (x + sizeof(fetch)) >> 4) << 4) + 15), where sizeof(fetch) is 1, 2 or 4 depending on if this is a byte, word or long access. Basically, it fetches enough 16 byte buckets to completely cover the request (and each bucket begins at an address divisible by 16). The reason why more than one 16-byte bucket might be necessary is because the first part of the requested value could be at the end of one bucket and the second part could be at the beginning of the next bucket. An example is the low memory global GrayRgn, whose address is $9EE - the first half (hi-word) exists in the bucket from $9E0 to $9EF and the second half (lo-word) exists in the bucket from $9F0 to $9FF. So, assuming GrayRgn isnt in the data cache to begin with, the instruction:
Move.L GrayRgn,A0
will cause 32 bytes to be read in from memory (4 of which the processor actually uses to process the Move.L instruction). At the end of the instruction, two of the 16 data caches are filled with the values from $9E0 to $9FF. The reason why the processor does this is because it is predicting that the code is going to want the bytes near the most recently requested memory location. Although the first access to a 16-byte bucket is expensive, subsequent accesses to the same bucket are free (in terms of time to fetch the data). This actually helps routines like BlockMove quite a bit (its 46% faster for non-overlapping moves because of it). And its not quite as sick as you might expect to read in 16 or 32 bytes from memory because the processor uses a special data burst mode to do it (which runs somewhat parallel with other stuff the processor is doing).
The other detail to know about the 030 data cache is that its a write-through cache. That means that if you change the value at a given location and that location is already in the data cache, then both the cached value and the real memory value are updated. This way you can never have stale data in the cache (i.e., the cache always agrees with actual memory and vice versa). (Note: Thats somewhat of a lie. It is actually possible to get stale data in an 030 data cache if you have a DMA card writing to main memory at interrupt time, but I assume that if youre doing that then you already know whats going on. Most people wont ever have to worry about this detail. Check out TN #261: Cache As Cache Can for more info on this.)
The last little thing to know about alignment is that independent of the 16-byte buckets the data cache uses, it is almost always faster to access a 4-byte quantity on a 4-byte boundary than on a 2-byte boundary (the only time its not is if the entire 4 bytes are already in the data cache). How much difference does it make? In the case of reading a 4-byte value it is 34% faster if the value is long-aligned and in the case of writing a 4-byte value it is 25% faster if the value is long-aligned (and if the non-aligned value crosses a 16-byte bucket boundary then its 52% faster for reads and 25% faster for writes; notice there is no penalty, other than the normal non-aligned penalty, for writing to a 4-byte value that crosses a 16-byte boundary because write operations dont cause 32-byte reads to happen).
So, the take home message is that you should always put your 4 byte quantities (longs, Ptrs, Handles, etc) at addresses that are divisible by 4 (or, to be more general, put your n-byte quantities on n-byte boundaries). At the time the Macintosh was originally conceived this was not a design goal because several of the low memory globals are 4-byte quantities at odd-word addresses. In fact, a few live at the worst possible location for a 4-byte quantity: those addresses ending in $E that will cause 32-byte reads when they are accessed: FCBSPtr ($34E), GrayRgn ($9EE), MinStack ($31E), ROMBase ($2AE) and WMgrPort ($9DE). Theres nothing we can do about those, but there is something we can do about your code.
ALIGNING YOUR VARIABLES AND DATA
The basic idea of aligning your variables is to keep all of your 4-byte variables/structs together and then have all of your 2-byte and 1-byte variables/structs follow. If you know that the first variable you allocate is long-word aligned and all the ones that follow it are multiples of 4-bytes in length, then you know that they are all going to be long-word aligned (likewise, if the first one is off then they are all going to be off).
Here is an example of some variables, some of which are guaranteed to not be aligned:
inti;
long x;
char c[2];
Ptrp;
Because i and c are 2-byte quantities, you know that one of x or p is going to be long-word aligned and the other one is not going to be long-word aligned, no matter where the compiler allocates them (assuming that they are all allocated next to each other in the order theyre declared).
Heres a regrouping that is better:
/* 1 */
long x;
Ptrp;
inti;
char c[2];
Now, as long as you know that x is long-word aligned, you know that p is long-word aligned, too. Forcing long-word alignment of variables comes in three parts: globals, locals and data structures.
Before we get to the specific details, I should mention that the C language doesnt specify the location or order of declared variables; its implementation dependent. Therefore, its more important that you understand what is desirable and why it is desirable. Then, you can consult your compiler manual or, more likely, write some test programs to see what its doing with your variables. In a perfect world you wouldnt have to be concerned with alignment - it would be a compiler option. Until that time, read on.
ALIGNING GLOBAL VARIABLES
Because of the fact that the Mac memory manager always allocates memory on 4-byte boundaries (Ptrs and dereferenced handles always end in 0, 4, 8 or C) you know that the heap is going to start and end on a long-word boundary. Further, because you know that the jump table is a multiple of 8 bytes you can assume that register A5 will point to an address divisible by 4. This means that as long as you allocate your globals in chunks of 4 bytes you will have long-word alignment for them. The best way to do this is to allocate your biggest variables first, then your next biggest, and so on, on down to your smallest. Just make sure that you have an amount of each one that divides by four.
Good global allocation:
/* 2 */
/* First, the 4*n byte quantities */
Rect r;
/* Now list the 4 byte quantities */
Handle h;
long x;
char *p;
/* Now list the 2 bytes quantities */
int i, j, k;
int dummyInt1; /* need this for padding */
/* And finally the 1 byte quantities */
char ch;
Booleanb;
int dummyInt2; /* need this for padding */
Note that each section is a multiple of 4 bytes long. This is important because the Think C 5.0 compiler/linker allocates globals in backwards order from how theyre declared. If these are your only global variables:
long x;
inti;
then x is at $FFFA(A5) and i is at $FFFE(A5). But if you put a dummyInt in there like this:
/* 3 */
long x;
inti;
intdummyInt;
then x becomes long-word aligned at $FFF8(A5) (which is the goal).
The only gotcha with globals has to do with the QuickDraw (QD for short) globals which arent a multiple of 4 bytes long (theyre 206 bytes long). You need to know if your compiler allocates them before or after your own globals and maybe add a dummyInt either before or after all of your globals to compensate for the unaligned QD globals. Think C users dont really have to worry about this problem because the Think compiler allocates the QD globals after the application globals (so the QD globals might be unaligned but theres generally nothing you can do about that; at least they wont affect your application globals alignment).
ALIGNING YOUR LOCAL VARIABLES
Aligning local variables is trickier because they are allocated on the stack. It requires that the stack be long-word aligned on entry to your routine, which you cant always control (especially if youre a callback proc). However, there is a little we can do to make this a more likely event.
First, we need to make sure the stack is aligned as soon as our code begins to execute. The easiest way to do this is by not having any local variables in main() and by adding a little code that makes sure the stack is aligned after launching by adjusting it by 2 if necessary. The following inline functions are used for this purpose:
/* 4 */
void AlignStack(void) = {
0x200F,//Move.L SP,D0
0x0240,0x0003, //And #0x0003,D0
0x6604,//Bne.S @1
0x7004,//Moveq #4,D0
0x554F,//Subq #2,SP
0x3F00 // @1 Move D0,-(SP)
};
void UnAlignStack(void) = {
0xDED7,//Add (SP),SP
};
If you need to, make main a shell proc that calls your real main, MyMain, so that you dont have any local variables in main():
/* 5 */
void main() {
AlignStack();
MyMain();
UnAlignStack();
}
If you do this, then you know for sure that MyMain will be called with an aligned stack. Now the goal is to make MyMains stack frame (as well as every other stack frame in the entire program) aligned - make sure you declare variables from largest to smallest and make sure that the total size of the stack frame (including parameters) is a multiple of 4 bytes (in addition, if youre using Pascal then you need to include the size of the function result since its allocated on the stack). The following inline function is useful to put at the front of each function to make sure the stack is aligned on entry. It will cause a hang (branch-to-self instruction) if it detects an unaligned stack (in which case you know the caller of that function needs work):
/* 6 */
void CheckStackAlignment(void) = {
0x2F00,//Move.L D0,-(SP)
0x200F,//Move.L SP,D0
0x0240,0x0003, //And #0x0003,D0
0x66FE,// @1 Bne.S @1
0x201F,//Move.L (SP)+,D0
};
Heres an example program with all of these macros in use:
/* 7 */
void main() {
AlignStack();
MyMain();
UnAlignStack();
}
void MyMain() {
long x;
int i;
int d; // needed for alignment
#if DEBUG
CheckStackAlignment();
#endif
x = 5;
i = MyProc(x);
}
int MyProc(long x) {
#if DEBUG
CheckStackAlignment();
#endif
return(x);
}
Not only are you ensuring that x is long-word aligned in this example but, because the stack is aligned the whole time you are getting another benefit: faster subroutine calls. The reason is because the processor can push and pop the return address (a 4-byte value) to an aligned location on the stack. The overall gain of an aligned stack on a Jsr/Bsr and Rts pair is 14% to 16% faster than the same Jsr/Bsr and Rts on a non-long-word aligned stack. Dont forget, too, that if the routine does a Movem.L to save/restore the registers it uses on the stack that the Movem.L instructions will be much faster on a long-aligned stack. Note: This assumes that your compiler is not doing delayed stack clean up. If you are coding for speed and using these alignment techniques, then you need to turn off that compiler option. It is called Defer & combine stack adjusts in Think C (make sure its unchecked).
Note that you dont want to put CheckStackAlignment in callback procs you pass to the ROMs (like filterProcs or ioCompletion routines) or in VBL tasks because you cant control what the ROM or interrupt put on the stack before you are called. Youll just have to live with whatever alignment you get at the time youre called. You can, however, do something about the alignment of the routines called by your callback proc if you put an AlignStack/UnAlignStack in your callback proc. That way, only the callback procs parameters and local variables will be potentially unaligned - all procs that it calls will be aligned.
There is one more detail relative to functions: You should design your functions so that, in C, the parameters should add up to a multiple of 4 bytes and, in Pascal, the parameters plus the size of the function result should add up to a multiple of 4 bytes (because in Pascal the function result is on the stack and in C its in a register). Otherwise your carefully crafted stack frames wont start at an aligned address. It is worth it to increase the size of an int parameter to long or even to add a dummy int parameter in order to maintain this multiple-of-4-bytes rule. If this is your original function:
int MyProc(int i) { return(i); }
then changing it to:
/* 8 */
int MyProc(long i) { return(i); }
will cause it to execute 17% faster when called from an aligned-stack caller. Changing
it to:
/* 9 */
int MyProc(int i, int dummy) { return(i); }
with a calling sequence of:
/* 10 */
i = MyProc(i, 0);
will cause it to execute 9% faster when called from an aligned-stack caller.
There are three things to keep in mind when adding up the sizes of parameters: (1)
One-byte parameters (Booleans, chars, bytes, etc) count as 2 bytes each because they are
pushed on the stack as 16-bits each in order to maintain word-alignment on the stack (or
else youd get an address error on the 68000), (2) pointers to any size parameter (like
char *) are 4 bytes each and (3) declare your parameters in the correct order. If youre
working in C, then declare them in order from smallest to largest (the opposite of what
you do with global and local variables) - that way the 4-byte values (which are pushed
first) will be long-aligned (assuming the stack is aligned at the time of the subroutine
call). Usually, a C compiler will push parameters in order from right to left (because of
functions like printf that take a variable number of parameters, the first parameter needs
to be pushed last) so you want youre biggest (4 byte) parameters at the end of the parameter
list. Here are some examples of bad (unaligned) C prototypes:
void MyProc1(long x, int i);
int MyProc2(int i, long x, char c);
Handle MyProc3(Handle h, long x, int i);
Better (aligned) versions of these would be:
/* 11 */
void MyProc1(int i, long x);
int MyProc2(char c, int i, long x);
Handle MyProc3( int i, long x, Handle h);
Note that the size of the return result doesnt matter in C. Also, you could do even better than these by making the sum of the sizes of the parameters a multiple of 4 bytes, as was mentioned above (make ints into longs or add dummy ints for filler; MyProc1 and MyProc3 need this). If youre unsure which way your C compiler pushes the parameters or if you want to write portable code that compiles on two different C compilers that do it differently, then just make sure all of your non-4-byte parameters are in one part of the parameter list and that you have a multiple of 4 bytes worth of them. Then it wont matter which way the compiler pushes them - the 4-byte ones (which are what we really care about) are guaranteed to be aligned (as is the stack when the routine that is called starts to execute).
In Pascal, its a little different. Parameters are pushed from left to right and only after space has been allocated for the function result (if there is one). So, you want to declare your parameters from largest to smallest in this case (the opposite of what youd do in C). Here are the same prototypes in aligned Pascal:
{ 12 }
PROCEDURE MyProc1(x: LongInt; i: INTEGER);
FUNCTION MyProc2(i: INTEGER; x: LongInt; c: Char): INTEGER;
FUNCTION MyProc3(h: Handle; x: LongInt; i: INTEGER): Handle;
Notice that in Pascal the size of the return result matters. In MyProc2 we make the first parameter an int to make up for the function result INTEGER; that way x (the LongInt) will be aligned when it is pushed on the stack. If the return result is a 4-byte quantity (like the Handle in MyProc3), then you can forget about it and just list the parameters from biggest to smallest. Dont forget that in Pascal the goal is to make the sum of all the parameters plus the function result be a multiple of 4 bytes (which is not true for any of the above). As in C, make INTEGERs into LongInts or add dummy INTEGERs if you need to.
ALIGNING DATA STRUCTURES
What about custom data structures? Where do they fit in? Well, the answer is simple: in order to have the least maintenance possible with all of this long-word alignment stuff, make sure all of your structs are multiples of 4 bytes long. Add a dummy field or two to the end if you need to, just make sure all structs are multiples of 4 bytes long. That way, if you declare them as a global or as a local, you will know they are multiples of 4 bytes long, and you can forget about them and put them in any section without ruining your alignment.
Bad (unaligned) data structure design:
typedef struct {
int i;
long j;
} MyStruct;
Good (aligned) data structure design:
/* 13 */
typedef struct {
long j;
int i;
int dummyInt; /* force the length to a multiple of 4 */
} MyStruct;
Some might say that the dummyInt is an unnecessary waste of space. I would argue that it is a good thing because it ensures that MyStruct is a multiple of 4 bytes, and it might come in useful some day when you need a couple extra bytes in MyStruct - think of dummyInt as unused and reserved for future expansion.
But what about system data structures? Yes, some of them are a problem. Take BitMap, for example:
struct BitMap {
Ptr baseAddr;
short rowBytes;
Rectbounds;
};
There are two problems with it: (1) Bounds starts at an odd-word address (even though its really an array of 4 ints, someone might reference the first half as topLeft and itll be an unaligned 4-byte reference) and (2) The size is not a multiple of 4 bytes. Unfortunately, we cant do anything about (1) but, we can do something about (2). Whenever you have a system data structure in your globals or locals and youre not sure if it is aligned or no,t then you should add this filler variable to take up the extra space:
/* 14 */
BitMap b;
char filler1[4-(sizeof(BitMap)%4)];
The worst case is where the struct in question is already aligned. In that case, the filler will be 4 bytes long (when what youd really like is to have it be zero bytes long). Still, if youre coding for speed, then a few extra bytes are worth it (better yet, you could manually check to see if the size of the struct is aligned and then not put in the filler in that case).
INSTRUCTION CACHE
The 68030 has a 256 byte instruction cache where it stores instructions (code). It works much like the data cache in that there are 16 caches of 16 bytes each. When an instruction is fetched, the 16 or 32 bytes surrounding it are fetched into the instruction cache, too. Like the other caches, you can turn off the instruction cache if you want to. I havent mentioned this fact before because it is hardly ever a good idea. (If you are so motivated, though, you would use the Movec D0,CACR instruction to enable/disable/clear the caches; this is a priveledged instruction, though, so dont ship any products that use it.).
The goal when writing small loops is to have everything contained within 242 bytes. You cant assume all 256 bytes of the instruction cache unless your routine begins at an address divisible by 16; the worst case is that your code begins at an address that ends in $E, thus the 242 byte number (2 bytes from the first bucket plus 15 other complete buckets). The time difference between a fully cached function vs. the same function that isnt fully cached is 24%. So, you can reduce the time of the entire loop by 24% if you can make it fit in 242 bytes or less (including the size of any routines called from within the loop, of course). Note that, in general, unrolling loops is a good way to speed them up (since a short branch not taken is faster than any size branch taken) but, make sure you dont unroll them to be larger than 242 bytes.
If you have a IIci with a cache card or a IIfx, then you wont notice this 24% performance gain; it only applies to the standard 030 machines: SE/30, Classic II, IIx, IIcx, IIsi, IIci (no card), PB 140 and PB 170. The IIcis cache card and IIfxs built in cache consist of 2048 16-byte caches (32K total) that can be used for data or instructions.
The 68020, by the way, only has a 64-byte instruction cache, which is 16 caches of 4 bytes each. If you want to make sure that your tight loops run well on that processor, then make sure they are 62 bytes or smaller. The 020 does not have a data cache.
The 68040 has 256 instruction caches of 16 bytes each, or 4K worth of instruction caches. It also has a similar 4K data cache.
INTERLEAVED WRITE OPERATIONS
Because memory write operations are write-through, the second of two successive write operations must wait until the first has finished before it can execute. Knowing this, you can use the delay
to your advantage to streamline the pipelining process. The idea is
to avoid two memory writes in a row in cases where you can.
Consider the code you might have at the top of a routine, where you save some registers and then set up some initial values:
Movem.L D0-D3,-(SP)
Moveq #x,D0
Moveq #y,D1
Moveq #w,D2
Moveq #z,D3
You can make this code execute 15% faster on a IIfx (its 19% faster on a IIci) if you interleave the memory writes with non-memory reads and writes:
;15
Move.L D3,-(SP)
Moveq #z,D3
Move.L D2,-(SP)
Moveq #w,D3
Move.L D1,-(SP)
Moveq #y,D1
Move.L D0,-(SP)
Moveq #x,D0
Of course, there is a tradeoff. This faster version takes 4 extra bytes. But if youre
coding for speed, its worth it.
SPECIAL INSTRUCTIONS
The 68030 supports some addressing modes and instructions that are not implemented
on the 68000 (they do work on the 020/040, though). If you use these, then your program
will not run on a 68000. If you have a tight loop that youd like to run faster on 020s,
030s and 040s, then consider modifying it to use these instructions. Be sure to do a run-time
check for which CPU youre running on (using Gestalt with the gestaltProcessorType selector)
and only call routines that use these instructions if youre running on the appropriate CPU
(using a set of procPtrs that you install when you start up your app is a good idea). Rather
than give you a dry listing of the new instructions, Ill give a couple of examples of where
the common ones might come into play when programming the Mac. If you really want to
know more, then go get a book on the 030 (if you call Motorola theyll probably send you
a Programmers Guide for free: 408/749-0510).
Double dereferencing of handles can now be done in one instruction with a new addressing
mode. If you want to put the first byte of a handle in A0 into D0, you would do this on the
68000:
Move.L (A0),A0
Move.B (A0),D0
Now you can use this instruction instead (which is 35% faster):
;16
Move.B ([A0]),D0
Of course, if youre going to need the dereferenced handle for more than this one instruction, then youre better off keeping the dereferenced handle in a register rather than using this addressing mode to repeatedly redereference it.
You can add an offset after the first dereference, too. If you want the 10th byte of the handle in A0, you used to do this:
Move.L (A0),A0
Move.B 9(A0),D0
On the 68020 (and 030/040) you can do this:
;17
Move.B ([A0],9),D0
Or, suppose your handle points to a list of 8 byte elements, Rects, lets say, and suppose you have a zero-based index of the Rect you want in D0 and suppose you want the botRight of that Rect in D1. On the 68000 you would do something like this:
Move.L (A0),A0
Asl #3,D0
Move.L botRight(A0,D0),D1
But now you can do this in a much more glamourous instruction:
;18
Move.L ([A0],D0*8,botRight),D1
which is 22% faster and 2 bytes smaller than the old way (and, oh, so much more glamourous). There is another new addressing mode, too, which is nearly as glamourous, and has the general form of ([bd,An,Xn.SIZE*SCALE],od). Geez, I can see why people are moving towards RISC these days - these instructions are getting a bit complex
A new instruction, Extb.L, was added that allows you to sign extend a byte to a long. On the 68000 you have to use two instructions, Ext.W and Ext.L, to get this effect.
Bit fields were added, too. Theyre really cool if youre working with bitmapped graphics. It will be worth your time to go read about them if you do a lot of graphics work.
DONT USE CLEAR
The NewPtrClear and NewHandleClear traps are a lot slower than they need to be. This is for two reasons: (1) because they use the Clr instruction (which does a memory read before setting the memory to zero; thus filling the data cache with useless data) and (2) because they only clear one byte at a time. The inner loop that they both use is something like this:
@1 Clr.B(A0)+
Dbra D0,@1
By substituting this loop instead:
;19
Moveq #0,D1
@1 Move.B D1,(A0)+
Dbra D0,@1
you get a 19% speed improvement. Furthermore, by clearing longs at a time (and dealing with the end as a special case) you can clear memory 65% faster than the Clr.B loop method. If you do a lot of NewPtrClears or NewHandleClears you probably want to write your own MyNewPtrClear and MyNewHandleClear that do this.
ASSEMBLY TRIVIA
Who knows what the slowest instruction on the 68030 is (other than Reset)? If you guessed something to do with division youre wrong. If, on the other hand, you guessed:
;20
Movem.L ([d32,An],Xn,d32),A0-A7/D0-D7
then youre right. Weighing in at about 200 cycles you should avoid this instruction wherever possible (Quick! Go change your source!).
I would like to thank Chris B, Dolf S and John A for help in preparing this article. Its always nice to have people around who can tell you precisely how confused you are. I would also like to thank Timo B for the use of his computer. Its always nice to have someone around who has more hardware than you do.