Efficient C
Volume Number: | | 9
|
Issue Number: | | 8
|
Column Tag: | | C Workshop
|
Efficient C Programming
High-level optimizations
By Mike Scanlin, MacTech Magazine Regular Contributing Author
This article explains and gives examples of how to get better code generation for common operations and constructs in C on the Macintosh. These higher level optimizations will, on average, produce faster and smaller code in other languages as well (C++, Pascal). Some of them will make your code harder to read, more difficult to port, and possibly have a negative performance impact on non-680x0 CPUs. However, for those cases where youre optimizing your bottlenecks for the 680x0 CPUs, these tricks will help you.
There are several things you can do to your code independent of which language or compiler youre using in order to improve performance. Lets start with those.
FILE INPUT/OUTPUT
There are three things to know in order to produce optimal file I/O: (1) always read/write in sector aligned amounts, (2) read/write in as big of chunks as possible and, (3) be sure to disable Apples disk cache.
From the Mac file systems point of view, files are made up of blocks. From a disk driver point of view blocks are made up of sectors. A sector is usually 512 bytes. On large disks a block can be 10K or larger (files are a minimum of 1 block in size). You can get the exact block size by doing a GetVInfo and checking the volumeParam.ioVAlBlkSiz field. Your buffers should be multiples of this amount when reading and writing to that particular volume (and if not, then they should at least be multiples of the sector size) and should begin on a block boundary (or, at a minimum, a sector boundary) within the file. Your reads/writes will be 2x faster if you read/write aligned sectors than if you dont.
Recently while implementing a virtual memory scheme I had to determine the optimal VM page size for maximum disk throughput (measured in bytes/second). After testing a variety of page sizes on a variety of CPUs and hard disks, I determined that the optimal size was 64K. If you read and write less than 64K at a time you will not be minimizing disk I/O time (and for very small reads and writes you will be paying a significant throughput penalty). Heres an experiment for the unbelievers: write a little program that writes an 8MB file 512 bytes at a time and then writes another 8MB file 64K at a time. You should find that the 64K at a time case is 8x to 40x faster than the 512 byte at a time case. Then try reading in that 8MB file 512 bytes at a time and then 64K at a time. It should be about 35x to 40x faster for the 64K case (your actual times will depend on your CPU and your hard drive).
Lastly, if you are using large aligned I/O buffers you should turn off Apples disk cache for your reads and writes. IM Files pg. 2-95 says that you can do this by setting bit 5 of ioPosMode before a Read or Write call. In those cases where the cache is kicking in, youll be 13x faster by forcing it off. For a complete explanation of Apples disk cache, see Apples Lame Disk Cache on pg. 75 of April 1993 MacTech.
OTHER LANGUAGE-INDEPENDENT THINGS
In a previous MacTech article (Sept 1992) I droned on and on about aligning your data structures and stack usage. Thats true in any language on the Mac (because its a result of the 680x0 architecture). Do it.
One thing you can do to reduce the calling overhead of your functions is to use fewer parameters. Sometimes youll find that a general purpose routine that takes several parameters is being called in a loop where most of the parameters arent changing between calls within the loop. In cases like this you should make a parameter block and pass a pointer to the parameter block rather than passing all of the parameters each time. Not only does this make aligned-stack calling easier to implement and maintain but it really reduces the pushing and popping of invariant stack data during your loop. For instance, you could change this prototype:
void MyWizzyFunction(short wizFactor, long numWizzes, Boolean
doWiz, short fudgeFactor, short wizHorz, short wizVert);
to this:
void MyWizzyFunction(WizParamBlockPtr wizPBPtr);
with this parameter block:
typedef struct WizParamBlock {
long numWizzes;
short wizFactor;
short fudgeFactor;
short wizHorz;
short wizVert;
BooleandoWiz;
} WizParamBlock, * WizParamBlockPtr;
Youll save a lot of time from reduced stack operations on each call to MyWizzyFunction.
TABLE LOOKUPS
I once met someone who told me that every computer science problem could be reduced to a table lookup. I guess that if your table was unlimited in size this might be true (but then the table initialization time might kill you). Nonetheless, there are many cases where code can be sped up with a relatively small table. The idea is to precompute some data and look it up rather than recalculate it each time through a loop. For example, this code:
!register Byte n, *xPtr;
register short i, *yPtr;
Byte x[1000];
short y[1000];
yPtr = y;
xPtr = x;
i = 1000;
do {
n = *xPtr++;
*yPtr++ = n*n + n/5 - 7;
} while (--i);
is much slower than this code:
/* 1 */
register Byte *tablePtr;
register short tableOffset;
short table[256];
/* first pre-compute all possible
* 256 values and store in table
*/
yPtr = table;
i = 0;
do {
*yPtr++ = i*i + i/5 - 7;
} while (++i < 256);
tablePtr = (Byte *) table;
yPtr = y;
xPtr = x;
i = 1000;
do {
/* we do manual scaling for speed */
tableOffset = *xPtr++;
tableOffset *= sizeof(short);
/* generates Move (Ax,Dx),(Ay)+ */
*yPtr++ = *(short *)
(tablePtr + tableOffset);
} while (--i);
This second version which only requires a 256 element table contains no multiplies or divides. The tableOffset *= sizeof(short) statement compiles down to an Add instruction since sizeof(short) evaluates to 2. The *yPtr++ = ... statement compiles down to Move (Ax,Dx),(Ay)+ which is as optimal as you can do (and what you would have written if you had been writing assembly).
One thing thats really important to know when using lookup tables is that your table element size needs to be a power of 2 in order to have fast pointer calculations (which can be done with a shift of the index). If you only need 5 bytes per table element then it would be better to pad each element to 8 bytes so that you can shift the index by 3 (times 8) rather than multiplying it by 5 when trying to get a pointer to a given element.
Also, depending on the amount of data involved, you may want to declare the table as a static and let the compiler calculate its values at compile-time.
USE SHIFTS WHEN YOU CAN
This one is obvious enough that most programmers assume that the compiler always does it for them. If you want to divide a value by 8, you might think that this would generate an efficient shift right by 3:
x /= 8;
Its true that if x is unsigned then MPW and Think do the right thing but if x is signed they generate a divide. The reason is because you cant shift a negative number to the right to divide by 8 (if the original value is -1 youll get -1 as the result, too, because of sign extension). To solve this problem, you should add 7 to x (when its negative) before shifting. Use this instead of the above for signed right-shifts by 3:
/* 2 */
if (x < 0)
x += 7;
x >>= 3;
and use a shift left instead of a multiply when multiplying by a power of 2. Also, there may be brain-dead compilers out there that your code will be ported to some day so you should use the shift operator even when working with unsigned values. Its a good habit to get into.
USE & INSTEAD OF % WHEN YOU CAN
When moding by powers of 2, you should and it by (value - 1) instead. Dont do this:
x = y % 8;
do this (to save a division):
x = y & (8 - 1);
As before, this may yield incorrect results if y is signed but if the result is just to get the last 3 bits, it works fine. And if you want the remainder of a negative number when divided by 8 (i.e. what mod would return to you if you used it) you can do this to save a divide:
/* 3 */
x = y & (8 - 1);
if (y < 0)
x += 8;
DONT USE MULTIPLY
As you know, multiply instructions are expensive on the 680x0 and you should avoid them wherever possible. What you may not know, though, is the extent to which you should avoid them. For instance, some would say that this code:
x *= 20;
is acceptable. However, in a tight loop it would be much better to use:
/* 4 */
temp = x;
temp += x;
temp += temp;
x <<= 4;
x += temp;
Its not necessarily intuitive that five instructions are better than one but, assuming temp and x are register variables, the times for the above are:
68000: 70 cycles for first one, 30 cycles for second
68030: 28 cycles for first one, 14 cycles for second
68040: 15 cycles for first one, 6 cycles for second
This type of C programming, which I call writing assembly language with C syntax requires a detailed knowledge of your compiler and your register variables allocation. It also requires a little knowledge of assembly language which, if you dont have, would be a good thing to start learning (use Thinks Disassemble command and MPWs dumpobj to see what the compiler is doing with your C code).
DONT USE FOR STATEMENTS
Many people resist this optimization but it falls into the category of convenient syntax vs. efficient syntax. The basic point is that you can always do at least as good as a for loop by using a while (for 0 or more iterations) or a do-while loop (for 1 or more iterations), and in most cases you can do better by not using a for loop. (In fact, Wirth removed the FOR keyword from his latest language Oberon because he considered it unnecessary.)
Heres an example. This code:
for (i = 0; i < NUM_LOOPS; i++) {
}
is better as:
/* 5 */
i = NUM_LOOPS;
do {
} while (--i);
because the first one generates:
Moveq #0,D7
Bra.S @2
@1 <body of loop>
@2 Addq #1,D7
Cmpi #NUM_LOOPS,D7
Blt.S @1
and the second one generates:
Moveq #NUM_LOOPS,D7
@1 <body of loop>
Subq #1,D7
Bne.S @1
Now, its true that Im comparing apples and oranges a bit here because the first loop counts up and the second loop counts down but the first loop is representative of how I see a lot of inexperienced programmers write their for loops. Even if they were to make the optimization of counting down to zero, the do-while loop is still more efficient because of the extra branch instruction at the top of the for loop.
As an experiment, try writing your code without for loops for a while. I think youll find that it often becomes clearer and in many cases it will become more efficient, too.
USE REASONABLE REGISTER VARIABLES
While register variables are certainly a good tool for making your code faster, if you dont use them right you might be hurting yourself.
When writing an application on the 680x0, you have 3 address registers (pointers) and 5 data registers to play with. Do NOT declare more than that. And if something doesnt really need to be in a register (because its only read from once or twice, for instance) then dont put it in a register. The time to save, initialize and restore the register will cause a performance hit rather than gain.
The most important thing is to write your functions so that they have a reasonable number of local variables (no more than 3 pointers and 5 non-pointers, ideally). If you just cant split the function up or use fewer variables then try to use register variables with restricted scope (some subset of the function) so that you can reuse them later in the function for other things.
Even if you dont use register variables, big functions with lots of locals make it extremely difficult for any compiler to allocate registers efficiently. This applies to many different machines and compilers.
TWO STATEMENTS CAN BE BETTER THAN ONE
Similar to the above trick, there are times when even the simplest statements, such as:
x = 131;
can be improved:
x = 127;
x += 4;
The reason is because the first generates one of the instructions that you should never use when programming on a non-68040:
Move.L #131,x
Thats a 6-byte instruction which is better replaced with this 4-byte version:
/* 6 */
Moveq #127,x
Addq #4,x
On the 68040 you wont notice any improvement from this optimization because 32-bit immediate operands are one of the optimized addressing modes. But on 680x0s less than the 68040 you will get a size and speed benefit from using the two instruction version (which must be written as two statements; if you do x = 127 + 4 the compiler will combine the compile-time constants for you).
SOME CONSTANTS GENERATE CODE
It was hard for me to believe it when I first saw it but this code:
#define COUNT (600 * 60)
register long x;
x = COUNT;
actually generates a run-time multiply instruction in Think C. The problem is that the result of the 600*60 multiplication is larger than the maximum positive integer. So the assignment at run time is x = -29536 (the 32-bit signed interpretation of an unsigned 16-bit 36000), which is probably not what you want. To get what you probably want, and to eliminate the run-time multiply instruction, add an L after the 600 in the #define. That way the compiler treats it as a 32-bit constant and will do the multiply at compile-time.
USE POINTERS WITHOUT ARRAY INDEXES
Square brackets [] are usually a sign of inefficiency in C programs. The reason is because of all the index calculations the compiler is going to generate to evaluate them. There are some exceptions to this rule, but not many. For instance, this code:
for (i = 0; i < 100; i++)
x[i] = blah;
is much better as:
/* 7 */
p = x;
for (i = 0; i < 100; i++)
*p++ = blah;
because the compiler doesnt have to calculate the effective address of x[i] each time through the loop.
Likewise, the following code (which is notationally convenient) to append .cp to the end of a Pascal string:
char *name;
name[++*name] = '.';
name[++*name] = 'c';
name[++*name] = 'p';
is much less efficient (and many more bytes) than this code:
/* 8 */
char *namePtr;
namePtr = name + *name;
*namePtr++ = '.';
*namePtr++ = 'c';
*namePtr++ = 'p';
*name += 3;
USE 16-BIT SIGNED INDEXES
If you find that you must use array addressing with square brackets, you can improve the efficiency by using a signed 16-bit index rather than an unsigned one (of 16 or 32 bits). The reason is because something like this:
x = p[i];
can then be coded as (assuming everything is a register variable and p points to an array of bytes):
Move (p,i),x
whereas, if i were unsigned youd get:
Moveq #0,D0
Move i,D0
Move (p,D0.L),x
If generating 68020 instructions or higher then this same trick improves efficiency even if p points to an array of 16-bit, 32-bit or 64-bit quantities because most compilers will use the auto-scaling addressing mode:
Move (p,i*8),x
for example, if p points to a table of 8-byte entries.
DONT USE PRE-DECREMENTED POINTERS
This one is really only a shortcoming of Think Cs code generation and nothing else. I hope they fix it soon because it drives me nuts. If you do this in Think C:
i = *--p;
youll get this code generated:
Subq.L #2,A4
Move (A4),D7
instead of the obviously more efficient:
Move -(A4),D7
If you have a large buffer that youre walking through backwards then the time penalty for pre-decremented pointers can be significant (and would be a good place to drop in a little in-line asm). The funny thing is that Think is smart about the post incrementing case. i.e., this code:
/* 9 */
i = *p++;
generates the optimal:
Move (A4)+,D7
Im not sure why they have this asymmetry in their code generator. Probably a function of shipping deadlines...
EVIL ELSES
In some cases, its better to do what appears to be more work. This code:
x = (expr) ? y : z;
or its equivalent:
if (expr)
x = y;
else
x = z;
can be made to execute faster and take fewer bytes like this:
/* 10 */
x = z;
if (expr)
x = y;
The reason is because the unconditional branch instruction generated by the compiler before the else statement is slower than the extra assignment instruction.
TEMPNEWHANDLE IS SLOW
Not too long ago I was asked to investigate why a certain application was running slow. The developers had made several recent changes, one of which was to use temporary memory, and noticed a slow down. I traced one of the problems down to the TempNewHandle call itself. It turns out that its 5x slower than NewHandle. Try allocating 80 handles of 64K each with NewHandle and then the same thing with TempNewHandle. The results are a strong argument against using TempNewHandle for places where you do lots of allocations and deallocations (in those cases where you have a choice).
BOOLEAN FLAGS
If you pack several boolean flags into a byte, put your most commonly tested flag in the highest bit position because the compiler will usually generate a Tst.B instruction for you rather than a less efficient Btst #7,<flags> instruction.
USE ELSE WHEN CLIPPING
When clipping a value to a certain range of values, be sure to use an else statement. Ive seen this code several times:
if (x < MIN_VAL)
x = MIN_VAL;
if (x > MAX_VAL)
x = MAX_VAL;
The insertion of a simple else keyword before the second if will improve performance quite a bit for those cases where x is less than MIN_VAL (because it avoids the second comparison in those cases where you know its false):
/* 11 */
if (x < MIN_VAL)
x = MIN_VAL;
else if (x > MAX_VAL)
x = MAX_VAL;
USE +=, NO, WAIT, DONT USE +=
You might think that these two instructions were the same:
Byte x;
x += x;
x <<= 1;
Or, if not, you might think that one of them would be consistently better than the other. Well, while they are the same functionally, depending on whether or not x is a register variable you can get optimal code with one or the other, but not both.
Lets look at the code. If x is not a register variable then you get this for the first one (in Think C):
Move.B nnnn(A6),D0
Add.B D0,nnnn(A6)
and you get this for the second one:
Move.B nnnn(A6),D0
Add.B D0,D0
Move.B D0,nnnn(A6)
So, as you can see the first one, x += x, is better. However, if x is a register variable then the first one generates:
Move.B D7,D0
Add.B D0,D0
Move.B D0,D7
and the second one generates:
Add.B D7,D7
And now the second one, x <<= 1, is clearly better. Dont ask me why (cause I dont know) but if it bothers you like it does me then write a letter to the Think implementors.
ONE FINAL EXAMPLE
Now that Ive covered several C optimization tricks, lets look at an example I encountered last week. Listing 6 of the Principia Off-Screen tech note builds a color table from scratch:
/* 12*/
#define kNumColors 256
CTabHandlenewColors;
short index;
/* Allocate memory for the color table */
newColors = (CTabHandle)
NewHandleClear(sizeof(ColorTable) +
sizeof(ColorSpec) * (kNumColors - 1));
if (newColors != nil) {
(**newColors).ctSeed = GetCTSeed();
(**newColors).ctFlags = 0;
(**newColors).ctSize = kNumColors - 1;
/* Initialize the table of colors */
for (index = 0;
index < kNumColors; index++) {
(**newColors).ctTable[index].value
= index;
(**newColors).ctTable[index].rgb.
red = someRedValue;
(**newColors).ctTable[index].rgb.
green = someGreenValue;
(**newColors).ctTable[index].rgb.
blue = someBlueValue;
}
}
Whats inefficient about it? For starters, its a little wasteful to clear all of the bytes with NewHandleClear since the code then proceeds to set every byte in the structure to some known value. Second, its wasteful to dereference the newColors handle every time a field of the color table is referenced. Nothing in that code except for the NewHandleClear call is going to move memory so, at a minimum, we should dereference the handle once and use a pointer to the block. Third, the evil square brackets array indexing is used in a place where a post-incrementing pointer would do. Forth, a for loop is used where a do-while will suffice.
Heres a more efficient version of the same code that fixes all of these problems:
/* 13 */
#define kNumColors 256
CTabHandlenewColorsHndl;
CTabPtr newColorsPtr;
short index, *p;
/* Allocate memory for the color table */
newColorsHndl = (CTabHandle) NewHandle(sizeof(ColorTable) +
sizeof(ColorSpec) * (kNumColors - 1));
if (newColorsHndl != nil) {
newColorsPtr = *newColorsHndl;
newColorsPtr->ctSeed = GetCTSeed();
newColorsPtr->ctFlags = 0;
newColorsPtr->ctSize = kNumColors - 1;
/* Initialize the table of colors */
p = (short *) newColorsPtr->ctTable;
index = 0;
do {
*p++ = index; /* value */
*p++ = someRedValue;
*p++ = someGreenValue;
*p++ = someBlueValue;
} while (++index < kNumColors);
}
Now, to be fair, Im sure the authors of that tech note wrote the code the way they did so that it would be clear to as many people as possible. After all, it is for instructional purposes. So please dont flame me for picking on them; with the exception of the inefficiencies in the example code, I happen to like that tech note a lot.
WRAPPING IT UP
Many Mac programmers Ive met have the impression that if youre programming in a high level language like C that many of the known assembly language peephole optimizations dont apply or cant be achieved because the compilers code generation is out of your control. While thats true for some of the low-level tricks, its certainly not true for all of them, as we have seen. Its just a matter of getting to know your compiler better so that you can coerce it to generate the optimal set of instructions. But if youre writing portable code, these types of CPU-dependent and compiler-dependent optimizations should probably not be used except in the 5% of your code that occupies 80% of the execution time (and even then youre probably going to want a per-CPU and per-compiler #ifdef so that you get optimal results on all CPUs and with all compilers).