Code Optimization
Volume Number: 15 (1999)
Issue Number: 4
Column Tag: CodeWarrior Workshop
Code Optimization
by Andrew Downs
Code Optimization Options in CodeWarrior
Introduction
Code optimization involves the application of rules and algorithms to program code with the goal of making it faster, smaller, more efficient, and so on. Often these types of optimizations conflict with each other: for instance, faster code usually ends up larger, not smaller. Optimizations can be performed at several levels (e.g. source code, intermediate representations), and by various parties, such as the developer or the compiler/optimizer.
This article discusses and demonstrates the code optimization options available in CodeWarrior Pro. In spite of its ever-growing sophistication, CodeWarrior still makes it easy to select the right level of optimization for your program. Understanding how CodeWarrior applies optimizations will help when it is time to debug your code, especially when using a low-level debugger or when the source code is not available.
Optimization Terms
In order to understand the process, you need to speak the language. Here are definitions for the terms discussed in this article, plus a few extras. Keep in mind that the words replace, substitute, etc. do not imply physical changes to your original source code, but rather are reflected in the generated machine code, after your source code has been parsed and optimized.
Each term includes a brief description, including why it is important. Code examples illustrating some of these terms will be presented later in the article.
- Common subexpression elimination
- If the value resulting from the calculation of a subexpression is used multiple times, perform the calculation once and substitute the result for each individual calculation.
- Constant propagation
- Replace variables that rely on an unchanging value with the value itself.
- Copy propagation
- Replace multiple variables that use the same calculated value with one variable.
- Dead code elimination
- Code that never gets executed can be removed from the object file to reduce stored size and runtime footprint.
- Dead store elimination
- Variables and related store operations that are never used can be removed from the object file to reduce stored size and runtime footprint.
- Global register allocation
- Variables that do not overlap in scope may be placed in registers, rather than remaining in RAM. Accessing values stored in registers is faster than accessing values in RAM.
- Global vs. peephole scope
- Global optimization allows the compiler/optimizer to look at the overall program and determine how best to apply the desired optimization level. Peephole provides local optimizations, which do not account for patterns or conditions in the program as a whole. Local optimizations may include instruction substitutions.
- Inline calls
- A function that is fairly small can have its machine instructions substituted at the point of each call to the function, instead of generating an actual call. This trades space (the size of the function code) for speed (no function call overhead).
- Instruction scheduling
- Instructions for a specific processor may be generated, resulting in more efficient code for that processor but possible compatibility or efficiency problems on other processors. This optimization may be better applied to embedded systems, where the CPU type is known at build time. Most desktop software should use the Generic PowerPC option in CodeWarrior.
- Lifetime analysis
- A register can be reused for multiple variables, as long as those variables do not overlap in scope.
- Loop counting using CTR
- The count register (CTR) on the PowerPC is intended to be used for counting, such as in loop iterations. There are some branch instructions that specifically use the value in the CTR.
- Loop invariant expressions (code motion)
- Values that do not change during execution of a loop can be moved out of the loop, speeding up loop execution.
- Loop transformations
- Some loop constructs can be rewritten to execute more quickly. For example, a for{} loop can be rewritten as a while{} loop, saving one or more instructions.
- Loop unrolling
- Statements within a loop that rely on sequential indices or accesses can be repeated more than once in the body of the loop. This results in checking the loop conditional less often.
- Strength reduction
- Certain operations and their corresponding machine code instructions require more time to execute than simpler, possibly less efficient counterparts.
CodeWarrior Optimization Dialog
Figure 1 shows the dialog box pane used to specify instruction scheduling and peephole optimization in CodeWarrior.
Figure 1. PPC processor pane.
Global optimizations are specified in the optimization pane, shown in Figure 2. Global optimization can be performed with the intent of making the resulting code faster or smaller, but not both.
Figure 2. Global optimization pane.
The code examples presented in this article use either peephole optimization or the faster execution speed settings. The smaller code size setting is briefly discussed later in the article. Profiled results for smaller code size are provided in the article archive, but are not illustrated in code examples.
Disassembling Code
The Disassemble menu item under the Project menu in CodeWarrior allows you to view the compiler's output resulting from application of any optimization choices. This is extremely useful in that it gives you a chance to save an assembly listing of your code for use during debugging sessions. That menu item was used to produce the listings presented in this article.
Just Enough Assembly Language
Listed below are the PowerPC instructions used in this article. Following each is the expanded mnemonic, and a short description of what the instruction does (using the specified operands). The instructions are presented in mnemonic alphabetical order, not the order in which they are encountered in the code. For a thorough treatment of Power PC assembly language, refer to the references listed at the end of this article.
add r30,r30,r0
- Add: store the sum of the values stored in r0 and r30 in r30.
addi r30,r30,1
- Add Immediate: add 1 to the value stored in r30, and store the result in r30.
b *+12
- Branch: continue execution at the location 12 bytes (3 instructions) after this instruction.
bdnz *-80
- Branch if Decremented Count Register is Not Zero: decrement the Count Register, and if its value is not zero, continue execution 80 bytes (20 instructions) before this instruction.
bl.whileTest__Fv
- Branch and Link: branch to the location specified and continue execution, saving the address of the next instruction to be executed in this routine in the Link Register.
blt *-12
- Branch if Less Than: if the result of the previous comparison resulted in the Condition Register LT bit being set, continue execution at the instruction 12 bytes (3 instructions) prior to this instruction.
bne *-28
- Branch if Not Equal: if the result of the previous compare operation is not zero, as recorded in the Condition Register EQ bit, continue execution at the instruction 28 bytes (7 instructions) before this instruction.
cmplw r31,r29
- Compare Logical Word: perform an unsigned comparison between r29 and r31, and update the Condition Register as appropriate.
cmplwir31,$0000
- Compare Logical Word Immediate: compare the value in r31 against 0, and set the appropriate flags in the Condition Register.
li r30,0
- Load Immediate: move the value 0 into register r30.
lwzr3,gLimit(RTOC)
- Load Word and Zero: move the value of the global variable gLimit into register r3. Global variables are referenced as offsets from the Table of Contents register (RTOC).
mr r31,r29
- Move Register: copy the contents of r29 into r31.
mtctr r0
- Move To Count Register: copy the value in r0 to the Count Register (typically used for loop iteration).
mullw r0,r29,r28
- Multiply integer word, return Low Word: store the product of the values stored in r28 and r29 in r0.
stmw r26,-24(SP)
- Store Multiple Word: store registers r26-r31 beginning at the location -24 bytes off the current stack pointer location.
stw r31,-4(SP)
- Store Word: store register r31 at the location -4 bytes off the current stack pointer location.
subir31,r31,1
- Subtract Immediate: subtract 1 from the value in r31, and store the result in r31.
subic.r31,r31,1
- Subtract Immediate Carrying: similar to subi, but set the Exception Register CA bit with the result of the carry (whether it occurs or not).
Example Code
The archive for this article includes the .c and .dump files. Listing 1 contains the source code used to test the optimization options. For illustration purposes, some of the functions contain redundant or unused variables, unoptimized loops, and so on.
The profiler calls and header include statement are commented out. The sample project file in the archive includes the profiler library, so if you would like to use the profiler, simply uncomment the four lines in the source code, compile, and run.
Listing 1: test.c
test.c
Sample code with loops and variable usage.
#include <stdio.h>
#include <stdlib.h>
//#include <profiler.h>
unsigned long cseTest( void );
unsigned long doWhileTest( void );
unsigned long whileTest( void );
unsigned long forTest( void );
unsigned long gLimit = 1000L;
int main( void ) {
//ProfilerInit( collectSummary, bestTimeBase, 10, 3 );
//ProfilerSetStatus( 1 );
unsigned long theResult;
unsigned long j = 0L;
j++;
theResult = whileTest();
theResult = doWhileTest();
theResult = cseTest();
theResult = forTest();
j-;
//ProfilerDump( "\ptest.c.prof");
//ProfilerTerm();
return 0;
}
unsigned long whileTest( void ) {
unsigned long theResult = 0L;
unsigned long i, limit = gLimit;
i = 0;
while ( i < limit )
i++;
return 0;
}
unsigned long doWhileTest( void ) {
unsigned long theResult = 0L;
unsigned long i, limit = 1000L;
unsigned long dest[ 1000 ], index = 0;
i = limit;
do {
dest[ index++ ] = i;
i-;
} while ( i > 0 );
return 0;
}
unsigned long cseTest( void ) {
unsigned long theResult = 0L;
unsigned long i, limit = 1000L;
unsigned long product = 0L;
unsigned long x = 4;
unsigned long y = 3;
theResult = limit;
i = limit;
do {
product += x * y;
} while ( i- > 0 );
return 0;
}
unsigned long forTest( void ) {
unsigned long i, limit = 1000, sum = 0;
for ( i= 0; i < limit; i++ )
sum += i;
return sum;
}
Peephole Optimization
The peephole optimizer applies a set of pattern-matching rules to the code, with the goal of replacing inefficient sequences of instructions with more efficient sequences. A second goal is to perform some register usage optimizations based on variable lifetimes. Both benefits are shown in the following examples.
Note that function prolog and epilog code is only occasionally presented in the listings in this article. The full listings are included in the article archive.
In the first example, the non-optimized version of main() contains the following assembly code, where register r30 is used to hold the value of j:
int main( void ) {
//...
// Initialize j.
// unsigned long j = 0L;
li r30,0
// Increment j.
// j++;
addi r30,r30,1
// theResult = whileTest();
bl .whileTest__Fv
// ...
Enabling the peephole optimizer results in no action being taken on j, since its value is never used:
int main( void ) {
//...
// Initialize j.
// unsigned long j = 0L;
// Increment j.
// j++;
// No instructions generated!
// theResult = whileTest();
bl .whileTest__Fv
// ...
The second example uses the function doWhileTest(). The variable limit is accessed twice, but its value never changes. In addition, there is a subtract/compare instruction sequence. First, the unoptimized version:
unsigned long doWhileTest( void ) {
// ...
// unsigned long i, limit = 1000L;
// Initialize limit in r29.
li r29,1000
// i = limit;
// Copy limit into r31.
mr r31,r29
// i-;
// Subtract 1 from i (in r31).
subi r31,r31,1
// } while ( i > 0 );
// Compare to 0 and branch if not equal.
cmplwi r31,$0000
bne *-28
The peephole optimizer detects the redundant register usage for the variable limit, and the fact that the value of limit never changes. The peephole optimizer also finds the inefficient subtract/compare sequence. It transforms the function into:
unsigned long doWhileTest( void ) {
// ...
// Initialize limit in r31.
// This is the same register used for i, so no copy/move operation is required.
li r31,1000
// i-;
// Subtract with carry and set condition flags.
subic. r31,r31,1
// Branch based on result of subtraction operation.
// Comparison operation is implicit.
bne *-20
The profiler says:
The CodeWarrior profiler, which can be used to measure (among other things) the amount of time spent in various functions, showed the following execution times for doWhileTest():
Optimization Level | Function | Time (ms) | % of Total Time |
None | doWhileTest() | 0.131 | 39.3 |
Peephole | doWhileTest() | 0.105 | 34.4 |
These results were obtained using the bestTimeBase setting. At runtime, this translated into the PowerPC timebase setting, which uses the Real-Time Clock (RTC) register on PPC 601-based machines, and the Time Base (TB) register on later implementations, for accurate, low-overhead timing.
The fourth column is the percentage of total program running time spent in this function. This is very useful in determining where to begin optimizing. To use it properly, you should compare the results for all of the functions at a given optimization level. (The profiler summary for all functions and all optimization levels is included in the article archive.) You need to be careful, because the time and % of total time columns do not necessarily mirror each other. You will see in one of the examples (Global Optimization - Level 4) that sometimes an inverse relationship occurs: less time spent in a function may be accompanied by a greater percentage of total time!
In summary, even though its scope is relatively local, the peephole optimizer may be able to provide some performance benefits, depending on patterns it detects in the code. In this case, it decreased execution time slightly (by approximately 20%).
Global Optimization - Level 1
This level of optimization provides global register allocation and dead code elimination.
Here is the non-optimized version of whileTest():
unsigned long whileTest( void ) {
stw r31,-4(SP)
stw r30,-8(SP)
// unsigned long theResult = 0L;
li r0,0
stw r0,-16(SP)
// unsigned long i, limit = gLimit;
lwz r3,gLimit(RTOC)
lwz r30,0(r3)
// i = 0;
li r31,0
// while ( i < limit )
b *+8
i++;
addi r31,r31,1
cmplw r31,r30
blt *-8
The level 1 optimized version eliminates the non-volatile register saves, and uses volatile registers for local variables. Also, the assignment of theResult is eliminated, since its value is never used.
unsigned long whileTest( void ) {
// unsigned long theResult = 0L;
// unsigned long i, limit = gLimit;
lwz r3,gLimit(RTOC)
lwz r0,0(r3)
// i = 0;
li r3,0
// while ( i < limit )
b *+8
// i++;
addi r3,r3,1
cmplw r3,r0
blt *-8
The profiler says:
Optimization Level | Function | Time (ms) | % of Total Time |
None | whileTest() | 0.025 | 7.5 |
Global - Faster 1 | whileTest() | 0.025 | 7.5 |
In this case, the result is not faster, but it is more efficient in terms of number of instructions used (code size).
Global Optimization - Level 2
This level of optimization adds common subexpression elimination and copy propagation to those provided by level 1.
Here is the non-optimized version of cseTest():
unsigned long cseTest( void ) {
// Preserve the non-volatile registers (r26-r31).
stmw r26,-24(SP)
// unsigned long i, limit = 1000L;
li r27,1000
// unsigned long product = 0L;
li r30,0
// unsigned long x = 4;
li r29,4
// unsigned long y = 3;
li r28,3
// theResult = limit;
// i = limit;
mr r31,r27
// do {
// product += x * y;
// A multiply is an expensive (slow) operation.
mullw r0,r29,r28
add r30,r30,r0
// } while ( i- > 0 );
cmplwi r31,$0000
subi r31,r31,1
bne *-16
Enabling level 2 optimization changes the multiply instruction to an add (using a constant value of 12). As with level 1, the use of volatile registers (r0, r3-r12) replaces usage of non-volatile registers (r13-r31). As a result, the store multiple instruction goes away.
unsigned long cseTest( void ) {
// No store multiple register instruction.
// unsigned long theResult = 0L;
// ...
// theResult = limit;
li r4,0
// i = limit;
// do {
li r3,1000
// product += x * y;
// Add the constant value 12 to product.
addi r4,r4,12
// } while ( i- > 0 );
cmplwi r3,$0000
subi r3,r3,1
bne *-12
The profiler says:
Optimization Level | Function | Time (ms) | % of Total Time |
None | cseTest() | 0.131 | 39.3 |
Global - Faster 2 | cseTest() | 0.047 | 18.7 |
The optimized version is approximately 64% faster than the original!
Global Optimization - Level 3
This level of optimization adds loop transformations, strength reduction, and loop-invariant code motion to those provided by level 2.
Here is the non-optimized version of forTest():
unsigned long forTest( void ) {
stw r31,-4(SP)
stw r30,-8(SP)
stw r29,-12(SP)
// unsigned long i, limit = 1000, sum = 0;
li r29,1000
li r30,0
// for ( i= 0; i < limit; i++ )
li r31,0
b *+12
// sum += i;
add r30,r30,r31
addi r31,r31,1
cmplw r31,r29
blt *-12
Applying the optimizations results in an unrolling of the loop. This code now contains the equivalent of ten loop iterations in the body of the loop. Notice that the iterator (limit) has been reduced to the value of 100 to compensate.
unsigned long forTest( void ) {
// unsigned long i, limit = 1000, sum = 0;
li r3,0
li r4,0
b *+4
// for ( i= 0; i < limit; i++ )
// Iterate over the loop 1000 times (total).
// Use the count register to hold the adjusted value of limit (100).
li r0,100
mtctr r0
b *+4
// sum += i;
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
add r3,r3,r4
addi r4,r4,1
bdnz *-80
The profiler says:
Optimization Level | Function | Time (ms) | % of Total Time |
None | forTest() | 0.046 | 13.9 |
Global - Faster 3 | forTest() | 0.031 | 14.9 |
The optimized version is approximately 33% faster. Notice the increased percentage of total time: the time savings obtained in this loop must have been outweighed by more significant savings elsewhere in the optimized program.
Global Optimization - Level 4
This level of optimization adds repeated loop-invariant code motion and common subexpression elimination. The optimized code becomes a candidate for further optimizations, until no more opportunities are found.
For this example, we will compare the level 3 and 4 versions of the same function. Here is the level 3 optimized version of cseTest():
unsigned long cseTest( void ) {
// unsigned long theResult = 0L;
// unsigned long i, limit = 1000L;
// unsigned long product = 0L;
// unsigned long x = 4;
// unsigned long y = 3;
// theResult = limit;
// i = limit;
// do {
li r4,0
li r3,1000
b *+4
// product += x * y;
// } while ( i- > 0 );
addi r4,r4,12
cmplwi r3,$0000
subi r3,r3,1
bne *-12
Here is the level 4 version. Notice that the value of product is never calculated, probably because it is never used.
unsigned long cseTest( void ) {
// unsigned long theResult = 0L;
// ...
// do {
li r3,1000
b *+4
b *+4
// product += x * y;
// } while ( i- > 0 );
cmplwi r3,$0000
subi r3,r3,1
bne *-8
The profiler says:
Optimization Level | Function | Time (ms) | % of Total Time |
Global - Faster 3 | cseTest() | 0.046 | 22.4 |
Global - Faster 4 | cseTest() | 0.028 | 14.8 |
Again, much (39%) faster the second time around!
Faster vs. Smaller
Although I am not presenting examples using global optimizations for smaller code, the profiler results for those settings are included in the archive. For reference, here are the generated code sizes for faster vs. smaller, for each optimization level:
Faster 0 360 48 (constant)
Faster 1 260
Faster 2 220
Faster 3 308
Faster 4 316
Smaller 0 312
Smaller 1 260
Smaller 2 220
Smaller 3 236
Smaller 4 244
Faster/Smaller | Level | Code Size (bytes) | Data Size (bytes) |
Faster | 0 | 360 | 48 (constant) |
Faster | 1 | 260 | |
Faster | 2 | 220 | |
Faster | 3 | 308 | |
Faster | 4 | 316 | |
Smaller | 0 | 312 | |
Smaller | 1 | 260 | |
Smaller | 2 | 220 | |
Smaller | 3 | 236 | |
Smaller | 4 | 244 | |
The smaller option always generated code that was no larger, and usually smaller, than its faster counterpart at each optimization level. Applying levels 3 and 4 always resulted in increased code size (compared to level 2). However, the percentage increase from level 2 to level 4 for the smaller setting (10.9%) is much lower than for the faster setting (43.6%).
Be careful in trying to apply these results to another program. Each program has its own unique characteristics, which may respond favorably to some types of optimization but not to others. Plus, your needs may dictate which type of optimization to perform. For example, code destined for embedded systems may require the smaller option.
Doing it Yourself
Hand-optimized programs may still present the best performance. For example, you can optimize loops yourself, or include assembly functions mixed with high-level code. On the other hand, that approach is time-consuming, costly, and possibly less portable. Plus, not all developers have the necessary skills to hand-optimize. Relying on the development environment to perform optimizations is a good thing.
Writing a straightforward, correct program is key to generating an optimized program. That version becomes the starting point in the optimization process. Every subsequent adjustment or optimization, whether applied by the developer or the development environment, should only make the code faster or tighter.
Debugging efforts should focus on the unoptimized version, since the generated machine code most closely mirrors the original source. Once the program is bug-free (!), the desired type and level of optimization should be applied.
Conclusion
Optimization techniques can improve code size or speed. CodeWarrior provides several categories of optimization, including peephole, global, and instruction scheduling. You can apply any of these in combination via the project preferences dialog. It is worth reviewing your disassembled code to assess the impact the optimizations will have on subsequent debugging sessions. Debugging will be easier if performed on the unoptimized version of your code, since the optimized version of the code may not closely mirror the original source. Once the bugs are out, add the appropriate type of optimization.
References
- Compiler Construction, Niklaus Wirth, Addison-Wesley, 1996.
- Optimizing PowerPC Code, Gary Kacmarcik, Addison-Wesley, 1995.
- PowerPC Microprocessor Developer's Guide, John Bunda et al, Sams Publishing, 1995.
- Writing Compilers and Interpreters, Ronald Mak, Wiley Computer Publishing, 1996.
Credits
Thanks to Tom Thompson of Metrowerks for reviewing this article and providing insight into CodeWarrior's optimization process.
Archive URL
The article archive, containing a CodeWarrior project file, source code, and disassembler and profiler dump files, is available at <www.mactech.com/>.
Andrew Downs is a Technical Lead for Template Software in New Orleans, LA, designing and building enterprise apps. He also teaches Java programming at Tulane University College. Andrew wrote the Macintosh freeware program Recent Additions, and the Java application UDPing. You can reach him at andrew@downs.net.