Extreme Advantage
Volume Number: | | 11
|
Issue Number: | | 4
|
Column Tag: | | Screaming Performance
|
Taking Extreme Advantage of PowerPC
Performance analysis and tuning tips to help your programs scream!
By Eric Traut, Cupertino, CA, etraut@apple.com
Note: Source code files accompanying article are located on MacTech CD-ROM or source code disks.
The Importance of Performance
By now, most people in the computer industry have received the message: PowerPC offers impressive performance gains over older CISC architectures. This is especially obvious to Macintosh developers who have already ported their software to the new architecture. Many have seen performance gains in the 200-400% range. But is this enough?
Although it is true that personal computers consistently double in speed every 18 months or so, computer users continue to demand faster and faster machines. In fact, the general public is probably more aware of computer performance now than ever before. Showdowns between such rivals as PowerPC and Intels Pentium are frequently covered in the media, increasing public awareness further. Developers continue to create more sophisticated and CPU-intensive applications. This all boils down to the fact that software performance tuning is more important than ever.
Customers demand low-cost, high-performance machines, and Apple has staked its future on PowerPC because it believes that RISC can deliver on these promises. However, the maximum benefit of RISC can be realized only with the support and cooperation of third-party software vendors. Many developers have already taken the first step and ported their products to run native on PowerPC-based Macs. The next step involves performance analysis, tuning and, most importantly, innovation. By taking advantage of new-found speed in imaginative ways, developers can deliver dazzling software that is easier and more enjoyable for users.
What Is Performance?
Although this article concentrates on CPU execution time, it should be noted that performance can (and should) be considered at many different levels. For example, perceived performance often has little or no relation to actual performance. Perceived performance of an operation can be increased simply by providing user feedback. This may include spinning beach-ball cursors, status bars, or other visual progress indicators. Along with perceived performance comes the concept of responsiveness. How quickly does a program react to a user event in some perceivable way? Both perceived performance and responsiveness are important considerations when designing and implementing Macintosh software.
Unlike perceived performance, actual performance can be objectively measured. For many types of software, actual performance can be measured with a stop watch. Other types of software may be optimized for greater bandwidth, lower latency, or a greater number of events per unit time. For example, QuickTime performance is often measured in frames per second, whereas bandwidth and latency are more important for networking software. Before tuning can begin, it is important to understand which performance measurement is most appropriate for the piece of software being analyzed. Without this understanding, there would be no way to determine if performance tuning efforts were successful.
Figure 1: Performance analysis and tuning process
Performance Tuning
Once the objectives for analyzing a piece of software are established, and a means for measuring performance has been identified, tuning can begin. This usually requires an iterative process involving a few simple steps. Tuning efforts should proceed until performance objectives are met or further tuning would result in little or no improvement.
The first step of the process typically requires the use of performance tools to identify hot spots within the code. These hot spots represent portions of the program which are critical for performance. Developers often assume they know where the hot spots are in their programs. Though they are often correct, a bad assumption can result in lost time and energy. In most cases, profiling and tracing tools will quickly and accurately identify hot spots.
How should you decide whether a piece of code is worth tuning? Ask yourself how much overall performance will increase if the tuning results were best case. For example, if a particular subroutine accounts for 4% of the total execution time, the best that can be done by tuning this subroutine is an overall speedup of 4%. If this were the case, the subroutine would probably not be worth tuning unless all other efforts had failed to achieve the performance goal.
Once a hot spot is identified, efforts should turn to detecting and removing bottlenecks, or ways in which the hardware is being used inefficiently. Bottlenecks come in many different forms, and one of the biggest challenges to performance tuning is sniffing them out and removing them. Much of the remainder of this article discusses various types of bottlenecks and ways to remove them.
Performance Analysis Tools
Keep in mind that removing a bottleneck which is not within a hot spot will have little or no effect on overall performance. Identifying hot spots can be challenging, but tools exist to help make the job easier. There are two main classes of analysis tools: tracers and profilers.
Tracers are used to record a series of events, often time-stamping each event as it is logged. Tracers are most useful when event ordering is of importance or when a distribution of timings is needed. The disadvantage to tracers is resource consumption. Because events are continually logged, traces can take up large amounts of RAM and disk space. This volume of data is not only difficult to store, but also difficult to analyze. However, tracers still have their use. For example, a tracer was used by Apple engineers to identifying common system call chains. This information was then used to determine which portions of the system to port to PowerPC first.
Profilers also record information throughout the execution of the program, but no ordering information is maintained. Rather, information is totaled or averaged throughout the profiling session. This means that profiler output does not consume as much space as that of tracers, and the resulting data is often easier to understand.
A number of profiling tools are available to Mac developers. Most compilers provide options to generate profiling code which collects function-level statistics at runtime. These statistics usually include an execution count and an average execution time for each profiled routine.
Another very useful profiling tool is built into Apples Macintosh Debugger for PowerPC. This debugger, which can be used in a one or two-machine environment, makes use of a mechanism called adaptive sampling profiling, or ASP. When enabled, ASP periodically samples the program counter. These sampled addresses are mapped to buckets which represent ranges of addresses. Each time the program counter falls within a bucket, the buckets counter is incremented. When a counter reaches a certain level, the ASP splits the bucket into two smaller buckets. This allows the tool to adapt to the code being profiled, providing higher resolution for hot spot areas. At the end of an ASP session, the tool attempts to map all address ranges to code symbols. It also produces a report which details how much time is spent within the system software as well as the percentage of time spent in emulated code. All of this information is useful in locating hot spots and removing bottlenecks.
Removing Hot Spots
Once a hot spot is found, an attempt should be made to remove the hot spot completely. This can be done by analyzing program flow. How is the hot spot being entered? Is the routine which contains the hot spot being called more often than necessary? Is there an algorithmic change which will alleviate some of these calls (e.g., a caching mechanism)?
Some of these types of problems are easy to remedy. Take, for example, the following code snippet:
// this code draws a column of strings within the window
while (stringIndex < stringCount) {
SetPort(theWindow);
TextFont(geneva);
TextSize(12);
ForeColor(blackColor);
MoveTo(h, v);
v += 14;
DrawText(stringArray[stringIndex++]);
}
This loop has a number of performance problems. First, it is clearly unnecessary to set the port, change the text characteristics, and set the drawing color each time through this loop. Second, if the number of strings is sufficiently large, it is unlikely that all of the strings will be visible within the window at one time. DrawText is an expensive routine, and it should be avoided if it can be determined that the text will fall outside of the windows clip region.
A profile of the above code would probably flag various QuickDraw routines as hot spots. When Mac toolbox routines show up high on the list of hot spots, the only recourse is to attempt to call these functions less frequently.
Toolbox hot spots sometimes occur when code makes implicit assumptions about processor speed - assumptions which may no longer be true on PowerPC processors. For example, take a look at the following loop:
// calculate vectors until complete or user presses command-period
while (!done && !canceled) {
done = CalculateVectors(vectorArray, kVectorsPerCall);
canceled = DetectCancelEvent();
}
This code contains a constant kVectorsPerCall which is used to control the number of vectors processed before checking for a user event. Although this constant may have been appropriate on a Mac Plus, it is very likely too small for a Power Mac. Checking for user events is only necessary every three to six ticks. This code may be spending a large percentage of its time in the Event Manager checking for a command-period. The code should probably be changed to something like the following:
// calculate number of vectors until complete or user presses command-period
prevTicks = TickCount();
while (!done && !canceled) {
done = CalculateVectors(vectorArray, kVectorsPerCall);
if (TickCount() - prevTicks >= kUserInteractionTicks) {
canceled = DetectCancelEvent();
prevTicks = TickCount();
}
}
This code will dynamically adjust to the speed of the processor. A similar technique should be used for the main event loop of an application. Many native PowerPC programs spend more time than necessary within WaitNextEvent because of implicit assumptions about processor speed. There is no need for the Event Manager to become a hot spot within your program.
Within some types of code, it is difficult to determine why a particular piece of code is a hot spot. This is especially true for C++ where a simple variable declaration can result in the execution of a relatively complex and costly constructor routine. Operations which are normally quite simple (e.g., addition) can be overridden with a costly replacement. These features of C++ offer incredible flexibility, but they can easily create hot spots which are difficult to track down. Use these features carefully.
Identifying and Removing Bottlenecks
Unlike Mac toolbox routines, most functions within your program are entirely under your control. If there is a good reason for a particular function to contain a hot spot, then the next step is to remove as many bottlenecks as possible.
Below are some of the most common bottlenecks with discussions of how to remove them. This is by no means an exhaustive list of all types of bottlenecks. Hopefully, I have touched upon most of the common performance hits.
Note that all examples are given in C. On RISC platforms, well-written C code compiled with a good optimizing compiler is usually as optimal (sometimes more optimal) than hand-written assembly code. Does this mean it is not important to learn the basics of the PowerPC instruction set? The answer is unfortunately no if you wish to become good at performance tuning. One of the most effective ways to track down bottlenecks is to use a disassembler. The disassembled code listings will provide clues as to how the compiler interpreted the original high-level code and often suggest ways in which the original code can be improved.
Picking a Better Algorithm
The first thing to consider when examining a hot spot is whether the chosen algorithm is appropriate. Choice of algorithms should take into consideration both the expected data size and the precision of the operation. Computer science courses teach about the importance of selecting algorithms which typically require less work, using the language of order n [where the notation is O(n)] to describe the expected amount of work in terms of the number of elements being worked on. These courses teach that its better to choose O(log n) algorithms over O(n) and O(n) over O(n2). Keep in mind, however, that more expensive algorithms may be less expensive if n is constrained to a certain limit.
Also remember that operations which result in more precision usually, but not always, take more time. For example, calling a library routine to compute a cosine may return 24 bits of precision when your program only really needs 8 bits. In this case, it may be best to write a cosine approximation function and avoid the library call.
Time critical routines should avoid calling extremely general functions. Take, for example, CopyBits. This routine has been optimized for certain common cases, but still needs to take into account alignment, transforms, colorizing, clipping, pixel resizing, and various other GrafPort state. Code which copies between two off-screen buffers can often make certain assumptions about those buffers (e.g., they are both eight-byte aligned, have the same background colors, and are the same pixel depth). A custom routine for copying between these buffers would probably be more efficient than calling the generic CopyBits routine.
Care should be taken when replacing library or toolbox calls - only do this when performance demands it and the benefits are obvious. For example, QuickDraw routines are handled in hardware on some graphics accelerator boards. Avoiding a QuickDraw routine may actually be slower in this case.
Register Usage
Most computers today make use of a simple memory hierarchy to compensate for the inherent speed limitations of larger storage media. On todays Power Macs, the memory hierarchy consists of secondary storage (i.e., virtual memory backing store), main memory (DRAM), second-level cache (SRAM), on-chip first-level cache(s), and processor registers. This hierarchy has been optimized to take advantage of typical memory access patterns. It is important to tune software so that it can take maximum advantage of the memory hierarchy and avoid costly access delays.
The PowerPC architecture, like many RISC architectures, defines 32 general purpose (integer) registers and 32 floating point registers. Performance is very dependent on the efficient use of these registers. Compilers play the most important role in optimizing the use of registers, but there are a number of techniques which can be employed by programmers to improve register usage.
Large, complex routines often contain many local variables. When compilers cannot fit all variables within processor registers, they are forced to spill them to memory. Routines should be kept simple enough to avoid this occurrence wherever possible.
Most RISC compilers perform live variable analysis to determine which variables contain live information at which point within the function and register coloring to efficiently assign these variables to machine registers. These techniques often allow a single register to be used for multiple local variables. Live register analysis and register assignment are complex operations, and some compilers are better than others. Programmers can help improve register usage by defining local variables only within the scope in which they are needed. Take, for example, the following piece of code:
void UnoptimizedRegUsage(long *a, long *b)
long temp;
if (*a > *b) {
// swap values if necessary
temp = *a; *a = *b; *b = temp;
}
*a = MungeNumbers1(*a, *b);
*b = MungeNumbers2(*a, *b);
if (*a > *b) {
// swap values if necessary
temp = *a; *a = *b; *b = temp;
}
*a = MungeNumbers1(*a, *b);
*b = MungeNumbers2(*a, *b);
}
This code makes use of a temporary variable to swap the values pointed to by a and b. Notice that although the variable temp is used twice, it is not live between the two uses, nor is it needed after its final use. The function also has a second performance problem. All references to the input values are made by dereferencing a and b. If we know it is not important to update *a and *b each time, we can store these values in temporary variables. It is often difficult or impossible for compilers to make this assumption. The new code would then look like this:
void OptimizedRegUsage(long *a, long *b)
long localA = *a, localB = *b;
if (localA > localB) {
// swap values if necessary
long temp;
temp = localA; localA = localB; localB = temp;
}
localA = MungeNumbers1(localA, localB);
localB = MungeNumbers2(localA, localB);
if (localA > localB) {
// swap values if necessary
long temp;
temp = localA; localA = localB; localB = temp;
}
*a = localA = MungeNumbers1(localA, localB);
*b = MungeNumbers2(localA, localB);
}
It appears we have made the code longer, but the resulting machine code will be shorter and very likely faster. Most important, the resulting code contains fewer loads and stores. Even if cache misses are unlikely (as in this case where we are accessing the same addresses over and over again), high-end processors like the 604 will be able to execute multiple arithmetic operations in the same cycle, whereas loads and stores are limited to one per cycle.
Due to the PowerPC runtime architecture, global and local static variables require special attention when used within a performance-critical piece of code. These classes of variables are stored within a programs data section and are referenced through the TOC (table of contents). Reading a global variable involves accessing its address by looking it up in the TOC, then dereferencing this address to get to the data. If a global or static local is used multiple times throughout a function, most compilers will generate code which performs the TOC lookup only once. However, because the compiler does not know if the global will be required or changed by other portions of the program (which could be running asynchronously to the current code), it must always perform the second dereference. If you know that the global is not needed or changed by other functions throughout the execution of the current procedure, it is often more optimal to assign the global to a local variable. Take, for example, the following code snippet in which a single global variable is used three times:
for (i=0; i<kLoopCount; i++) {
SetVectorOrigin(gCurrentH, v);
CalcVector(gCurrentH, v + gCurrentH);
gCurrentH += 10;
}
We can tighten up the inner loop by changing the code to this:
short localH = gCurrentH;
for (i=0; i<kLoopCount; i++) {
SetVectorOrigin(localH , v);
CalcVector(localH, v + gCurrentH);
localH += 10;
}
gCurrentH = localH;
Cache and Virtual Memory Usage
Poor register usage can slow down a program by a factor of two to four. Poor cache usage is much worse and can slow down a program by five to ten times. Misuse of virtual memory can cause code to run thousands of times slower yet. It is therefore very important to be aware of cache and VM usage in time critical pieces of code.
Both caching and virtual memory rely on spatial and temporal locality. Spatial locality means that if a memory address has just been accessed, it is likely that other memory addresses nearby will be accessed. Temporal locality means that if a memory address is accessed, it is likely that the same address will be accessed within a small period of time. The key to tuning for caches and VM is to increase both types of locality within your code. Keep in mind that a single virtual memory page on todays Power Macs is 4K, and most PowerPC processors handle caching on a 32-byte basis.
There are three types of memory accesses which occur during the execution of most code: data loads, data stores, and instruction fetches. All of these accesses require the processor to perform address translation to turn the effective address into a physical address. To speed up this operation, processors make use of a translation look-aside buffer (or TLB for short). TLB entries are allocated on a per-page basis. This means code which erratically accesses many different pages (i.e., code with poor locality) will pay the penalty for missing in the TLB. Each TLB miss results in a reload which requires on the order of 10 to 50 cycles depending on the processor. This type of code is also likely to cause more page faults, each of which can take hundreds of thousands of cycles to resolve.
So, if locality is important, how can it be improved? Here is a piece of code which has poor data locality:
char myArray[kMaxY][kMaxX];
// initialize array
for (x=0; x<kMaxX; x++)
for (y=0; y<kMaxY; y++)
myArray[y][x] = x;
In this example, the doubly-nested loop traverses a two-dimensional array. Note that the inner loop increments y, the left-most array index. Unfortunately for this piece of code, C lays out arrays such that elements indexed by the right-most index are contiguous in memory. Each time we go through the inner-most loop, we are moving kMaxX bytes ahead in memory. By simply switching the two for statements, the code will access contiguous bytes each time through the loop.
This example was fairly trivial, but it provides a taste for locality optimizations. In general, it is best to access consecutive addresses of memory in ascending order.
Within the PowerPC architecture, only a few instructions (loads, stores and a few cache-related operations) can result in a data memory access. However, the processor must perform an instruction fetch for every instruction. Like data, instructions are cached to speed up fetches, so code locality is also important for performance.
One of the most obvious ways to increase code locality is to make sure that functions which call each other are logically near each other. This increases the likelihood that they lie on the same page. This can be done at the source code level, since most compilers and linkers preserve the order in which functions are implemented within the source.
More subtle locality issues come in to play within functions. Here, page locality is not so much of an issue, but cache locality can be. Take a look at the following piece of code:
myHandle = NewHandle(sizeof(DataStructure));
if (myHandle == NULL) {
// complex tear-down operation
} else {
// initialize data structure
}
This piece of code is testing for an exceptional case (i.e., not enough memory to allocate the handle) and handling it appropriately. However, exceptional cases, by definition, do not occur most of the time. This means it would be better to place the exception handling code someplace else - where it is not taking up local cache space. In the above example, this can be easily accomplished by negating the conditional and swapping the two branches of the if-else statement.
Most real functions are more complex and contain a number of exception-handling sections. For dealing with these efficiently, I suggest using the exception macros presented by Sean Parent in the Aug. 1992 issue of develop. These macros are defined in Exceptions.h which is provided by Apple on the E.T.O. disk. They make use of the C goto statement so all exception handling code can be placed at the end of the function.
In any case, attempts should be made to make the normal code paths as contiguous and linear as possible.
Video Buffer Accesses
Although not recommended by Apple for most applications, direct access to the video buffer gives certain programs the added video performance they need for smooth animation. Accesses to Power Mac video buffers are different from normal memory accesses in one very important way. Most of the Power Mac RAM is mapped copy-back-cacheable. This means that a store operation writes its data to the first-level cache, but not immediately to memory. Main memory is updated only when the dirty cache block is replaced within the cache.
If the video buffer were mapped copy-back-cacheable, pixels would appear sporadically on the screen. For this reason, video buffers are marked either write-through-cacheable or non-cacheable. Either way, stores to video are immediately written to the buffer. Many such stores in a row will fill up the on-chip store queues and stall the processor until pending data transfers are complete. A single store to video can take from 6 to 18 cycles, depending on the processor clock rate and the speed of the video subsystem. Ideally, programs could take advantage of these stalls to execute arithmetic operations which do not involve loads or stores. Unfortunately, interleaving blitting and calculations is extremely difficult.
If code is stuck with these multi-cycle stalls, the best we can do is to make optimal use of each store. With integer registers, it is possible to store four bytes at a time. By storing data from the floating point registers, the code can store eight bytes at a time. Take, for example, the following blitting loop which performs pixel smearing, effectively doubling the width of the image as it draws it to the video device:
// Copy scan line from off-screen buffer to screen, doubling the width
// of the image in the process. Assumes 8-bit video.
void DoubleCopyPixels1(unsigned short *src, unsigned long *dest)
{
long column;
for (column=0; column<kRowBytes/2; column++) {
unsigned long pixels;
pixels = *src++;
*dest++ = TWO_TO_FOUR(pixels);
}
}
If we can assume the destination is eight-byte aligned, and kRowBytes is divisible by four, then we can rewrite the code:
// Copy scan line from off-screen buffer to screen, doubling the width
// of the image in the process. Assumes 8-bit video.
void DoubleCopyPixels2(unsigned short *src, double *dest)
{
long column;
for (column=0; column<kRowBytes/4; column++) {
double eightPixels;
unsigned long pixels;
pixels = *src++;
((unsigned long*)&eightPixels)[0] = TWO_TO_FOUR(pixels);
pixels = *src++;
((unsigned long*)&eightPixels)[1] = TWO_TO_FOUR(pixels);
*dest++ = eightPixels;
}
}
Once again, it appears we have made the code longer and more complex. In fact, we are doing three stores each time through the loop now instead of one. Luckily, two of these stores are to a cacheable area of RAM, so they are not very expensive. We more than make up for the added complexity by doubling our bandwidth to the screen. Both QuickDraw and QuickTime make use of this technique to accelerate common blitting operations.
Data Alignment
All PowerPC instructions fall on four-byte boundaries, but data is byte-addressable. Most processors are optimized for accesses which are aligned to the data length, i.e., doubles should be aligned to eight bytes, longs to four, etc. PowerPC C compilers typically default to this type of alignment for structures. For backward compatibility, however, it is often necessary to use 68K-style alignment which requires a maximum of two-byte alignment within structures.
The standard universal headers released by Apple include compiler directives which force 68K alignment for old toolbox data structures. All new system data structures defined by Apple will use PowerPC alignment. Developers should also attempt to migrate toward natural PowerPC alignment wherever possible. For compatibility reasons, this may be impossible for structures which are stored to disk or sent over the network. However, most internal data structures can be changed from 68K to PowerPC alignment with no undesirable side effects (except for, perhaps, a slight increase in memory usage).
What is the cost of using misaligned data accesses? This depends on the instruction stream and the processor. Most PowerPC processors have a load latency of one cycle, i.e., the data being loaded cannot be used for one cycle after the load executes. This latency typically doubles or triples for misaligned accesses. Accesses are even more expensive when they cross cache-block boundaries. Note that this will never occur for properly aligned data. On the 601, accesses which cross page boundaries are even more expensive; they force an alignment exception which is handled by a low-level software exception handler.
Extra special care should be used when accessing floating point values (either doubles or floats). Some processors, including the 603 and 604, force an alignment exception when floating point loads and stores access a misaligned value. This means, for example, that the optimized blitting loop shown in the previous section would be 30 to 50 times slower if care were not taken to align the accesses.
Data Types
The choice of simple integer data types can affect performance. On early 68K processors, 16-bit values and operations were often faster than 32-bit. Just the opposite is true on PowerPC. The use of signed short and signed char operations can force the compiler to emit extra, often unnecessary, sign-extension instructions. Take, for example, the following piece of code:
long SumIntegers(short a, short b)
{
long tempSum = 0;
for (;a > 0; a--, b--)
tempSum += a + b;
return tempSum;
}
Simply changing the two input parameters from shorts to longs increases the speed of this loop by 30% on a PowerPC 601. (A 604, with its multiple integer execution units, can execute the extra sign-extensions for free, but these extra instructions still consume valuable space within the instruction cache.)
In general, 32-bit integer quantities are more efficient than 16 or 8-bit, and unsigned are more efficient than signed.
Complex Arithmetic Operations
Most arithmetic and logical operations on PowerPC processors require a single cycle and have no added latency, i.e., the results of the operation can be used in the next cycle. There are a few complex arithmetic instructions, however, which take multiple cycles. Multiplies, for example, typically take between two and five cycles, and divides between 20 and 40 cycles. The actual cycle counts depend on the processor and the significant digits within the operands.
Multiplication and division are both very useful operations, so how can they be avoided in a time-critical loop? Most compilers attempt to optimize multiplication by a power of two by replacing the expensive operation with a single-cycle shift. There are occasions where multiplication by other constant values can be optimized in a similar manner. Take for example, the following calculation:
a = b * 40;
If we notice that 40 equals 32 plus 8, we can rewrite this as:
a = (b * 32) + (b * 8);
The compiler will now use shifts to optimize the code. The cycle count goes from five to three on a 601 and from four to two on a 604. (Unfortunately, it increases from two to three cycles on a 603 which has a relatively fast multiplier.) Note that weve increased the generated code size (one multiply becomes two shifts and an add).
Multiplies are not much more expensive than a series of shifts and adds, so we cannot win much with this optimization, but division is so expensive, there is more room for improvement. Division by a power of two can also be changed into a shift as long as the dividend is unsigned. However, other constant values are more difficult. It is often necessary to give up some precision to construct a cheap replacement. Take, for example, the following calculations:
a = b / 13;
c = d / 40;
This piece of code can be replaced by:
// divide b by 12.800
a = (b / 16) + (b / 64);
// divide d by 39.385
c = (d / 64) + (d / 128) + (d / 512);
In this example, assuming b and d are unsigned, the compiler can use right shifts rather than expensive divisions, and the quotients will be accurate to within 1.5% (practically close enough for a Pentium!). How much speed has the change bought us? On a 601, the cycle count goes from 74 to 8; on a 603 from 76 to 8; on a 604 from 41 to 4. In short, this optimization is a huge win!
The large difference in performance between multiplication and division makes it tempting to try to replace one with the other. This is possible in some scenarios, as demonstrated in the following piece of code:
for (i=0; i<kMaxI; i++)
dest[i] = src[i] * a / b;
Notice that values of a and b do not change throughout the loop. It would be much faster to calculate the fraction once and then use a single multiply within the loop. Unfortunately, calculation of the fraction a/b will result in a loss of significant digits. Luckily, if we can assume a and b are both 16 bits in length, we can use a shift operation to maintain more significant digits. The above code can be changed to:
unsigned long fract = (a << 16) / b;
for (i=0; i<kMaxI; i++)
dest[i] = (src[i] * fract) >> 16;
The cycle count for the inner loop goes from 49 to 26 on a 601 and from 28 to 9 on a 604 - another huge performance win!
Floating Point Operations
One of the strengths of PowerPC processors is floating point performance. Unlike older architectures, PowerPC defines double precision floating point as mandatory for architectural compliance. This allows developers to use floating point without worrying about terrible performance on low-end platforms. If floating point makes sense in your program, by all means use it. Do not attempt to avoid it by using a complex fixed-point scheme as would have been suggested on older processors.
Floating point calculations are fast for two primary reasons. First, most PowerPC processors dedicate significant silicon area to floating point calculations. In fact, floating point multiplication and division are faster on most processors than the corresponding integer operations. Second, floating point is handled by a separate functional unit which operates, to a great extent, independently of the other functional units. While the other functional units are busy handling loads, stores, branches, and integer calculations, floating point operations can be interleaved with little extra cost.
Keep in mind that single-precision floating point (32-bit) operations are faster on most processors than double-precision (64-bit). Long double (128-bit) operations are not supported in hardware and must be emulated in software, often with a large performance hit. Be careful when selecting the level of precision your program requires.
Because the floating point unit is relatively independent of the other functional units, the PowerPC architecture includes no integer-to-floating-point conversion instructions. This means that frequent use of operations which intermix integer and floating point values can result in suboptimal performance. Both explicit and implicit type conversions should be kept to a minimum.
Branch Optimizations
Branches play a critical role in most programs, but they unfortunately dont perform any useful computation. For this reason, PowerPC processors attempt to fold branches out of the instruction stream to keep the other functional units fed without skipping a cycle. The majority of branches within a typical program are conditional, and good performance is dependent on the processor being able to predict whether or not each of these branches is going to be taken.
The 601 and 603 use simple rules to predict the direction of a branch. If a conditional branch is forward (i.e., has a positive offset), it is predicted to be not taken; backward branches are predicted taken. The architecture defines a special prediction override bit which can be set within the branch instruction to indicate that the normal prediction is probably not optimal. Unfortunately, compilers do not usually have enough information to accurately set these bits, and there is no way to tell the compiler which conditionals will typically be true. Most compilers simply leave these bits clear. This means that it is important to structure conditional statements in a way which will result in the best branch prediction. Since forward branches are predicted not taken, it is best to put the most frequently executed code in the if clause of an if-else statement.
The 604 uses a more sophisticated mechanism for branch prediction. It dynamically tracks branches under the assumption that a branch which was recently taken will probably be taken the next time it is encountered. There is very little that can be done from a programming perspective to increase the branch prediction success rate in this case.
Typical code often makes use of counters which are decremented each time through a loop. For this reason, many processor architectures provide instructions which perform a combined decrement/branch operation. This is true of both the 68K and PowerPC architectures. However, due to minor differences, high-level loops should be coded differently. Take the following piece of code written for the 68K:
// this loop initializes a block of memory to zero
for (x = blockLength - 1; x >= 0; x--)
*blockPtr++ = 0;
The 68K dbra (decrement and branch always) instruction can be used at the end of this loop to decrement x and terminate the loop when the count becomes negative. The PowerPC bdnz (decrement and branch if not zero) instruction performs a similar function, but terminates the loop when the loop count reaches zero. Therefore, most compilers will generate better PowerPC code if the above example is changed to the following:
// this loop initializes a block of memory to zero
for (x = blockLength; x > 0; x--)
*blockPtr++ = 0;
Subroutine Calls
Because of the PowerPCs ability to fold branches, subroutine inlining and loop unrolling is not as important as with other RISC architectures. In fact, unless a subroutine or loop is extremely simple, it is often best to avoid inlining and loop unrolling as these will only serve to increase code size (along with instruction cache misses, page faults, etc.).
Not all subroutine calls are this cheap, however. When the callee is implemented in a different shared library, an additional cost is incurred. These types of calls are called cross-TOC because the compiler-generated code performs a double indirection through the TOC to get to the subroutine. The code must first load a pointer to a transition vector which lives in the callees data section. The transition vector contains a pointer to the actual subroutine code as well as a new TOC pointer for the callees library.
At compile time, the compiler cannot usually determine whether a subroutine will be located within the same library as the caller, so it generates the following code sequence:
bl subroutine
nop
When all object files are linked to form a single library, the linker fills in the correct branch offset, and the nop instruction is left as is. If the linker does not find the subroutine within the library, it assumes it is an import symbol and generates a small glue stub. This glue is responsible for saving the current TOC pointer, loading a transition vector pointer from the current TOC, loading a new TOC pointer and code address from the transition vector, and finally jumping to the subroutine. The nop instruction is overwritten with an instruction which restores the callers TOC pointer after the return from the subroutine. The typical code sequence looks like the following:
bl subroutine_glue
lwz rTOC,20(SP)
subroutine_glue:
lwz r12,subroutineTVOffset(rTOC)
stw rTOC,20(SP)
lwz r0,0(r12)
lwz rTOC,4(r12)
mtctr r0
bctr
The entire cross-TOC call sequence adds approximately ten cycles to the subroutine call, assuming there are no cache misses. Care should be taken to minimize the use of cross-TOC calls within time-critical code if possible.
When the linker determines that the subroutine is in the same library as the caller, the extra nop instruction is left unmodified. Luckily, the 603 and 604 fold nops out of the instruction stream with a zero-cycle penalty, but they still take up extra space. One way to help reduce the number of these unneeded instructions is to define functions as static where appropriate. When the key word static appears before a function prototype, the compiler can assume the routine will be implemented within the current file. The compiler can also assume a subroutine is local if its implementation appears before it is called. Take, for example, the following code sequence:
OSErr caller(void) { return callee(void); }
OSErr callee(void) {return noErr; }
It is better to either reverse the implementations or declare the callee as static (if appropriate):
static OSErr callee(void);
OSErr callee(void) {return noErr; }
OSErr caller(void) { return callee(void); }
Parameter Passing
The PowerPC runtime architecture makes use of the large on-chip register files wherever possible to reduce parameter passing overhead. Most CISC architectures, including the 68K, use stack-based calling conventions for most high-level languages. This means that passing small structures by reference is less costly than passing them directly, in which case the entire structure must be copied. This is not necessarily the case on the PowerPC. Take, for example the following routine:
long DotProduct1(Vector *v1, Vector *v2)
{
return v1->x * v2->x + v1->y * v2->y;
}
Assuming the caller typically has the vector coordinates in registers, the subroutine can be optimized by passing the x and y coordinates of the two vectors by value.
long DotProduct2(long v1x, long v1y, long v2x, long v2y)
{
return v1x * v2x + v1y * v2y;
}
Register-based parameter passing does have its limitations. Up to eight integer and thirteen floating point parameters can be passed in registers. Additional parameters must be spilled to the stack. Integer and floating point parameters overlap (e.g., if a 64-bit double is passed as the first parameter, it is assigned a floating point register, and the first two 32-bit general purpose registers are left unused). This means that long parameter lists containing both integer and floating point parameters will produce more optimal code if the floating point parameters are passed at the end of the parameter list.
Variable-length argument lists pose an interesting problem for register-based parameter passing conventions. The PowerPC runtime architecture calls for all variable argument lists to be spilled to the stack. Use of these routines should therefore be kept to a minimum in time-critical code.
Mixed Mode
Subroutine calls which involve a Mixed Mode switch are much more expensive than simple or cross-TOC calls. Although Apple engineers have worked very hard to tune both Mixed Mode and the 68K emulator for optimal performance, the emulator should be avoided within time-critical code. Simply switching from native to emulated modes requires hundreds of cycles. Profilers and debuggers can help determine which system routines are implemented with native PowerPC code, which are fat, and which are still emulated.
Mixed Mode calls which do not involve a mode switch (e.g., calls to toolbox routines which have already been ported) are much less costly but are still expensive relative to simple cross-TOC calls. Care should be taken to move these types of calls out of time-critical loops whenever possible.
Case Study: Fractal Contours
In an attempt to demonstrate some of the techniques discussed in this article, I went on a hunt for a piece of code which could benefit from tuning. My search ended on the Internet when I found the source for an old public domain Mac program called Fractal Contours. This program was written in 1985 by Jim Cathey for the Mac Plus. The application generates impressive wire-framed mountainous landscapes. The source code for the program, along with the changes I made each step of the way, can be downloaded from any of MacTechs usual on-line sites.
Fractal 1
The version I found on the Internet no longer ran under System 7.5, so my first task was to modernize the code somewhat. I removed assumptions about the screen size, added a little bit of color and a menu option which toggles a continuous redraw mode. When this mode is enabled, fractal pictures are continuously calculated and displayed, and a frame rate (fractals per second) is computed. On a Power Mac 8100/80 using 8-bit video, I measured an initial frame rate of 0.59 fractals per second in emulation mode.
Fractal 2
Now that the program worked and I had defined a measure for performance, I could start tuning. The first step was to simply recompile the program for PowerPC. I chose Metrowerks CodeWarrior C/C++ 1.2 for development. CodeWarriors speedy turn-around times allowed me to try out many different performance optimizations and quickly measure their effects. With all of CodeWarriors optimizations enabled, I immediately saw a 4.80x speed-up over the emulated version. I was up to 2.83 fractals per second.
Fractal 3
My first step in tuning was to find the hot spots within the program. I chose to use the adaptive sampling profiler built into the Macintosh Debugger. The first profile indicated that I was spending over 80% of the time in QuickDraws StdLine function drawing lines to the screen. Because the screen is non-cacheable, QuickDraw is more efficient when working with RAM-based off-screen buffers. I changed the program to draw the fractal image to an off-screen buffer and then use CopyBits to copy it to the screen. The resulting frame rate was now 2.97 fractals per second, a rather disappointing 1.05x speedup over the previous version.
Fractal 4
A second profile showed that the program was still spending the majority of its time within the StdLine function. I noticed that all of the lines the program draws are typically very short (15 to 25 pixels), are always one pixel wide, never need clipping, and are always either black or blue. I decided that a custom line-drawing routine would help to remove the QuickDraw bottleneck. The result was a small increase in frame rate - now 3.47 fractals per second, a 1.17x speedup over the previous version.
Fractal 5
I was a little disappointed and confused as to why the custom line drawing routine did not improve performance further. I profiled the program once again. Indeed, I was no longer spending much time in StdLine, but I was spending a large amount of time in the routine GetPixBaseAddr and in the Mixed Mode Manager. GetPixBaseAddr is simply an accessor function and dereferences a PixMap to return its base address. A quick check revealed that the routine had unfortunately not been ported to native PowerPC. I was making a Mixed Mode switch each time I called this function. By removing the call from an inner loop and placing it in a less time-critical piece of code, I was able to increase performance substantially. I was now up to 6.69 fractals per second, an impressive 1.93x increase over the previous version. My custom line drawing routine was successful after all.
Fractal 6
I ran the profiler once again to find the remaining hot spots. My custom line drawing routine now came up on top. Using various techniques described above, I removed division and multiplication operations wherever possible and changed signed shorts to longs. I also added special cases for drawing horizontal and vertical lines, added static keywords to function prototypes, and passed points and vectors by their individual values rather than by reference. All of this work once again paid off. I was now seeing 10.92 fractals per second, a 1.63x increase over the previous version.
Fractal 7
One last profile revealed that I was still spending a large amount of time in the line drawing routines. Confident that I had already tuned these as much as I could, I concentrated on other hot spots within the code. I noticed quite a bit of time was being spent scaling and transforming the points so the two-dimensional array of xy coordinates could be drawn in a three-dimensional view. The original program used a cordic algorithm to iteratively approximate sines and cosines. I had added another scaling operation to magnify the image from its original size - the size of a Mac Plus screen - to the size of the window. I found that an algorithmic change was needed, so I devised a simple linear algebra technique to scale, rotate, and skew the image in one step. The resulting image looks slightly different from the original, but the effects are still impressive - and so was the speed improvement. The final frame rate was now 27.00 fractals per second. This is 9.54x faster than the first native version and 45.76x faster than the emulated version! The tuning efforts definitely paid off.
Fractal 8
Now that I could achieve almost thirty frames per second, I decided to add some more interesting features to the program. I first incorporated a morphing feature which draws intermediate fractals so it appears that each one morphs into the next. Next, a friend contributed some triangle blitting code he had written, and we added a surface mapping feature. The result is quite entertaining. Fully shaded mountains rise and fall in an ever changing pattern. None of this animation would have been possible if I had been stuck with the 2.8 frames per second I achieved by simply porting the program native.
Appeal for Innovation
As the fractal example shows, performance tuning doesnt just mean a faster application. It can enable a program to do things that wouldnt otherwise be possible.
Applications which are noticeably lethargic are no fun to use. Really fast programs which simply do their job faster are less annoying, but still often boring. Programs which make use performance in innovative ways, on the other hand, will always be in demand. PowerPC offers Mac developers an excellent opportunity to deliver these types of applications. When youre planning the next version of your application, keep in mind the performance tuning tips discussed in this article - and use your imagination to add the features that will make your program really stand out.
For More Information, Refer To:
Inside Macintosh, PowerPC System Software, 1994. Addison-Wesley Publishing Co.
Inside Macintosh, PowerPC Numerics, 1994. Addison-Wesley Publishing Co.
Macintosh Debugger Reference, 1994. Apple Computer, Inc.
Parent, Sean. Living in An Exceptional World, develop, Issue 11, August 1992. Apple Computer, Inc.
PowerPC 601 Users Manual. 1993, Motorola, Inc.
PowerPC 603 Users Manual. 1994, Motorola, Inc.
PowerPC 604 Users Manual. 1994, Motorola, Inc.