TweetFollow Us on Twitter

June 94 - BALANCE OF POWER

BALANCE OF POWER

Enhancing PowerPC Native Speed

DAVE EVANS

[IMAGE 055-057_Balance_of_Power1.GIF]

When you convert your applications to native PowerPC code, they run lightning fast. To get the most out of RISC processors, however, you need to pay close attention to your code structure and execution. Fast code is no longer measured solely by an instruction timing table. The Power PC 601 processor includes pipelining, multi-issue and speculative execution, branch prediction, and a set associative cache. All these things make it hard to know what code will run fastest on a Power Macintosh.

Writing tight code for the PowerPC processor isn't hard, especially with a good optimizing compiler to help you. In this column I'll pass on some of what I've learned about tuning Power PC code. There are gotchas and coding habits to avoid, and there are techniques for squeezing the most from your speed-critical native code. For a good introduction to RISC pipelining and related concepts that appear in this column, see "Making the Leap to PowerPC" in Issue 16.

MEASURING YOUR SPEED
The power of RISC lies in the ability to execute one or more instructions every machine clock cycle, but RISC processors can do this only in the best of circumstances. At their worst they're as slow as CISC processors. The following loop, for example, averages only one calculation every 2.8 cycles:

float a[], b[], c[], d, e;
for (i=0; i < gArraySize; i++) {
  e = b[i] + c[i] / d;
  a[i] = MySubroutine(b[i], e);
}

By restructuring the code and using other techniques from this column, you can make significant improvements. This next loop generates the same result, yet averages one calculation every 1.9 cycles -- about 50% faster.

reciprocalD = 1 / d;
for (i=0; i < gArraySize; i+=2) {
  float result, localB, localC, localE;
  float result2, localB2, localC2, localE2;

  localB = b[i];
  localC = c[i];
  localB2 = b[i+1];
  localC2 = c[i+1];

  localE = localB + (localC * reciprocalD);
  localE2 = localB2 + (localC2 * reciprocalD);
  InlineSubroutine(&result, localB, localE);
  InlineSubroutine(&result2, localB2, localE2);

  a[i] = result;
  a[i+1] = result2;
}

The rest of this column explains the techniques I just used for that speed gain. They include expanding loops, scoping local variables, using inline routines, and using faster math operations.

UNDERSTANDING YOUR COMPILER
Your compiler is your best friend, and you should try your hardest to understand its point of view. You should understand how it looks at your code and what assumptions and optimizations it's allowed to make. The more you empathize with your compiler, the more you'll recognize opportunities for optimization.

An optimizing compiler reorders instructions to improve speed. Executing your code line by line usually isn't optimal, because the processor stalls to wait for dependent instructions. The compiler tries to move instr uctions that are independent into the stall points. For example, consider this code:

first = input * numerator;
second = first / denominator;
output = second + adjustment;

Each line depends on the previous line's result, and the compiler will be hard pressed to keep the pipeline full of useful work. This simple example could cause 46 stalled cycles on the PowerPC 601, so the compiler will look at other nearby code for independent instructions to move into the stall points.

EXPANDING YOUR LOOPS
Loops are often your most speed-critical code, and you can improve their performance in several ways. Loop expanding is one of the simplest methods. The idea is to perform more than one independent operation in a loop, so that the compiler can reorder more work in the pipeline and thus prevent the processor from stalling.

For example, in this loop there's too little work to keep the processor busy:

float a[], b[], c[], d;
for (i=0; i < multipleOfThree; i++) {
  a[i] = b[i] + c[i] * d;
}

If we know the data always occurs in certain sized increments, we can do more steps in each iteration, as in the following:

for (i=0; i < multipleOfThree; i+=3) {
  a[i] = b[i] + c[i] * d;
  a[i+1] = b[i+1] + c[i+1] * d;
  a[i+2] = b[i+2] + c[i+2] * d;
}

On a CISC processor the second loop wouldn't be much faster, but on the Power PC processor the second loop is twice as fast as the first. This is because the compiler can schedule independent instructions to keep the pipeline constantly moving. (If the data doesn't occur in nice increments, you can still expand the loop; just add a small loop at the end to handle the extra iterations.)Be careful not to expand a loop too much, though. Very large loops won't fit in the cache, causing cache misses for each iteration. In addition, the larger a loop gets, the less work can be done entirely in registers. Expand too much and the compiler will have to use memory  to store intermediate results, outweighing your marginal gains. Besides, you get the biggest gains from the first few expansions.

SCOPING YOUR VARIABLES
If you're new to RISC, you'll be impressed by the number of registers available on the PowerPC chip -- 32 general registers and 32 floating-point registers. By having so many, the processor can often avoid slow memory operations. Your compiler will take advantage of this when it can, but you can help it by carefully scoping your variables and using lots of local variables.

The "scope" of a variable is the area of code in which it is valid. Your compiler examines the scope of each variable when it schedules registers, and your code can provide valuable information about the usage of each variable. Here's an example:

for (i=0; i < gArraySize; i++) {
  a[i] = MyFirstRoutine(b[i], c[i]);
  b[i] = MySecondRoutine(a[i], c[i]);
} 

In this loop, the global variable gArraySize is scoped for the whole program. Because we call a subroutine in the loop, the compiler can't tell if gArraySize will change during each iteration. Since the subroutine might modify gArraySize, the compiler has to be conservative. It will reload gArraySize from memory on every iteration, and it won't optimize the loop any further. This is wastefully slow.

On the other hand, if we use a local  variable, we tell the compiler that gArraySize and c[i] won't be modified and that it's all right to just keep them handy in registers. In addition, we can store data as temporary variables scoped only within the loop. This tells the compiler how we intend to use the data, so that the compiler can use free registers and discard them after the loop. Here's what this would look like:

arraySize = gArraySize;
for (i=0; i < arraySize; i++) {
  float localC;
  localC = c[i];
  a[i] = MyFirstRoutine(b[i], localC);
  b[i] = MySecondRoutine(a[i], localC);
} 

These minor changes give the compiler more information about the data, in this instance accelerating the resulting code by 25%.

STYLING YOUR CODE
Be wary of code that looks complicated. If each line of source code contains complicated dereferences and typecasting, chances are the object code has wasteful memory instructions and inefficient register usage. A great compiler might optimize well anyway, but don't count on it. Judicious use of temporary variables (as mentioned above) will help the compiler understand exactly what you're doing -- plus your code will be easier to read.

Excessive memory dereferencing is a problem exacerbated by the heavy use of handles on the Macintosh. Code often contains double memory dereferences, which is important when memory can move. But when you can guarantee that memory won't  move, use a local pointer, so that you only dereference a handle once. This saves load instructions and allows fur ther optimizations. Casting data types is usually a free operation -- you're just telling the compiler that you know you're copying seemingly incompatible data. But it's not  free if the data types have different bit sizes, which adds conversion instructions. Again, avoid this by using local variables for the commonly casted data.

I've heard many times that branches are "free" on the PowerPC processor. It's true that often the pipeline can keep moving even though a branch is encountered, because the branch execution unit will try to resolve branches very early in the pipeline or will predict the direction of the branch. Still, the more subroutines you have, the less your compiler will be able to reorder and intelligently schedule instructions. Keep speed-critical code together, so that more of it can be pipelined and the compiler can schedule your registers better. Use inline routines for short operations, as I did in the improved version of the first example loop in this column.

KNOWING YOUR PROCESSOR
As with all processors, the PowerPC chip has performance tradeoffs you should know about. Some are processor model specific. For example, the PowerPC 601 has 32K of cache, while the 603 has 16K split evenly into an instruction cache and a data cache. But in general you should know about floating-point performance and the virtues of memory alignment.

Floating-point multiplication is wicked fast -- up to nine times  the speed of integer multiplication. Use floating-point multiplication if you can. Floating-point division takes 17 times as long, so when possible multiply by a reciprocal instead of dividing.

Memory accesses go fastest if addressed on 64-bit memory boundaries. Accesses to unaligned data stall while the processor loads different words and then shifts and splices them. For example, be sure to align floating-point data to 64-bit boundaries, or you'll stall for four cycles while the processor loads 32-bit halves with two 64-bit accesses.

MAKING THE DIFFERENCE
Native PowerPC code runs really fast, so in many cases you don't need to worry about tweaking its performance at all. For your speed-critical code, though, these tips I've given you can make the difference between "too slow" and "fast enough."

RECOMMENDED READING

  • High-Performance Computing  by Kevin Dowd (O'Reilly & Associates, Inc., 1993).
  • High-Performance Computer Architecture  by Harold S. Stone (Addison-Wesley, 1993).
  • PowerPC 601 RISC Microprocessor User's Manual (Motorola, 1993).

DAVE EVANS may be able to tune PowerPC code for Apple, but for the last year he's been repeatedly thwarted when tuning his 1978 Harley-Davidson XLCH motorcycle. Fixing engine stalls, poor timing, and rough starts proved difficult, but he was recently rewarded with the guttural purr of a well-tuned Harley. *

Code examples were compiled with the PPCC compiler using the speed optimization option, and then run on a Power Macintosh 6100/66 for profiling. A PowerPC 601 microsecond timing library is provided on this issue's CD. *

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

iTubeDownloader 6.5.13 - Easily download...
iTubeDownloader is a powerful-yet-simple YouTube downloader for the masses. Because it contains a proprietary browser, you can browse YouTube like you normally would. When you see something you want... Read more
FileZilla 3.47.0 - Fast and reliable FTP...
FileZilla (ported from Windows) is a fast and reliable FTP client and server with lots of useful features and an intuitive interface. Version 3.47.0: Fixed regression loading advanced site... Read more
Transmit 5.6.3 - Excellent FTP/SFTP clie...
Transmit is an excellent FTP (file transfer protocol), SFTP, S3 (Amazon.com file hosting) and iDisk/WebDAV client that allows you to upload, download, and delete files over the internet. With the... Read more
Doomsday 2.2.2 - Play classic Doom on mo...
id Software's Doom pioneered the modern first-person shooter genre. Released in 1993, it was a quantum leap in game engine technology with fluid and - at the time - incredibly realistic 3D graphics.... Read more
Ableton Live 10.1.9 - Record music using...
Ableton Live lets you create and record music on your Mac. Use digital instruments, pre-recorded sounds, and sampled loops to arrange, produce, and perform your music like never before. Ableton Live... Read more
Maintenance 2.6.5 - System maintenance u...
Maintenance is a system maintenance and cleaning utility. It allows you to run miscellaneous tasks of system maintenance: Check the the structure of the disk Repair permissions Run periodic scripts... Read more
Adobe Photoshop 21.1.0 - Professional im...
You can download Photoshop for Mac as a part of Creative Cloud for only $20.99/month (or $9.99/month if you have purchased an earlier software version). Adobe Photoshop remains the gold standard in... Read more
Adobe Lightroom Classic 9.2 - Import, de...
You can download Lightroom for Mac as a part of Creative Cloud for only $9.99/month with Photoshop, included as part of the photography package. The latest version of Lightroom gives you all of the... Read more
Adobe InCopy 15.0.1 - Create streamlined...
InCopy is available as part of Adobe Creative Cloud for $52.99/month (or $4.99/month for InCopy app only). Adobe InCopy, ideal for large team projects involving both written copy and design work,... Read more
Adobe Illustrator 24.0.3 - Professional...
You can download Adobe Illustrator for Mac as a part of Creative Cloud for only $20.99/month (or $9.99/month if you have also purchased an earlier software version). Adobe Illustrator for Mac is the... Read more

Latest Forum Discussions

See All

Mobile Games Starter Kit
Over here at 148Apps, we regularly dive deep into the latest and greatest mobile games hitting the App Store, but that’s not always what people are looking for when searching for a new mobile game. Some folks just want to dip their toes into... | Read more »
Unresolved is a hard-hitting narrative a...
Ghofran Akil's Unresolved in an upcoming text-based adventure game that sees you playing as a mother attempting to find her disappeared husband during the Lebanese Civil War. [Read more] | Read more »
Marvel Strike Force introduces new brawl...
FoxNext's squad-based RPG Marvel Strike Force is set to receive some fresh characters from the X-Men and Iron Man series. They'll arrive as part of the game's latest update, which follows a sizable spending boycott on the title due to complaints... | Read more »
Speed Dating for Ghosts is a narrative a...
Speed Dating for Ghosts originally released on Steam back 2018, since then it has received honourable mentions for narrative during the Independent Games Festival. Now it's made its way over to iOS devices where it's available as a premium title... | Read more »
Fast-paced multiplayer title Tennis Star...
Tennis Stars: Ultimate Clash is the latest free-to-play tennis title to hit iOS and Android. It's said to be a fairly casual experience, offering easy-to-learn controls and fast-paced, mobile-friendly matches. [Read more] | Read more »
Super Mecha Champions' latest updat...
Super Mecha Champions' latest update sees the addition of a brand new character called R.E.D. Alongside that, there's news about the current season and a series of Emojis that have been added to the game. [Read more] | Read more »
Apple Arcade: Ranked - Top 50 [Updated 2...
In case you missed it, I am on a quest to rank every Apple Arcade game there is. [Read more] | Read more »
Apple Arcade: Ranked - 51+ [Updated 2.19...
This is part 2 of our Apple Arcade Ranking list. To see part 1, go here. To skip to part 3, click here. 51. Mini Motorways Description: [Read more] | Read more »
Isle Escape: The House is an upcoming pu...
Isle Escape: The House is an upcoming puzzle game from Simeon Angelov that's intended to serve as an introduction to a saga they're planning on releasing in an episodic fashion. The first chapter is set to release for both iOS and Android on 29th... | Read more »
Company of Heroes, the classic RTS, is n...
Feral Interactive has finally released their highly anticipated iOS version of the strategy classic Company of Heroes. It's available now for iPad as a premium title and has had various tweaks to ensure that it's optimised for touch controls. [... | Read more »

Price Scanner via MacPrices.net

Verizon offers free iPhone 7 to customers ope...
Verizon is offering a free 32GB iPhone 7 for new or existing customers who open a new line of service, no trade-in required. Cost of the phone is credited to your account monthly over 24 months. The... Read more
Sale! 10.5″ 256GB WiFi iPad Air for $549, $10...
Amazon has new 10.5″ 256GB WiFi iPad Airs, in Space Gray, on sale today for $549 shipped. Their price is $100 off Apple’s MSRP for this model, and it’s the cheapest price available from any Apple... Read more
Back on sale! Apple’s new Mac Pro for $5499,...
B&H Photo has the base 2019 Mac Pro (3.5GHz 8-Core Xeon, 32GB RAM, 256GB SSD) in stock today and on sale for $5499 including free overnight delivery to many addresses in the US. Their price is $... Read more
B&H offers $100 discount on base 13″ 1.4G...
B&H Photo has new 2019 13″ 1.4GHz MacBook Pros on sale for $100 off Apple’s MSRP today with prices starting at $1199. Overnight shipping is free to many addresses in the US. These are the same... Read more
Apple continues to offer Certified Refurbishe...
Apple has Certified Refurbished iPhone XS models available for up to $350 off MSRP, with prices starting at $699. Each iPhone is unlocked and comes with Apple’s standard one-year warranty and a new... Read more
Apple AirPods are on sale for $30 off today
Amazon has new 2019 Apple AirPods (non-Pro models) on sale today for $30 off MSRP, starting at $129. Shipping is free: – AirPods with Wireless Charging Case: $169 $30 off MSRP – AirPods with Charging... Read more
27″ 3.7GHz 6-Core 5K iMac on sale for $2099,...
B&H Photo has the 2019 27″ 3.7GHz 6-Core 5K iMac in stock today and on sale for $200 off Apple’s MSRP. Overnight shipping is free to many locations in the US: – 27″ 3.7GHz 6-Core 5K iMac: $2099 $... Read more
Save up to $250 on a 12.9″ iPad Pros with the...
Apple has Certified Refurbished 12.9″ iPad Pros available on their online store for up to $250 off the cost of new models. Prices start at $849. Each iPad comes with a standard Apple one-year... Read more
Save up to $220 on 11″ iPad Pros with these r...
Apple has Certified Refurbished 11″ iPad Pros available on their online store for up to $220 off the cost of new models. Prices start at $679. Each iPad comes with a standard Apple one-year warranty... Read more
8-Core 27″ iMac Pro available for $4249, Cert...
Apple has Certified Refurbished 27″ 3.2GHz 8-Core iMac Pros available for $4249 including free shipping. Their price is $750 off the cost of new models. A standard Apple one-year warranty is included... Read more

Jobs Board

Medical Assistant - *Apple* Valley Clinic -...
…professional, quality care to patients in the ambulatory setting at the M Health Fairview Apple Valley Clinic, located in Apple Valley, MN. Join the **M Health Read more
Windows/ *Apple* Technical Support Engineer...
Windows/ Apple Technical Support Engineer McLean , VA , US Apply + Be you + Be Booz Allen + Be empowered + Learn More Job Description Location: McLean, VA, US Job Read more
Medical Assistant - *Apple* Valley Clinic -...
…professional, quality care to patients in the ambulatory setting at the M Health Fairview Apple Valley Clinic, located in Apple Valley, MN. Join the **M Health Read more
Geek Squad *Apple* Consultation Professiona...
**762475BR** **Job Title:** Geek Squad Apple Consultation Professional **Job Category:** Store Associates **Store NUmber or Department:** 001423-San Jose-Store **Job Read more
Medical Assistant - *Apple* Valley Clinic -...
…professional, quality care to patients in the ambulatory setting at the M Health Fairview Apple Valley Clinic, located in Apple Valley, MN. Join the **M Health Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.