TweetFollow Us on Twitter

June 94 - BALANCE OF POWER

BALANCE OF POWER

Enhancing PowerPC Native Speed

DAVE EVANS

[IMAGE 055-057_Balance_of_Power1.GIF]

When you convert your applications to native PowerPC code, they run lightning fast. To get the most out of RISC processors, however, you need to pay close attention to your code structure and execution. Fast code is no longer measured solely by an instruction timing table. The Power PC 601 processor includes pipelining, multi-issue and speculative execution, branch prediction, and a set associative cache. All these things make it hard to know what code will run fastest on a Power Macintosh.

Writing tight code for the PowerPC processor isn't hard, especially with a good optimizing compiler to help you. In this column I'll pass on some of what I've learned about tuning Power PC code. There are gotchas and coding habits to avoid, and there are techniques for squeezing the most from your speed-critical native code. For a good introduction to RISC pipelining and related concepts that appear in this column, see "Making the Leap to PowerPC" in Issue 16.

MEASURING YOUR SPEED
The power of RISC lies in the ability to execute one or more instructions every machine clock cycle, but RISC processors can do this only in the best of circumstances. At their worst they're as slow as CISC processors. The following loop, for example, averages only one calculation every 2.8 cycles:

float a[], b[], c[], d, e;
for (i=0; i < gArraySize; i++) {
  e = b[i] + c[i] / d;
  a[i] = MySubroutine(b[i], e);
}

By restructuring the code and using other techniques from this column, you can make significant improvements. This next loop generates the same result, yet averages one calculation every 1.9 cycles -- about 50% faster.

reciprocalD = 1 / d;
for (i=0; i < gArraySize; i+=2) {
  float result, localB, localC, localE;
  float result2, localB2, localC2, localE2;

  localB = b[i];
  localC = c[i];
  localB2 = b[i+1];
  localC2 = c[i+1];

  localE = localB + (localC * reciprocalD);
  localE2 = localB2 + (localC2 * reciprocalD);
  InlineSubroutine(&result, localB, localE);
  InlineSubroutine(&result2, localB2, localE2);

  a[i] = result;
  a[i+1] = result2;
}

The rest of this column explains the techniques I just used for that speed gain. They include expanding loops, scoping local variables, using inline routines, and using faster math operations.

UNDERSTANDING YOUR COMPILER
Your compiler is your best friend, and you should try your hardest to understand its point of view. You should understand how it looks at your code and what assumptions and optimizations it's allowed to make. The more you empathize with your compiler, the more you'll recognize opportunities for optimization.

An optimizing compiler reorders instructions to improve speed. Executing your code line by line usually isn't optimal, because the processor stalls to wait for dependent instructions. The compiler tries to move instr uctions that are independent into the stall points. For example, consider this code:

first = input * numerator;
second = first / denominator;
output = second + adjustment;

Each line depends on the previous line's result, and the compiler will be hard pressed to keep the pipeline full of useful work. This simple example could cause 46 stalled cycles on the PowerPC 601, so the compiler will look at other nearby code for independent instructions to move into the stall points.

EXPANDING YOUR LOOPS
Loops are often your most speed-critical code, and you can improve their performance in several ways. Loop expanding is one of the simplest methods. The idea is to perform more than one independent operation in a loop, so that the compiler can reorder more work in the pipeline and thus prevent the processor from stalling.

For example, in this loop there's too little work to keep the processor busy:

float a[], b[], c[], d;
for (i=0; i < multipleOfThree; i++) {
  a[i] = b[i] + c[i] * d;
}

If we know the data always occurs in certain sized increments, we can do more steps in each iteration, as in the following:

for (i=0; i < multipleOfThree; i+=3) {
  a[i] = b[i] + c[i] * d;
  a[i+1] = b[i+1] + c[i+1] * d;
  a[i+2] = b[i+2] + c[i+2] * d;
}

On a CISC processor the second loop wouldn't be much faster, but on the Power PC processor the second loop is twice as fast as the first. This is because the compiler can schedule independent instructions to keep the pipeline constantly moving. (If the data doesn't occur in nice increments, you can still expand the loop; just add a small loop at the end to handle the extra iterations.)Be careful not to expand a loop too much, though. Very large loops won't fit in the cache, causing cache misses for each iteration. In addition, the larger a loop gets, the less work can be done entirely in registers. Expand too much and the compiler will have to use memory  to store intermediate results, outweighing your marginal gains. Besides, you get the biggest gains from the first few expansions.

SCOPING YOUR VARIABLES
If you're new to RISC, you'll be impressed by the number of registers available on the PowerPC chip -- 32 general registers and 32 floating-point registers. By having so many, the processor can often avoid slow memory operations. Your compiler will take advantage of this when it can, but you can help it by carefully scoping your variables and using lots of local variables.

The "scope" of a variable is the area of code in which it is valid. Your compiler examines the scope of each variable when it schedules registers, and your code can provide valuable information about the usage of each variable. Here's an example:

for (i=0; i < gArraySize; i++) {
  a[i] = MyFirstRoutine(b[i], c[i]);
  b[i] = MySecondRoutine(a[i], c[i]);
} 

In this loop, the global variable gArraySize is scoped for the whole program. Because we call a subroutine in the loop, the compiler can't tell if gArraySize will change during each iteration. Since the subroutine might modify gArraySize, the compiler has to be conservative. It will reload gArraySize from memory on every iteration, and it won't optimize the loop any further. This is wastefully slow.

On the other hand, if we use a local  variable, we tell the compiler that gArraySize and c[i] won't be modified and that it's all right to just keep them handy in registers. In addition, we can store data as temporary variables scoped only within the loop. This tells the compiler how we intend to use the data, so that the compiler can use free registers and discard them after the loop. Here's what this would look like:

arraySize = gArraySize;
for (i=0; i < arraySize; i++) {
  float localC;
  localC = c[i];
  a[i] = MyFirstRoutine(b[i], localC);
  b[i] = MySecondRoutine(a[i], localC);
} 

These minor changes give the compiler more information about the data, in this instance accelerating the resulting code by 25%.

STYLING YOUR CODE
Be wary of code that looks complicated. If each line of source code contains complicated dereferences and typecasting, chances are the object code has wasteful memory instructions and inefficient register usage. A great compiler might optimize well anyway, but don't count on it. Judicious use of temporary variables (as mentioned above) will help the compiler understand exactly what you're doing -- plus your code will be easier to read.

Excessive memory dereferencing is a problem exacerbated by the heavy use of handles on the Macintosh. Code often contains double memory dereferences, which is important when memory can move. But when you can guarantee that memory won't  move, use a local pointer, so that you only dereference a handle once. This saves load instructions and allows fur ther optimizations. Casting data types is usually a free operation -- you're just telling the compiler that you know you're copying seemingly incompatible data. But it's not  free if the data types have different bit sizes, which adds conversion instructions. Again, avoid this by using local variables for the commonly casted data.

I've heard many times that branches are "free" on the PowerPC processor. It's true that often the pipeline can keep moving even though a branch is encountered, because the branch execution unit will try to resolve branches very early in the pipeline or will predict the direction of the branch. Still, the more subroutines you have, the less your compiler will be able to reorder and intelligently schedule instructions. Keep speed-critical code together, so that more of it can be pipelined and the compiler can schedule your registers better. Use inline routines for short operations, as I did in the improved version of the first example loop in this column.

KNOWING YOUR PROCESSOR
As with all processors, the PowerPC chip has performance tradeoffs you should know about. Some are processor model specific. For example, the PowerPC 601 has 32K of cache, while the 603 has 16K split evenly into an instruction cache and a data cache. But in general you should know about floating-point performance and the virtues of memory alignment.

Floating-point multiplication is wicked fast -- up to nine times  the speed of integer multiplication. Use floating-point multiplication if you can. Floating-point division takes 17 times as long, so when possible multiply by a reciprocal instead of dividing.

Memory accesses go fastest if addressed on 64-bit memory boundaries. Accesses to unaligned data stall while the processor loads different words and then shifts and splices them. For example, be sure to align floating-point data to 64-bit boundaries, or you'll stall for four cycles while the processor loads 32-bit halves with two 64-bit accesses.

MAKING THE DIFFERENCE
Native PowerPC code runs really fast, so in many cases you don't need to worry about tweaking its performance at all. For your speed-critical code, though, these tips I've given you can make the difference between "too slow" and "fast enough."

RECOMMENDED READING

  • High-Performance Computing  by Kevin Dowd (O'Reilly & Associates, Inc., 1993).
  • High-Performance Computer Architecture  by Harold S. Stone (Addison-Wesley, 1993).
  • PowerPC 601 RISC Microprocessor User's Manual (Motorola, 1993).

DAVE EVANS may be able to tune PowerPC code for Apple, but for the last year he's been repeatedly thwarted when tuning his 1978 Harley-Davidson XLCH motorcycle. Fixing engine stalls, poor timing, and rough starts proved difficult, but he was recently rewarded with the guttural purr of a well-tuned Harley. *

Code examples were compiled with the PPCC compiler using the speed optimization option, and then run on a Power Macintosh 6100/66 for profiling. A PowerPC 601 microsecond timing library is provided on this issue's CD. *

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

SketchUp 19.1.174 - Create 3D design con...
SketchUp is an easy-to-learn 3D modeling program that enables you to explore the world in 3D. With just a few simple tools, you can create 3D models of houses, sheds, decks, home additions,... Read more
ClamXav 3.0.12 - Virus checker based on...
ClamXav is a popular virus checker for OS X. Time to take control ClamXAV keeps threats at bay and puts you firmly in charge of your Mac’s security. Scan a specific file or your entire hard drive.... Read more
BetterTouchTool 3.151 - Customize multi-...
BetterTouchTool adds many new, fully customizable gestures to the Magic Mouse, Multi-Touch MacBook trackpad, and Magic Trackpad. These gestures are customizable: Magic Mouse: Pinch in / out (zoom)... Read more
FontExplorer X Pro 6.0.9 - Font manageme...
FontExplorer X Pro is optimized for professional use; it's the solution that gives you the power you need to manage all your fonts. Now you can more easily manage, activate and organize your... Read more
Dropbox 77.4.131 - Cloud backup and sync...
Dropbox is an application that creates a special Finder folder that automatically syncs online and between your computers. It allows you to both backup files and keeps them up-to-date between systems... Read more
DiskCatalogMaker 7.5.3 - Catalog your di...
DiskCatalogMaker is a simple disk management tool which catalogs disks. Simple, light-weight, and fast Finder-like intuitive look and feel Super-fast search algorithm Can compress catalog data for... Read more
Notion 1.0.7 - A unified workspace for m...
Notion is the unified workspace for modern teams. Notion Features: Integration with Slack Documents Wikis Tasks Note: This application contains in-app and/or external module purchases. Version 1.0... Read more
Microsoft Office 2016 16.16.12 - Popular...
Microsoft Office 2016 - Unmistakably Office, designed for Mac. The new versions of Word, Excel, PowerPoint, Outlook, and OneNote provide the best of both worlds for Mac users - the familiar Office... Read more
Little Snitch 4.4.2 - Alerts you about o...
Little Snitch gives you control over your private outgoing data. Track background activity As soon as your computer connects to the Internet, applications often have permission to send any... Read more
MainStage 3 3.4.3 - Live performance too...
Apple MainStage makes it easy to bring to the stage all the same instruments and effects that you love in your recording. Everything from the Sound Library and Smart Controls you're familiar with... Read more

Latest Forum Discussions

See All

Upcoming visual novel Arranged shines a...
If you’re in the market for a new type of visual novel designed to inform and make you think deeply about its subject matter, then Arranged by Kabuk Games could be exactly what you’re looking for. It’s a wholly unique take on marital traditions in... | Read more »
TEPPEN guide - The three best decks in T...
TEPPEN’s unique take on the collectible card game genre is exciting. It’s just over a week old, but that isn’t stopping lots of folks from speculating about the long-term viability of the game, as well as changes and additions that will happen over... | Read more »
Intergalactic puzzler Silly Memory serve...
Recently released matching puzzler Silly Memory is helping its fans with their intergalactic journeys this month with some very special offers on in-app purchases. In case you missed it, Silly Memory is the debut title of French based indie... | Read more »
TEPPEN guide - Tips and tricks for new p...
TEPPEN is a wild game that nobody asked for, but I’m sure glad it exists. Who would’ve thought that a CCG featuring Capcom characters could be so cool and weird? In case you’re not completely sure what TEPPEN is, make sure to check out our review... | Read more »
Dr. Mario World guide - Other games that...
We now live in a post-Dr. Mario World world, and I gotta say, things don’t feel too different. Nintendo continues to squirt out bad games on phones, causing all but the most stalwart fans of mobile games to question why they even bother... | Read more »
Strategy RPG Brown Dust introduces its b...
Epic turn-based RPG Brown Dust is set to turn 500 days old next week, and to celebrate, Neowiz has just unveiled its biggest and most exciting update yet, offering a host of new rewards, increased gacha rates, and a brand new feature that will... | Read more »
Dr. Mario World is yet another disappoin...
As soon as I booted up Dr. Mario World, I knew I wasn’t going to have fun with it. Nintendo’s record on phones thus far has been pretty spotty, with things trending downward as of late. [Read more] | Read more »
Retro Space Shooter P.3 is now available...
Shoot-em-ups tend to be a dime a dozen on the App Store, but every so often you come across one gem that aims to shake up the genre in a unique way. Developer Devjgame’s P.3 is the latest game seeking to do so this, working as a love letter to the... | Read more »
Void Tyrant guide - Guildins guide
I’ve still been putting a lot of time into Void Tyrant since it officially released last week, and it’s surprising how much stuff there is to uncover in such a simple-looking game. Just toray, I finished spending my Guildins on all available... | Read more »
Tactical RPG Brown Dust celebrates the s...
Neowiz is set to celebrate the summer by launching a 2-month long festival in its smash-hit RPG Brown Dust. The event kicks off today, and it’s divided into 4 parts, each of which will last two weeks. Brown Dust is all about collecting, upgrading,... | Read more »

Price Scanner via MacPrices.net

Clearance 12″ 1.2GHz MacBook on sale for $899...
Focus Camera has clearance 12″ 1.2GHz Space Gray MacBooks available for $899.99 shipped. That’s $400 off Apple’s original MSRP. Focus charges sales tax for NY & NJ residents only. Read more
Get a new 2019 13″ 2.4GHz 4-Core MacBook Pro...
B&H Photo has new 2019 13″ 2.4GHz MacBook Pros on sale for up to $150 off Apple’s MSRP. Overnight shipping is free to many addresses in the US: – 2019 13″ 2.4GHz/256GB 6-Core MacBook Pro Silver... Read more
AirPods with Wireless Charging Case now on sa...
Amazon has extended their Prime Day savings on Apple AirPods by offering AirPods with the Wireless Charging case for $169.99. That’s $30 off Apple’s MSRP, and it’s the cheapest price available for... Read more
New 2019 15″ MacBook Pros on sale for $200 of...
B&H Photo has the new 2019 15″ 6-Core and 8-Core MacBook Pros on sale for $200 off Apple’s MSRP. Overnight shipping is free to many addresses in the US: – 2019 15″ 2.6GHz 6-Core MacBook Pro Space... Read more
Amazon drops prices, now offers clearance 13″...
Amazon has new dropped prices on clearance 13″ 2.3GHz Dual-Core non-Touch Bar MacBook Pros by $200 off Apple’s original MSRP, with prices now available starting at $1099. Shipping is free. Be sure to... Read more
2018 15″ MacBook Pros now on sale for $500 of...
Amazon has dropped prices on select clearance 2018 15″ 6-Core MacBook Pros to $500 off Apple’s original MSRP. Prices now start at $1899 shipped: – 2018 15″ 2.2GHz Touch Bar MacBook Pro Silver: $1899.... Read more
Price drop! Clearance 12″ 1.2GHz Silver MacBo...
Amazon has dropped their price on the recently-discontinued 12″ 1.2GHz Silver MacBook to $849.99 shipped. That’s $450 off Apple’s original MSRP for this model, and it’s the cheapest price available... Read more
Apple’s 21″ 3.0GHz 4K iMac drops to only $936...
Abt Electronics has dropped their price on clearance, previous-generation 21″ 3.0GHz 4K iMacs to only $936 shipped. That’s $363 off Apple’s original MSRP, and it’s the cheapest price we’ve seen so... Read more
Amazon’s Prime Day savings on Apple 11″ iPad...
Amazon has new 2018 Apple 11″ iPad Pros in stock today and on sale for up to $250 off Apple’s MSRP as part of their Prime Day sale (but Prime membership is NOT required for these savings). These are... Read more
Prime Day Apple iPhone deal: $100 off all iPh...
Boost Mobile is offering Apple’s new 2018 iPhone Xr, iPhone Xs, and Xs Max for $100 off MSRP. Their discount reduces the cost of an Xs to $899 for the 64GB models and $999 for the 64GB Xs Max. Price... Read more

Jobs Board

*Apple* IOS Systems Engineer - Randstad (Uni...
Apple IOS Systems Engineer **job details:** + location:Irvine, CA + salary:$45 - $55 per hour + date posted:Tuesday, July 16, 2019 + job type:Temp to Perm + Read more
Business Development Manager, *Apple* Globa...
Business Development Manager, Apple Global Tampa, FL, US Requisition Number:73805 As a Global Apple Business Development Manager at Insight, you proactively Read more
*Apple* Systems Architect/Engineer, Vice Pre...
…its vision to be the world's most trusted financial group. **Summary:** Apple Systems Architect/Engineer with strong knowledge of products and services related to Read more
*Apple* Graders/Inspectors (Seasonal/Hourly/...
…requirements. #COVAentryleveljobs ## Minimum Qualifications Some knowledge of agricultural and/or the apple industry is helpful as well as the ability to comprehend, Read more
Best Buy *Apple* Computing Master - Best Bu...
**710003BR** **Job Title:** Best Buy Apple Computing Master **Job Category:** Store Associates **Location Number:** 000171-Winchester Road-Store **Job Description:** Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.