December 94 - Balance of Power: PowerPC Branch Prediction
Balance of Power: PowerPC Branch Prediction
Dave Evans
The PowerPC processors try to predict which way your code will execute. This
sounds surprisingly astrological for a digital machine, but it becomes very
useful for a pipelined processor and will often speed up your code. In this
column I'll go over why and how this works, focusing especially on the new
PowerPC 604 processor prediction techniques, and I'll answer the question "Can
a Power Macintosh really tell the future?"
PSYCHIC DECISIONS
Typically about one-seventh of the instructions in your code are branches,
either to call subroutines or to make logical decisions in your program. The
PowerPC processor would ordinarily tend to stall at branches, since it tries to
work on more than one instruction at a time and it's not always sure which code
it should execute after a branch. It could either take the branch or fall
through, and often the processor won't know which until a couple of cycles
later.
So the PowerPC processors allow for speculative execution, meaning they'll
guess at the most probable direction the branch will go and then will issue
those instructions. But the processor doesn't let the instructions commit until
it's sure the guess was correct. Usually it guesses right, and a few
instructions are already completed when the branch is decided. If the guess was
wrong, it throws out those results and starts over with the correct code.
This predictive skill helps keep the processor executing successfully without
stalls, and better prediction techniques will yield better overall performance.
The new PowerPC 604 processor improves on earlier prediction techniques; I'll
discuss all of them in detail below.
But first, a relevant astrological note: The "birthday" of the 601 makes it a
Taurus, whereas then 603 is a Libra. The 604 chip had a birthday in April, so it's an Aries.
TAURUS AND LIBRA ARE COMPATIBLE
The PowerPC 601 and 603 processors use basically the same techniques to predict
branches. For simple unconditional branches, for example, they both process and
remove the branch early in the instruction issue stage. This operation, called
branch folding, keeps the instruction stream moving without having to
wait for the branch to be processed. The branch is handled early, and the new
instructions are fetched from the cache immediately.
For conditional branches, both processors first try to handle the branch early
in the instruction issue stage. If the condition being tested has already been
evaluated, the branch is folded out of the instruction stream. But if the
condition being tested is still in the pipeline, the processor must guess at
the branch direction.
Prediction of guessed branches are based on two things: the direction of the
branch and a software "hint" bit. If the direction is negative -- backward in
your code -- the branch is taken (because loops often iterate a few times
backward before falling through, and this heuristic is more often true). All
other branches fall through by default. The hint bit is a way for the compiler
to reverse this heuristic: if the bit is set, the prediction will be
reversed.
As far as I know there are no compilers that allow you to specify the hint bit
in your code, although this could be a valuable feature. Also, profilers or
similar tools could take statistics on your code flow and then set the bits for
you from trial runs of your software.
THE TEMPERAMENT OF ARIES
The PowerPC 604 has much better branch prediction, which means better
performance. Because branch statements most often repeat themselves, it
remembers recent branch results to make its predictions:
- It has a cache of the last 64 branches that it has taken, and any time it
sees one of these branches again it will immediately predict to the same branch
destination. This technique, called dynamic branch prediction, is used
on the Pentium and other processors with great results.
- It keeps a history of all other branches and predicts based on the recent
directions that branch took.
The cache technique has the advantage of being
very fast. When the 604 fetches an instruction, it also sends the instruction's
address to the branch cache. If the instruction is a recently executed branch,
the cache will return the address of where the branch last went. This is
immediately used to fetch the next instruction. Because this all occurs during
the fetch of the branch instruction itself, there's no delay in fetching the
first predicted instruction.
For conditional branches that aren't in the branch cache, the 604 keeps a
history of recent times it saw that instruction. It keeps 512 such histories,
each two bits wide, to remember whether the branch was taken during the last
few executions. The processor hashes the instruction address to keep the branch
histories distinct, and hash collisions are very rare.
Each history is set to one of four states: strongly taken, taken, not taken,
and strongly not taken. The current state determines the branch prediction as
taken or not taken. After the branch commits, the state is updated. Each update
adjusts the state one step toward strongly taken or strongly not taken. The two
intermediate steps are a hedge so that it will usually take two mistakes before
a prediction changes. Because branches tend to repeat, this algorithm generally
results in the following prediction:
- If the branch was taken during the last two executions, the 604 predicts
it will again be taken.
- If the branch wasn't taken during both of the last two executions, the 604
predicts it again won't be taken.
Also with the 604, branches on the count
register base their prediction on the current count value. This will usually
predict loops correctly and yield good performance, since loops count down for
a number of iterations before the final iteration causes an incorrect
prediction.
But these techniques also come with a tradeoff: the 604 has an extra pipeline
stage to dispatch instructions. This means instructions take longer to get
through the pipe, and mispredicted branches are more expensive.
ARIES RISING
The 604 is the fastest PowerPC processor yet, and I can't talk about it here
without also going into why it's such a fast engine. Besides its advanced
branch prediction hardware, it has significantly more integer and
floating-point hardware, which yields improved overall performance. Given that
it's produced with a more advanced silicon process than
the original 601, it's clocked above 80 MHz and offers blazingly fast
computation for your code.
As a backbone for the chip, the instruction issuing and control logic allow the
604 to issue up to four instructions per clock, compared to the 601's and 603's
effective three. As mentioned above, however, its pipeline has one extra decode
stage and branches are issued and handled in their own branch unit. To help it
speculatively execute more instructions than the other chips, it also comes
with twice the number of "rename" registers than the 603. Twelve extra
general-purpose and eight extra floating-point registers are available to hold
speculatively produced results until a branch commits. The 604 is also the
first PowerPC processor that can speculatively execute two branches at once.
This, combined with advanced branch prediction, should keep the processor
screaming even through complex code flow.
What most people will notice, however, is the additional integer math
performance on the 604. At any one time, the 604 can have two add-subtract
instructions and one multiply-divide instruction completing in a cycle. IBM
says that it therefore has three integer units, but the multiply-divide
hardware is also used for logical and bit manipulation operations. The bottom
line is much better integer performance than the Power Macintosh 8100/80. As an
example of this, the following code should execute nearly twice as fast on the
604 than on the 601:
do {
unsigned long datapoint;
datapoint = *(dataarray + datasize);
if (datapoint > kThreshold) {
if (datapoint > kMaxLong - accumulate)
MyOverflowError();
accumulate += datapoint;
samplecount += 1;
}
} while (datasize--);
Looking
at this code, we see a few integer operations that will be dual-issued on the
604. As long as the datapoint values aren't too erratic, the 604 will better
predict the first
if statement's branch: it will assume that the current
datapoint is on the same side of the threshold as on the previous iteration,
which in fact is where it will tend to be. And the second
if statement,
which checks for an overflow, will (barring an exception) get predicted
correctly out of the loop. The 601 or 603 may predict it incorrectly. So even
though one integer unit will be busy doing the math, the overflow checking will
effectively occur without stalling the pipeline.
The floating-point hardware was also supercharged. On the 601 and 603
processors, a single-precision floating-point instruction can issue and
complete each cycle, but double-precision numbers take twice as long. The 604
allows one full double-precision multiply-add instruction to be issued and one
to complete each cycle. The chip is twice as fast as the 601 and 603 for these
double-precision calculations.
THE FUTURE IS IN THE STARS
So can Power Macintosh tell your future? It certainly tries to with the
prediction techniques described above, and in doing so yields better
performance. With the simple methods of the 601 and 603, or the dynamic
prediction of the 604, your Power Macintosh will speculatively execute your
code with seemingly psychic results.
What about the future of the Power Macintosh? The PowerPC architecture allows
excellent growth. When I saw the specifications for the first processor, the
601, I was very impressed. It's an excellent design and it has proven to be a
potent engine for the Macintosh. When I saw the specifications for the follow-on chips, however, I
was really blown away. The 603 and 604 offer incredible performance for the
price, and prove that the PowerPC architecture scales well both into
low-cost/low-energy solutions and to the cutting edge in performance. And the
technology applied to the 604 can be expanded in future chips, adding more
execution units and advanced caches at higher clock speeds. The latest IBM
POWER2 processors can issue two load/store, two logic/branch, two
floating-point, and two integer instructions per cycle. These processors point
to the future of PowerPC performance.
So without any additional tuning on your part, PowerPC will continue to improve
your performance in the future. I also feel compelled to reiterate this advice
from my previous columns: tune your critical code. Tuning often trades
performance for code readability and maintainability, so carefully choose which
code to tune and use code profilers (and the stars?) to guide your way.
DAVE EVANS (Aquarius, January 20-February 18) Look for opportunities to
communicate. You are bound to have fun. Love
is in the air; don't work too much or you'll miss it. Apple continues to hold
promise for you. Compatible with Sagittarius.
Thanks to Phil Sohn, Peter Steinauer, and Eric Traut for reviewing this
column.
This page was last modified on Sunday, April 06 1997 04:24