AltiVec Revealed

Volume Number: 15 (1999)
Issue Number: 7
Column Tag: Into The Hardware

AltiVec Revealed

By Tom Thompson

This extension to the PowerPC instruction set promises high performance for multimedia, communication, and 3D graphics applications

Introduction

More than ever, the Macintosh handles more diverse and richer types of information. A partial list would include: displaying 3D graphics for scientific applications and games, capturing data from a digital video camera, decoding and displaying MPEG-2 video from a training DVD, arranging and maintaining a streaming video session, and using VOIP (voice over IP) to implement a conference call.

For these tasks, a Mac must actually perform a significant amount of real-time data processing. Aggravating this situation is that the interfaces that supply this time-critical information are a lot faster than just a year ago: the Mac's standard Ethernet interface now operates at 100 Mbps and there's FireWire, which blasts data about at 100- to 400-Mbps rates.

Thus far, the PowerPC processors in these computers have delivered the goods through sheer computational brawn. It also helps that the PowerPC instruction set provides a number of Digital Signal Processing (DSP)-style operations and a non-precise floating-point mode that speeds arithmetic computations. Such capabilities allowed a first-generation, 80 MHz PowerPC 601 to implement a V.32 modem and a speech recognition engine entirely in software.

Today's third-generation PowerPC 750 (a.k.a. G3) has more execution units, larger on-chip caches, support for a high-speed backside L2 cache, and operates at higher clock speeds. (The clock speed of today's systems currently hover around 450 MHz.) These features endow the Mac with the processing muscle to handle many of the multimedia and communications chores just described. As this type of work becomes the norm rather than the exception, however, such demanding jobs will tax even the capabilities of this processor.

To handle this growing category of applications, in 1997 Motorola announced a major extension to the PowerPC since the architecture was conceived in 1991. Termed AltiVec, it is a technology that performs high-speed hardware-based data manipulation of vectors. (A vector is contiguous list of data elements that, from the programmer's point of view, can be considered a one-dimensional array.) AltiVec works with vectors that are a fixed 128 bits in length, a size that's sufficient to store sixteen 8-bit numbers or four 32-bit numbers.

The data arrays manipulated by communications and graphics algorithms are -- at the machine code level -- represented as vectors. Since AltiVec manipulates vectors in hardware, the technology can significantly accelerate such algorithms. Put another way, AltiVec enhances the PowerPC's ability to handle vector operations, similar to how the Floating-Point Unit (FPU) boosts the speed of floating-point operations. AltiVec has over 160 PowerPC-compatible instructions that perform various arithmetic and logical operations on a vector's contents. In addition, AltiVec instructions can manipulate both integer and floating-point data, unlike Intel's MMX technology that is restricted to integer data.

The AltiVec technology will first appear in a fourth-generation PowerPC processor, the G4. The G4 has been sampling in quantity since late last year, and should appear in Macs early next year. Given the lead times for becoming familiar with a new technology and revising your software to take advantage of it, now's the time for MacTech readers to take a serious look at AltiVec.

Something Old, Something New

To fully appreciate how the AltiVec technology fits into PowerPC architecture and operates, it's necessary to take a tour of the G4 processor itself. The G4 is a 32-bit PowerPC processor that's basically a souped-up G3 (which itself is a souped-up PowerPC 603e). Although the G4 borrows heavily from its G3 roots, significant changes in the design will allow the G4 to deliver performance that will be vastly superior to its predecessor.

As Figure 1 shows, the G4 design starts with the G3's six concurrent execution units and adds a new one. This seventh unit implements the AltiVec technology and is thus called the vector unit. I'll describe the capabilities of the vector unit shortly. The processor reuses the G3's proven integer core, which consists of two integer Arithmetic Logic Units (ALUs), and the System Unit. (The System Unit is considered an execution unit because it participates in certain integer calculations). The Branch unit, which uses the G3's prediction logic to manage branch and jump operations, remains unchanged.

Figure 1. The G4's microarchitecture.

However, here the similarities end. The internal buses between the processor's L1 caches and many of the execution units have been expanded from 64 to 128 bits. The wider buses were necessary to support AltiVec's vector operations, but they also boost data transfers throughout the processor. The G4's FPU has been beefed up so that double-precision floating-point instructions -- not just single-precision instructions, as was the case with the G3 -- are fully pipelined. The resulting lower latency in the FPU's processing of these instructions, in combination with the wider internal buses, should enable the G4 to accelerate many scientific applications that rely heavily on double-precision floating-point calculations. The processor's backside L2 cache interface is now 128 bits wide and supports 128-bit transfers. The maximum size of the L2 cache is doubled to 2 MB.

To effectively process real-time or multimedia data, the G4 must obtain it from main memory a steady rate. To ensure that this occurs, the processor has four software-controlled prefetch engines, called Data Stream (DS) channels. Each engine operates independently of each other, and uses the processor's empty (idle) bus cycles to transfer data into the L1 and L2 caches. You use a set of Data Stream Touch (DST) instructions to initiate a transfer, and it proceeds automatically without further program intervention.

Finally, the G4 brings back the multiprocessor (MP) support that's absent in the G3. To simplify the G3's bus design, a fourth "shared" state was removed from its cache coherency protocols. This shared state is necessary to implement the shared data regions that MP systems rely on for coordinating activities and exchanging information. Building a multiprocessor system out of more than two G3s required the addition of glue logic, which complicates an MP system design and increases its cost.

The G4's bus uses a 5-state cache coherency protocol. This includes the standard four states known as MESI (modified/exclusive/shared/invalid), plus a new "reserved" state. The reserved state implements direct data transfers between processor caches. Up to four G4s and their L2 caches can be assembled into a multiprocessor array without glue logic, which makes it easy to build MP systems.

Motorola fabricates the G4 with its HIP5 0.22-micron, six-metal layer CMOS process that uses copper as the interconnecting material. This process technology allows Motorola to pack the G4's 10.5 million transistors onto a 83 mm2 die. (For comparison, Intel's Deschutes version of the Pentium II die occupies 118 mm2, and the Katmai Pentium III die weighs in at 128 mm2.) With the copper traces and a lower operating voltage of 1.2 Volts, the G4 dissipates less than 8 Watts at 400 MHz. Despite the higher clock speed and additional transistors, the G4's power consumption compares favorably with a G3's, which operates at 3.3 Volts and dissipates 5 Watts at 250 MHz.

Initial versions of the G4 will use the same 360-pin Ball Grid Array (BGA) packaging that houses the G3. This makes the G4 pin-compatible with the G3, and allows it to be dropped into existing G3-based designs. However, this configuration limits the width of the G4's bus and L2 cache interfaces to 64 bits, which crimps the processor's throughput to external memory and peripherals. To realize its full potential, later G4s will sport the 128-bit interfaces. Depending upon when the first G4-based Macs ship, Apple engineers might use the G3 pin-compatible version of the G4 to speed system design and testing. Or, they may opt for better performance by leap-frogging to a G4 equipped with the wider paths, similar to what happened with the PowerPC 740 /750 versions for G3-based systems.

To sum up, the G4's microarchitecture provides a slew of new features that will improve the performance of many multimedia applications and 3D graphics applications. The improved FPU should also make a G4-based Mac a valuable tool for complex simulations and data visualization. With its modest power consumption, Mac road warriors can expect to see the G4 at the heart of future PowerBook models. Power users also win, since the G4's multiprocessor support means that high-performance MP systems will be available to tackle heavy-duty computing jobs.

Vector Unit Overview

As Figure 1 indicates, a separate autonomous execution unit implements AltiVec's vector instructions. This vector unit has its own register file, status/control register (VSCR), and a vector save/restore register (VRSAVE). The register file consists of 32 registers that are 128 bits wide. To adequately feed the vector unit, 128-bit wide data paths link it and other execution units to the processor's L1 caches and the load/store unit. The vector unit shares few resources and communications paths with the other execution units. This eliminates situations where the vector unit must be tightly synchronized to another execution unit because it depends on data from that particular unit. Because it works in parallel with the other execution units and has its own register file, you don't have to invoke special processor mode-switching routines before using vector instructions. You can freely intermix integer, floating-point, and vector instructions in the source code without impacting a program's performance.

To further improve the vector unit's throughput, it uses a simplified, streamlined design. There's no support for misaligned data accesses and it only generates several hardware exceptions. Nor does the vector unit implement complex instructions: many of them execute in a single cycle, although certain instructions can take up to three or four cycles to execute.

Figure 2. Detail of the vector unit. All of the sub-units operate in parallel. The G4 can dispatch two vector instructions at a time to the unit.

The vector unit itself consists of parallel sub-units, as illustrated in Figure 2. Each sub-unit is tailored to handle specific instructions. A vector simple sub-unit and vector complex sub-unit handle integer vector operations. The vector floating-point sub-unit deals with the floating-point operations. The vector permute sub-unit implements a large-scale shift operation and can selectively reorder the contents of vectors. Certain application-specific versions of the G4 might have a vector unit that has different combinations of these sub-units, such as an array of vector simple sub-units.

The vector simple sub-unit executes single-cycle instructions that perform addition, subtraction, comparison, shifts, and logical operations with vectors. The vector complex sub-unit fields the compute-intensive multiply and multiply-add instructions that require several cycles to complete.

The vector floating-point sub-unit is equipped with four multiply-add devices that process four single-precision floating-point numbers simultaneously. It performs floating-point add, subtract, and multiply-add vector instructions in four cycles. Because of the parallel multiply-add devices, properly coded algorithms should execute four times faster.

The floating-point sub-unit has two modes of operation: a Java mode and a non-Java mode. The Java mode provides compliance with the Java Language Specification 1. The non-Java mode provides faster results with less numeric accuracy. This latter mode is useful for real-time algorithms where response times are more critical than the data's accuracy.

In the Java mode, the floating-point sub-unit implements the default behavior for exception handling as specified by the IEEE 754 numeric standard. In addition, it supports only the standard's default rounding mode, round-to-nearest. This simplifies the floating-point sub-unit's design in that arithmetic errors generate default results and don't invoke a hardware exception, and rounding control flags aren't required. Nor does the sub-unit set any floating-point status flags in the FPU's status register. As a consequence of these simplifications, technical applications that require full compliance with the IEEE 754 standard must use the G4's FPU. However, algorithms that require only single-precision floating-point arithmetic (such as those for signal processing and 3D graphics) will work fine within these limits.

The vector permute sub-unit implements a sophisticated data-mangling instruction known as permute, which gives the unit its name. With the permute instruction, you can choose individual bytes from two source vector registers and merge them into any position within a destination vector register. In a single cycle, the permute instruction can clean up misaligned data or slip a new destination address into a network packet header.

All sub-units but the permute unit use the Single Instruction Multiple Data (SIMD) technique that enables the hardware to process a vector's data elements in parallel. This capability allows the vector unit to process 16 integer or four floating-point operations at a time. The G4 can dispatch up to two AltiVec instructions (one arithmetic/logic and one permute) to the vector unit per tick of the processor clock. Therefore, for a 400 MHz G4, the peak performance of the vector unit can reach a peak of 12.8 billion integer operations per second, while floating-point calculations using multiply-add instructions can hit 3.2 GFLOPS

Vector Instruction Overview

AltiVec instructions, like other PowerPC instructions, are a fixed 32 bits in length. The instructions typically use three operands (two source registers and one destination register), although there are the inevitable exceptions to this format. When an instruction completes, the contents of the source operands are left intact. AltiVec follows the principles of RISC in that the instructions only modify the contents of the vector registers. Vector load and store instructions must be used to transfer data between memory and these registers.

AltiVec instructions can be divided into six categories. These are:

Vector integer arithmetic operations. These instructions implement add, subtract, multiply and shift operations for integer computations, plus Boolean logic and compare operations for bit masking and program control.
Vector floating-point operations. These instructions perform calculations on vectors that contain floating-point digits. They support add, subtract, and multiply-add operations, plus the obligatory conversions between integer and floating-point values.
Vector permute/format operations. These instructions handle sophisticated data manipulation and data replication functions. Some of them implement data packing and unpacking operations, including two instructions whose specialty is format conversions between 16-bit video pixels and 32-bit graphics pixels.
Vector load/store operations. These instructions retrieve data from memory into a vector register, or write the contents of a vector register to memory. Load and store operations often work with quadwords. However, they also support scalar (non-vector) transfers.
Memory control operations. These instructions manage the inflow of data to the processor caches. Specifically, they are the DST instructions that control the G4's prefetch engines.
Processor control operations. These instructions load and store the contents of the vector unit's control/status register.

Most of the time, you'll work exclusively with instructions in the first four categories. The latter two categories will be used in performance-critical applications where data must be readily available in the processor caches, or to switch the processor between the Java/non-Java mode.

AltiVec instructions fall into two distinct groups as to how they manipulate the vector data: intraelement operations and interelement operations. Intraelement operations take elements from the same location or position in the source registers, process them in parallel, and place the results in the same location in a destination register, as shown in Figure 3. For example, a vector add instruction takes the corresponding digits in two source registers, adds them together, and stores the sums in a destination register. The bulk of AltiVec's arithmetic and logical instructions process elements in this fashion.

Figure 3. AltiVec features vector operations that either retain the order of the elements processed (intraelement), or rearranges them (interelement).

Interelement operations take elements from different locations in the source registers, process them, and place the outcome into different locations in the destination register. The permute instruction, and its variants such as merge and splat, perform interelement operations.

In short, intraelement operations fetch, process, and store data elements without disturbing the order of the data. Interelement operations fetch elements in any order, process them, and store the results in any order. By organizing the instructions this way, Motorola could use separate sub-units to implement the instructions. The separate sub-units improve the parallel processing ability of the vector unit, thus improving its throughput. AltiVec introduces 162 new instructions and four new data types. These data types consist of packed data elements that occupy a 128-bit quantity called a quadword. Quadwords represent the contents of memory or vector registers. As shown in Figure 4, quadwords can be divided into vectors composed of 16 bytes (8 bits), or 8 half-words (16 bits), or four words (32 bits). The first two formats represent only packed integer data. The last format represents the two other data types: either four packed 32-bit integers or four packed single-precision floating-point numbers. The floating-point data conforms to the IEEE 754 standard.

Figure 4. Data types supported by AltiVec instructions. Quadwords represent the contents of memory or vector registers. Quadwords can be used for the bulk transfer any type of data.

The byte-ordering of the packed data can be either little-endian (Intel x86 layout) or big-endian (Motorola 68K and PowerPC layout). The default memory addressing mode is big-endian byte ordering. The PowerPC can be programmed to operate in little-endian mode for use in an embedded application (perhaps a network switch) whose design is based on little-endian peripheral devices. In either addressing mode, the AltiVec instructions properly manipulate the data.

Integer Operations

As the data types indicate, the integer instructions work with vectors whose elements can be bytes, half-words, or words. For situations where an algorithm performs numerous computations with the elements so that arithmetic overflow is an issue, AltiVec provides two modes to deal with it. The first is the saturate mode, where the value of the offending number is held constant (or clamped) to the minimum or maximum value that the integer can represent. When an overflow occurs, this mode also sets a saturation bit in the VSCR. Algorithms can poll this bit to either apply a correction or bail out when saturation is detected. The other method, the modulo mode, uses modulo arithmetic to truncate the result so that it "wraps around" into a representable value. The saturate mode is useful for algorithms that work with pixel data, while the modulo mode is practical for managing the indexes to arrays such as data buffers or look-up tables.

To perform a vector integer operation, you specify the type of arithmetic function required, the size of the data element to be manipulated, and how the vector unit deals with overflows. The choice of instruction provides these details to the vector unit. For example, to subtract a vector composed of signed bytes from another one while using the saturation mode, you'd use the vector subtract saturated instruction, vsubsbs. The instruction mnemonic's root, vsub, selects the arithmetic operation, while the trailing characters indicate that the operation works with signed integers (s) whose size is a byte (b), and uses the saturate mode (s) for any overflow. Most of the instructions that process signed integers have counterparts that deal with unsigned integers, so to perform the same vector subtract operation with unsigned integers, you'd use vsububs. To manage the subtraction with unsigned word quantities, you'd choose the vsubuws instruction.

As is becoming apparent, AltiVec actually doesn't implement a large number of unique instructions. Instead, many of them are variations on a smaller number of arithmetic operations. These variations give you fine-grained control in selecting what quantities the SIMD operation works with, and how it deals with an overflow situation.

Besides add, subtract, and multiply instructions, there are a number of specialized instructions whose capabilities are useful for communications and multimedia jobs. For example, a vector average instruction (vavg) that adds the respective elements of two vectors and shifts them right by one, can accelerate video data decoding. A pair of instructions that determine the maximum and minimum values of a vector (vmax and vmin, respectively), can boost the performance of search and filter algorithms.

Capable Compares

The vector integer instructions also offer a suite of logical and arithmetic comparison instructions. Like their arithmetic siblings, these instructions work with elements that are 8, 16, or 32 bits in size. AltiVec supports equal-to or greater-than tests for integer compares. Other comparison tests can be cobbled out of operations that logically combine and invert the results of these tests.

These instructions compare the elements of two source registers, and place a TRUE value (all 1s) or FALSE value (all 0s) into the corresponding element of the destination register. The destination register's values can be used in subsequent vector operations. For example, a vector compare operation can create a bitmask that controls a vector select instruction (vsel). This instruction copies and merges bits from two source registers into a destination register, as selected by the bits in the control register. A bit value of 0 selects one source register, while a bit value of one selects the other.

The vector compare and vector select instructions can be a potent combination for manipulating graphics data. Normally, to combine two data sources using a bitmask, you must write a loop that tests the bitmap and then executes a branch to code that copies data from the proper source. As depicted in Figure 5, a sequence of vector compare and vector select instructions can generate a bitmask and merge two data sources without using branch instructions. This both simplifies and accelerates algorithms that perform 3D clipping or chroma-keying (where video from one source is overlayed on another source).

Figure 5: Vector compare instructions let you generate bitmasks that can be used by a subsequent vector select instruction to combine data. The combination of these two instructions lets you quickly internix data without resorting to branch instructions that might cause cache misses.

In situations where the result of a sequence of comparisons is required, variants of the comparison instruction set bits in the PowerPC's condition code register (CR). Bit 24 in the CR is set if the result of a vector comparison is true (the destination register's elements contain all 1s), and bit 26 is set if the vector comparison is false (the elements contain all 0s). Conventional PowerPC branch instructions can use these bits to implement decision-making functions in 3D lighting effect or 3D clipping accept/reject algorithms.

For Boolean operations, AltiVec supports OR, XOR, AND, NOR, and AND-with-compliment operations on vector elements.

Floating-Point Instructions

The floating-point instructions perform SIMD operations with vectors composed of four single-precision numbers. That's no surprise, since the IEEE standard sets the size of a single-precision number at 32 bits, and an AltiVec vector register only holds four values of this size.

You have the usual repertoire of add, subtract, and multiply-add operations for manipulating floating-point vectors. The multiply-add instruction (vmaddfp) performs a vector multiply, followed by a vector add of the multiply's result. This instruction is valuable for writing DSP algorithms where such instruction sequences are common. There are also instructions that handle rounding, compares, estimates, and floating-point/integer conversions.

For floating-point comparisons, AltiVec supports greater than, equal, and greater-than-or-equal operations. The latter instruction was provided because use of the other two instructions alone could not produce the required IEEE result when the order of the operands changed in the expression stating the logical relation.

The vector unit's performance-driven, simplified design introduces some quirks in the floating-point instruction set. There isn't a dedicated vector multiply instruction: instead, you use a multiply-add instruction with the addend register initialized to -0.0. (The negative sign ensures that the instruction performs an IEEE- or Java-compliant computation whose product has the proper zero sign.)

AltiVec doesn't provide divide or square root instructions. Such instructions require considerable hardware to implement and would increase the floating-point unit's latency. AltiVec instead supplies a vector reciprocal estimate (vrefp), and vector reciprocal square root estimate (vrsqrtefp) instructions. Performing a vector divide or computing a vector chock full of square roots then becomes a matter of executing a faster vector multiply instruction that uses the reciprocals generated by these instructions. The estimates are accurate to 12 bits, which is adequate for high-performance code.

To perform divisions that require higher precision digits, you feed the estimates returned by vrefp and vrsqrtefp into a Newton-Rhapson algorithm. This algorithm iterates through a sequence of computations that successively refine the value to a higher precision. The Newton-Rhapson equation to compute a reciprocal that represents the division operation Q = A/B, resembles:

y1 = y0 + y0 * (1.0 - B * y0);		// Get reciprocal
Q = A * y1;										// Obtain quotient
R = A - B * Q;									// Compute remainder

where y0 is the initial estimate that primes the process, and y1 is the low-precision reciprocal, 1/B. You obtain the initial estimate by running B through the vrefp instruction. The resulting vector Q holds quotients that nearly have 24 bits of precision. For situations where the divisor B is small number, you'll have to add safeguard code that handles vrefp generating an infinity.

If the application requires a IEEE-compliant 24-bit accurate quotient, you'll need to execute the Newton-Rhapson algorithm a second time, using the statements

// Get reciprocal (1st iteration)
y1 = y0 + y0 * (1.0 - B * y0);
// Get high-precision reciprocal (2nd iteration)
y2 = y1 + y1 * (1.0 - y1 * B);

before computing the quotients and remainders. The AltiVec code for computing the 24-bit IEEE-compliant remainder thus looks like this:

y0 = vec_re(B);					// approximate 1/B

// y1 = y0*(-(y0*B - 1.0))+y0  i.e. y0+y0*(1.0 - y0*B)
// we repeat the Newton-Raphson to get the required 24 bits

y1 = vec_vmadd(y0,vec_nmsub(y0, B, 1.0),y0);

// y2 = y1*(-(y1*B - 1.0))+y1 i.e. y1+y1*(1.0 - y1*B)
// y2 is now the correctly rounded reciprocal, and the manual considers this OK for use in 
// computing the remainder: Q = A*y2, R = A - B*Q

y2 = vec_vmadd(y1, vec_nmsub(y1, B, 1.0),y1);

// for strict Java/IEEE should use -0.0
Q = vec_madds(A,y2,-0.0);
R = vec_nmsub(B,Q,A);		// -(B*Q-A) == (A-B*Q)

In order to correctly round Q, a last adjustment Q' = Q + R * y2 is performed:

Q = vec_madds(R, y2, Q);

While these operations can be written in assembly language, it's preferable to use C because the compiler handles the construction of the vectors containing the constants -0.0 and 1.0 for you. The compiler can also schedule instructions such that the division function's instructions are load-balanced across all of the G4's execution units, while an assembly language function would execute primarily in the vector unit. This would allow the other execution units to fall idle, possibly impacting code performance.

Note the use of the vector negative multiply-subtract instruction, vnmsubfp, which is executed through the function call, vec_nmsub(). This AltiVec instruction performs the basic operation of C - A * B. It does the heavy lifting in this code sequence, and was crafted specifically to accelerate the Newton-Rhapson algorithm.

Listing 1 shows functions, written in C++, that implement high-precision vector divide and remainder operations. It's worth repeating that some applications -- particularly those operating in real-time -- may not require this much precision and the value returned by the reciprocal instructions alone will suffice.

Listing 1: vec_div.cpp and vec_rem.cpp

static inline vector float vec_div(vector float A,
	vector float B)
{
     vector float y0;
     vector float y1;
     vector float y2;
     vector float Q;
     vector float R;

     y0 = vec_re(B);            // approximate 1/B

// y1 = y0*(-(y0*B - 1.0))+y0  i.e. y0+y0*(1.0 - y0*B)
     y1 = vec_vmadd(y0,vec_nmsub(y0, B, 1.0),y0);
  
// REPEAT the Newton-Raphson to get the required 24 bits
     y2 = vec_vmadd(y1, vec_nmsub(y1, B, 1.0),y1);

// y2 = y1*(-(y1*B - 1.0))+y1  i.e. y1+y1*(1.0 - y1*B)
// y2 is now the correctly rounded reciprocal, and the manual considers this
 // OK for use in computing the remainder: Q = A*y2, R = A - B*Q

     Q = vec_madds(A,y2,-0.0);  // -0.0 for strict Java/IEEE
     R = vec_nmsub(B,Q,A);		   // -(B*Q-A) == (A-B*Q)

// final rouding adjustment
     return(vec_madds(R, y2, Q));
 }

static inline vector float vec_rem(vector float A, vector float B)
{
     vector float y0;
     vector float y1;
     vector float y2;
     vector float Q;

     y0 = vec_re(B);             // approximate 1/B
// y1 = y0*(-(y0*B - 1.0))+y0  i.e. y0+y0*(1.0 - y0*B)
     y1 = vec_vmadd(y0,vec_nmsub(y0, B, 1.0),y0);
  
// REPEAT the Newton-Raphson to get the required 24 bits
     y2 = vec_vmadd(y1, vec_nmsub(y1, B, 1.0),y1);

// y2 = y1*(-(y1*B - 1.0))+y1  i.e. y1+y1*(1.0 - y1*B)
// y2 is now the correctly rounded reciprocal, and the manual considers
// OK for use in computing the remainder: Q = A*y2, R = A - B*Q

     Q = vec_madds(A,y2,-0.0);  // -0.0 for strict Java/IEEE
     return(vec_nmsub(B,Q,A));	 // -(B*Q-A) == (A-B*Q)
}

Mangling Bits

Thus far, the AltiVec instructions we've examined perform intraelement operations. That is, they retain the order of the elements that they process. By themselves, the ability of the intralelement instructions to execute up to 16 integer and 8 floating-point operations in parallel makes AltiVec a powerful tool for writing high-performance programs. However, it is the interelement instructions, the ones that process and reorganize vector elements, that give the AltiVec technology some unique capabilities. A programmer can use these instructions to readily implement matrix arithmetic, write data convolution/deconvolution algorithms, or modify network packets on the fly.

By far the most powerful of these instructions is the one you've heard about already, the permute instruction, (vperm). This instruction takes arbitrary bytes from two source registers and stores them into different positions in a third destination register. Bytes from either source register can be replicated and stored in the destination as well. The contents of a fourth operand, the control vector, informs the permute sub-unit where each byte is needed. As shown in Figure 6, the control register holds 16 bytes. The upper nibble of each byte specifies the source register, while the lower nibble selects a particular byte in that source register. The order of the bytes in the control register indicate the order that the chosen bytes occupy in the destination register. In just a few cycles, a properly coded vperm can extract a packet's destination address or implement a fast look-up mechanism for small tables. This instruction will often be used to clean up misaligned data, as we'll see later.

Figure 6. The vector permute instruction take arbitrary bytes from two source registers and places them in any position in a destination register.

Many of the interelement instructions are variants of vperm that serve specialized purposes. A vector merge instruction (vmrg) takes alternating byte, halfword, or word digits from two source registers and interleaves them into a destination register. A vector splat (vsplt) takes a selected element and replicates its value throughout a destination register. The resulting vector is useful for scaling an array's data. The vector pack and unpack instructions (vpk and vupk) can expand and condense data. The most notable of these are the vpkpx and vupkpx instructions that handle conversions between 16-bit video pixels (three 5-bit color components, plus an alpha channel bit) and 32-bit graphics pixels (three 8-bit color components, and a byte of alpha channel information).

Other interelement instructions help implement matrix arithmetic. For example, the vector multiply-sum instruction (vmsum) multiplies the elements of two vectors, then sums the outcome of this operation with a third vector's elements. A vector sum (vsum) instruction adds the elements of one vector together, and then adds this sum to a value in a second vector. A vector multiply odd (vmulo) multiplies the odd addressed elements of two source vectors together, and its counterpart, vector multiply even (vmule) multiplies the even-addressed elements.

Algorithms that perform data compression/decompression, digital filtering, signal/image processing, and 3D graphics transformations do so by heavily manipulating the contents of 2D and 3D matrices. The AltiVec interlement instructions are a perfect fit for these types of operations. As Figure 7 shows, a vector dot product--a matrix operation that's the staple of by many scientific and engineering applications -- can be written with two instructions, vmsum and vsum. The vector merge instruction can be used to write a function to transpose a matrix. The permute instruction can be used to manage low-layer network protocols. Finally, the vector multiply even/old instructions can help implement data convolution/deconvolution algorithms for signal processing. They can also be used to implement high-precision matrix arithmetic that produce 64-bit results. (The vmsum instruction works with half-word elements, producing 32-bit results.)

Figure 7. The vector multiply-sum and vector sum instructions can readily implement the vector dot product, an operation often used in scientific and engineering applications.

As the vector dot product example shows, AltiVec's capabilities let you write complex code with a minimum of instructions. Better still, because the SIMD instructions process the data in parallel, the code is often fast enough to serve in certain real-time applications.

Load that Quad, Shift that Byte

AltiVec provides a number of load/store instructions to transfer data in and out of the vector registers. The load vector indexed (lvx, lvxl) and store vector indexed (stvx, stvl) instructions transfer 128-bit quadword quantities between memory and the AltiVec registers. Two source registers specify the effective address of the memory location that's the target of the operation. The first source register is typically an offset value, while the second register holds a base address (a pointer).

The lvx, lvxl,stvx, stvl instructions make no assumptions about the quadword's contents, so you can use them for bulk transfers of all types of data, as well as vectors. This differs from Intel's MMX technology, which restricts memory/MMX register transfers to integer data.

The load vector element index (lvexx) and store vector element index (stvexx) instructions transfer byte, half-word and word elements between memory and a particular location in a vector register. Like the vector integer instructions, you use the appropriate instruction variant to select the element size. To store a word element, for instance, you'd use the stvewx instruction. The least significant bits of the memory location's effective address are used to construct an offset into the quadword. This offset references the desired vector element within the quadword.

Note that when you load a vector element from memory into a vector register, the contents of the other elements in the register are set to undefined values. However, if you load a vector element from a vector register into memory, the contents of the adjacent memory elements in the quadword are left intact. This means that you can use vector element stores to assemble data structures in memory, but not the other way around. In the case of the vector load, you will probably follow it with a vector splat instruction to replicate the value throughout the vector register, either for a matrix scaling operation or to construct a constant.

I've often mentioned that the vector unit doesn't support misaligned data. In fact, when you present an AltiVec load/store instruction with a misaligned address, the vector unit ignores the low-order bits in the address and accesses the data instead starting at the data type's natural boundary. (A boundary is a memory location whose address is an integral multiple of the data element's size. For example, a quadword boundary consists of memory locations whose addresses are a multiple of of sixteen. That is, the four least significant bits of a quadword's boundary address are zeros.) This was done to simplify the vector unit's design and reduce the time required to perform memory accesses.

In the real world, however, data often winds up in weird memory locations. For example, a window's content area as it resides in the video display's frame buffer may not fall on a memory boundary, or a driver for a data acquisition peripheral may have its own ideas about data alignment. Still, you need AltiVec to perform heavy-duty processing with this data. Fortunately, Motorola's engineers don't leave you in a lurch here: there are AltiVec instructions whose purpose is to adjust the data after has it has been read into registers or before it is written to memory.

The best way to describe these instruction's behavior is with an example. Figure 8 should help illustrate the process.

Figure 8. With a few instructions, you can retrieve and clean up misaligned vector data.

Suppose you want to read a vector of misaligned data (that is, the vector straddles a quadword boundary). Since the vector unit only fetches aligned quadwords, first you must read the two quadwords that contain the vector's data. Next, you use the vector permute instruction to extract bytes from each quadword and reconstruct the vector. It doesn't take much thought to realize that the hard part here is to set up the control register so that vperm merges the proper bytes in the destination register. An instruction, Load Vector Shift Left (lvsl), is dedicated to performing this job. You provide it the address of the misaligned quadword, and it generates a control vector for vperm. Vperm then performs what amounts to a "super shift" left of the concatenated quadwords. A similar instruction, Load Vector Shift Right (lvsr), generates a control vector for right shifting the vector data.

Listing 2 shows the code for cleaning up the misaligned quadword.

Listing 2

vector signed char highQuad, lowQuad, controlVect;
unsigned char * vPointer;

// Fetch quadword with most  significant bytes of misaligned vector 
highQuad = vec_ld(0, (unsigned char *) vPointer);
  
// Make control vector for permute op
controlVect = vec_lvsl(0, (unsigned char *) vPointer));

// quadword with vector's least significant bytes
lowQuad = vec_ld(16, (unsigned char *) vPointer); 

destVect = vec_perm(highQuad, lowQuad, controlVect);

For situations where the load/permute operations are part of a loop that reads streaming data, the overhead of the permute operation can be amortized over more instructions by unrolling the loop.

The Data Stream Touch instructions (dst) enable you to speculatively prefetch large blocks of data. These instructions don't guarantee that the desired data will be in the processor cache when needed. However, the prefetch engines will take every opportunity to obtain the data during idle bus cycles.

Each DST instruction lets you specify the starting address and amount of data to be prefetched (up to a maximum of 128 KB), the unit stride used to access successive memory locations, and a channel. A channel represents one of the four prefetch engines. Normally, the prefetched data is treated as persistent information, where it remains in the cache and can be used more than once. One variant of this instruction, the Data Stream Touch Transient (dstt) marks the data as transient information. For a stream of constantly changing data that won't be reused (say, the contents of video conference window), you'd flag the data as transient so that the G4 can discard it from the cache if needed. Another instruction variant, the Data Stream Touch Store (dsts) prefetches data with the expectation that you will write to it. This instruction enables you to build a buffer in the processor cache that holds frequently-used objects. A Data Stream Stop (dss) instruction halts the specified channel, and a Data Stream Stop All (dssall) halts all four channels.

Where the Competition Stacks Up

The comparison to Intel's MMX and Streaming SIMD Extensions (SSE) is inevitable. Table 1 shows how they compare to AltiVec. While SSE does have 128-bit wide registers, internally Katmai--the first member of the Pentium III family -- double cycles the processor's existing internal 64-bit data paths to achieve the 128-bit transfers. While this avoided a major overhaul of the Pentium III's architecture, the trade-off is that it constricts SSE's throughput. SSE's new integer SIMD instructions use the MMX 64-bit registers, so the maximum byte operations these instructions can handle is 8 at a time, versus AltiVec's 16. Clock for clock, AltiVec simply outguns SSE in integer performance. My score: AltiVec: 1, MMX/SSE: 0.

Table 1 Instruction extension AltiVec MMX SSE Number of registers 32 8 8 Vector size (bits) 128 4 128 Register file seperate aliased on to seperate FPU registers Internal data 128 64 64 path size(bits) Integer sizes supported 8/16/32 8/16/32 8/16/32 Maximum SIMD elements 16x8 8x8 8x8 (uses MMX registers) Flaoting-point size 32 N/A 32 Maximum SIMD elements 4x32 N/A 4x32 (floating-point) IEEE-754 compliance Java 1 subset N/A full

For floating-point computations, the playing field initially appears more level between the two instruction set extensions: both can process four single-precision floating-point numbers simultaneously. However, Katmai's 64-bit paths have an effect here as well: at 550 MHz, Intel claims SSE delivers a peak performance of 2.2 GFLOPS. AltiVec, by contrast, hits a peak of 3.2 GFLOPS at 400 MHz. SSE does offer full compliance with the IEEE 754 standard, including all of the rounding modes and numeric exceptions. This could give SSE an edge for scientific applications. While AltiVec's floating-point operations comply with only a subset of the standard, one of the technology's planned targets is embedded applications. These real-time applications consider performance -- rather than accuracy -- the real issue at this level of precision. My score:. AltiVec: 1, MMX/SSE: 0. It remains to be seen whether desktop applications will require full IEEE 754 compliance for single-precision floating-point computations.

While we're on the subject of floating-point operations, recall that the MMX registers are aliased off the FPU's 80-bit floating-point registers. That is, the processor physically maps the MMX registers to the same space as the FPU registers. This means the Pentium III can perform floating-point operations or MMX SIMD integer operations, but not both at the same time. You have to write short routines that switch the processor from the floating-point mode to the MMX mode, and back. The overhead of these switching routines aren't an issue where multimedia applications deal with predominately integer data. However, for 3D applications that use a mix of integer and floating-point operations, the overhead of the mode-switching can add up. AltiVec has a separate register file, so no mode-switching routines are necessary.

To be fair, Intel used this sleight of hand with the MMX/FPU registers because it could lever a multitasking OS's context-switching code to preserve the MMX graphics state. Such context switching code normally saves/restores the floating-point registers, and so it automatically preserves any MMX data. This allowed programs to use MMX instructions without requiring an update to the operating system. To preserve AltiVec's separate register file, an update to the Mac OS is necessary.

SSE now has separate registers to perform floating-point operations. On the plus side, this eliminates the mode-switching problem, but on the negative side it does require the OS be updated to support the new register file. Score: draw. AltiVec wins on technical merits, but SSE wins on execution (the Pentium III is shipping, while the G4 is not.)

From a programmer's perspective, AltiVec makes code writing a lot easier. Its non-destructive use of source operands simplifies algorithm design because the operands are available to other instructions. (Take a look at how the arguments to the lvsl and vperm instructions are reused in Listing 2). The larger register file allows you to store frequently used vectors, and the interelement instructions can accelerate many algorithms. Intel was careful to introduce SSE with a high-level programming interface, but an SSE instruction destroys one of its two source operands upon completion. This was done to ease pressure on SSE's smaller register file, but it makes writing algorithms more convoluted, which can require even more instructions. Score: AltiVec: 1, MMX/SSE: 0.

In summary, AltiVec wins over SSE in terms of sheer bandwidth, technical features, and programming ease. Tests by Motorola indicate that the inverse Discrete Cosine Transform (iDCT), a key algorithm that decodes MPEG-2 video, executes up to 11.4 times faster than a version using scalar PowerPC instructions. Conversions of RGB data to CCIR601 video executes 9.6 times faster. Anecdotal evidence from the developer community indicates that for applications whose code can be vectorized, boosts of three to ten times in performance have been seen.

Programming with AltiVec

One thing that stymied MMX's acceptance by developers was that you could write MMX code in any language that you wanted, as long as it was assembly language. Motorola has taken care to ensure that the proper interface files and libraries are in place so that you can use AltiVec from within C and C++ programs. You've already seen examples of this support in the division and remainder functions. Not only do such high-level interfaces make writing vector-based code easier, but the compiler can also schedule both conventional and AltiVec instructions so that all of the G4's execution units are kept as busy as possible.

AltiVec's high-level programming interface defines a bevy of vector data types, plus the pixel type for the pack/unpack instructions, and a slew of functions. The function names are based on the root AltiVec instruction name, and you pass source registers as arguments into these functions. The function typically returns a vector that contains the result of the operation (that is, the contents of the destination register). For example, the vector permute instruction appears as

 vector unsigned char dest, sourceA, sourceB, control;

 dest = vec_perm(sourceA, sourceB, control);

The AltiVec Technology Programming Interface Manual describes the supported functions, new vector data types, and other high-level language features available to the C/C++ programmer. You'll also want to peruse the AltiVec Technology Programming Environments Manual, which has good technical description of the technology's capabilities and an exhaustive description of the instruction set. Both manuals are available in electronic form as Acrobat PDF files from Motorola's AltiVec Web site. The specific URLs are provided at the end of this article.

It doesn't do much good to read about programming AltiVec if you don't have a G4 to work with. Fortunately, Apple has solution. The company has an AltiVec Emulator Extension that endows existing Power Macs running Mac OS 8.5 or later with the ability to execute AltiVec instructions. The Extension patches the Mac OS trap table so that it intercepts any unimplemented instruction exceptions. The Emulator parses the exception, and if it is an AltiVec instruction, the emulator executes conventional PowerPC code that performs the operations necessary to implement the instruction. The emulated AltiVec instructions will run slowly, of course, but at least you can begin writing and testing AltiVec code in preparation for the G4-based Macintoshes. To get up to speed on the programming interface, a cross reference table that maps the C/C++ function calls to AltiVec machine instructions is available in Excel format.

In terms of compilers, there are Apple's MPW tools with AltiVec extensions, and their MrC C/C++ compiler. The Pro 5 release of Metorwerks' PowerPC tools includes the PowerPC compiler that also generates AltiVec instructions.

Conclusion

The AltiVec technology promises to transform many of the applications described at the start of this article from special purpose programs to everyday utilities. Multimedia and 3D imagery will become commonplace, and enable Mac users to push the capabilities of the machine in new directions. AltiVec will give technical applications the ability to tackle more complex tasks and demanding simulations, and serve up powerful imagery derived from these computations. The information and tools are available to you to experiment with, so feel free to check them out.

Metrowerks engineers Bob Campbell and Rommel Manuel provided valuable information and source code for the division and remainder functions in this article.

AltiVec Information and Tools

Motorola AltiVec Web site:
http://www.mot.com/SPS/PowerPC/AltiVec/
AltiVec Technology Programming Interface Manual
http://www.mot.com/SPS/PowerPC/teksupport/teklibrary/manuals/altivecpim.pdf
AltiVec Technology Programming Environments Manual
http://www.mot.com/SPS/PowerPC/teksupport/teklibrary/manuals/altivec_pem.pdf
AltiVec emulator:
ftp://ftp.apple.com/developer/Development_Kits/altivec/av_emulator.hqx
Instruction Cross-Reference table (Excel 98 format)
ftp://ftp.apple.com/developer/Development_Kits/altivec/instruction_xref_98.hqx

Tom Thompson bought his first Mac in 1984 and wrote the first PowerPC programming book, Power Macintosh Programming Starter Kit (1994, Hayden Books) using Metrowerks CodeWarrior DR1. He is now a Senior Training Specialist at Metrowerks, and can be reached at thompson@metrowerks.com.

Software Updates via MacUpdate

Latest Forum Discussions

Combo Quest (Games)

Combo Quest 1.0 Device: iOS Universal Category: Games Price: $.99, Version: 1.0 (iTunes) Description: Combo Quest is an epic, time tap role-playing adventure. In this unique masterpiece, you are a knight on a heroic quest to retrieve... | Read more »

Hero Emblems (Games)

Hero Emblems 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: ** 25% OFF for a limited time to celebrate the release ** ** Note for iPhone 6 user: If it doesn't run fullscreen on your device... | Read more »

Puzzle Blitz (Games)

Puzzle Blitz 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: Puzzle Blitz is a frantic puzzle solving race against the clock! Solve as many puzzles as you can, before time runs out! You have... | Read more »

Sky Patrol (Games)

Sky Patrol 1.0.1 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0.1 (iTunes) Description: 'Strategic Twist On The Classic Shooter Genre' - Indie Game Mag... | Read more »

The Princess Bride - The Official Game...

The Princess Bride - The Official Game 1.1 Device: iOS Universal Category: Games Price: $3.99, Version: 1.1 (iTunes) Description: An epic game based on the beloved classic movie? Inconceivable! Play the world of The Princess Bride... | Read more »

Frozen Synapse (Games)

Frozen Synapse 1.0 Device: iOS iPhone Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: Frozen Synapse is a multi-award-winning tactical game. (Full cross-play with desktop and tablet versions) 9/10 Edge 9/10 Eurogamer... | Read more »

Space Marshals (Games)

Space Marshals 1.0.1 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0.1 (iTunes) Description: ### IMPORTANT ### Please note that iPhone 4 is not supported. Space Marshals is a Sci-fi Wild West adventure taking place... | Read more »

Battle Slimes (Games)

Battle Slimes 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: BATTLE SLIMES is a fun local multiplayer game. Control speedy & bouncy slime blobs as you compete with friends and family.... | Read more »

Spectrum - 3D Avenue (Games)

Spectrum - 3D Avenue 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: "Spectrum is a pretty cool take on twitchy/reaction-based gameplay with enough complexity and style to stand out from the... | Read more »

Drop Wizard (Games)

Drop Wizard 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: Bring back the joy of arcade games! Drop Wizard is an action arcade game where you play as Teo, a wizard on a quest to save his... | Read more »

Price Scanner via MacPrices.net

Our MacBook Price Trackers will show you the...

Our Apple award-winning MacBook Price Trackers are continually updated with the latest information on prices, bundles, and availability for 16″ and 14″ MacBook Pros along with 13″ and 15″ MacBook... Read more

Amazon is offering a 10% discount on Apple’s...

Don’t pay full price! Amazon has 16-inch M4 Pro MacBook Pros (Silver and Black colors) on sale today for 10% off Apple’s MSRP. Shipping is free. These are the lowest prices currently available for 16... Read more

13-inch M4 MacBook Airs on sale for $150 off...

Amazon has new 13″ M4 MacBook Airs on sale for $150 off MSRP right now, starting at $849. Sale prices apply to most colors and configurations. Be sure to select Amazon as the seller, rather than a... Read more

15-inch M4 MacBook Airs on sale for $150 off...

Amazon has new 15″ M4 MacBook Airs on sale for $150 off Apple’s MSRP, starting at $1049. Be sure to select Amazon as the seller, rather than a third-party: – 15″ M4 MacBook Air (16GB/256GB): $1049, $... Read more

Amazon is offering a $50 discount on Apple’s...

Amazon has Apple’s 11th-generation A16 iPads in stock on sale for $50 (or a little more) off MSRP this week. Shipping is free: – 11″ 11th-generation 128GB WiFi iPads: $299 $50 off MSRP – 11″ 11th-... Read more

Clearance 13-inch M1 MacBook Airs available f...

Walmart has clearance, but new, Apple 13″ M1 MacBook Airs (8GB RAM, 256GB SSD) available online for $649, $360 off original MSRP, in Space Gray, Silver, and Gold colors. These are new MacBooks for... Read more

iPad minis on sale for $100 off Apple’s MSRP...

Amazon is offering $100 discounts (up to 20% off) on Apple’s newest 2024 WiFi iPad minis, each with free shipping. These are the lowest prices available for new minis among the Apple retailers we... Read more

AirPods Max headphones on sale for $479, $70...

Amazon has AirPods Max with USB-C on sale for $479.99 in all colors. Shipping is free. Their price is $70 off Apple’s MSRP, and it’s the lowest price available today for AirPods Max. Keep an eye on... Read more

14-inch M4 Pro/M4 Max MacBook Pros on sale th...

Don’t pay full price! Get a new 14″ MacBook Pro with an M4 Pro or M4 Max CPU for up to $320 off Apple’s MSRP this weekend at these retailers…they are the lowest prices available for these MacBook... Read more

Get a 15-inch M4 MacBook Air for $150 off App...

A couple of Apple retailers are offering $150 discounts on new 15″ M4 MacBook Airs this weekend. Prices at these retailers start at $1049: (1): Amazon has new 15″ M4 MacBook Airs on sale for $150 off... Read more

Jobs Board

SPREAD THE WORD:
Slashdot
Digg
Del.icio.us
Reddit
Newsvine