Fixed-Point Math Speed
Volume Number: 13 (1997)
Issue Number: 11
Column Tag: Programming Techniques
Getting a Speed Boost with Fixed-Point Math
by Daniel W. Rickey, Manitoba Cancer Treatment and Research Foundation
A little assembly goes a long way to improving code execution speed
Motivation
The recent releases of the Metrowerks compilers allow Pascal programmers to include assembly language in their programs as functions and procedures. A built-in assembler is a very powerful tool because many of the features of the Pascal environment are available in the assembler. For example, compiler directives enable a routine to contain both 68k and PowerPC assembly code. In addition, the integrated environment provides seamless source-level debugging of the assembly and Pascal code. (C programmers can gain the same benefits, but they must write complete functions to use the built-in assembler.)
One of the biggest advantages of assembler is that it allows the programmer to use processor instructions that are difficult or impossible to access from a high-level language. Because of this ability, the programmer may have the option of using a more efficient algorithm when coding in assembler. For these reasons, the assembler is an excellent tool for optimizing time-critical routines. This article shows how the built-in assembler can be used to optimize some simple fixed-point math routines.
Fixed-Point Math
In many imaging applications, such as texture mapping or ray casting of three-dimensional medical data sets (Herman 1993), approximately 105 to 106 pixel values are mapped to the screen. For interactive viewing, we need to calculate at least 105 position and intensity values each second. These calculations can be performed with floating-point math and the resulting values rounded to integer format for display. Unfortunately, the round function is very slow, particularly on 68k machines lacking a 68882 co-processor, e.g. machines using a 68LC040. In fact, all floating-point operations on machines lacking a co-processor are very slow. Although the PowerPC processors offer excellent floating-point performance they are relatively slow at rounding because this operation requires a move to and from main memory. To overcome these speed limitations, programmers can use fixed-point calculations instead of floating-point operations. On a 68k processor, rounding a fixed-point number can be 50 times faster than rounding a floating-point number. If a co-processor is not present, other operations, such as multiplication, will be much faster using fixed-point math. On PowerPC-based machines the fixed-point round is about 3 times faster than a floating-point round. The speed advantage gives strong motivation to use fixed-point math.
Fixed-point math is surprisingly easy to use. On Macintosh a fixed-point number is 32 bits long, that is Fixed = LongInt;. The low-order 16 bits contain the fractional part of the number and the high-order 16 bits contain the integer portion of the number. Addition and subtraction are handled as they normally would be: by simply adding and subtracting the numbers. Multiplication, division and type conversions must be performed using function calls. The Toolbox functions for performing these operations are:
FUNCTION FixMul(a, b: Fixed): Fixed;
FUNCTION FixDiv(a, b: Fixed): Fixed;
FUNCTION Fix2X(x: Fixed): double_t;
FUNCTION X2Fix(x: double_t): Fixed;
FUNCTION Long2Fix(x: LongInt): Fixed;
FUNCTION Fix2Long(x: Fixed): LongInt;
FUNCTION FixRound(x: Fixed): Integer;
Because the extended variable type is not supported on a PowerPC, the original Fix2X and X2Fix functions were updated to replace the Extended type with double_t, which is equal to a Double on a PowerPC and an Extended on a 68k processor.
Goals
The primary motivation for using fixed-point calculations is speed. Unfortunately, neither the 68k nor PowerPC Toolbox routines are well optimized. In fact, these routines were not PowerPC native until system 7.5.3. In this article we will concentrate on optimizations of the multiply, divide, and round functions as they are used most often. For this work we will use the built-in assembler, available in the Metrowerks Pascal Compiler, to write 68k and PowerPC code.
In addition to being fast, we require our functions to be well behaved. Because fixed-point numbers must lie between 32767.9999 and -32768, there exists a possibility of overflow during multiplication and division. If an overflow occurs, the function returns a large value with the appropriate sign. In the case of multiplication, an overflow condition behaves similar to the Toolbox routine: $7FFE0000 is returned when both arguments have the same sign and -$7FFE0000 is returned when the arguments have different signs. The largest possible fixed-point value is not returned as this might cause an overflow in the calling code. Note that these examples differ from the 68k assembler code of Lebedev (1994), which did not deal with overflow conditions.
Fixed-Point Round
The algorithm for rounding a fixed-point number is very simple: one half is added to the number and then the result is bit-shifted right by 16 bits. Shown in Listing 1 is the function for rounding a fixed-point number. A normal function declaration is used followed by the asm directive. Compiler directives are used to determine whether 680x0 or PowerPC instructions are used. Note that when compiled for a 68000-based machine, the Toolbox routine is used. This is because we make use of instructions that are available only on 68020 and newer processors.
680x0
Those unfortunate souls who program in C should note that the conventions for passing parameters to and from a 68k function differ from the Pascal code shown here. The function begins with the fralloc directive, which creates a stack frame using link a6, #$0. Once x is loaded into register d0, it is added to $7FFF, which, to maintain consistency with the PowerPC code shown below, is a tad less than one half ($8000). An algebraic shift is used to sign extend the result. Note that the "immediate" form of the asr bit-shift instruction can only shift a maximum of 8 bits, thus two instructions are required. The result is placed on the stack in the location allocated by the caller, i.e. $C(a6). Note that (a6) is the previous value of a6; $4(a6) is the return address; $8(a6) is the value of x; and $C(a6) is the space allocated for the returned value. The frfree directive simply undoes what fralloc did: it emits unlk a6. After executing this instruction, a6 will contain the previous value of a6 and a7 will point to the return address. Finally the function executes the rtd instruction to return to the caller by loading the return address into the program counter. This instruction also causes the stack pointer to point to the returned value thereby de-allocating the stack space originally allocated for x.
PowerPC
Compared to the 680x0, assembly language on the PowerPC processor is much more civilized. For example, we don't have to sort out the gruesome details of address modes and stack pointers because parameters are passed in registers and the same conventions are used for Pascal and C. In addition, there are only a few simple addressing modes compared to the fourteen available on the 680x0.
As described by Evans (1996) and Kacmarcik (1995), volatile registers r3 through r10 are used to pass parameters to the function. The PowerPC implementation of the MyFixRound function illustrates this: x is passed to the function in register r3 and the function result is returned in register r3. The PowerPC assembler differs from the 68k assembler because a stack frame is not required as long as the number of local variables is kept small. The first instruction addic (add immediate with carry) adds the sign-extended 16-bit $7FFF to r3 and places the result in r3. Because of the sign extension, we cannot add the exact one-half value of $8000. However, this shortcoming should not make any difference to real-world applications. The srawi (shift right algebraic word immediate) instruction is used to perform an algebraic shift right by 16 bits. The standard blr (branch to link register) instruction ends the function. (The link register was loaded with the return address when the function was called.) It should be noted that in PowerPC assembly language there are many extended mnemonics that the assembler converts into valid instructions. However, the Metrowerks compiler may not recognize all of them.
Listing 1: MyFixRound
Function MyFixRound(x :Fixed): LongInt;
{$IFC NOT POWERPC} {$IFC NOT OPTION(MC68020)}
INLINE $A840; {on a 68000 use the Toolbox routine}
{$ENDC} {$ENDC}
Function MyFixRound(x :Fixed): LongInt;
asm;
Begin
{$IFC OPTION(MC68020)}
fralloc {create an a6 stack frame}
move.l x, d0; {move x into register d0}
add.l #$7FFF, d0; {add 1/2 to x}
asr.l #$08, d0; {shift 32-bit word to right by 8 bits}
asr.l #$08, d0; {shift 32-bit word to right by 8 bits}
move.l d0, $000C(a6); {place the result on the stack}
frfree; {free the a6 stack frame}
rtd #$0004; {return and de-allocate parameter}
{$ENDC}
{$IFC POWERPC}
addic r3, r3, $7FFF; {add 1/2 to x}
srawi r3, r3, 16; {shift word to right by 16 bits}
blr; {return; results are returned in r3}
{$ENDC}
End; {MyFixRound}
Fixed-Point multiply
The algorithm for multiplying two fixed-point values is also simple: multiply the two values together and then bit-shift the result to the right by 16 bits. Shown in Listing 2 is the function for multiplying two fixed-point numbers. The basic structures of the 680x0 and PowerPC functions are similar to those shown in Listing 1.
680x0
This function illustrates the use of a processor instruction (muls.l) that is not available from Pascal. The muls.l instruction multiplies two signed 32-bit numbers together to give a 64-bit result, which is contained in two registers. After performing the multiplication, the routine ensures that the result is smaller than the largest acceptable fixed-point value, i.e. the high word of the result must be smaller than $7FFF. If the result is within range, the bit shift right is performed. Because the result is contained in two registers, the bit-shift operation requires a couple of steps: 1) d1 is shifted right by 16 bits; 2) d2 is algebraically shifted left by 16 bits; and, 3) the high order two bytes of d2 are moved into d1. Pairs of instructions are used to perform the 16-bit shifts. In the case of an overflow or out-of-range result, the function returns a large number whose sign depends on the signs of the two arguments. Note that because two parameters are passed to this function, the result is placed in a different location on the stack than for the MyFixRound function, i.e. $10(a6) versus $0C(a6). In addition, the rtd instruction removes two four-byte parameters from the stack.
PowerPC
Performing a 32-bit by 32-bit multiply on PowerPC processor requires two instructions: mulhw calculates the high-order 32 bits, and mullw the low-order 32 bits. After performing the multiplication, the cmpwi instruction is used to check if the result is smaller than the largest acceptable fixed-point value. Note that cmpwi is an extended form of the cmpi (compare immediate) instruction. If the result is in range, a technique similar to the 680x0 example is used to combine the results stored in registers r5 and r6. However, only two instructions are needed on the PowerPC: the rlwinm (rotate left word immediate then AND with mask) instruction is used to bit shift the low-order 32 bits of r5 and the rlwimi (rotate left word immediate then mask insert) is used to move the lower 16 bits of r6 into the top 16 bits of r3. Out-of-range values are handled as in the 68k code.
Listing 2: MyFixMul
Function MyFixMul(x, y :Fixed):Fixed;
{$IFC NOT POWERPC} {$IFC NOT OPTION(MC68020)}
INLINE $A868; {on a 68000 use the Toolbox routine}
{$ENDC} {$ENDC}
Function MyFixMul(x, y :Fixed):Fixed;
asm;
Begin
{$IFC OPTION(MC68020)}
fralloc {create an a6 stack frame}
move.l x, d0; {move x and y values into registers}
move.l y, d1;
muls.l d0, d2:d1; {do the multiplication d2:d1 := d0*d1}
cmpi.l #$00007FFF, d2; {check if the result is larger than the}
bge Overflow; {maximum allowed for a fixed-point number}
cmpi.l #-$00007FFF, d2; {check if the result is less than the}
ble Overflow; {minimum allowed for a fixed-point number}
lsr.l #08, d1; {want the high-order 2 bytes of d1; shift right 16 bits}
lsr.l #08, d1; {can only bit shift a maximum of 8 bits}
asl.l #08, d2; {want the low-order 2 bytes of d2; shift left 16 bits}
asl.l #08, d2; {use an algebraic shift to maintain the sign bit}
move.w d1, d2; {combine d2 and d1}
bra Finished;
Overflow:
move.l #$7FFE0000, d2; {return a large number}
tst.l x; {compare x to zero to determine if it is negative}
bge CheckY; {if positive jump to CheckY:}
neg.l d2; {multiply the large number by -1}
CheckY:
tst.l y; {compare y to zero to determine if it is negative}
bge Finished; {if positive jump to Finished:}
neg.l d2; {multiply the large number by -1}
Finished:
move.l d2, $0010(a6); {place the result d2 on the stack}
frfree; {free the a6 stack frame}
rtd #$0008; {return and de-allocate parameters}
{$ENDC} {68020}
{$IFC POWERPC}
mulhw r6, r3, r4; {find x * y and place high-order 32 bits into r6}
mullw r5, r3, r4; {find x * y and place low-order 32 bits into r5}
cmpwi r6, $7FFF; {check if high-order value is larger than the}
bge OverFlow; {maximum allowed for a fixed-point number}
cmpwi r6, -$7FFF; {check if high-order value is less than the}}
ble OverFlow; {minimum allowed for a fixed-point number}
rlwinm r3, r5, 16, 0, 31; {shift r5 right by 16 bits}
rlwimi r3, r6, 16, 0, 15; {move low 16 bits of r6 into top 16 bits of r3}
blr; {return; result is returned in r3}
OverFlow: {have an overflow condition}
lis r5, $7FFE; {load r5 with $7FFE0000}
cmpwi r3, $0; {compare to 0 to determine if x is negative}
bge CheckY; {if positive then go on to check sign of y}
neg r5, r5; {multiply r5 by -1}
CheckY:
cmpwi r4, $0; {compare to 0 to determine if y is negative}
bge Finished; {we're done}
neg r5, r5; {multiply r5 by -1}
Finished:
mr r3, r5; {copy value from r5 into r3}
blr; {return; results are returned in r3}
{$ENDC} {POWERPC}
End; {MyFixMul}
Fixed-Point DIVIDE
The algorithm for dividing two fixed-point numbers is to bit shift the numerator to the left by 16 bits and then perform the division. Shown in Listing 3 is the function for dividing two fixed-point numbers. The basic structures of the 680x0 and PowerPC functions are similar to those shown in Listing 2. However, two function interfaces are used, one for PowerPC and the other for 680x0.
680x0
This function also uses a processor instruction (divs.l) that is not available from Pascal. The divs.l instruction divides a 64-bit value by a 32-bit value to give a 32-bit result. Before the division is performed, two things are done: 1) the denominator is checked for a divide-by-zero case; and 2) the numerator is split with the low-order 16 bits placed into the high word of d0 and the high-order 16 bits placed into the low word of d1. An algebraic shift is used to maintain the sign bits in d1. Once the division is complete, the bvc instruction is used to check for an overflow condition, in which case a large positive or negative number is returned. The result is placed on the stack and the function exited in the same manner as for MyFixMul.
PowerPC
Unfortunately, the PowerPC instruction set does not provide an integer 64 by 32-bit divide. Because of this limitation, I chose to use the floating-point unit to perform the calculations. Notice that the function parameters are declared as Real instead of Fixed. Instead of bit-shifting the numerator, we perform the equivalent operation of multiplying by 65536. The Metrowerks assembler allows a real number to be specified directly in the assembler code as is illustrated by loading fp4 with 65536. The division is then performed with the fdivs instruction. Notice that we use single-precision multiplication and division instructions, which are faster than their double-precision counterparts. The function then uses fctiw to convert the real number to an integer word, which is stored in the low-order word of fp3. This instruction also takes care of any out-of-range conditions for us. Unfortunately there is no instruction for transferring data between general-purpose and floating-point registers. Thus we are forced to store the contents of fp3 in a temporary variable in memory. The integer value can then be loaded into register r3 before the function is exited.
Listing 3: MyFixDiv
{$IFC POWERPC}
Function MyFixDiv(x, y: Real): Fixed;
{$ELSEC}
Function MyFixDiv(x, y: Fixed): Fixed;
{$ENDC}
{$IFC NOT POWERPC} {$IFC NOT OPTION(MC68020)}
INLINE $A84D; {on 68000 use Toolbox routine}
{$ENDC} {$ENDC}
{$IFC OPTION(MC68020)}
Function MyFixDiv(x, y: Fixed): Fixed;
asm;
Begin
fralloc; {create an a6 stack frame}
move.l y, d2; {d2 contains denominator}
move.l x, d0; {d0 contains numerator}
tst.l d2; {compare denominator to zero}
beq Overflow; {division by zero gives an overflow}
DoDivision:
move.l d0, d1; {d1 to contain upper 16 bits of numerator}
asr.l #08, d1; {algebraic shift right 16 so the low word of d1}
asr.l #08, d1; {contains the upper 16 bits of the numerator}
lsl.l #08, d0; {bit shift left 16 so the high word of d0 contains}
lsl.l #08, d0; {the lower 16 bits of the numerator}
divs.l d2, d1:d0; {d0 := d1:d0/d2; perform a 64 by 32-bit division}
bvc Finished; {check for overflow; go to Finished: if there was none}
Overflow:
move.l #$7FFE0000, d0; {return a large number}
tst.l x; {compare x to zero to check if it is negative}
bge CheckY; {if positive jump to CheckY:}
neg.l d0; {make the large number negative}
CheckY:
tst.l y; {compare y to zero to determine if it is negative}
bge Finished; {if positive jump to Finished:}
neg.l d0; {multiply the large number by -1}
Finished:
move.l d0, $0010(a6); {place the result d0 on the stack}
frfree; {free the a6 stack frame}
rtd #$0008; {return and de-allocate parameters}
End;
{$ENDC} {68020}
{$IFC POWERPC}
Function MyFixDiv(x, y: Real): Fixed;
asm;
Var
tempStorage :Record
highLong :LongInt;
lowLong :LongInt;
End;
Begin
lfs fp4, 65536.0; {load 2^16 into fp4}
fmuls fp3, fp1, fp4; {multiply numerator by 2^16}
fdivs fp3, fp3, fp2; {do the division}
fctiw fp3, fp3; {convert to an integer}
stfd fp3, tempStorage; {store result in temp memory}
lwz r3, tempStorage.lowLong; {load 4-byte integer into r3}
blr; {return; result is returned in r3}
End;
{$ENDC}
Benchmarks
To test the speed increase given by these functions, operations were performed on 4096-element arrays (see TestFixedPointMath.p for the details). For these measurements, the routines were compared to the Toolbox routines on a PowerPC 7200/90 with an L2 cache and an LC 475 with a 33 MHz 68LC040. Each assembly routine and its Toolbox counterpart was executed 4096000 times and the elapsed time determined using TickCount. The elapsed times ranged from 1.17 seconds and up. From the speed increases given in Table 1, it can be seen that even with the PowerPC-native fixed-point routines built into System 7.5.3, our assembly language routines deliver a respectable increase in performance. As expected, our routines are much faster than the non-native routines in System 7.5.2. In addition, our 68k routines are substantially faster than the Toolbox routines in emulation mode or on the genuine 68LC040 processor.
Speed Increase (%)
Fixed Round Fixed Multiply Fixed Divide
7200/90
System 7.5.3 56 18 49
7200/90
System 7.5.2 49 392 760
7200/90
68k emulation 159 71 130
LC 475
System 7.5.3 178 137 369
Table 1. Percentage speed increase obtained using the assembly language routines. A 100% speed increase means that our routine ran in half the time required for the Toolbox routine.
Discussion
In this article, I have shown that assembly-language routines are faster than the Toolbox routines. However, as described by Kacmarcik (1995), the actual increase in performance on PowerPC machines will depend on how the instructions are scheduled and how the data interact with the processor and L2 caches. The multiply and divide instructions use the majority of time spent in the routine, thus the removal or addition of a few fast instructions should not make much difference. The division function on the PowerPC used floating-point instructions and incorporated a relatively slow floating-point round. Although it might be advantageous to implement an integer algorithm, division is always slow compared to addition and multiplication and should be avoided. In the future, a significant increase in performance should be realized when the 64-bit PowerPC 620 processor is used in the Macintosh and the 64-bit multiply and divide instructions are performed with one integer instruction.
It is hoped that this article illustrates that a built-in assembler is an excellent tool for optimizing time-critical, bottle-neck routines. Although the functions shown here are unlikely to be the fastest implementations, they illustrate the capabilities and advantages of including assembler routines directly into Pascal programs. For example, the functions used some instructions that are not accessible from Pascal. Therefore, the ease with which assembler is included and debugged in a modern programming environment makes us question the philosophy that assembler code should only be used as a last resort.
Bibliography and References
- Apple Computer. "Mathematical and Logical Utilities". Inside Macintosh: Operating System Utilities, pp. 3-11-3-49. Addison Wesley, 1994.
- Evans, Dave. "Balance of Power: Introducing PowerPC Assembly Language." develop, pp. 23-28, Issue 21 March 1995.
- Herman, Gabor T. "3D Display: A Survey From Theory to Applications". Computerized Medical Imaging and Graphics pp. 231-242, Volume 17, Issue 4/5 1993.
- Hoxey, Steve, Faraydon Karim, Bill Hay and Hank Warren (editors) The PowerPC Compiler Writer's Guide. Wartham Associates, 1995 ISBN 0-9649654-0-2.
- Kacmarcik, Gary. Optimizing PowerPC Code: Programming the PowerPC Chip in Assembly Language. Addison Wesley, 1995 ISBN 0-201-40839-2.
- Lebedev, Alexei. "Fixed Point Math For Speed Freaks". MacTech Magazine (formerly MacTutor) 10:3 March 1994.
- Motorola. MOTOROLA M68000 FAMILY Programmer's Reference Manual. 1992. (Available online from Motorola High Performance Embedded Systems Division, http://pirs.aus.sps.mot.com/docs/pubs).
- PowerPC Microprocessor Family: The Programming Environments., MPCFPE/AD (Motorola Order Number) and MPRPPCFPE-01 (IBM Order Number, available online from IBM Microelectronics, http://www.chips.ibm.com).
- PowerPC 601 RISC Microprocessor User's Manual, Rev 1 MPC601UM/AD (Motorola Order Number) and 52G7484/(MPR601UMU-02) (IBM Order Number).
Daniel W. Rickey is a medical imaging physicist with a background in Doppler ultrasound. He started programming a shareware 3-D image display program (MacCubeView) on a Mac Plus and has since decided that a PowerMac is faster. He can be reached at the Department of Medical Physics, 100 Olivia St., Winnipeg, Manitoba, R3E 0V9 Canada. daniel@kaon.mctrf.mb.ca.