Float Point
Volume Number:   2

Issue Number:   7

Column Tag:   Threaded Code

A Custom Floating Point Package
By Jörg Langowski, EMBL, c/o I.L.L., Grenoble, Cedex, France, Editorial Board
"12 Kflops  Forth goes inSANE"
You might have noticed my frequent complaints about the slowness of the Apple 80bit SANE routines and the nonavailability of a reasonably fast 32 bit floating point package in Forth systems (and most other programming languages, for that matter, Fortran being the only exception).
The scaling that is necessary if you want to do rapid calculations in Forth (using integer arithmetic) is a nuisance, and it is almost impossible to implement general purpose mathematical routines without floating point. Since I needed to run this machine as fast as possible, I set out to write a set of single precision real arithmetic routines.
IEEE 32bit real numbers
The number format that we are dealing with is the IEEE standard and defined as follows:
bit 31 (highest bit): sign (0 = positive)
bits 30 23 : exponent (offset = 127)
bits 22 0 : fraction part (23 bits with the 24th bit always =1)
That is, a real number has the form
(+/)1.xxx xxx * 2yyy,
where the exponent yyy is obtained by subtracting 127 from the stored exponent; only the fraction part of the mantissa is stored because the integer part is always 1 by definition (otherwise, the fraction part will have to be shifted left and the exponent decreased until the integer part is =1; this is called normalization).
The highest exponent, 255, is reserved for special cases: with zero mantissa it designates positive/negative infinity, and with nonzero mantissa it is a 'nonumber' (Not a Number or NaN). This latter case is used to mark the results of undefined or illegal operations, such as 0/0, infinity  infinity, square root (x), etc. The type of 'non numberness' is indicated by the value of the mantissa.
Floating point arithmetic is thouroughly treated in D. Knuth's 'The Art of Computer Programming', Vol.2. Taking this excellent book as a guidance, the job to write the single precision routines becomes manageable. We'll first consider addition.
Floating point addition and subtraction
To add two floating point numbers which have the same exponent is trivial, we simply add the fraction parts. Unfortunately, in real life numbers often have different exponents. Adding such two numbers is done by shifting the fraction part of the smaller one right by the difference in exponents. Then the fractions may be added; the exponent of the result is that of the larger of the two input numbers.
This addition may generate a fraction overflow, when the new fraction part overflows into the 25th bit. In that case, the fraction has to be shifted right and the exponent increased by one. This might generate an exponent overflow; in which case we have to set the result to infinity.
When adding two numbers of opposite sign, the fractions will have to be subtracted and the resultant fraction may be significantly smaller than 24 bit precision. We have to normalize the result, shifting the fraction left until the 24th bit is equal to one again, decreasing the exponent as we go. We assume that the input to our routines always consists of normalized floating point numbers; the routines will always return normalized results as well.
Normalizing after subtraction may result in an exponent underflow. We could simply stop normalization and keep an unnormalized fraction part for very small numbers (in fact, the IEEE standard provides for this). I have chosen here to simply set the result equal to zero, since the algorithms become more complicated if one allows unnormalized small numbers.
After normalization, the fraction part is rounded. This is necessary because we actually calculate the result with a higher precision than 24 bits; in general, we will have a 'raw' 32bit result. The rounding algorithm implemented here goes as follows (bit 0 is the least significant bit of the 24 bit mantissa, bits (1) and (2) the bits below it):
 if bit (1) is equal to zero, don't change anything.
 if bits (1) and (2) are equal to one, increment fraction by one.
 if bit (1) = 1, bit (2) = 0, increment fraction if bit 0 = 1 (round to next even number).
Since rounding includes incrementing, we might generate a new fraction overflow, so we have to check for that again after rounding.
Getting the correct sign for the result is not too complicated, either; if both numbers have the same sign, the result will have that sign. If the numbers are of opposite sign, the fractions have to be subtracted instead of added. If because of the subtraction the result fraction becomes negative, the fraction and the sign of the result have to be inverted.
The result of the subtraction then is normalized and rounded as described before.
The algorithm is implemented in Listing 1 in standard 68000 assembly code of the Mach1 variety. My code assumes that registers D0 to D6 may be used freely, which is true for Mach1. If your Forth system (or any other system where you wish to implement this code) uses any of those registers, you will have to push them on the stack before you enter and pull them off again on exit. Or change register assignments.
The Mach1 data stack is maintained by A6, so the operands are pulled off the A6 stack on entry and the sum pushed there on exit. This might also be a place where changes would be necessary going to a different system. Otherwise, you can take the code as it is. In MacForth, you will have to transfer it into reverse polish notation. NEON provides a standard assembler.
Read through the listing and follow the algorithm in machine language; it is essentially the one described in Sec. 4.2.1 of Vol.2 of Donald Knuth's set of books, except that normalization is only done after subtraction.
There is some additional 'glue' which is specific to Mach1. F>S and S>F move floating point numbers from the floating stack (D7) to the data stack (A6) and back, changing between 80 and 32bit precision. S provides for subtraction; it simply changes the sign of the number on top of stack and then adds.
<<< BUG WARNING >>>
 Palo Alto Shipping, please take note 
As you see in the listing, I had to handassemble a BCHG #31,(A6) by doing a DC.L. This is because the Mach 1 assembler still doesn't like some instructions. I would recommend to all of you using this (otherwise excellent) system to disassemble routines after assembly and quickly look whether there are differences. Another instruction with which I had problems is the ANDI.W and ANDI.B type (not the ANDI.L). Further down, in the code for multiplication and division, I had to move around those problems by handcoding with DC.L.
Multiplication
The multiplication algorithm is somewhat simpler than that for addition. Its heart is a double precision multiplication of the fraction parts. A 24 by 24bit multiply yields a 48bit result, only the upper 24 bits of which are needed (actually, some more for rounding  we'll have 32 bits as an intermediate result).
Lets call the two operands u and v, their 8 most significant bits um and vm, the 16 least significant bits ul and vl. The result is then composed as follows (Fig.1):
um * vm > upper 16 bits of result;
ul * vl > lower 32 bits of (48bit) result;
um*vl + ul*vm > 24 lower bits of the upper 32bit part of result.
The final 24bit result is obtained by taking the upper rounded 24 bits of the 48bit result. This result might be unnormalized by one position (the high bit might be =0); if that's the case, it will have to be shifted left by one and the exponent adjusted accordingly. After that adjustment, the result is rounded as shown above.
The exponents are simply added in multiplication, after subtracting the excess (the offset of 127). At this point, over and underflows are detected. The correct sign is automatically attached to the product by the following trick:
Fig. 1
Before the two exponents are added, the signs are rotated into bit 0 of the registers so that the exponents are in the upper 8 bits of the word. After addition, bit 0 will be one only when the operands had opposite signs. When the exponent is then rotated back, the sign bit will be set (=negative) only in that case. Of course, the low order bits have to be masked out before exponent and fraction are put back together again.
Again, this algorithm is best understood by looking at the listing.
Division
Division is tough, since we can only divide 32 by 16 bits at a time, but we want a result that is precise to 24 bits! Ideally, we'd need a 48bit register whose upper half we fill with the fraction part of the dividend, then do a 48by 24 division to get a 24 bit result. If we have less precision, things get much more complicated. Knuth is the answer again.
Imagine we have two numbers u and v with upper and lower 16bit halves um, vm and ul, vl, which we want to divide. This can be written as
wm + 216wl = (um + 216ul)/(vm + 216vl)
= (um + 216ul)/vm * 1/(1 + 216(vl/ vm)) .
Now is is convenient to know that the second term of the product in the second line may be approximated:
1/(1 + 216(vl/ vm)) 1  216(vl/ vm) .
The way to go is now the following:
 divide u (32bit) by vm, getting a 16bit quotient w'm and 16 bits of remainder.
 shift w'm into the upper half of the result. Divide the 16 bit remainder (shifted into the upper half of a 32bit register) by vm. This gives the 16bit quotient w'l. Combine w'm and w'l to give w'.
 compute w'/vm (32bit/16 bit > 16 bit). Multiply this result by vl to get a 32bit product r', shift this product down by 16 bits to get r.
 The result of the division is w = w'  r.
The exponents have just to be subtracted in division, and correct sign is obtained by the same rotation 'trick' that was used in multiplication. Again, at this stage we have to watch for over and underflows and do the correct rounding. Note that when adjusting the fraction parts to 32 bits at the beginning, the dividend is shifted one bit less than the divisor. This makes sure that the divisor is always larger than the dividend, otherwise another check for overflows would be necessary. No accuracy is sacrificed by this procedure.
Testing the package accuracy and speed benchmarks
The end of the listing contains some benchmarks that I wrote to check out the routines. The accuracy routines are used to check the precision of the package; with accuracy3 you will see that the result deviates only in the lowest fraction bit from the 'exact' result obtained with the SANE routines. (I have not tested all cases, there may me some exceptional ones where the deviation is 2 bits.)
The speed increase is considerable (Table 1): running the speed.test and fspeed.test for comparison, you'll find that you get a speed of 12,000 floating point additions per seconds (having subtracted loop overhead), and orderofmagnitude the same speed for the other operations. This is an increase by a factor of 5 to 10 over the SANE package. Just what we wanted.
Another bug report (mine!)
As I mentioned in my last article, there might have been a problem with the CRT saver the way it was written in that column. It turns out that the program actually can crash because of collision of the local stacks and therefore the two parts of the routine have to have separate local stacks. Listing 2 shows the change that has to be made in the program. The MacTutor source disk contains the corrected version, which is to my knowledge safe to use (I'm using it now...).
See you next time!
Table 1: single and extended precision benchmarks
10000 operations (pi•e)
single add: 0.83 sec > 12.0 Kflops
single sub: 1.0 sec > 10.0 Kflops
single mul: 0.95 sec > 10.5 Kflops
single div: 1.3 sec > 7.7 Kflops
extended add: 4.47 sec > 2.2 Kflops
extended sub: 4.73 sec > 2.1 Kflops
extended mul: 7.0 sec > 1.4 Kflops
extended div: 13.35 sec > 0.75 Kflops
speedup add: 5.36
speedup sub: 4.8
speedup mul: 7.37
speedup div: 10.3
Listing 1: single precision floating point math primitives, for installation
into Forth systems
( 32 bit floating point routines )
( © 27.4.1986 J. Langowski /MacTutor )
( This code may be used freely for noncommercial
purposes to speed up the inSANEly precise and slow
Macintosh floating point package. Any inclusion in
software that is distributed commercially requires the
author's permission. )
only forth also assembler also sane
CODE S>F
PEA (A6)
SUBI.L #$A,D7
ADDQ.L #$4,A6
MOVE.L D7,(A7)
MOVE.W #$100E,(A7)
_Pack4
RTS
ENDCODE
CODE F>S
MOVE.L D7,(A7)
ADDI.L #$A,D7
SUBQ.L #$4,A6
PEA (A6)
MOVE.W #$1010,(A7)
_Pack4
RTS
ENDCODE
CODE S+ ( 32 bit FP add )
MOVE.L (A6)+,D1
BEQ @1
MOVE.L (A6)+,D0
BNE @add.main
MOVE.L D1,(A6)
@1 RTS
@add.main
MOVE.L D0,D2
( number in D0 will be called u )
MOVE.L D1,D3
( and in D1 v )
LSL.L #8,D0
LSL.L #8,D1
BSET #31,D0
BSET #31,D1
( now D0 and D1 contain the fraction parts,
shifted into the upper part of the long word )
SWAP.W D2
SWAP.W D3
LSR.W #7,D2
LSR.W #7,D3
( D2 and D3 contain exponent plus sign )
MOVE.B D2,D4
SUB.B D3,D4
( D4 contains the difference in exponent )
BEQ @expo.same
BHI @check.expo
( if vexponent > uexponent, exchange u and v )
NEG.B D4
EXG D0,D1
EXG D2,D3
@check.expo
CMPI.B #25,D4
( if difference in exponents > precision, just return u )
BHI @backtogether
LSR.L D4,D1 ( shift smaller fraction )
@expo.same
EOR.W D2,D3 ( test if signs are equal )
BTST #8,D3
BNE @subtract ( if not, have to subtract )
ADD.L D1,D0
BCC @round
( fraction overflow )
ROXR.L #1,D0
ADDQ.B #1,D2
BEQ @ovfl
@round BTST #7,D0
BEQ @backtogether
BTST #6,D0
BNE @incr
BTST #8,D0
BEQ @backtogether
@incr ADDI.L #$80,D0
BCC @backtogether
( rounding overflow )
ROXR.L #1,D0
ADDQ.B #1,D2
BEQ @ovfl
@backtogether
( reconstruct number from mantissa and exponent )
LSR.L #8,D0
BCLR #23,D0
LSL.W #7,D2
SWAP.W D2
CLR.W D2
OR.L D2,D0
@end MOVE.L D0,(A6)
RTS
@subtract
SUB.L D1,D0
BEQ @end
BCC @normal
( fraction v was larger than fraction u, change sign )
BCHG #8,D2
NEG.L D0 ( make mantissa positive )
@normal BMI @round
SUBQ.B #1,D2
BEQ @small ( expo underflow )
ADD.L D0,D0 ( faster than LSL.L #1,D0 )
BRA @normal
@small MOVEQ.L #0,D0
BRA @end
@ovfl MOVE.L #$7F800000,(A6)
RTS
ENDCODE
CODE S ( 32 bit FP subtract )
( BCHG #$1F,[A6] )
DC.L $0856001F
S+
RTS
ENDCODE
CODE S* ( 32 bit single precision multiply)
MOVE.L (A6)+,D1
BEQ @makezero
MOVE.L (A6)+,D0
BEQ @end
( D0,D1 will be used for lower 16 bits of mantissa.
D2,D3 for exponent )
MOVE.L D0,D2
MOVE.L D1,D3
SWAP.W D2
SWAP.W D3
( get rid of junk in D4,D5 )
CLR.W D4
CLR.W D5
( move most significant 7 mantissa bits to D4,D5
and set implied highest bit = 1 )
MOVE.B D2,D4
MOVE.B D3,D5
BSET #7,D4
BSET #7,D5
( isolate exponent + sign in D2,D3 )
( ANDI.W #$FF80,D2 )
DC.L $0242FF80
( ANDI.W #$FF80,D3 )
DC.L $0243FF80
( rotate sign into lowest bit D2,D3 )
ROL.W #1,D2
ROL.W #1,D3
( subtract exponent offset )
SUBI.W #$7F00,D2
SUBI.W #$7F00,D3
( sum exponents, check for over or underflow )
ADD.W D2,D3
BVS @ovflchk
( now do 24*24 bit multiplication of mantissa )
MOVE.W D4,D2 ( uhi > D2 )
MULU.W D1,D2 ( uhi * vlo > D2 )
MULU.W D0,D1 ( ulo * vlo > D1 )
MULU.W D5,D0 ( ulo * vhi > D0 )
MULU.W D4,D5 ( uhi * vhi > D5 )
ADD.L D2,D0 ( uhi*vlo + ulo*vhi > D0 )
MOVE.W D5,D1
( uhi*vhi > LSW[D1],
MSW unchanged , contains MSW of ulo*vlo )
SWAP.W D1
( put LSW and MSW in correct order )
ADD.L D1,D0 ( put it all together )
( highest mantissa bit might have changed to one )
BPL @nohibit
ADDI.W #$100,D3
BVC @round
BRA @ovflchk
@nohibit ADD.L D0,D0 ( if hi bit =0, make it =1)
@round BTST #7,D0
BEQ @blk.exp
BTST #6,D0
BNE @incr
BTST #8,D0
BEQ @blk.exp
@incr ADDI.L #$80,D0
BCC @blk.exp
ADDI.W #$100,D3
BVC @blk.exp
@ovflchk BPL @makezero
MOVE.L #$7F800000,(A6)
RTS
@makezero MOVEQ.L #0,D0
BRA @end
( readjust exponent )
@blk.exp ADDI.W #$7F00,D3
BLE @makezero
ROR.W #1,D3
( ANDI.W #$FF80,D3 )
DC.L $0243FF80
LSR.L #8,D0
BCLR #23,D0
SWAP.W D3
CLR.W D3
OR.L D3,D0
@end MOVE.L D0,(A6)
RTS
ENDCODE
CODE S/ ( 32 bit FP divide )
MOVE.L (A6)+,D1
BEQ @byzero ( divisor = 0)
MOVE.L (A6)+,D0
BEQ @end ( dividend = 0)
MOVE.L D0,D3
SWAP.W D0
MOVE.W D0,D5
MOVE.L D1,D6
SWAP.W D6
( exponents in D5,D6 )
MOVE.W #$100,D4 ( exponent adjust )
( isolate mantissa in D3,D1 and shift up )
ANDI.L #$007FFFFF,D3
ANDI.L #$007FFFFF,D1
BSET #$17,D3
BSET #$17,D1
LSL.L #7,D3
LSL.L #8,D1
( divide 32 by 32 bits )
MOVE.L D1,D2
SWAP.W D2 ( vhi )
DIVU.W D2,D3 ( [u/vhi]hi)
MOVE.L D3,D0
CLR.W D0 ( remainder )
DIVU.W D2,D0 ( [u/vhi]lo)
SWAP.W D3
MOVE.W D0,D3
MOVE.L D3,D0 ( u/vhi = w)
DIVU.W D2,D3 ( w/vhi )
MULU.W D3,D1 ( w*vlo/vhi)
CLR.W D1
SWAP.W D1 ( lo 16 bits )
SUB.L D1,D0 ( final 32 bit result )
LSR.L #6,D0 ( adjust )
( round result, as before)
BTST #1,D0
BEQ @rounded
BTST #0,D0
BNE @incr
BTST #2,D0
BEQ @rounded
@incr ADDQ.L #4,D0
@rounded BTST #25,D0
( account for zero hi order bit )
BNE @hifrac
CLR.W D4
LSR.L #1,D0
BRA @expo
@hifrac LSR.L #2,D0
( subtract exponents )
@expo ROL.W #1,D5
ROL.W #1,D6
BSET #1,D5
( ANDI.B #1,D6 )
DC.L $02060001
SUBI.W #$7F00,D5
SUBI.W #$7F00,D6
SUB.W D6,D5
BVS @exp.ovfl
ADDI.W #$7E00,D5
ADD.W D4,D5
ROR.W #1,D5
( ANDI.W #$FF80,D5 )
DC.L $0245FF80
BCLR #$17,D0
SWAP.W D5
CLR.W D5
OR.L D5,D0
@end MOVE.L D0,(A6)
RTS
@byzero MOVE.L (A6)+,D0
BNE @ovfl
MOVE.L #$7F800400,(A6)
RTS
@ovfl MOVE.L #$7F800000,(A6)
RTS
@exp.ovfl BPL @makezero
MOVE.L #$7F800000,(A6)
RTS
@makezero MOVEQ.L #0,D0
BRA @end
ENDCODE
( accuracy and benchmark tests)
decimal
fp 9 float
: s. s>f f. ;
1.0 f>s constant one
10. f>s constant ten
pi f>s constant pi.s
2.718281828 f>s constant eu
1.0 fconstant fone
10. fconstant ften
: accuracy
one dup
20 0 do
ten s/ 2dup s+
s. cr
loop
drop drop
;
: accuracy2
one dup
20 0 do
ten dup s. s* 2dup s. s. 2dup s+
s. cr
loop
drop drop
;
: accuracy3
one fone
1000 0 do
dup fdup
i .
i i>f ften f/ fdup f>s
f/ s/ dup u. dup s. f>s dup u. dup s.
s>f s>f f f.
cr
loop
;
: bmark1
counter 10000 0 do pi.s eu drop drop loop timer ;
: bmark2
counter 10000 0 do pi.s eu s+ drop loop timer ;
: bmark3
counter 10000 0 do pi.s eu s drop loop timer ;
: bmark4
counter 10000 0 do pi.s eu s* drop loop timer ;
: bmark5
counter 10000 0 do pi.s eu s/ drop loop timer ;
: speed.test cr
bmark1 cr
bmark2 cr
bmark3 cr
bmark4 cr
bmark5 cr
;
: fmark1 pi 2.718281828
counter 10000 0 do fover fover fdrop fdrop loop
timer fdrop fdrop ;
: fmark2 pi 2.718281828
counter 10000 0 do fover fover f+ fdrop loop
timer fdrop fdrop ;
: fmark3 pi 2.718281828
counter 10000 0 do fover fover f fdrop loop
timer fdrop fdrop ;
: fmark4 pi 2.718281828
counter 10000 0 do fover fover f* fdrop loop
timer fdrop fdrop ;
: fmark5 pi 2.718281828
counter 10000 0 do fover fover f/ fdrop loop
timer fdrop fdrop ;
: fspeed.test cr
fmark1 cr
fmark2 cr
fmark3 cr
fmark4 cr
fmark5 cr
;
Listing 2: Correction of CRT saver routine, MT V2#6
( This code has to be inserted after the START header. References to
setup.local.stack in the two routines have to be changed to setup.GNE.stack
and setup.VBL.stack, respectively )
decimal
header START
( *** we need two local stacks after the relocation *** )
header GNE.stack 100 allot
CODE setup.GNE.stack
LEA 8(PC),A6
( GNE stack grows downward from here )
RTS
ENDCODE
header VBL.stack 100 allot
CODE setup.VBL.stack
LEA 8(PC),A6
( VBL stack grows downward from here )
RTS
ENDCODE