TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox

Introduction

The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

-----------------------------------------------------------------------------
|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |
-----------------------------------------------------------------------------

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:

[1]

This becomes, in the case of the square root of n, = x2 - n:where f(x)

[2]

An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:

[3]

There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:

[4]

the first two terms of which yield, in the case where f(x) = SQRT(x):

[5]

The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:

and

[6]

In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1
}

Performance

The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
   {
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.
   }

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   {
   long double   num, num2;
   long startTime, endTime, time;
   short i;


   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
   else
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   }
   while (1);                              // repeat until Quit
   }

References

PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.


The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

PDFpenPro 12.2.1 - Advanced PDF toolkit...
PDFpenPro allows users to edit PDF's easily. Add text, images and signatures. Fill out PDF forms. Merge or split PDF documents. Reorder and delete pages. Create fillable forms and tables of content... Read more
PDFpen 12.2.1 - Edit and annotate PDFs w...
PDFpen allows users to easily edit PDF's. Add text, images and signatures. Fill out PDF forms. Merge or split PDF documents. Reorder and delete pages. Even correct text and edit graphics! Features... Read more
Civilization VI 1.3.7 - Next iteration o...
Civilization® VI is the award-winning experience. Expand your empire across the map, advance your culture, and compete against history’s greatest leaders to build a civilization that will stand the... Read more
Time Out 2.7 - Break reminder tool with...
Time Out helps remind you to take work breaks throughout the day. It has two kinds of breaks: a "Normal" break, typically for 10 minutes after 50 minutes of work, so you can move about and relax,... Read more
OsiriX Lite 12.0 - 3D medical image proc...
OsiriX Lite is an image processing software dedicated to DICOM images (".dcm" / ".DCM" extension) produced by medical equipment (MRI, CT, PET, PET-CT, ...) and confocal microscopy (LSM and BioRAD-PIC... Read more
Hazel 5.0.1 - Create rules for organizin...
Hazel is your personal housekeeper, organizing and cleaning folders based on rules you define. Hazel can also manage your trash and uninstall your applications. Organize your files using a familiar... Read more
SketchUp 21.0.338 - Create 3D design con...
SketchUp is an easy-to-learn 3D modeling program that enables you to explore the world in 3D. With just a few simple tools, you can create 3D models of houses, sheds, decks, home additions,... Read more
1Password 7.7 - Powerful password manage...
1Password is a password manager that uniquely brings you both security and convenience. It is the only program that provides anti-phishing protection and goes beyond password management by adding Web... Read more
Tidy Up 5.4.0 - Find duplicate files and...
Tidy Up is a full-featured duplicate finder and disk-tidiness utility. Features: Supports Lightroom: it is now possible to search and collect duplicates directly in the Lightroom library. Multiple... Read more
DEVONthink Pro 3.6 - Knowledge base, inf...
DEVONthink is DEVONtechnologies' document and information management solution. It supports a large variety of file formats and stores them in a database enhanced by artificial intelligence (AI). Many... Read more

Latest Forum Discussions

See All

Transport City: Truck Tycoon is a logist...
Transport City: Truck Tycoon is a brand new simulation game from developer Cupgum that's available now for iOS and Android. As you can probably guess from the name of the game it'll see you working on building a supply chain and logistics setup... | Read more »
Bridge Constructor: The Walking Dead is...
Headup Games' team up with their popular bridge-building franchise and AMC's zombie TV is now available for iOS and Android. Bridge Constructor: The Walking Dead is the latest game in the series and the second to make use of another IP after... | Read more »
Genshin Impact - How to Beat Tartaglia (...
In the 1.1 update to Genshin Impact, the final chapter to the story arc in Liyue reaches its climax with a fight against Tartaglia (AKA Childe), one of the Fatui Harbingers. We won't go into detail as to why this is or the circumstances of the... | Read more »
Genshin Impact - Unreconciled Stars Even...
Genshin Impact's 1.1 update has brought a ton of new content to the game, and one of these things is a brand new event: Unreconciled Stars. This event kicked off yesterday and is running through the end of the month. There are sweet rewards and... | Read more »
The next expansion for Gwent, “Way of th...
CD Projekt Red has announced Way of the Witcher, the next free expansion for its card game, Gwent: The Witcher Card Game. You won’t have long to wait either, as it’s coming to the game on 8th December. [Read more] | Read more »
Distract Yourself With These Great Mobil...
There’s a lot going on right now, and I don’t really feel like trying to write some kind of pithy intro for it. All I’ll say is lots of people have been coming together and helping each other in small ways, and I’m choosing to focus on that as I... | Read more »
FMV thriller The Complex coming to iOS a...
Wales Interactive has announced its interactive FMV sci-fi thriller, The Complex, is coming to iOS and Android on 3rd December. [Read more] | Read more »
Frogger in Toy Town has received a new u...
Frogger in Toy Town was one of the initial bunch of Apple Arcade titles that launched alongside the service. As such it recently celebrated its first anniversary. To commemorate the occasion and to celebrate the game's first birthday it has... | Read more »
Genshin Impact Guide - Bounties Explaine...
Bounties are new to Genshin Impact in its 1.1 update. As mentioned in our write-up of the new content, players can now grind out reputation for each region available in the game, and one of the ways you can do this is through hunting down powerful... | Read more »
Marvel Realm of Champions will launch wo...
Following a hefty pre-registration period, Kabam has now revealed that its 3v3 MOBA Marvel Realm of Champions will release on iOS and Android devices on 16th December and there will be a big update made to the game to celebrate it. [Read more] | Read more »

Price Scanner via MacPrices.net

B&H offers Apple’s new M1 Mac minis for u...
B&H Photo has Apple’s new 2020 Mac minis with M1 Silicon on sale for up to $40 off MSRP, starting at only $669. Expedited shipping is free to many addresses in the US: – 2020 Mac mini M1 CPU/... Read more
B&H offers early Black Friday $50 discoun...
Apple’s new 2020 13″ MacBook Pros with M1 Apple Silicon CPUs are now on sale at B&H Photo for $50 off MSRP, starting at $1249, as part of their early Black Friday/Cyber Monday 2020 sales event.... Read more
Apple’s hot new 10.9″ iPad Air is on sale tod...
Amazon has Apple’s new 10.9″ 256GB WiFi iPad Air, in Space Gray, on sale and available for order today for $699 shipped. Their price is $50 off Apple’s MSRP for this model, and it’s currently the... Read more
Here’s how to get a 2020 13″ Intel-based MacB...
Looking for a cheap MacBook? With the introduction of Apple’s new 13″ M1 MacBook Airs, Apple and its resellers have dropped prices on clearance Intel-based models to new low prices ranging from $809... Read more
New at Verizon: Buy one Apple iPhone SE, get...
Buy one 64GB Apple iPhone SE at Verizon, and get a second 64GB iPhone SE for free, or $400 off 128GB and 256GB models. Offer is valid for customers opening a new line of service with a Verizon... Read more
New Verizon Black Friday 2020 promo: Free App...
Verizon is offering the new 2020 64GB iPhone 12 for free with new lines of service, and up to $800 off on 128GB and 256GB models. Qualified trade-in required, and purchase price will be credited to... Read more
Apple’s new M1 Mac mini is on sale for $669,...
Amazon has Apple’s new 2020 Mac mini with an Apple Silicon M1 processor and 256GB SSD on sale and available for order today for only $669.99 shipped. Their price is $30 off MSRP, and it’s the... Read more
Apple restocks full line of clearance 2019 iM...
Apple has restocked Certified Refurbished 2019 21″ & 27″ iMacs starting at $889 and up to $520 off original MSRP. Apple’s one-year warranty is standard, shipping is free, and each iMac features a... Read more
Apple’s new 2020 13″ MacBook Pros with M1 App...
Amazon has Apple’s new 2020 13″ MacBook Pros with M1 Apple Silicon CPUs now on sale for $50 off MSRP, starting at $1249. They are among the first Apple resellers to offer a discount on any Apple... Read more
Save up to $120 on an LG UltraFine Display fo...
In the market for one of LG’s UltraFine Display for the Mac? These displays are designed and engineered, with Apple’s help, to work specifically with your newer MacBook Pro, MacBook Air, iMac, iMac... Read more

Jobs Board

Surgical Technologist - *Apple* Hill Surgic...
Surgical Technlogist Apple Hill Surgical Center Full Time - Day Shift - Possible Saturdays *$3,000 Sign On Bonus* *New Grads considered* Job Description: Under Read more
Geek Squad Advanced Repair *Apple* Professi...
**788646BR** **Job Title:** Geek Squad Advanced Repair Apple Professional **Job Category:** Store Associates **Store Number or Department:** 000498-Bellevue-Store Read more
Senior Software Developer - *Apple* - The W...
**SUMMARY** Hulu's Apple team is seeking a Senior Software Developer who will be an outstanding addition to our team. As a developer at Hulu, you will contribute to Read more
Senior Data Engineer - *Apple* - Theorem, L...
Job Summary Apple is seeking an experienced, detail-minded data engineeringconsultant to join our worldwide business development and strategy team. If you are Read more
Cub Foods - *Apple* Valley - Now Hiring Par...
Cub Foods - Apple Valley - Now Hiring Part Time! United States of America, Minnesota, Apple Valley Retail Post Date 9 hours ago Requisition # 124800 Sign Up for Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.