TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox

Introduction

The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

-----------------------------------------------------------------------------
|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |
-----------------------------------------------------------------------------

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:

[1]

This becomes, in the case of the square root of n, = x2 - n:where f(x)

[2]

An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:

[3]

There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:

[4]

the first two terms of which yield, in the case where f(x) = SQRT(x):

[5]

The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:

and

[6]

In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1
}

Performance

The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
   {
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.
   }

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   {
   long double   num, num2;
   long startTime, endTime, time;
   short i;


   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
   else
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   }
   while (1);                              // repeat until Quit
   }

References

PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.


The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Hopper Disassembler 5.6.1 - Binary disas...
Hopper Disassembler is a binary disassembler, decompiler, and debugger for 32- and 64-bit executables. It will let you disassemble any binary you want, and provide you all the information about its... Read more
Skim 1.6.11 - PDF reader and note-taker...
Skim is a PDF reader and note-taker for OS X. It is designed to help you read and annotate scientific papers in PDF, but is also great for viewing any PDF file. Skim includes many features and has a... Read more
Alfred 4.6.7 - Quick launcher for apps a...
Alfred is an award-winning productivity application for OS X. Alfred saves you time when you search for files online or on your Mac. Be more productive with hotkeys, keywords, and file actions at... Read more
Transmit 5.8.7 - Excellent FTP/SFTP clie...
Transmit is an excellent FTP (file transfer protocol), SFTP, S3 (Amazon.com file hosting) and iDisk/WebDAV client that allows you to upload, download, and delete files over the internet. With the... Read more
Adobe Lightroom Classic 11.4.1 - Import,...
You can download Lightroom for Mac as a part of Creative Cloud for only $9.99/month with Photoshop, included as part of the photography package. The latest version of Lightroom gives you all of the... Read more
MarsEdit 4.5.9 - Quick and convenient bl...
MarsEdit is a blog editor for OS X that makes editing your blog like writing email, with spell-checking, drafts, multiple windows, and even AppleScript support. It works with with most blog services... Read more
Thunderbird 91.11.0 - Email client from...
As of July 2012, Thunderbird has transitioned to a new governance model, with new features being developed by the broader free software and open source community, and security fixes and improvements... Read more
A Better Finder Rename 11.50 - File, pho...
A Better Finder Rename is the most complete renaming solution available on the market today. That's why, since 1996, tens of thousands of hobbyists, professionals and businesses depend on A Better... Read more
DaisyDisk 4.23 - $9.99
DaisyDisk allows you to visualize your disk usage and free up disk space by quickly finding and deleting big unused files. The program scans your disk and displays its content as a sector diagram... Read more
BBEdit 14.5 - Powerful text and HTML edi...
BBEdit is the leading professional HTML and text editor for the Mac. Specifically crafted in response to the needs of Web authors and software developers, this award-winning product provides a... Read more

Latest Forum Discussions

See All

Downhill Mountain Biking Game ‘Descender...
Just over three years ago in May of 2019 developer RageSquid and publisher No More Robots released a quirky downhill mountain biking game called Descenders on PC and Xbox One. Bemoaning a lack of “extreme sports" titles in recent years led RageSquid... | Read more »
SwitchArcade Round-Up: ‘Monster Hunter R...
Hello gentle readers, and welcome to the SwitchArcade Round-Up for June 30th, 2022. Thursday is once more upon us, and that means a bunch of new releases to look at. We start things off with DLC for some very big games, Monster Hunter Rise and... | Read more »
‘HOOK 2’ Review – A Sharp Left Hook From...
The original HOOK ($1.99) had a very simple idea behind it. You were presented with a tangled mess of hooks and loops, and you needed to remove each one without snagging any others. Extremely simple at first, but as the puzzles rolled along,... | Read more »
‘Dicey Dungeons’ Mobile Version Launchin...
After a very long wait, Terry Cavanagh’s dungeon crawling roguelite deckbuiler hybrid experience Dicey Dungeons is coming to mobile platforms next week alongside a huge free DLC pack on all platforms. This DLC will be included in the mobile... | Read more »
Distract Yourself With These Great Mobil...
Every day, we pick out a curated list of the best mobile discounts on the App Store and post them here. This list won't be comprehensive, but it every game on it is recommended. Feel free to check out the coverage we did on them in the links below... | Read more »
‘Danganronpa S: Ultimate Summer Camp’ is...
If you’ve been following Danganronp over the last few years, Spike Chunsoft celebrated its anniversary by bringing the series to mobile in the form of anniversary editions. After the first two released, there was a long delay for V3, but it finally... | Read more »
Out Now: ‘HOOK 2’, ‘Incoherence’, ‘Juras...
Each and every day new mobile games are hitting the App Store, and so each week we put together a big old list of all the best new releases of the past seven days. Back in the day the App Store would showcase the same games for a week, and then... | Read more »
Upcoming Mobile MMO RPG Shooter ‘Avatar:...
This past January a contingent of developers made up of Archosaur Games, Tencent, Lightstorm Entertainment, and Disney announced a new mobile game set in James Cameron’s Avatar universe titled Avatar: Reckoning. | Read more »
Culinary Platformer ‘Chefy-Chef’ Coming...
If your name is Chefy, it’s pretty much a given that you should be a chef. Such is the case with Chefy-Chef, a game from Bug Studio about a chef named Chefy who must travel to all sorts of exotic locations using a magical refrigerator in an effort... | Read more »
SwitchArcade Round-Up: Nintendo Direct H...
Hello gentle readers, and welcome to the SwitchArcade Round-Up for June 29th, 2022. As the month winds to a close, we’ve got a bigger-than-usual Wednesday for both news and new games to check out. I’ve got some crib notes for yesterday’s Nintendo... | Read more »

Price Scanner via MacPrices.net

2nd generation 4K Apple TVs with Siri remote...
Apple has restocked a full line of Certified Refurbished 2nd generation 32GB and 64GB 4K Apple TVs with Siri remotes for $30 off the cost of new models. Apple’s standard one-year warranty is included... Read more
Back in stock: Apple Watch Series 7 models fo...
Apple has restocked Certified Refurbished Apple Watch Series 7 WiFi-only models in their online store for $60-$70 off MSRP, starting at $339. Each Watch includes Apple’s standard one-year warranty, a... Read more
July 4th Sale at Expercom: $200 off any 16″ M...
Apple reseller Expercom has 16″ M1 Pro and M1 Max MacBook Pros available for $200 off MSRP as part of their July 4th sale. In addition to their MacBook Pro sale prices, take $50 off AppleCare+ when... Read more
10.2″ Apple iPads (WiFi models) are on sale f...
Amazon has Apple’s 9th generation 10.2″ WiFi iPads on sale for up to $20-$50 off MSRP for a limited time. Their prices are the lowest price currently available for one of these iPads. All models are... Read more
10-Core M1 Pro 14″ MacBook Pros on sale for $...
B&H Photo is offering $200 discounts on Apple’s new 14″ M1 Pro MacBook Pros with 10-Core CPUs (16GB RAM/1TB SSDs). Free 1-2 day shipping is available to most US addresses, and both models are in... Read more
B&H has 16-inch M1 Pro MacBook Pros in st...
New Space Gray 16″ MacBook Pros with Apple’s M1 Pro CPUs are in stock and on sale today at B&H Photo for $200 off Apple’s MSRP. Sale prices are for M1 Pro models with 512GB or 1TB of SSD storage... Read more
Price drop! 13″ M1 MacBook Pro with 512GB SSD...
Amazon has dropped prices on recently-discontinued 13″ M1 MacBook Pros with a 512GB SSD by $200 off Apple’s original MSRP. Shipping is free: – 2020 13″ MacBook Pro (Space Gray or Silver) M1 CPU/512GB... Read more
Deal Alert! 14″ Apple MacBook Pros with M1 Pr...
Amazon has 14″ MacBook Pros with 8-Core M1 Pro CPUs back on sale for $200 off MSRP, only $1799. Shipping is free. Be sure to make your purchase from Amazon rather than a third-party seller: – 14″ M1... Read more
16″ MacBook Pros with Apple M1 Pro CPUs are b...
Amazon is discounting new 16″ MacBook Pros with 10-Core Apple M1 Pro CPUs by $200 off MSRP again today. Be sure to select Amazon as the seller rather than a third-party seller, and note that Amazon’s... Read more
13″ M1 MacBook Airs with 16GB of RAM availabl...
Apple has 13″ M1 MacBook Airs (8-Core CPU/7-Core GPU) in stock today with 16GB of RAM for $190 off MSRP, Certified Refurbished. Apple includes a standard one-year warranty with these models, each... Read more

Jobs Board

I/S Senior Engineer - *Apple* Systems Engin...
**19647BR** **Position Title:** I/S Senior Engineer - Apple Systems Engineering - Remote **Department:** Information Systems **Location:** Lakeland, FL between Read more
*Apple* IT Support Analyst - 2nd Shift - Zon...
Apple IT Support Analyst - 2nd Shift Professional Services Albany, New York Malta, New York Clifton Park, New York Menands, New York Syracuse, New York Watertown, Read more
Infotainment Certification Test Engineer (XC)...
…integration - CarPlay, android auto, MirrorLink, Baidu Carlife, MFi/iPod certification testing; Apple PPID preparation, Google HUCD and GTM preparation + 3 years of Read more
Workplace Services *Apple* Device Managemen...
…3350 Riverwood Parkway Suite 900, Atlanta, GA, 30339 USA **Workplace Services Apple Device Management** **Role Overview** Carrier is seeking an experienced and Read more
Physician Assistant - Certified, Primary Care...
Physician Assistant - Certified, Primary Care, Apple Valley (1.07FTE) + Job ID: 65766 + Department: AV Primary Care + City: Apple Valley, MN + Location: HP - Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.