TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox

Introduction

The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

-----------------------------------------------------------------------------
|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |
-----------------------------------------------------------------------------

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:

[1]

This becomes, in the case of the square root of n, = x2 - n:where f(x)

[2]

An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:

[3]

There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:

[4]

the first two terms of which yield, in the case where f(x) = SQRT(x):

[5]

The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:

and

[6]

In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1
}

Performance

The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
   {
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.
   }

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   {
   long double   num, num2;
   long startTime, endTime, time;
   short i;


   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
   else
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   }
   while (1);                              // repeat until Quit
   }

References

PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.


The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Dropbox 193.4.5594 - Cloud backup and sy...
Dropbox is a file hosting service that provides cloud storage, file synchronization, personal cloud, and client software. It is a modern workspace that allows you to get to all of your files, manage... Read more
Google Chrome 122.0.6261.57 - Modern and...
Google Chrome is a Web browser by Google, created to be a modern platform for Web pages and applications. It utilizes very fast loading of Web pages and has a V8 engine, which is a custom built... Read more
Skype 8.113.0.210 - Voice-over-internet...
Skype is a telecommunications app that provides HD video calls, instant messaging, calling to any phone number or landline, and Skype for Business for productive cooperation on the projects. This... Read more
Tor Browser 13.0.10 - Anonymize Web brow...
Using Tor Browser you can protect yourself against tracking, surveillance, and censorship. Tor was originally designed, implemented, and deployed as a third-generation onion-routing project of the U.... Read more
Deeper 3.0.4 - Enable hidden features in...
Deeper is a personalization utility for macOS which allows you to enable and disable the hidden functions of the Finder, Dock, QuickTime, Safari, iTunes, login window, Spotlight, and many of Apple's... Read more
OnyX 4.5.5 - Maintenance and optimizatio...
OnyX is a multifunction utility that you can use to verify the startup disk and the structure of its system files, to run miscellaneous maintenance and cleaning tasks, to configure parameters in the... Read more
Hopper Disassembler 5.14.1 - Binary disa...
Hopper Disassembler is a binary disassembler, decompiler, and debugger for 32- and 64-bit executables. It will let you disassemble any binary you want, and provide you all the information about its... Read more
WhatsApp 24.3.78 - Desktop client for Wh...
WhatsApp is the desktop client for WhatsApp Messenger, a cross-platform mobile messaging app which allows you to exchange messages without having to pay for SMS. WhatsApp Messenger is available for... Read more
War Thunder 2.33.0.135 - Multiplayer war...
In War Thunder, aircraft, attack helicopters, ground forces and naval ships collaborate in realistic competitive battles. You can choose from over 1,500 vehicles and an extensive variety of combat... Read more
Iridient Developer 4.2 - Powerful image-...
Iridient Developer (was RAW Developer) is a powerful image-conversion application designed specifically for OS X. Iridient Developer gives advanced photographers total control over every aspect of... Read more

Latest Forum Discussions

See All

A Legitimate Massage Parlor, I Swear – T...
In this week’s Episode of The TouchArcade Show we talk about some of the major new releases on mobile this week including Warframe, we go over all the major news that came out from the Nintendo Direct Partner Showcase on Wednesday, we read our one... | Read more »
TouchArcade Game of the Week: ‘Rainbow S...
I feel like I am in a fever dream right now. What is this game that I’m playing? It’s a Rainbow Six game? But it’s all cutesy, and cartoony, and also kind of psychedelic? How is this a real thing? Well, it’s no fever dream, it is indeed a real thing... | Read more »
SwitchArcade Round-Up: ‘Promenade’, ‘Cho...
Hello gentle readers, and welcome to the SwitchArcade Round-Up for February 23rd, 2024. It’s Friday, so we have to check out the remaining releases of the week. Not so many big ones today, but a healthy crop nonetheless. After summarizing all the... | Read more »
Steam Deck Weekly: Gundam Breaker 4 and...
Welcome to this week’s slightly shorter edition of the Steam Deck Weekly. I was a bit unwell this week so no reviews in this edition, but there is a lot of news and new Steam Deck Verified and Playable games to catch up on. I have something special... | Read more »
The 10 Best Run-And-Gun Games for Ninten...
The year 2024 is a rare one, because it is a year with a brand-new Contra game. Contra: Operation Galuga might be the freshest face on the block when it comes to Nintendo Switch run-and-gun action games, but it’s hardly fighting that war alone.... | Read more »
Version 1.4 of Reverse: 1999 will be lan...
Free up your diary for February 29th, as Bluepoch has announced the impending release of the award-winning Reverse: 1999s Version 1.4 update. The Prisoner in the Cave is an Ancient Greece-themed update with new recruits, gameplay modes, and plenty... | Read more »
Premium Mobile RPG ‘Ex Astris’ From ‘Ark...
Arknights developer Hypergryph’s premium RPG Ex Astris () recently had its release date confirmed, and we finally have an extended gameplay showcase. | Read more »
Apple Arcade Weekly Round-Up: Updates fo...
Following yesterday’s big Hello Kitty Island Adventure update, a few more notable game updates and events have gone live on Apple Arcade. Cypher 007 () has gotten its first content update in a few months taking you to Egypt for five new levels... | Read more »
‘Thunder Ray’ and ‘Hime’s Quest’ Are Now...
Crunchyroll has pushed two new games into the Crunchyroll Game Vault including Purple Tree Studio’s Thunder Ray which was already on iOS before as a premium release. Shaun even reviewed it last year. Read his review here. The second game is Poppy... | Read more »
Adorable Kitty-Collector Sequel ‘Neko At...
Ya’ll. This October will mark the ten-year anniversary of Hit Point launching Neko Atsume, the adorable kitty collecting sim that has become a runaway success and essentially created a sub-genre of cozy pet-collecting life sim games. Sure, the... | Read more »

Price Scanner via MacPrices.net

16-inch M3 Max MacBook Pro on sale for $300 o...
Amazon is offering a $300 instant discount on one of Apple’s 16″ M3 Max MacBook Pros today. Shipping is free: – 16″ M3 Max MacBook Pros (36GB/1TB/Space Black): $3199, $300 off MSRP Their price is the... Read more
Apple M2 Mac minis on sale for $100 off MSRP...
B&H Photo has Apple’s M2-powered Mac minis in stock and on sale for $100 off MSRP this weekend with prices available starting at $499. Free 1-2 day shipping is available to most US addresses: –... Read more
Apple Watch SE on sale for $50 off MSRP this...
Best Buy has all Apple Watch SE models on sale this weekend for $50 off MSRP on their online store. Sale prices available for online orders only, in-store prices may vary. Order online, and choose... Read more
Deal Alert! Apple 15-inch M2 MacBook Airs on...
Looking for the lowest sale price on a new 15″ M2 MacBook Air? Best Buy has Apple 15″ MacBook Airs with M2 CPUs in stock and on sale today for $300 off MSRP on their online store. Prices valid for... Read more
Amazon discounts iPad mini 6 models up to $12...
Amazon is offering Apple’s 8.3″ WiFi iPad minis for $100-$120 off MSRP, including free shipping, for a limited time. Prices start at $399. Amazon’s prices are among the lowest currently available for... Read more
Apple AirPods Pro with USB-C discounted to $1...
Walmart has Apple’s 2023 AirPods Pro with USB-C in stock and on sale for $199.99 on their online store. Their price is $50 off MSRP, and it’s currently one the lowest prices available for new AirPods... Read more
Apple has 14-inch M3 MacBook Pro with 16GB of...
Apple has 14″ M3 MacBook Pros with 16GB of RAM, Certified Refurbished, available for $270-$300 off MSRP. Each model features a new outer case, shipping is free, and an Apple 1-year warranty is... Read more
Save $318-$432 on 16-inch M3 Max MacBook Pros...
Apple retailer Expercom has 16″ M3 Max MacBook Pros on sale for $318-$432 off MSRP when bundled with a 3-year AppleCare+ Protection Plan. Discounts are available for Silver models as well a Space... Read more
New today at Apple: 16-inch M3 Pro/M3 Max Mac...
Apple is now offering 16″ M3 Pro and M3 Max MacBook Pros, Certified Refurbished, starting at $2119 and ranging up to $530 off MSRP. Each model features a new outer case, shipping is free, and an... Read more
Apple is now offering $300-$480 discounts on...
Apple is now offering 14″ M3 Pro and M3 Max MacBook Pros, Certified Refurbished, starting at $1699 and ranging up to $480 off MSRP. Each model features a new outer case, shipping is free, and an... Read more

Jobs Board

Part-time *Apple* and Peach Research Assist...
…and minimum qualifications: + Assist with planting, pruning, and harvesting of apple and peach trees + Conduct regular maintenance tasks to ensure optimal Read more
Housekeeper, *Apple* Valley Villa - Cassia...
Apple Valley Villa, part of a senior living community, is hiring entry-level Full-Time Housekeepers to join our team! We will train you for this position and offer a Read more
Sublease Associate Optometrist- *Apple* Val...
Sublease Associate Optometrist- Apple Valley, CA- Target Optical Date: Feb 22, 2024 Brand: Target Optical Location: Apple Valley, CA, US, 92307 **Requisition Read more
Sublease Associate Optometrist- *Apple* Val...
Sublease Associate Optometrist- Apple Valley, CA- Target Optical Date: Jan 24, 2024 Brand: Target Optical Location: Apple Valley, CA, US, 92307 **Requisition Read more
Housekeeper, *Apple* Valley Village - Cassi...
Apple Valley Village Health Care Center, a senior care campus, is hiring a Part-Time Housekeeper to join our team! We will train you for this position! In this role, Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.