TweetFollow Us on Twitter

Fast Square Root Calc

Volume Number: 14 (1998)
Issue Number: 1
Column Tag: Assembler Workshop

Fast Square Root Calculation

by Guillaume Bédard, Frédéric Leblanc, Yohan Plourde and Pierre Marchand, Québec, Canada

Optimizing PowerPC Assembler Code to beat the Toolbox

Introduction

The calculation of the square root of a floating-point number is a frequently encountered task. However, the PowerPC processors don't have a square root instruction. The implementation presented here performs the square root of a double-precision number over the full range of representation of the IEEE 754 standard for normalized numbers (from 2.22507385851E-308 to 1.79769313486E308) with an accuracy of 15 or more decimal digits. It is very fast, at least six times faster than the Toolbox ROM call.

Theory of Operation

A floating point number has three components: a sign, a mantissa and an exponential part. For example, the number +3.5 x 10^4 (35 000) has a plus sign, a mantissa of 3.5 and an exponential part of 104. The mantissa consists of an integer part and a fractional part f.

A double precision number in IEEE 754 format has the same components: a sign bit s, an 11-bit exponent e and a 52 bit fraction f. The exponential part is expressed in powers of 2 and the exponent is biased by adding 1023 to the value of e. The mantissa is normalized to be of the form 1.f. Since the integer part of the normalized mantissa is always 1, it doesn't have to be included in the representation. The number is thus represented as follows: (-1)s x 1.f x 2^e+1023.

For example, the number 5.0 can be expressed in binary as 101.0, which means 101.0 x 2^0, which in turn is equal to 1.010 x 2^2, obtained by dividing the mantissa by 4 and multiplying 2^0 by 4. Therefore, the normalized mantissa is 1.010 and the exponent 2. Fraction f is then .01000000.... The biased exponent is obtained by adding 1023 to e and is 1025, or 10000000001 in binary. The double precision IEEE representation of 5.0 is finally:

-----------------------------------------------------------------------------
|  0 | 10000000001 | 01000000000000000000000000000000000000000000000000000  |
-----------------------------------------------------------------------------

or 4014000000000000 in hexadecimal notation for short.

First Approximation

Given this representation, a first approximation to the square root of a number is obtained by dividing the exponent by 2. If the number is an even power of 2 such as 16 or 64, the exact root is obtained. If the number is an odd power of 2 such as 8 or 32, 1/SQRT(2) times the square root is obtained. In general, the result will be within a factor SQRT(2) of the true value.

Refining the Approximation

The Newton-Raphson method is often used to obtain a more accurate value for the root x of a function f(x) once an initial approximation x0 is given:

[1]

This becomes, in the case of the square root of n, = x2 - n:where f(x)

[2]

An excellent approximation to the square root starting with the initial approximation given above is obtained within 5 iterations using equation [2]. This algorithm is already pretty fast, but its speed is limited by the fact that each iteration requires a double-precision division which is the slowest PowerPC floating-point instruction with 32 cycles on the MPC601 (Motorola, 1993).

Eliminating Divisions

Another approach is to use equation [1] with the function.

In this case, equation [1] becomes:

[3]

There is still a division by n, but since n is constant (it's the original number whose root we want to find), it can be replaced by multiplying by 1/n, which can be calculated once before the beginning of the iteration process. The five 32-cycle divisions are thus replaced by this single division followed by 5 much faster multiplications (5 cycles each). This approach is approximately three times faster than the preceding one. However, care must be taken for large numbers since the term in x02 can cause the operation to overflow.

Use of a table

Finally, an approach that is even faster consists in using a table to obtain a more accurate first approximation. In order to do so, the range of possible values of fraction f (0 to ~1) is divided into 16 sub-ranges by using the first 4 bits of f as an index into a table which contains the first two coefficients of the Taylor expansion of the square root of the mantissa (1.0 to ~2) over that sub-range.

The Taylor expansion is given in general by:

[4]

the first two terms of which yield, in the case where f(x) = SQRT(x):

[5]

The square root of x is thus approximated by 16 straight-line segments. The table therefore contains the values of

A =

and B =

for each of the 16 sub-ranges as shown in Figure 1. This first approximation gives an accuracy of about 1.5 %.

Figure 1. Approximation by straight line segment.

To reach the desired accuracy of 15 digits, equation [2] is applied twice to the result of equation [5]. To avoid having to perform two divisions by repeating the iteration, the two iterations are folded together as follows, which contains only one division:

and

[6]

In order to perform these calculation, the exponent of x and n is reduced to -1 (1022 biased), so that floating-point operations apply only to the values of the mantissa and don't overflow if the exponent is very large. The value of these numbers will therefore be in the range 0.5 to 1.0 since the mantissa is in the range 1.0 to 2.0. If the original exponent was odd, the mantissa is multiplied by SQRT(2) before applying equation [6].

Finally, the original exponent divided by two is restored at the end.

The Code

The SQRoot function shown in Listing 1 has been implemented in CodeWarrior C/C++ version 10.

Listing 1: SQRoot.c

// On entry, fp1 contains a positive number between 2.22507385851E-308
// and 1.79769313486E308. On exit, the result is in fp1.

asm long double SQRoot(long double num);   // prototype

float Table[35] = {
0.353553390593, 0.707106781187, 0.364434493428, 0.685994340570,
0.375000000000, 0.666666666667, 0.385275875186, 0.648885684523,
0.395284707521, 0.632455532034, 0.405046293650, 0.617213399848,
0.414578098794, 0.603022689156, 0.423895623945, 0.589767824620,
0.433012701892, 0.577350269190, 0.441941738242, 0.565685424949,
0.450693909433, 0.554700196225, 0.459279326772, 0.544331053952,
0.467707173347, 0.534522483825, 0.475985819116, 0.525225731439,
0.484122918276, 0.516397779494, 0.492125492126, 0.508000508001,
1.414213562373, 0.000000000000, 0.000000000000 };

asm long double Sqrt(long double num) {

   lwz   r3,Table(rtoc)      // address of Table[]
   lhz   r4,24(sp)           // load
                             // Sign(1)+Exponent(11)+Mantissa(4)
   andi.   r5,r4,0xF         // keep only Mantissa(4)
   ori   r5,r5,0x3FE0        // exponent = -1+BIAS = 1022
   sth   r5,24(sp)           // save reduced number

   rlwinm   r5,r5,3,25,28    // take 8*Mantissa(4) as index
   lfd   fp1,24(sp)          // load reduced number
   lfsux   fp4,r5,r3         // load coefficient A
   lfs   fp5,4(r5)           // load coefficient B
   lfs   fp3,128(r3)         // load SQRT(2)
   fmr   fp2,fp1             // copy reduced number
   rlwinm.   r5,r4,31,18,28  // divide exponent by 2
   beq   @@2                 // if (exponent == 0) then done

   fmadd   fp2,fp2,fp5,fp4   // approximation SQRT(x) = A + B*x
   andi.   r4,r4,0x10        // check if exponent even
   beq   @@1                 // if (exponent even) do iteration
   fmul   fp2,fp2,fp3        // multiply reduced number by SQRT(2)
   fadd   fp1,fp1,fp1        // adjust exponent of original number

@@1:   fadd   fp3,fp2,fp2    // 2*x
   fmul   fp5,fp2,fp1        // x*n
   fadd   fp3,fp3,fp3        // 4*x
   fmadd   fp4,fp2,fp2,fp1   // x*x + n
   fmul   fp5,fp3,fp5        // 4*x*x*n
   fmul   fp6,fp2,fp4        // denominator = x*(x*x + n)
   fmadd   fp5,fp4,fp4,fp5   // numerator = (x*x + n)*(x*x + n) +
                             // 4*x*x*n
   fdiv   fp1,fp5,fp6        // double precision division
   andi.   r5,r5,0x7FF0      // mask exponent 
   addi   r5,r5,0x1FE0       // rectify new exponent

@@2:   sth   r5,132(r3)      // save constant C (power of 2) 
   lfd   fp2,132(r3)         // load constant C
   fmul   fp1,fp1,fp2        // multiply by C to replace exponent
   blr                       // done, the result is in fp1
}

Performance

The code presented above runs in less than 100 cycles, which means less than 1 microsecond on a 7200/75 Power Macintosh and is more than six times faster than the ROM code. The code could be modified to make use of the floating reciprocal square root estimate instruction (frsqrte) that is available on the MPC603 and MPC604 processors, and which has an accuracy of 5 bits. It is not available on the MPC601, however. The method used here could also be used to evaluate other transcendental functions.

Performance was measured by running the code a thousand times and calling a simple timing routine found in (Motorola, 1993), that we called myGetTime(). It uses the real-time clock of the MPC 601 processor (RTCU and RTCL registers) and is shown in Listing 2. The routine would have to be modified to run on MPC603 or MPC604 processors, since they don't have the same real-time clock mechanism.

The code doesn't support denormalized numbers (below 2.22507385851E-308). This could easily be implemented albeit at the cost of a slight reduction in performance.

Listing 2: myGetTime.c

asm long myGetTime()
   {
lp:    mfspr   r4,4           // RTCU
   mfspr   r3,5               // RTCL
   mfspr   r5,4               // RTCU again
    cmpw      r4,r5           // if RTCU has changed, try again
    bne      lp
    rlwinm   r3,r3,25,7,31    // shift right since bits 25-31 are
                              // not used
    blr                       // the result is in r3. 1 unit is
                              // worth 128 ns.
   }

To run the code, a very simple interface using the SIOUX library is provided in Listing 3.

Listing 3: main.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fp.h>

void main()
   {
   long double   num, num2;
   long startTime, endTime, time;
   short i;


   do {
   printf("%2s","> ");           // caret
   scanf("%Lf",&num);           // read long double
   if (num < 0.0) num = 0.0;     // replace by 0.0 if negative
   startTime = myGetTime();
   for (i = 0; i < 1000; i++)    // repeat 1000 times
   num2 = SQRoot(num);              // call our function
   endTime = myGetTime();
   time = endTime - startTime;
   if (num > 1e-6 && num < 1e7)
      printf("%7s%Lf\n","root = ",num2);   // show result
   else
      printf("%7s%Le\n","root = ",num2);
   printf("%7s%d\n","time = ", time);      // show elapsed time
   }
   while (1);                              // repeat until Quit
   }

References

PowerPC 601 RISC Microprocessor User's Manual, Motorola MPC601UM/AD Rev 1, 1993.


The first three authors are undergraduate students in Computer Science at Université Laval in Québec, Canada. This work was done as an assignment in a course on Computer Architecture given by the fourth author.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Minecraft 1.19.3 - Popular sandbox build...
Minecraft allows players to build constructions out of textured cubes in a 3D procedurally generated world. Other activities in the game include exploration, gathering resources, crafting, and combat... Read more
Boom 3D 1.4.0 - $19.99
Boom 3D is a revolutionary app with 3D Surround Sound and phenomenally rich and intense audio that is realistic and works on any headphones. Features 3D surround sound Built-in audio player... Read more
Visual Studio Code 1.74.0 - Cross-platfo...
Visual Studio Code provides developers with a new choice of developer tool that combines the simplicity and streamlined experience of a code editor with the best of what developers need for their... Read more
Notion 2.1.9 - A unified workspace for m...
Notion is the unified workspace for modern teams. Features: Integration with Slack Documents Wikis Tasks More guests: invite up to 10 collaborators, friends & family to your pages Page... Read more
Spotify 1.2.0.1165 - Stream music, creat...
Spotify is a streaming music service that gives you on-demand access to millions of songs. Whether you like driving rock, silky R&B, or grandiose classical music, Spotify's massive catalogue puts... Read more
Thunderbird 102.5.1 - Email client from...
As of July 2012, Thunderbird has transitioned to a new governance model, with new features being developed by the broader free software and open source community, and security fixes and improvements... Read more
Pinegrow 7.03 - Mockup and design web pa...
Pinegrow (was Pinegrow Web Designer) is desktop app that lets you mockup and design webpages faster with multi-page editing, CSS and LESS styling, and smart components for Bootstrap, Foundation,... Read more
Adobe After Effects 2022 23.1 - Create p...
The new, more connected Adobe After Effects can make the impossible possible. Get powerful new features like a Live 3D Pipeline that brings CINEMA 4D scenes in as layers - without intermediate... Read more
SteerMouse 5.6.7 - Powerful third-party...
SteerMouse is an advanced driver for USB and Bluetooth mice. SteerMouse can assign various functions to buttons that Apple's software does not allow, including double-clicks, modifier clicks,... Read more
Wireshark 4.0.2 - Network protocol analy...
Wireshark is one of the world's foremost network protocol analyzers, and is the standard in many parts of the industry. It is the continuation of a project that started in 1998. Hundreds of... Read more

Latest Forum Discussions

See All

Level Infinite announce December events...
The hit MMORPG Tower of Fantasy will receive a new update in December, along with a powerful new Simulacrum. Players will be able to explore new areas of Mirroria with fun new games and take part in a big new event. [Read more] | Read more »
Dead Cells: Return to Castlevania DLC Re...
Dead Cells on iOS from Playdigious recently released on Apple Arcade in the form of Dead Cells+ (). Dead Cells+ is an App Store Great featuring all prior DLC and updates. At The Game Awards 2022, Motion Twin revealed the next major DLC for Dead... | Read more »
‘Among Us’ Hide and Seek Mode Update Com...
At The Game Awards 2022, Innersloth announced the next major update for Among Us (Free). This update is arriving today for all platforms and it will bring in the new Hide n Seek game mode alongside new cosmetics and pets. This update is the game’s... | Read more »
Massively Popular ‘Vampire Survivors’ Is...
Ahead of The Game Awards 2022, the massively popular Vampire Survivors (Free) has released on iOS and Android for free. Vampire Survivors is one of the most popular PC games right now, and it is a regular on every Steam Deck most-played list from... | Read more »
SwitchArcade Round-Up: ‘Chained Echoes’,...
Hello gentle readers, and welcome to the SwitchArcade Round-Up for December 8th, 2022. Today is Thursday, and that usually means an absolute deluge of new releases on the eShop. But the year is winding down, so we’ve only got ten or so to look at... | Read more »
‘Awaken Legends: Idle RPG’ Celebrates th...
Awaken Legends: Idle RPG is adding its first update since the game was soft-launched in November, letting players get their hands on a new hero “Hera Valen". Players can also look forward to the Covenant of the Dark Knight event and the Wishing Well... | Read more »
‘Horizon Chase 2’ Japan World Tour Expan...
Horizon Chase 2 () from Aquiris is getting a major expansion today on Apple Arcade. The Japan World Tour expansion brings in 11 new races across 9 cities and it should be rolling out now as of this writing. I expect it to be available worldwide... | Read more »
Dark Fantasy Visual Novel ‘The 13th Mont...
Originally announced for release in August, The 13th Month from Japanese developer Kobayashimaru and publisher Kodansha released on PC via Steam worldwide this month. The dark fantasy visual novel that reimagines the classic Sleeping Beauty tale, is... | Read more »
Tom Clancey’s The Divison Resurgence ann...
Ubisoft has announced the latest Live Test dates for Tom Clancy’s The Division Resurgence, the hotly anticipated mobile entry in the Divison series. Starting December 8th and ending on the 22nd, the test will offer a huge amount of content for the... | Read more »
‘Easy Come Easy Golf’ New Update Adds St...
Easy Come Easy Golf () from Clap Hanz is one of my favorite games on Apple Arcade. It has been updated quite a bit since launch bringing in new modes and improvements. It recently launched on Nintendo Switch as well. | Read more »

Price Scanner via MacPrices.net

Lowest price of the season: 2022 Apple AirPod...
Verizon has new 2022 AirPods Pro in stock and on sale for $199.99 shipped as part of their Holiday sale. Their price is $50 off Apple’s MSRP, and it’s the lowest price currently available for 2022... Read more
New! Details on Verizon’s Christmas/Holiday p...
Verizon is offering discounts on iPhones, Apple Watch models, and iPads with specific promo codes as part of their Christmas/Holiday 2022 offerings. Codes are valid when adding a new line of service... Read more
Apple MagSafe accessories are back on Holiday...
Amazon has Apple MagSafe Chargers and Apple’s MagSafe Battery on sale for up to 24% off MSRP again as part of their Christmas/Holiday sale. Shipping is free, and all models are in stock: – MagSafe... Read more
13″ M2 MacBook Airs on sale again for the low...
Amazon has 13″ MacBook Airs with M2 CPUs in stock today and on sale for $150 off MSRP as part of their Christmas/Holiday Sale, prices start at $1049. Shipping is free. They are the lowest prices... Read more
Get an Apple 16″ MacBook Pro for $400 off MSR...
16″ MacBook Pros with Apple’s M1 Pro CPUs are in stock and on sale today at B&H Photo for $300-$400 off Apple’s MSRP for a limited time. Prices start at $2099 for M1 Pro models with 512GB or 1TB... Read more
Holiday clearance sale! Previous-generation A...
Amazon has 2nd generation 32GB and 64GB 4K Apple TVs with Siri remotes and 32GB Apple TV HDs on clearance sale for $80-$90 off original MSRP. Shipping is free, and delivery is available in time for... Read more
Christmas sale at Verizon: Apple AirPods Pro...
Verizon has first-generation Apple AirPods Pro on sale for $159.99 on their online store as part of their continuing Christmas/Holiday sale. Their price is $90 off Apple’s original MSRP, and it’s the... Read more
New Christmas/New Years promo at Xfinity Mobi...
Switch to Xfinity Mobile and open a new line of service, and take $400 off the price of a new iPhone, no trade-in required, through January 10, 2023. The $400 is applied to your account as credits... Read more
Apple iPad Smart Keyboard Folio prices drop u...
Apple iPad Smart Keyboard Folio prices have dropped up to $60 off MSRP at Amazon and Walmart as part of their Christmas/Holiday sales. These are the cheapest prices currently available for these iPad... Read more
Today is the final day for Xfinity Mobile’s $...
If you switch to Xfinity Mobile and open a new line of service, they will take $500 off the price of a new iPhone, no trade-in required. This is the best no trade-in Cyber Monday Apple iPhone 14 deal... Read more

Jobs Board

*Apple* Systems Administrator - JAMF - Activ...
…Administration **Duties and Responsibilities** + Configure and maintain the client's Apple Device Management (ADM) solution. The current solution is JAMF supporting Read more
Cashier - *Apple* Blossom Mall - JCPenney (...
Cashier - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Blossom Mall Read more
Omnichannel Associate - *Apple* Blossom Mal...
Omnichannel Associate - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Read more
Sephora Beauty Advisor - *Apple* Blossom Ma...
Sephora Beauty Advisor - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Read more
Operations Associate - *Apple* Blossom Mall...
Operations Associate - Apple Blossom Mall Location:Winchester, VA, United States (https://jobs.jcp.com/jobs/location/191170/winchester-va-united-states) - Apple Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.