From: chughes@maths.tcd.ie (Conrad Hughes)
Subject: Re: 32/64 bit square root.
Date: 25 Sep 91 16:30:11 GMT
Here's the best code I've managed yet; it manages to fit into four
instructions per bit. If anyone has managed better, please let me
know.. If you do use the code, it'd be nice to receive a mention
:-)
DEFFNsqrt(N,R,T,S):LOCALA%:IFS<>N [OPTpass:MOV S,N:]
[OPTpass:MOV R,#0:]:FORA%=7TO0STEP-1:[OPTpass:ADD T,R,#1<The algorithm I'd use would be a Newton-Raphson iteration scheme designed
>to solve x^2=z. (Whether this is the fastest method I don't know)
>...
>so you can see 4 or maybe 5 iterations at most is enough. Each iteration
>(if you arrange things *carefully*) will need only a 16x16 bit multiply
>and a 32x16 bit divide.
There is a *much* faster way. If your number is of the form a*2^n,
where 0.5 <= a < 1.0 then do the following:
sqrt(a*2^(2n)) = sqrt(a)*2^n
sqrt(a*2^(2n+1)) = sqrt(a/2)*2^(n+1)
This reduces the problem to finding a square root of a number a,
0.25 <= a < 1.0. This can be done by this method, which should have
a cost comparable to 2 or 3 divisions.
{ 0.25 <= a < 1.0 }
x := 4*(a-0.25);
r := 1;
n := 1;
while (n0) do begin
{ invariant: r^2+x = 4^n*a, x<2*r+1 }
n := n+1;
t := 4*r+1;
x := 4*x;
r := 2*r;
if x >= t then begin
x := x-t;
r := r+1
end
end;
r := r/2^n;
{ sqrt(a)-r < 2^-n }
Note that only addition, subtractions and multiplications by 2 or 4
are used in each part of the loop. At the end a single n-bit shift is
done. Note that r and t are integers, while x is a real.
The (a<>0) test allows you to exit the loop if an exact root is found,
but it can be omitted without causing error (so the loop can be
unwound).
We can reformulate the algorithm to keep x < 1.0 throughout the
algorithm:
{ 0.25 <= a < 1.0 }
x := (a-0.25)/2;
r := 1;
for n := 2 to number-of-bits do begin
{ invariant: r^2+2^n*x= 4^n*a, x<(2*r+1)*2^-n }
t := 4*r+1;
x := 2*x;
r := 2*r;
if x >= t/2^(n+1) then begin
x := x-t/2^(n+1);
r := r+1
end
end;
r := r/2^number-of-bits;
{ sqrt(a)-r < 2^-number-of-bits }
Assuming we have a 32 bit mantissa which is stored with the most
significant bit as 0.5, we would get the following ARM code, which
uses 213 cycles to get a 32 bit result. A test of x=0 could be added
at every iteration. This would make minimum time shorter, but assuming
there is no large number of squares in the input, this would make the
average time longer.
\ on entry Ra holds a
\ on exit Rr holds sqrt(a)
\ uses Rx and Rt (either can be equal to Ra)
SUB Rx,Ra,#2^30
MOV Rx,Rx LSR #1 \ x := (a-0.25)/2;
MOV Rr,#1 \ r := 1;
MOV Rt,Rr ASL #2
ADD Rt,Rt,#1 \ t := 4*r+1;
MOV Rx,Rx ASL #1 \ x := 2*x;
MOV Rr,Rr, ASL #1 \ r := 2*r;
SUBS Rt,Rx,Rt ASL #29 \ if x >= t/2^(n+1) then begin
MOVPL Rx,Rt \ x := x-t/2^(n+1);
ADDPL Rr,Rr,#1 \ r := r+1
...
MOV Rt,Rr ASL #2
ADD Rt,Rt,#1 \ t := 4*r+1;
MOV Rx,Rx ASL #1 \ x := 2*x;
MOV Rr,Rr, ASL #1 \ r := 2*r;
SUBS Rt,Rx,Rt ASL #0 \ if x >= t/2^(n+1) then begin
MOVPL Rx,Rt \ x := x-t/2^(n+1);
ADDPL Rr,Rr,#1 \ r := r+1
\ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa
We can actually get this better by avoiding the doubling of r in
each iteration. This way r and t are no longer integers, but the code
is better anyway:
{ 0.25 <= a < 1.0 }
x := a-0.25;
r := 0.5;
for n := 2 to number-of-bits do begin
{ invariant: r^2+x/2^(n-1) = a, x= t then begin
x := x-t;
r := r+2^-(n+1)
end
end;
{ sqrt(a)-r < 2^-number-of-bits }
In ARM this uses 149 cycles to get a 32 bit result. Try to better that!
\ on entry Ra holds a
\ on exit Rr holds sqrt(a)
\ uses Rx and Rt (either can be equal to Ra)
SUB Rx,Ra,#2^30 \ x := a-0.25;
MOV Rr,#2^31 \ r := 0.5;
ADD Rt,Rr,#2^28 \ t := r+2^-(n+2);
MOV Rx,Rx ASL #1 \ x := 2*x;
SUBS Rt,Rx,Rt \ if x >= t then begin
MOVPL Rx,Rt \ x := x-t;
ADDPL Rr,Rr,#2^29 \ r := r+2^-(n+1)
...
ADD Rt,Rr,#2^0 \ t := r+2^-(n+2);
MOV Rx,Rx ASL #1 \ x := 2*x;
SUBS Rt,Rx,Rt \ if x >= t then begin
MOVPL Rx,Rt \ x := x-t;
ADDPL Rr,Rr,#2^1 \ r := r+2^-(n+1)
\ t := r+2^-(n+2);
\ { 2^-(n+2) is less than precision }
\ x := 2*x;
RSBS Rt,Rr,Rx ASL #1 \ if x >= t then begin
\ x := x-t; { x not needed }
ADDPL Rr,Rr,#2^0 \ r := r+2^-(n+1)
\ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa
Note: I haven't tried this code on my Archimedes, as it is at home and
I am at work. I have, however, tried the pseudo-Pascal code.
From: torbenm@diku.dk (Torben AEgidius Mogensen)
Date: 22 Nov 91 10:46:26 GMT
I made a small error in this. The correct code is:
{ 0.25 <= a < 1.0 }
x := a-0.25;
r := 0.5;
for n := 2 to number-of-bits do begin
{ invariant: r^2+x/2^(n-2) = a, x= t then begin
x := x-t;
r := r+2^-n
end
end;
{ sqrt(a)-r < 2^-number-of-bits }
There are, however, still a small problem: x might be greater than 1
after the doubling. We can solve this by delaying the doubling:
{ 0.25 <= a < 1.0 }
x := a-0.25;
r := 0.5;
for n := 2 to number-of-bits do begin
{ invariant: r^2+x/2^(n-2) = a, x= t/2 then begin
x := x-t/2;
r := r+2^-n
end;
x := 2*x
end;
{ sqrt(a)-r < 2^-number-of-bits }
Also, my ARM code used unsigned numbers but compared them as if they
were signed (using PL). Comparison of unsigned numbers requires HS.
The code is still untested, so there might be further bugs. Note also
that you can't write the constant 2^31 in BASIC assembler. Use 1<<31
instead. The number of cycles have increased to 154, as an extra pass
through the loop is required to get 32 bit precision:
\ on entry Ra holds a
\ on exit Rr holds sqrt(a)
\ uses Rx and Rt (either can be equal to Ra)
SUB Rx,Ra,#2^30 \ x := a-0.25;
MOV Rr,#2^31 \ r := 0.5;
ADD Rt,Rr,#2^29 \ t := r+2^-(n+1);
SUBS Rt,Rx,Rt,LSR #1 \ if x >= t/2 then begin
MOVHS Rx,Rt \ x := x-t/2;
ADDHS Rr,Rr,#2^30 \ r := r+2^-n
MOV Rx,Rx,ASL #1 \ x := 2*x;
...
ADD Rt,Rr,#2^0 \ t := r+2^-(n+1);
SUBS Rt,Rx,Rt,LSR #1 \ if x >= t/2 then begin
MOVHS Rx,Rt \ x := x-t/2;
ADDHS Rr,Rr,#2^1 \ r := r+2^-n
MOV Rx,Rx,ASL #1 \ x := 2*x;
\ t := r+2^-(n+1);
\ { 2^-(n+1) is less than precision }
SUBS Rt,Rx,Rr,LSR #1 \ if x >= t/2 then begin
\ x := x-t/2; { x not needed }
ADDHS Rr,Rr,#2^0 \ r := r+2^-n
\ x := 2*x; { x not needed }
\ at exit: 0.5 <= Rr < 1.0 when considered as a mantissa
Torben Mogensen (torbenm@diku.dk)
From: dseal@armltd.co.uk (David Seal)
Subject: Square roots (was Re: The (non-charity) ray tracer)
Date: 21 Nov 91 15:52:14 GMT
The Newton-Raphson technique is a good one when you've got a fast division -
e.g. if you've got division in hardware (and haven't got square root in
hardware :-). However, if you're going to do division by a software routine
(as you must on an ARM), there is a "long square root" algorithm which is
very similar to the "long division" algorithms used to do division in
software. It takes only a bit more time than one division *in total*,
compared with a division, an addition and a shift *for each of several
iterations* in Newton-Raphson. I would therefore recommend it for ARM square
root routines when speed is important.
I won't supply code, since (a) I don't know what type of numbers you want to
take square roots of, how accurate a result you want, etc. - these details
will obviously affect the code; (b) I don't have time to write code, whereas
I have a written description of the algorithm on file which I can simply
include here!
I'll illustrate the "long square root" algorithm in base 10 - suppose we
want to take the square roots of 2 and 20:
(A) First write down the number you want to take the square root of, split
into groups of two digits, with one split at the decimal point. In these
particular cases, this gives us:
----------------- -----------------
V 02. 00 00 00 ... V 20. 00 00 00 ...
(Note that the two numbers chosen have the same pattern of digits, but
different splits because of the different positions of the decimal
point. This is the only reason they produce different answers.)
(B) Initialise by finding the largest square which is smaller than the first
pair of digits: in the first case the first pair of digits is 02 = 2, so
this square is 1^2 = 1, and in the second the first pair of digits is
20, so this square is 4^2 = 16. Subtract the square from the first pair
of digits, and put its square root in as the first digit of the result:
1. 4.
----------------- -----------------
V 02. 00 00 00 ... V 20. 00 00 00 ...
- 1 -16
-- --
1 4
Then bring the next pair of digits down:
1. 4.
----------------- -----------------
V 02. 00 00 00 ... V 20. 00 00 00 ...
- 1 -16
------ ------
100 400
(C) To find the next digit of the result: take twice the answer so far. Look
at numbers of the form N * (twice answer so far concatenated with N) for
N a digit. The largest one of these that is still less than or equal to
the remainder we've got gives the next digit of the result, and the
value of N * (twice answer so far concatenated with N) is what we should
subtract from the remainder before bringing down the next pair of
digits:
Twice answer so far is 2*1=2. Twice answer so far is 2*4=8.
0*20 = 0, less than 100 0*80 = 0, less than 400
1*21 = 21, less than 100 1*81 = 81, less than 400
2*22 = 44, less than 100 2*82 = 164, less than 400
3*23 = 69, less than 100 3*83 = 249, less than 400
4*24 = 96, less than 100 4*84 = 336, less than 400
5*25 = 125, more than 100 5*85 = 425, more than 400
So the next digit of the answer So the next digit of the answer
is 4, and we must subtract 96 is 4, and we must subtract 336
1. 4 4. 4
----------------- -----------------
V 02. 00 00 00 ... V 20. 00 00 00 ...
- 1 -16
------ ------
100 400
- 96 -336
------ ------
400 6400
(D) Repeat step (C) as many times as you need result digits - I'll do it
once more for the examples:
Twice answer so far is 2*14=28. Twice answer so far is 2*44=88.
0*280 = 0, less than 400 0*880 = 0, less than 6400
1*281 = 281, less than 400 1*881 = 881, less than 6400
2*282 = 564, more than 400 2*882 = 1764, less than 6400
So the next digit of the answer :
is 1, and we must subtract 281 :
7*887 = 6209, less than 6400
8*888 = 7104, more than 6400
So the next digit of the answer
is 7, and we must subtract 6209
1. 4 1 4. 4 7
----------------- -----------------
V 02. 00 00 00 ... V 20. 00 00 00 ...
- 1 -16
------ ------
100 400
- 96 -336
------ ------
400 6400
- 281 -6209
------- -------
11900 19100
As to why the method works, I'll give some hints and leave you to work out
the details yourself. To give the rough reason for it, note that the
"remainder" in the above calculations is actually the difference between the
number you are trying to take the square root of and the square of the
answer so far:
2 20
= 1^2 + 1 = 4^2 + 4
= 1.4^2 + 0.04 = 4.4^2 + 0.64
= 1.41^2 + 0.0119 = 4.47^2 + 0.0191
= ... = ...
[ Aside: this fact yields another useful property of the algorithm, which is
that (as in long division) your answer is exactly right if the remainder
becomes 0 and there are no further non-zeros to bring down. This can be
very helpful e.g. for calculating square roots in floating point systems
that conform to IEEE standard 754. ]
Then prove this - it depends on the identity:
(10x+y)^2 - (10x)^2 = 20xy + y^2
= (20x + y) * y
The quantities (20x + y) * y are what the digit calculation in part (C) is
actually determining.
One important observation: this algorithm simplifies immensely in binary!
I'll run through it, illustrating it for the case of the square root of two
again:
(A) Write down the argument, split into pairs aligned with the binary point:
--------------------
V 10. 00 00 00 00 ...
(B) If the first pair is 00, the first digit of the answer is 0; otherwise
it is 1. Subtract the square of this first digit from the first pair and
bring down the next pair:
1.
--------------------
V 10. 00 00 00 00 ...
- 1
------
100
(C) There are only two possible next digits, namely 0 and 1. But 0 * (twice
the answer so far concatenated with 0) is always 0, which must be less
than or equal to the remainder. So all that matters is whether 1 *
(twice the answer so far concatenated with 1) is less than or equal to
the remainder: if it is, the next result digit is 1; if it is greater
than the remainder, the next result digit is 0. One more thing: twice
the answer so far concatenated with 1 is the answer so far concatenated
with 01.
So the result digit selection procedure becomes: concatenate the answer
so far with 01, and compare with the remainder. If greater, leave the
remainder unchanged and the result digit is 0; if less than or equal,
subtract it from the remainder and the result digit is 1.
To continue the example:
Result so far concatenated with 01 is 101, which is greater than 100,
so result digit is 0 and subtract 0:
1. 0
--------------------
V 10. 00 00 00 00 ...
- 1
------
100
- 0
------
10000
Result so far concatenated with 01 is 1001, which is less than 10000,
so result digit is 1 and subtract 1001:
1. 0 1
--------------------
V 10. 00 00 00 00 ...
- 1
-------
100
- 0
-------
10000
- 1001
--------
11100
Result so far concatenated with 01 is 10101, which is less than 11100,
so result digit is 1 and subtract 10101:
1. 0 1 1
--------------------
V 10. 00 00 00 00 ...
- 1
-------
100
- 0
-------
10000
- 1001
--------
11100
- 10101
---------
11100
Result so far concatenated with 01 is 101101, which is greater than
11100, so result digit is 0 and subtract 0:
1. 0 1 1 0
--------------------
V 10. 00 00 00 00 ...
- 1
-------
100
- 0
-------
10000
- 1001
--------
11100
- 10101
---------
11100
- 0
----------
1110000
One further point: Step (B) is just a special case of step (C) if we
initialise the "square root so far" to 0 - i.e. you don't need special
case hardware for the first iteration.
Hope this helps.
David Seal
dseal@armltd.co.uk