From: chughes@maths.tcd.ie (Conrad Hughes)
Subject: Re: fast divide
Date: 31 Oct 91 13:15:24 GMT
DEFFNdiv(A,B,C,R):LOCALA%:[OPTpass:MOVS C,A,LSR#31:RSBNE A,A,#0
:ADDS C,C,A,LSL#16:MOV A,A,LSR#16:ADDS A,B,A:SUBCC A,A,B:]
FORA%=1TO31:[OPTpass:ADCS C,C,C:ADCS A,B,A,LSL#1:SUBCC A,A,B:]NEXT
[OPTpass:ADCS R,C,C:RSBCC R,R,#0:]=0
This is the BASIC assembler code to do the division.. Use it this way:
FNdiv(reg_top,reg_bot,reg_misc,reg_res)
while in the BASIC assembler.. It inlines the code. Register reg_res
contains 32768*(reg_top/reg_bot); reg_bot _must_ be negative (if you
don't know what sign it'll be, tack the following in at the beginning:
CMP B,#0:RSBPL B,B,#0:RSBPL A,A,#0
...) , and reg_top can be either sign.. If you need slightly higher
accuracy (I think I can increase the accuracy by half a bit) and
don't worry about registers, tell me. reg_res can be any register,
including reg_top, reg_bot or reg_misc.
If you want to turn it into a loop, do so.. But remember that the
fundamental unit of of the routine takes 3 clock counts, a branch-loop
takes five! Also, it'll be straightforward enough to turn into a
subroutine..
If you need any clarification of it, ask me...
Someone posted another version which is faster for a logarithmic
distribution of numbers, with 0 and 1 occuring most frequently; I
haven't got it to hand, but could dig it up if nobody else posts
it.
Conrad
From: torbenm@diku.dk (Torben AEgidius Mogensen)
Date: 31 Oct 91 15:34:21 GMT
Subject: Re: fast divide
Sender: torbenm@freke.diku.dk
Here is one I cooked up. It is from memory, so there might be slight
errors. The comparisons before the first SUB ensure that no overflow
happens on the ASL. Worst case time is 5 cycles per bit.
Arguments in p,q.
on exit, r = p div q, p = p mod q
.div
MOV r,#0
CMP p,q
BLT nodiv
CMP p,q ASL #1
BLT div0
CMP p,q ASL #2
BLT div1
... \ repeat for 3..30
CMP p,q ASL #31
SUBGE p,q ASL #31
ADDGE r,#2^31
CMP p,q ASL #30
.div30
SUBGE p,q ASL #30
ADDGE r,#2^30
CMP p,q ASL #29
.div29
... \ repeat for 29..1
.div0
SUBGE p,q ASL #0 \ equiv. to SUBGE p,q
ADDGE r,#2^0
.nodiv
MOV PC,Link
From: gcwillia@daisy.waterloo.edu (Graeme Williams)
Date: 2 Nov 91 20:13:28 GMT
Subject: Re: fast divide
Organization: University of Waterloo
Ok, here is the fastest unsigned divide routine I know of. The 3 instruction
loop (unrolled) wasn't my idea - but the piece of code that figures out
whereabouts to jump into the unrolled loop is.
On entry d and n should be both +ve.
CMP d,#0 : BEQ end% ; Remove zero divisors
CMP d,n : MOVGT m,n : MOVGT n,#0 : BGT end% ; Exit if denominator > numerator
; This next bit of code figures out roughly how many bits the quotient is
; going to have and hence how many times you need to run through the
; divide loop proper - its only worth finding out to the nearest 4 times
.start%
MOV cnt,#28 : MOV m,n,LSR#4
CMP d,m,LSR#12 : SUBLE cnt,cnt,#16 : MOVLE m,m,LSR#16
CMP d,m,LSR#4 : SUBLE cnt,cnt,#8 : MOVLE m,m,LSR#8
CMP d,m : SUBLE cnt,cnt,#4 : MOVLE m,m,LSR#4
MOV n,n,LSL cnt
ADDS n,n,n : RSB d,d,#0 ; Have to flip the sign of d for the divide loop
ADD cnt,cnt,cnt,LSL#1 ; Multiply cnt by 3
ADD pc,pc,cnt,LSL#2 ; Jump cnt instructions
MOV R0,R0 ; Dummy instruction to take care of pipelining
32 copies of { ADCS m,d,m,LSL#1 : SUBCC m,m,d : ADCS n,n,n }
.end%
The best times occur when the numerator and denominator of your division
are of similar size. The worst times when the numerator is very large
and the denominator small. (The case where the denominator is larger than
the numerator is killed off very early.)
Suppose one has a distribution of numbers wherein the highest bit of
the numbers is equally distributed between bit 0 and bit 31, that is,
there is the same number of numbers between 2^n and 2^(n+1) for every n.
(I think this is actually the case for most purposes)
Now if a numerator and denominator are selected at random from this
distribution subject only to the requirement that denom <= numer
then the above routine will execute on average in 58.25 cycles
(from the point start% in the code).
The worst case is 116 cycles (top bit of quotient is bit 28 or higher)
the best case is 32 cycles (top bit of quotient is bit 0 to 3)
The inbetween cases get 12 cycles worse for every 4 extra bits in the
quotient.
Note that the average case isn't the simple average of the various cases
but has to be weighted toward the better cases - this is because there
are more ways (27) for numer and denom to differ by say 5 bits than there
are (7) for them to differ by say 25 bits.
Have fun.
Graeme Williams
From: zrzm0370@rusmv1.rus.uni-stuttgart.de (Joerg Scheurich)
Subject: integer division
Date: 10 Jun 91 11:25:36 GMT
Hi
So i know, there is no Assembler-command for integer-division in the
ARM. Therefor a fast division-routine is a thing of common interest.
In a earlier Posting i wrote about the bad quick hacked division-routine
for the demo of my formula-tool (math. formula ==> assemblermacros).
Now i wrote a better routine, but i don't know, if there is a better
algorithm.
The shorter and slower 32-Bit/32-Bit-division-routine requires
382 Cycles in 100 Bytes, with usage of 4 32-Bit-registers and 4 Byte memory.
With unrolling the loop,
the faster and longer 32-Bit/32-Bit-division-routine requires
182 Cycles in 692 Bytes, with usage of 4 32-Bit-registers.
This is very faster then the old routine (which requires 2**33 Cycles in
worst case and requires all time of the universe, if there is a division
by zero ... )
Cause the INTERNAL-assembler-command of the INTEL 8086
for a 32-Bit/16-Bit-division requires
165-184 Cycles in 2-4 Bytes, with usage of 3 16-Bit-registers,
i think, the result is ok.
The result is nothing against the INTERNAL-assembler-command of the ARM
for a 32-Bit*32-Bit-multiplication which requires
17 Cycles in 4 Bytes, with usage of 1-3 32-Bit-registers,
but my Professor in digital electronics used to say:
"If a program contain much divisions, the programmer is a idiot" ...
Know someone a better (faster or shorter) algorithm ?
so long
MUFTI
( internet: zrzm0111@helpdesk.rus.uni-stuttgart.de
( from janet: zrzm0111%de.uni-stuttgart.rus.helpdesk@NFS-RELAY )
bitnet: ZRZM AT DS0RUS1I )
div-fast:
---------
; division routine
; 1. calculate flag if result is negate and convert operands to positiv
; 2. second "divide" with unsigned suczessive Approximation
; 3. fix the sign of result
.MACRO INTdivide
LDR R2,%2
LDR R3,%3
MOV R0,#0
CMP R2,#0
RSBLT R2,R2,#0
SUBLT R0,R0,#1
CMP R3,#0
RSBLT R3,R3,#0
MVNLT R0,R0
MOV R1,#0
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l2
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l3
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l4
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l5
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l6
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l7
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l8
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l9
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l10
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l11
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l12
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l13
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l14
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l15
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l16
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l17
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l18
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l19
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l20
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l21
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l22
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l23
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l24
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l25
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l26
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l27
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l28
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l29
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l30
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l31
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
\l32
ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
CMP R0,#0
RSBNE R2,R2,#0
STR R2,%1
.ENDM
div-short:
----------
; division routine
; 1. calculate flag if result is negate and convert operands to positiv
; 2. second "divide" with unsigned suczessive Approximation
; 3. fix the sign of result
.MACRO INTdivide
LDR R2,%2
LDR R3,%3
MOV R0,#0
CMP R2,#0
RSBLT R2,R2,#0
SUBLT R0,R0,#1
CMP R3,#0
RSBLT R3,R3,#0
MVNLT R0,R0
STR R0,minusflag
MOV R0,#32.
MOV R1,#0
\loop ADDS R2,R2,R2
ADCS R1,R1,R1
CMP R1,R3
SUBGE R1,R1,R3
ADDGE R2,R2,#1
SUB R0,R0,#1
CMP R0,#0
BNE \loop
LDR R0,minusflag
CMP R0,#0
RSBNE R2,R2,#0
STR R2,%1
.ENDM
From: chughes@maths.tcd.ie (Conrad Hughes)
Subject: Re: integer division
Date: 11 Jun 91 14:10:31 GMT
zrzm0370@rusmv1.rus.uni-stuttgart.de (Joerg Scheurich) writes:
>382 Cycles in 100 Bytes, with usage of 4 32-Bit-registers
>and 4 Byte memory.
>With unrolling the loop,
>the faster and longer 32-Bit/32-Bit-division-routine requires
>182 Cycles in 692 Bytes, with usage of 4 32-Bit-registers.
This divides a 64bit number by a 32bit one, with 32bit result, in 97
instructions/388 bytes/97 clock cycles..
Entry: A=MSW of 64bit no., B=LSW of 64bit no., C=32bit no.
C _must_ be negative (if it's positive, RSB C,C,#0),
A _must_ be positive (if not, RSBS B,B,#0:RSC A,A,#0)
ADDS B,B,B
ADCS A,C,A,LSL#1:SUBCC A,A,C:ADCS B,B,B [repeat these three 32 times]
Exit: A=Remainder of division, B=result, C unchanged.
..if you don't need the remainder, get rid of the last
SUBCC A,A,C.
The result will be unsigned, and flags will (obviously)
reflect the last addition.
Adjust as you will to get 32/32 division (Initialise A to 0, or AB
= numerator shifted some way) or any other variety you like..
Rolling up the loop it could get slightly complicated since every
instruction requires flag information from the previous one ;-)
Something like SUB count,count,#1:TEQ count,#0:BNE loop with count
initialised to #32 might do the trick, but it'll slow it down to at
least 288 cycles (so what if it only uses 32 bytes?? ;-}), which is
pretty unacceptable unless you're on an ARM3..
This is off the top of my head and I'm nowhere near the Arc, so if
it doesn't work (it should - I've typed it in enough times now..)
flame me..
Conrad
From: gcwillia@daisy.waterloo.edu (Graeme Williams)
Subject: Re: really FAST integer division
Summary: Avg. exec. time < 60 cycles - the fastest div in the West?
Date: 28 Jun 91 19:07:58 GMT
Organization: University of Waterloo
Following several posts regarding division routines, in particular one
with a lovely 3 instr. central loop I modified my old division routine
into the following (supersonic, lightning fast, ZR rated, :-) )
32bit routine:
;Essentially what's happening here is we are figuring out how far left
;we can shift the numerator before any actual dividing begins and thus save
;a significant number of useless trips through the (unrolled) 3 instr. loop.
; - Note we only go to the nearest 4 bits, trying to save another 2 bits
; results in 3 extra instrs. for a saving of 6 instrs. half the time
; and 0 instrs. the other half - i.e. no nett benefit.
MOV cnt,#28 : MOV mod,num,LSR#4
CMP den,mod,LSR#12 : SUBLE cnt,cnt,#16 : MOVLE mod,mod,LSR#16
CMP den,mod,LSR#4 : SUBLE cnt,cnt,#8 : MOVLE mod,mod,LSR#8
CMP den,mod : SUBLE cnt,cnt,#4 : MOVLE mod,mod,LSR#4
MOV num,num,LSL cnt : RSB den,den,#0
ADDS num,num,num
;Now skip over cnt copies of the 3 instr. loop.
ADD cnt,cnt,cnt,LSL#1 : ADD pc,pc,cnt,LSL#2 : MOV R0,R0
{ADCS mod,den,mod,LSL#1 : SUBLO mod,mod,den : ADCS num,num,num} x 32 copies
The routine behaves as follows:
On Entry: num - is a signed numerator A (but +ve)
den - is a signed denominator B (but non-zero and +ve)
On Exit: mod - contains A MOD B
num - contains A DIV B
The routine requires 4 registers and 452 bytes of memory.
Execution time:
Given numbers A and B (both +ve) chosen at random from a distribution in
which the most significant bit of the numbers is uniformly distributed
between bits 0 and 31 (i.e. the density of numbers goes as roughly log n)
then the average execution time is:
For A < B : 32 cycles (no division actually required)
For A >= B: 58.25 cycles
The best and worst cases are 32 cycles and 116 cycles respectively.
Have fun. (Stuff like this ought to be worth good money!)
Graeme Williams
gcwillia@daisy.waterloo.edu
P.S. With any luck there's only a couple of typos :-).
From: dseal@armltd.uucp (David Seal)
Subject: Re: Dividing
Date: 5 Mar 91 19:47:12 GMT
In article <8828@castle.ed.ac.uk> ecwu61@castle.ed.ac.uk (R Renwick) writes:
> Can anyone help me with this problem:
>In my super-duper, wire-frame model animating 'C' program, I represent
>floating point numbers as 4-byte integers. This I do by shifting the number
>being represented by 15 bits (multiply by 32768) in order that I can keep a
>track of the fractional part. The reason for doing this is that I want
>to make sure my program runs as fast as possible by not using floating
>point arithmetic...
>
> What I want to do is do a divide a number by another number
>and not lose alot of the precision.
> The way I do this at the moment is by
> result=((a/b)<<15)
>But if a and b are close together then (a/b) loses alot of precison
>(since the result is an int).
>
>Does anyone know how I could do an integer division without losing
>precision. I cannot use result=(((a<<8)/b)<<7) or any other tricks
>because a may be close to 32768<<15 already...
>Perhaps if I can find the reciprocal of b in the above representation
>and multiply it by a then I'll not lose the as much precision, but I
>don't know how to do this (find the reciprocal) :-(
Two quick suggestions:
(1) The basic problem is that integer division stops when it determines the
bit in the units place, whereas you want it to go on and calculate
another 15 fractional binary places. So rather than using C's integer
divide operator, write a long division routine using subtractions and
shifts and force it to go on 15 iterations longer than normal. (This
won't be much slower than C's integer divide, which essentially uses a
long division routine anyway).
If speed is very important, you'll probably want to either (a) write a
well-optimised assembler routine; or (b) write it as a C macro, so that
the compiler includes it inline. In the latter case, you may want to
look carefully at the assembler produced by the compiler for an instance
of your macro and experiment a bit - careful choice of the details of
the long division algorithm can make a significant difference to the
speed of the division.
Incidentally, you may want to watch out for significant bits being lost
during the last 15 iterations: this is perfectly possible and means that
your result has overflowed.
(2) Use a Newton-Raphson iteration to find the reciprocal of your divisor
and then multiply - this presupposes that you've got an efficient way to
do multiplications, both for the Newton-Raphson iteration and for the
final multiplication. The iteration to find the reciprocal of a number A
is (in pseudo-code - you'll have to translate to C):
newx = initial approximation to reciprocal;
repeat
x = newx;
newx = x * (2 - A*x);
until x and newx sufficiently close together;
The main thing to beware of in this is that your initial approximation
must be reasonably close to the reciprocal - it must be greater than 0
and less than twice the reciprocal. Such an approximation can be found
by finding the most significant 1 in A and choosing a value based on
this - e.g. if A lies in the range 1.0 <= A < 2.0, its most significant
1 is at bit position 16, and a suitable starting point for the iteration
would be 0.75. This gives us:
Most sign. bit position Range for A Suitable starting point
-----------------------------------------------------------------------
:
:
9 2^(-7) <= A < 2^(-6) 1.5*2^6
10 2^(-6) <= A < 2^(-5) 1.5*2^5
11 2^(-5) <= A < 2^(-4) 1.5*2^4
12 2^(-4) <= A < 2^(-3) 1.5*2^3
13 2^(-3) <= A < 2^(-2) 1.5*2^2
14 2^(-2) <= A < 2^(-1) 1.5*2^1
15 2^(-1) <= A < 2^0 1.5*2^0
16 2^0 <= A < 2^1 1.5*2^(-1)
17 2^1 <= A < 2^2 1.5*2^(-2)
18 2^2 <= A < 2^3 1.5*2^(-3)
19 2^3 <= A < 2^4 1.5*2^(-4)
20 2^4 <= A < 2^5 1.5*2^(-5)
21 2^5 <= A < 2^6 1.5*2^(-6)
22 2^6 <= A < 2^7 1.5*2^(-7)
:
:
Watch out for very big values - e.g. 2^16 is representable in this
number system, but 2^(-16) isn't!
Hope these suggestions help.
David Seal
dseal@acorn.co.uk or dseal@armltd.uucp