COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### Extended-Precision Division

• Email
• Follow

```I am ultimately looking for a way to perform a 96-bit / 64-bit divide. This
is pretty simple to implement via a pair of 64-bit / 64-bit divides, so this
is the problem that I am focusing on at the moment. I know the basic
shifting implementation, but that is excruciatingly slow. I am estimating
that it would be upwards of 300 cycles for the 64-bit / 64-bit divide, so
the full 96-bit / 64-bit divide would be around 600 cycles.

Isn't there a faster way?

-Matt

```
 0

See related articles to this posting

```Matt wrote:

> I am ultimately looking for a way to perform a 96-bit / 64-bit divide. This
> is pretty simple to implement via a pair of 64-bit / 64-bit divides, so this
> is the problem that I am focusing on at the moment. I know the basic
> shifting implementation, but that is excruciatingly slow. I am estimating
> that it would be upwards of 300 cycles for the 64-bit / 64-bit divide, so
> the full 96-bit / 64-bit divide would be around 600 cycles.

Maybe you could ask in a newgroup like comp.programming? You would have
to phrase it in a machine-independent manner, but they might be able to
provide some insight.

```
 0

```Matt wrote:

> I am ultimately looking for a way to perform a 96-bit / 64-bit divide. This
> is pretty simple to implement via a pair of 64-bit / 64-bit divides, so this
> is the problem that I am focusing on at the moment. I know the basic
> shifting implementation, but that is excruciatingly slow. I am estimating
> that it would be upwards of 300 cycles for the 64-bit / 64-bit divide, so
> the full 96-bit / 64-bit divide would be around 600 cycles.
>
> Isn't there a faster way?

'There is always a faster way'

(Probably written by Mike Abrash)

First, do you have a 64-bit cpu, like a recent AMD or Intel?

If so, the problem is trivial, since the x86 DIV opcode and register
assignements are designed for exactly this sort of stuff:

(I don't have the 64-bit instruction manuals here, so I'll show the
solution for a 48 / 32 divide using a 32-bit cpu)

; Divide a [48 bits] by b [32 bits], returning the result in EAX
; and the remainder in EDX. This will fail if the actual result
; doesn't fit in 32 bits!

movzx eax, word ptr a[4]
xor edx,edx
mov ebx,[b]
div ebx
; At this point in time EDX contains the remainder, and EAX
; should hopefully be zero:

;  mov [overflow],eax	; Keep this if you want an overflow indication
mov eax, dword ptr a[0]
div ebx

That's it!

Doing 64-bit divides on a 32-bit machine is a _lot_ harder, so several
workarounds might be employed:

1) Solve it by integer DIVs using successive approximation:
a) Shift the divisor up so that it has no leading zero bits.
Remember the shift count
Worst case this doesn't save any bits at all.
b) Divide the top 64 (or 63 to avoid overflow) bits by this
32-bit number, then back-multiply using the full 64-bit
divisor, and subtract the result from the original
96-bit number.

c) Repeat (b) one more time, and you should be done.

2) Do the same as (1), but start by calculating a reciprocal divisor
that you can use to multiply by instead.

3) Load both the divisor and dividend into fp registers, calculate an
approximate answer, then convert this back to a 64-bit integer value.

Do a single back-multiply and subtract to decide if you need to repeat
the process, or if you can get away with a single-bit adjustment.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```"Matt" wrote:

| I am ultimately looking for a way to perform a 96-bit / 64-bit divide.
| This is pretty simple to implement via a pair of 64-bit / 64-bit divides,
| so this is the problem that I am focusing on at the moment.
| I know the basic shifting implementation, but that is excruciatingly slow.
| I am estimating that it would be upwards of 300 cycles for the 64-bit /
| 64-bit divide, so  the full 96-bit / 64-bit divide would be around 600
| cycles.
|
| Isn't there a faster way?

I assume you talk about integer divide,
here the 'binary 1/x approximation' wont help on speed.

My calculator routines work up to 512 bits and I've tried many
ways to speed up divide, best is to avoid it at all.

At the moment I use a mix of partial logarithmic table look-up
and byte/w/dw ordered shifts (not really shifts, just altered
offsets in double-sized buffers).
But this needs to forget about decimal thinking for some time ;) :)

This solution is slower than FPU for 32..63 bit operations,
but is much faster in comparision for larger figures.
__
wolfgang

```
 0

```wolfgang kern wrote:
> "Matt" wrote:
[...]
> My calculator routines work up to 512 bits and I've tried many
> ways to speed up divide, best is to avoid it at all.
[...]

Yes, specifically I am taking a timestamp from rdtsc and converting it into
microseconds. I have the 64-bit CPU frequency and 64-bit timestamp. First I
have to multiply by 1,000,000 to get a temporary 96-bit product, and then I
can divide by the CPU frequency. Most data that I am measuring only need a
precision of at least 1 second. However, I also use the timer to measure
ping times, and for that I need to go down to at least 1 millisecond. It
does not hurt to have too much precision, so I am using microseconds for my
counter.

I would strongly prefer to avoid the FPU here as the rest of my application
context switch simply to keep a high-precision timer. The timer is not a
critical feature of my application; it is simply something that is nice to
have.

The division isn't really avoidable, but it isn't part of a critical loop
either. I am mostly interested because I know how to extend addition,
subtraction, and multiplication out to arbitrary precisions. Division is the
only one that I do not know about. It is easy to extend the numerator out to
arbitrary precision, but the denominator does not easily follow.

-Matt

```
 0

```Matt wrote:

> wolfgang kern wrote:
>
>>"Matt" wrote:
>
> [...]
>
>>My calculator routines work up to 512 bits and I've tried many
>>ways to speed up divide, best is to avoid it at all.
>
> [...]
>
> Yes, specifically I am taking a timestamp from rdtsc and converting it into
> microseconds. I have the 64-bit CPU frequency and 64-bit timestamp. First I
> have to multiply by 1,000,000 to get a temporary 96-bit product, and then I
> can divide by the CPU frequency. Most data that I am measuring only need a
> precision of at least 1 second. However, I also use the timer to measure
> ping times, and for that I need to go down to at least 1 millisecond. It
> does not hurt to have too much precision, so I am using microseconds for my
> counter.

Ouch! You _really_ need to consider your algorithm!
>
> I would strongly prefer to avoid the FPU here as the rest of my application
> does not use it, and it seems wasteful to me to add additional overhead to a
> context switch simply to keep a high-precision timer. The timer is not a
> critical feature of my application; it is simply something that is nice to
> have.
>
> The division isn't really avoidable, but it isn't part of a critical loop
> either. I am mostly interested because I know how to extend addition,
> subtraction, and multiplication out to arbitrary precisions. Division is the
> only one that I do not know about. It is easy to extend the numerator out to
> arbitrary precision, but the denominator does not easily follow.

Not at all:

The easy way to solve your assumed problem _exactly_ is to calculate a
65-bit accurate reciprocal value, and use this to multiply the timestamp
difference. Do a final shift, and you have the same answer as you'd get
from a full division.

However, there is a _much_ better way, as long as you remember that the
CPU frequency given is _at best_ accurate to 6 decimal digits:

On startup you should calculate a fp double precision value which is the
reciprocal of the frequency:

double init_reciprocal(uint64_t frequency)
{
return 1.0e9 / (double) frequency;
}

Save this value, and you're mostly done:

At runtime you calculate all deltas as exact 64-bit values, then convert
do double and multiply by the reciprocal. The result is a time delta
measured in nanoseconds.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Matt wrote:
> I am ultimately looking for a way to perform a 96-bit / 64-bit divide. This
> is pretty simple to implement via a pair of 64-bit / 64-bit divides, so this
> is the problem that I am focusing on at the moment. I know the basic
> shifting implementation, but that is excruciatingly slow. I am estimating
> that it would be upwards of 300 cycles for the 64-bit / 64-bit divide, so
> the full 96-bit / 64-bit divide would be around 600 cycles.
>
> Isn't there a faster way?

Yes, there is - you can see the sort of techniques that can be used in
the following paper on fast recursive division:

http://domino.mpi-sb.mpg.de/internet/reports.nsf/NumberView/1998-1-022

```
 0

```Terje Mathisen wrote:
> Matt wrote:
[...]
>> The division isn't really avoidable, but it isn't part of a critical
>> loop either. I am mostly interested because I know how to extend
>> addition, subtraction, and multiplication out to arbitrary
>> precisions. Division is the only one that I do not know about. It is
>> easy to extend the numerator out to arbitrary precision, but the
>> denominator does not easily follow.
>
> Not at all:
>
> The easy way to solve your assumed problem _exactly_ is to calculate a
> 65-bit accurate reciprocal value, and use this to multiply the
> timestamp difference. Do a final shift, and you have the same answer
> as you'd get from a full division.
[...]

Now that I think about it, this seems very reasonable. I can use a crappy
shift-and-subtract algorithm, and I only pay that cost once. The multiplies
may well be as fast as a single divide.

I am still fairly interested in a more mathematical answer. At Grumble's
recommendation, I have posted the same question in comp.programming, and
already a couple of interesting replies have come up.

-Matt

```
 0

```BRG wrote:
> Matt wrote:
>> I am ultimately looking for a way to perform a 96-bit / 64-bit
>> divide. This is pretty simple to implement via a pair of 64-bit /
>> 64-bit divides, so this is the problem that I am focusing on at the
>> moment. I know the basic shifting implementation, but that is
>> excruciatingly slow. I am estimating that it would be upwards of 300
>> cycles for the 64-bit / 64-bit divide, so the full 96-bit / 64-bit
>> divide would be around 600 cycles.
>>
>> Isn't there a faster way?
>
> Yes, there is - you can see the sort of techniques that can be used in
> the following paper on fast recursive division:
>
> http://domino.mpi-sb.mpg.de/internet/reports.nsf/NumberView/1998-1-022

comp.programming after it was recommended that I do so.

-Matt

```
 0

```Matt wrote:
> BRG wrote:
>
>>Matt wrote:
>>
>>>I am ultimately looking for a way to perform a 96-bit / 64-bit
>>>divide. This is pretty simple to implement via a pair of 64-bit /
>>>64-bit divides, so this is the problem that I am focusing on at the
>>>moment. I know the basic shifting implementation, but that is
>>>excruciatingly slow. I am estimating that it would be upwards of 300
>>>cycles for the 64-bit / 64-bit divide, so the full 96-bit / 64-bit
>>>divide would be around 600 cycles.
>>>
>>>Isn't there a faster way?
>>
>>Yes, there is - you can see the sort of techniques that can be used in
>>the following paper on fast recursive division:
>>
>>http://domino.mpi-sb.mpg.de/internet/reports.nsf/NumberView/1998-1-022
>
> cross-posting, but I had posted here already and only posted to
> comp.programming after it was recommended that I do so.

Hi,

I posted a second time because my first reply failed to show up (it did
eventually turn up after a several hour delay).

The recursive division algorithm is worth looking at if you want a a
general solution to disivion for large multiple precision division tasks.

```
 0

```--B=_H8082f536P31d5Hydro+j356Kee6003650
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Matt wrote:
> Terje Mathisen wrote:
>
>>Matt wrote:
>
> [...]
>
>>>The division isn't really avoidable, but it isn't part of a critical
>>>loop either. I am mostly interested because I know how to extend
>>>addition, subtraction, and multiplication out to arbitrary
>>>precisions. Division is the only one that I do not know about. It is
>>>easy to extend the numerator out to arbitrary precision, but the
>>>denominator does not easily follow.
>>
>>Not at all:
>>
>>The easy way to solve your assumed problem _exactly_ is to calculate a
>>65-bit accurate reciprocal value, and use this to multiply the
>>timestamp difference. Do a final shift, and you have the same answer
>>as you'd get from a full division.
>
> [...]
>
> Now that I think about it, this seems very reasonable. I can use a crappy
> shift-and-subtract algorithm, and I only pay that cost once. The multiplies
> may well be as fast as a single divide.
>
> I am still fairly interested in a more mathematical answer. At Grumble's
> recommendation, I have posted the same question in comp.programming, and
> already a couple of interesting replies have come up.

The 65-bit reciprocal _will_ give mathematically exact answers.

For a proof, search for the article Agner Fog wrote after he & I
reinvented the algorithm and proved that it works for all possible
combinations of divisor and dividend.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

--B=_H8082f536P31d5Hydro+j356Kee6003650
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

***********************************************************************
NOTICE: This e-mail transmission, and any documents, files or previous
e-mail messages attached to it, may contain confidential or privileged
information. If you are not the intended recipient, or a person
responsible for delivering it to the intended recipient, you are
hereby notified that any disclosure, copying, distribution or use of
any of the information contained in or attached to this message is
STRICTLY PROHIBITED. If you have received this transmission in error,
please immediately notify the sender and delete the e-mail and attached
documents. Thank you.
***********************************************************************

--B=_H8082f536P31d5Hydro+j356Kee6003650--

```
 0

```Terje Mathisen wrote:
>> Matt wrote:
>>> Terje Mathisen wrote:
>>>
>>>> Matt wrote:
>>>
>>> [...]
>>>
>>>>> The division isn't really avoidable, but it isn't part of a
>>>>> critical loop either. I am mostly interested because I know how
>>>>> to extend addition, subtraction, and multiplication out to
>>>>> arbitrary precisions. Division is the only one that I do not know
>>>>> about. It is easy to extend the numerator out to arbitrary
>>>>> precision, but the denominator does not easily follow.
>>>>
>>>> Not at all:
>>>>
>>>> The easy way to solve your assumed problem _exactly_ is to
>>>> calculate a 65-bit accurate reciprocal value, and use this to
>>>> multiply the timestamp difference. Do a final shift, and you have
>>>> the same answer as you'd get from a full division.
>>>
>>> [...]
>>>
>>> Now that I think about it, this seems very reasonable. I can use a
>>> crappy shift-and-subtract algorithm, and I only pay that cost once.
>>> The multiplies may well be as fast as a single divide.
>>>
>>> I am still fairly interested in a more mathematical answer. At
>>> Grumble's recommendation, I have posted the same question in
>>> comp.programming, and already a couple of interesting replies have
>>> come up.
>>
>> The 65-bit reciprocal _will_ give mathematically exact answers.
>>
>> For a proof, search for the article Agner Fog wrote after he & I
>> reinvented the algorithm and proved that it works for all possible
>> combinations of divisor and dividend.

I know...I meant that I'd like to see the mathematics of how to perform long
division. Every grade school child taking Algebra can tell you that X/(Y +
Z) != X/Y * (Y + Z). A reciprocal multiply is a good solution, but it also
requires an existing reciprocal or division algorithm. IIRC reciprocal can
be computed iteratively with only multiplies, but...

I suppose for the reciprocal that means that for Q = X / Y I compute R =
2^65 / Y and Q = (X * R) >> 65?

How exactly do you come by 65-bit reciprocal? Rather, what is the
significance of 65 bits?

-Matt

```
 0

```Matt wrote:

> Terje Mathisen wrote:
>>>The 65-bit reciprocal _will_ give mathematically exact answers.
>>>
>>>For a proof, search for the article Agner Fog wrote after he & I
>>>reinvented the algorithm and proved that it works for all possible
>>>combinations of divisor and dividend.
>
> I know...I meant that I'd like to see the mathematics of how to perform long
> division. Every grade school child taking Algebra can tell you that X/(Y +
> Z) != X/Y * (Y + Z). A reciprocal multiply is a good solution, but it also
> requires an existing reciprocal or division algorithm. IIRC reciprocal can
> be computed iteratively with only multiplies, but...

>
> I suppose for the reciprocal that means that for Q = X / Y I compute R =
> 2^65 / Y and Q = (X * R) >> 65?

Not exactly, check out the Agner Fog paper.
>
> How exactly do you come by 65-bit reciprocal? Rather, what is the
> significance of 65 bits?

A very simple search with my and Agner's names found two links, the
seocnd of which is the one you want:

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Terje Mathisen wrote:
> Matt wrote:
[...]
>> How exactly do you come by 65-bit reciprocal? Rather, what is the
>> significance of 65 bits?
>
[...]

According to the paper I should be using *147* bits of precision, yes? I'm
dividing an 84-bit number (64-bits * 1000000 fits in 84-bits) by a 64-bit
number, so I need r = 84 + 64 - 1 = 147. The paper also mentions rounding,
but it seems to me that I could simply add an extra bit of precision to
accomplish the same, yes? In my case, adding the extra bit of precision

-Matt

```
 0

```Matt wrote:

> Terje Mathisen wrote:
>
>>Matt wrote:
>
> [...]
>
>>>How exactly do you come by 65-bit reciprocal? Rather, what is the
>>>significance of 65 bits?
>>
>
> [...]
>
> According to the paper I should be using *147* bits of precision, yes? I'm
> dividing an 84-bit number (64-bits * 1000000 fits in 84-bits) by a 64-bit
> number, so I need r = 84 + 64 - 1 = 147. The paper also mentions rounding,
> but it seems to me that I could simply add an extra bit of precision to
> accomplish the same, yes? In my case, adding the extra bit of precision

Actually, you might need more than 65 bits if you want to do everyting
with a single (wide) multiplication, but nowhere near 147:

The div replacement we came up with needs n+1 bits for an N/N -> N
division, so you can use a 65-bit reciprocal (really 64 bits plus a
single increment) to divide the top 64 bits of the dividend by this value.

This gives you the top 64 bits of the result, right?

You then back-multiply by the original divisor, subtract, shift left 32
bits and add the bottom 32 bits of the original dividend.

One more reciprocal multiplication and you're done.

Total work: 4 MULs + some ADD/ADC for the first reciprocal, 4 MULs for
the back-multiply, and another 4 MULs for the second reciprocal.

I have a strong suspicion that a 97-bit reciprocal will work for all
possible 96/96 -> 96 divisions, including those where the divisor only
has 64 bits.

Doing such a reciprocal multiplication with 32-bit MULs would require 9
MUL operations, plus a set of ADD/ADC operations to merge the partial
results. On a PIII, it seems like it should be doable in less than 40
cycles, which happens to be the same time as would be required by a
single 32-bit DIV opcode. :-)

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```> Matt wrote:
>
> > I am ultimately looking for a way to perform a 96-bit / 64-bit divide.
This
> > is pretty simple to implement via a pair of 64-bit / 64-bit divides, so
this
> > is the problem that I am focusing on at the moment. I know the basic
> > shifting implementation, but that is excruciatingly slow. I am
estimating
> > that it would be upwards of 300 cycles for the 64-bit / 64-bit divide,
so
> > the full 96-bit / 64-bit divide would be around 600 cycles.

I, for one, would be interested in seeing this done with a pair of 64/64
divides.
Most algorithms I've seen, that do extended precision arithmetic, tend to be
O(N**2) with n being the number of bits.  There are hacks to speed up
special cases (i.e., lots of zero bits), but for the most part the worst
case
turns out to be O(n**2).

Yep, the shifting implementation is excruciatingly slow. But it is generic.
There are faster divides around for special cases, but I've not seen any
algorithms that are signficantly faster and completely general.  Of course,
I don't claim to have seen everything, so you never know, but if there
was an amazingly good algorithm, I'd probably have seen it in the past
20 years or so, as I have to do this type of division every five or ten
years,
myself.
Cheers,
Randy Hyde

```
 0

```Hi Matt,

| Yes, specifically I am taking a timestamp from rdtsc and converting it
| into microseconds. I have the 64-bit CPU frequency and 64-bit timestamp.
| First I have to multiply by 1,000,000 to get a temporary 96-bit product,
| and then I can divide by the CPU frequency.

After reading this one more time, I think you could setup your factor
in a way that it already includes the CPU-frequency and a 2^n alignment.

ie: 1.5 GHz will become 666,666 E-12 Seconds (just divide once),
multiply this value by a 2^n figure (SHL) to become an integer factor
with desired precision. Save the used 2^n value as your divisor, which
then allow to replace the divide with SHR after you multiplied the

And if you can align the figures to get a shift-count which is a multiple
of 8/16/best:32, you can also forget about the right shift by using just
the desired upper bytes as the result (lower bytes contain fractions
then), therefore I'd use a larger (128-bit) result buffer there.

| .... I am using microseconds for my counter.

Yes, I got a 1mS base for my system too,
but I can just use the IRQ0-timer for it, while my IRQ8 RTCL
fires every full Second and synchronise IRQ0 (can't be exact 1mS).

| I would strongly prefer to avoid the FPU here as the rest of my
| application does not use it, and it seems wasteful to me to add
| additional overhead to a context switch simply to keep a high-precision
| timer. The timer is not a critical feature of my application;
| it is simply something that is nice to have.

Sure.

| The division isn't really avoidable, but it isn't part of a critical loop
| either. I am mostly interested because I know how to extend addition,
| subtraction, and multiplication out to arbitrary precisions.
| Division is the only one that I do not know about.
| It is easy to extend the numerator out to arbitrary precision,
| but the denominator does not easily follow.

Right divisions are time consuming, I remember an article on "Dr.Math"
but I found it's not _that_ fast as expected either.
So whenever many divisions are required, like sin/cos table creation,
I load my log-LUT and use the logarithmic detour :)

__
wolfgang

```
 0

```In article <Nz24e.27300\$Pc.11044@tornado.tampabay.rr.com>,
Matt <spamtrap@crayne.org> wrote:
>
>
>I am ultimately looking for a way to perform a 96-bit / 64-bit divide.
>This
>is pretty simple to implement via a pair of 64-bit / 64-bit divides, so
>this
>is the problem that I am focusing on at the moment. I know the basic
>shifting implementation, but that is excruciatingly slow. I am
>estimating
>that it would be upwards of 300 cycles for the 64-bit / 64-bit divide,
>so
>the full 96-bit / 64-bit divide would be around 600 cycles.
>
>Isn't there a faster way?
>
>-Matt

Yes.  See Knuth, section 4.3.1, Algorithm D.
How to perform a 3n/2n -> n + 2n unsigned division using a single
2n/n -> n + n unsigned division, an n*xn-> 2n multiply, some shifting,

For x86, let n = 32.

To divide (n2,n1,n0) by (d1,d0), producing quotient q and (r1,r0), do:

1) Normalize.  Shift (n2,n1,n0) and (d1,d0) left until the msbit of d1 is set.
All following math is done with te normalized values unless stated
otherwise.
2) Estimate quotient.  Divide (n2,n1) / d1 to get quotient estimate q
and associated remainder r1.
3) Form the product (p1,p0) = q * d0
4) Let r0 = n0.
5) While (r1,r0) < (p1,p0), do:
* q = q-1
* (r1,r0) = (r1,r0) + (d1,d0)
(This loop will execute (see Knuth's Theorem B) at most twice.)
6) (r1,r0) = (r1,r0) - (p1,p0)
7) Unnormalize: Shift (r1,r0) right as many bits as you shifted left
in step 1.
8) You now have the final quotient q and remainder (r1,r0).

Basically, it turns out that if you shift the denominator up as much as
possible, you get an amazingly good estimate of the quotient, which
is further guaranteed to never be an underestimate.

If normalizing d overflows n, so that you have a 128/64-bit division
problem, then do the above twice.  The first time, divide (n3,n2,n1) by
(d1,d0) to compute q1, and use the remainder from that to perform the
second division, (r1,r0,n0) / (d1,d0).

To divide by denominators longer than 64 bits, use the most significant
64 bits in the above algorithm to compute a quotient digit estimate q,
then subtract q times the denominator from the numerator.  Very rarely,
the estimate q will be too high by 1, in which case the subtraction will
underflow.  Add back the denominator and subtract 1 from q.

Repeat for each quotient digit until the remaining numerator, now the
remainder, is less than d.  Then unnormalize and you have your answer.

It's all in Knuth, _The Art Of Computer Programming_, Volume 2
(Seminumerical Algorithms).

```
 0