Using MUL & SHRD instead of DIV

  • Follow


Hi all,

I read somewhere someone saying that these days
it is common to use multiply & shift in lieu of
division because doing so is faster.

I did a quick test and found that indeed it is
faster with a Core 2 DUo. What took 102 seconds
for DIV required only 10.6 seconds using MUL with SHRD.

; Using DIV
mov eax, ecx
mov esi, 123
xor edx, edx
div esi

; Using MUL & SHRD
mov eax, ecx
add eax, 32
mov esi, (1<<25)/123
mul esi
shrd eax, edx, 25

However there is a problem: The multiply & shift
approach is giving me results that are not
precise. They are usually off by 1.

Can anyone suggest a way to improve
the precision of this much-faster approach?

Thanks.
0
Reply Noop 6/30/2010 1:48:19 AM

On Jun 29, 8:48=A0pm, Noop <questione...@nospicedham.yahoo.com> wrote:
> Hi all,
>
> I read somewhere someone saying that these days
> it is common to use multiply & shift in lieu of
> division because doing so is faster.
>
> I did a quick test and found that indeed it is
> faster with a Core 2 DUo. What took 102 seconds
> for DIV required only 10.6 seconds using MUL with SHRD.
>
> ; Using DIV
> mov eax, ecx
> mov esi, 123
> xor edx, edx
> div esi
>
> ; Using MUL & SHRD
> mov eax, ecx
> add eax, 32
> mov esi, (1<<25)/123
> mul esi
> shrd eax, edx, 25
>
> However there is a problem: The multiply & shift
> approach is giving me results that are not
> precise. They are usually off by 1.
>
> Can anyone suggest a way to improve
> the precision of this much-faster approach?


Picking the right constant to multiply by is a bit more complex than
what you're doing, plus many divisions require an adjustment step
after the multiplication by the reciprocal.  For an excellent
discussion (and programs to generate the sequences needed), see the
"Software Optimization Guide for AMD64 Processors," particularly 8.1
and 8.8:

http://support.amd.com/us/Embedded_TechDocs/25112.PDF

For example, an unsigned division by 123, generates:


G:\wessel>udiv

Unsigned division by constant
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D

enter divisor: 123

; dividend: register other than EAX or memory location

MOV EAX, 085340853h
MUL dividend
ADD EAX, 085340853h
ADC EDX, 0
SHR EDX, 6

; quotient now in EDX


While a division by 1000 generates the rather simpler:


G:\wessel>udiv

Unsigned division by constant
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D

enter divisor: 1000

; dividend: register other than EAX or memory location

MOV EAX, 010624DD3h
MUL dividend
SHR EDX, 6

; quotient now in EDX
0
Reply robertwessel2 6/30/2010 3:36:58 AM


"Noop" wrote:

> I read somewhere someone saying that these days
> it is common to use multiply & shift in lieu of
> division because doing so is faster.
>
> I did a quick test and found that indeed it is
> faster with a Core 2 DUo. What took 102 seconds
> for DIV required only 10.6 seconds using MUL with SHRD.
>
> ; Using DIV
> mov eax, ecx
> mov esi, 123
> xor edx, edx
> div esi
>
> ; Using MUL & SHRD
> mov eax, ecx
> add eax, 32
> mov esi, (1<<25)/123
> mul esi
> shrd eax, edx, 25
>
> However there is a problem: The multiply & shift
> approach is giving me results that are not
> precise. They are usually off by 1.
>
> Can anyone suggest a way to improve
> the precision of this much-faster approach?

quick and dirty:
_____
2**32/123 = 34918433.3
lets now truncute the fraction and show it in hex: 0214D021h

MOV ecx,00214D021h  ;reciprocal divisor
MOV eax,dividend    ;
MUL ecx
;no shift required  ;edx holds the result,

precise enough ?
If not then you need to imply some fraction-bits ie:
_____
2**34/123 = 139673733.2
this become 08534085h but needs an additional shr edx 2
and the valid input range is limited.

still not enough precise ?
then you may end up with Robert's example from AMD:
_____
2**38/123 = 2234779731 (fraction disappeared now!) == 85340853h

__
wolfgang


0
Reply wolfgang 6/30/2010 8:20:49 AM

Noop wrote:
> Hi all,
>
> I read somewhere someone saying that these days
> it is common to use multiply&  shift in lieu of
> division because doing so is faster.
[snip]
> However there is a problem: The multiply&  shift
> approach is giving me results that are not
> precise. They are usually off by 1.
>
> Can anyone suggest a way to improve
> the precision of this much-faster approach?

I reinvented this approach many years ago, and so did Agner Fog. He has 
written a very nice guide in one of his optimization guides (google is 
your friend).

The key is that if you want N-bit accurate results, you need an N+1 bit 
accurate scaled reciprocal value. Said reciprocal must also be rounded 
UP, so that after the truncation which happens after the multiplication, 
the slightly too large result becomes exact.

The easiest method is to first calculate the largest power of two, such 
that when you divide by the divisor the result is still less than 2^32.

If the fraction is greater than 0.5, you can simply round up the result 
and you have a 32-bit reciprocal which is 33-bit accurate.

Another poster used x / 1000 as an example:

512 is the largest power of two that is less than 1000, so we calculate

  512 * 65536 * 65536 / 1000 = 2199023255.552

Rounding up we get 2199023256, so to divide by 1000 we can do:

   mov eax,2199023256
   mul ebx		; Number to be divided
   shr edx,9		; 512 = 2^9

and we get the result in about 5 cycles.

If otoh the fraction is less than 0.5, you need one more bit in the 
reciprocal, or you must accept that the reciprocal only works for a 
smaller range of input numbers.

Let's try x / 123:

C:\>perl -e "print(64*65536*65536/123)"
2234779731.25203

Since the fraction is less than 0.5 we need to multiply by 2234779731.5 
and shift right 6 bits:

   mov eax,2234779731
   mul ebx
   shr ebx,1
   add eax,ebx
   adc edx,0
   shr edx,6

We can also multiply by 4469559462.50407 (i.e. 4469559463 when rounded 
up) and shift down 7 bits.

4469559463 is 2^32 + 174592167, i.e. we multiply the original value by 
174592167 and then add the same value to EDX:

   mov eax,174592167
   mul ebx
   add edx,ebx
   shr edx,7

It is ~15 years since I originally worked out this, afair there might by 
some corner cases for that last simplified approach.

I do remember that you can calculate this during runtime when you know 
that you are going to make a lot of divisions by the same value and use 
constant code to do the divide:

  mov eax,[reciprocal]
  mul ebx
  mov eax,[mask]
  and eax,ebx
  mov ecx,[shift]
  add edx,eax
  shr edx,cl

I.e. the code above takes 6-8 cycles, vs 40+ for a DIV.

Terje
-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
0
Reply Terje 6/30/2010 9:54:59 AM

On Jun 29, 11:36=A0pm, "robertwess...@yahoo.com"
<robertwess...@nospicedham.yahoo.com> wrote:

> MOV EAX, 085340853h
> MUL dividend
> ADD EAX, 085340853h
> ADC EDX, 0
> SHR EDX, 6

Interesting, thanks.

I decided to put this ADD/ADC extra step to the test, and
wrote assembly code that tries to comprehensively test every
possible value -- a brute force approach.

The extra ADD/ADC step is indeed vital, but
I have found that very large dividends, e.g.
0x80000000, don't work. Smaller ones like 0x10000000
down to 1 are fine. But in any program where
dividends could be any value, this faster approach won't work.

For example, my code shows that dividends starting
at 0x4000000 and decreasing, I get:

Success with divisor 4 using const=3D0x40000000
Success with divisor 6 using const=3D0x2aaaaaaa
Success with divisor 7 using const=3D0x24924924
Success with divisor 10 using const=3D0x19999999
Success with divisor 20 using const=3D0x0ccccccc
Success with divisor 26 using const=3D0x09d89d89
....

If I start with dividends of 0xffffffff and decreasing,
I get far fewer successes:

Success with divisor 641 using const=3D0x00663d80


0
Reply Noop 6/30/2010 2:48:00 PM

Terje Mathisen wrote:

> I reinvented this approach many years ago, and so did Agner Fog. He has
> written a very nice guide in one of his optimization guides (google is
> your friend).

And so am I :-)

http://www.agner.org/optimize/
http://www.agner.org/optimize/optimizing_assembly.pdf

16.9 Division (all processors)
Integer division by a constant (all processors)
Dividing by a constant can be done by multiplying with the reciprocal.
A useful algorithm for integer division by a constant is as follows:
[snip]
0
Reply Noob 7/1/2010 9:14:53 AM

On Jun 30, 6:36=A0am, "robertwess...@yahoo.com"
<robertwess...@nospicedham.yahoo.com> > > I read somewhere someone
saying that these days
> > it is common to use multiply & shift in lieu of
> > division because doing so is faster.
>
> > I did a quick test and found that indeed it is
> > faster with a Core 2 DUo. What took 102 seconds
> > for DIV required only 10.6 seconds using MUL with SHRD.


If your benchmark is large amount of divisions and not much else then
results might be accurate, but not useful. It's hard to find use-case
where the workload is a lot of divisions and not much else. ;)

It would be interesting to see how much of the processor you are
testing with is idle. You only found a bottleneck, but it won't be one
when you don't need the result immediately. The (i)div isn't a
blocking instruction. Of course, if this is used in a really tight
loop and the result is needed in the current iteration then it might
again stall the pipe and slow your code down.

Just saying that things aren't as simple as counting clock cycles for
individual instructions.
0
Reply hanukas 7/2/2010 11:30:49 AM

hanukas wrote:

> On Jun 30, 6:36 am, "robertwess...@yahoo.com"
> <robertwess...@nospicedham.yahoo.com>  >  >  I read somewhere someone
> saying that these days
>>> it is common to use multiply&  shift in lieu of
>>> division because doing so is faster.
>>
>>> I did a quick test and found that indeed it is
>>> faster with a Core 2 DUo. What took 102 seconds
>>> for DIV required only 10.6 seconds using MUL with SHRD.
>
>
> If your benchmark is large amount of divisions and not much else then
> results might be accurate, but not useful. It's hard to find use-case
> where the workload is a lot of divisions and not much else. ;)
>
> It would be interesting to see how much of the processor you are
> testing with is idle. You only found a bottleneck, but it won't be one
> when you don't need the result immediately. The (i)div isn't a
> blocking instruction. Of course, if this is used in a really tight
> loop and the result is needed in the current iteration then it might
> again stall the pipe and slow your code down.
>
> Just saying that things aren't as simple as counting clock cycles for
> individual instructions.

For AMD:

DIV  reg/mem _____________ VectorPath
IDIV reg/mem _____________ VectorPath

The fastest DIV_ is 16 VP, the slowest 72 VP
The fastest IDIV is 18 VP, the slowest 79 VP

IMUL reg8 ________________ DirectPath Single 3
IMUL reg16 _______________ VectorPath ______ 4
IMUL reg16, imm16 ________ VectorPath ______ 4
IMUL reg16, mem16 ________ DirectPath Single 6
IMUL reg16, mem16, imm ___ VectorPath ______ 7
IMUL reg16, reg16 ________ DirectPath Single 3
IMUL reg16, reg16, imm ___ VectorPath ______ 4
IMUL reg32 DirectPath ____ Double __________ 3
IMUL reg32, imm32 ________ DirectPath Single 3
IMUL reg32, mem32 ________ DirectPath Single 6
IMUL reg32, mem32, imm ___ VectorPath ______ 7
IMUL reg32, reg32 ________ DirectPath Single 3
IMUL reg32, reg32, imm ___ DirectPath Single 3
IMUL reg64 _______________ DirectPath Double 5
IMUL reg64, imm32 ________ DirectPath Single 4
IMUL reg64, mem64 ________ DirectPath Single 7
IMUL reg64, mem64, imm ___ VectorPath ______ 8
IMUL reg64, reg64 ________ DirectPath Single 4
IMUL reg64, reg64, imm32 _ DirectPath Single 4
IMUL mem8 ________________ DirectPath Single 6
IMUL mem16 _______________ VectorPath ______ 7
IMUL mem32 _______________ DirectPath Double 6
IMUL mem64 _______________ DirectPath Double 8
MUL_ reg8 ________________ DirectPath Single 3
MUL_ reg16 _______________ VectorPath ______ 4
MUL_ reg32 _______________ DirectPath Double 3
MUL_ reg64 _______________ DirectPath Double 5
MUL_ mem8 ________________ DirectPath Single 6
MUL_ mem16 _______________ VectorPath ______ 7
MUL_ mem32 _______________ DirectPath Double 6
MUL_ mem64 _______________ DirectPath Double 8

VP blocks all three pipes for the given amount
of cycles. MUL and IMUL execute in pipe 0 (ex-
clusive). Subsequential MULs/IMULs are limited
to one operation per given amount of cycles.

AMD surely did not suggest to replace division
by multiplication if both were on par - 64 bit
IMUL reg or MUL reg need 5 cycles, the corres-
ponding DIV 72 and IDIV 79 cycles. Calculating
the speed gain: 1440 % for MUL vs. DIV, 1580 %
for IMUL vs. IDIV. Isn't that amazing? ;)


Greetings from Augsburg

Bernhard Schornak
0
Reply Bernhard 7/2/2010 1:55:10 PM

hanukas wrote:
> On Jun 30, 6:36 am, "robertwess...@yahoo.com"
> <robertwess...@nospicedham.yahoo.com>  >  >  I read somewhere someone
> saying that these days
>>> it is common to use multiply&  shift in lieu of
>>> division because doing so is faster.
>>
>>> I did a quick test and found that indeed it is
>>> faster with a Core 2 DUo. What took 102 seconds
>>> for DIV required only 10.6 seconds using MUL with SHRD.
>
>
> If your benchmark is large amount of divisions and not much else then
> results might be accurate, but not useful. It's hard to find use-case
> where the workload is a lot of divisions and not much else. ;)

Please take a look at the Ogg Vorbis audio codec:

A decoder needs to do a bit-perfect recreation of an encoded frequency 
envelope, this basically requires a DIV for every sample (think pixel) 
on the curve.

When I wrote the world's fastest decoder a few years ago, replacing 
those divisions with on-the-fly (during header decoding) calculated 
reciprocals was one of the optimizations that worked well.

(At the end I had to try about 10 different optimizations before I found 
one that actually helped at least for most of my sample sound files.)

AFAIR, I also found one other location where reciprocals made sense, but 
I'd have to re-read my code to verify this.

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
0
Reply Terje 7/2/2010 7:38:33 PM

8 Replies
318 Views

(page loaded in 0.176 seconds)

Similiar Articles:













7/24/2012 7:02:00 PM


Reply: