slow MUL & DIV

  • Follow


Hi,

It is said that MUL and DIV are very slow instructions as compared to
others, in 8086.

Is it true for 486 and Pentium technology?

0
Reply spamtrap 12/17/2004 7:46:58 AM

Advait Raut wrote:

> It is said that MUL and DIV are very slow instructions as compared
> to others, in 8086. Is it true for 486 and Pentium technology?

Assume two's complement representation.

Addition is never slower than multiplication.
Multiplication is never slower than division.

Specifically, on the K7, addition takes 1 cycle, multiplication
takes 4-6 cycles, and division takes ~40 cycles.

-- 
Regards, Grumble

0
Reply Grumble 12/17/2004 10:32:01 AM


"Grumble" <devnull@kma.eu.org> wrote in message
news:cpucc2$8vo$1@news-rocq.inria.fr...

> Assume two's complement representation.

> Addition is never slower than multiplication.
> Multiplication is never slower than division.

Perhaps there are no cases in existence for integer units,
but there have been processors where multiplication is
faster than addition on the floating point unit.  Any
processor with a fully pipelined floating point
multiplier which has the floating point adder (or any
other resource) saturated and a work-starved FP
multiplier will benefit from a code transformation that
converts floating point additions to FP multiplications.
Alphas and Athlons come to mind here.  The Alpha 21164
was so extreme in this regard that I could make code go
faster by converting integer additions to floating point
multiplications!  I have lots of code that's written to
minimize floating point additions, even at the cost of
creating more floating point multiplications, see:

http://home.comcast.net/~kmbtib/

where most of the code is written in this style.

-- 
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

0
Reply James 12/17/2004 7:01:44 PM

"James Van Buskirk" <spamtrap@crayne.org> wrote in message 
news:1vEwd.276650$R05.101670@attbi_s53...
> "Grumble" <devnull@kma.eu.org> wrote in message
> news:cpucc2$8vo$1@news-rocq.inria.fr...
>
>> Assume two's complement representation.
>
>> Addition is never slower than multiplication.
>> Multiplication is never slower than division.
>
> Perhaps there are no cases in existence for integer units,
> but there have been processors where multiplication is
> faster than addition on the floating point unit.  Any
> processor with a fully pipelined floating point
> multiplier which has the floating point adder (or any
> other resource) saturated and a work-starved FP
> multiplier will benefit from a code transformation that
> converts floating point additions to FP multiplications.
> Alphas and Athlons come to mind here.  The Alpha 21164
> was so extreme in this regard that I could make code go
> faster by converting integer additions to floating point
> multiplications!  I have lots of code that's written to
> minimize floating point additions, even at the cost of
> creating more floating point multiplications, see:

How exactly would you convert an FP add to a FP multiply? The only case I 
can think of is 2^m*x + 2^n*x where m and n are pre-computed constants and 
abs(m - n) < precision. That doesn't strike me as a very common case.

I can't speak for Alpha, but on K7 fadd latency == fmul latency. I have 
never seen anyone convert between the two. Usually it is more effective to 
use 3DNow! or SSE. The FP add and multiply units can both perform 2 
single-precision ops per cycle or 1 double-precision op per cycle, so for 
single-precision floating-point you can get twice the throughput using 
3DNow! or SSE.

-Matt 

0
Reply Matt 12/18/2004 8:29:28 PM

"Matt" <spamtrap@crayne.org> wrote in message
news:2m0xd.176295$6w6.70314@tornado.tampabay.rr.com...

> How exactly would you convert an FP add to a FP multiply? The only case I
> can think of is 2^m*x + 2^n*x where m and n are pre-computed constants and
> abs(m - n) < precision. That doesn't strike me as a very common case.

Well, the following example does it.  You might think that I am
emulating an increment operation rather than a full-blown integer
addition, but Alpha, of course, has no dedicated increment opcode.
If you can't see what the code is doing I can answer questions
about it here or in comp.lang.fortran.

C:\LF9556\James\clf\fortran_count>type fortran_count.f90
program fortran_count
   implicit none
   integer, parameter :: ik8 = selected_int_kind(18)
   integer, parameter :: dp = selected_real_kind(15,300)
   integer(ik8) magic1, magic2
   real(dp) counter2, fortran
   integer i
   integer(ik8) decoder
   data magic1, magic2/Z'4230000000000000', Z'3ff0000000000001'/

   counter2 = transfer(magic1,fortran)
   fortran = transfer(magic2,fortran)
   do i = 1, 17
      counter2 = counter2+fortran  ! Increment high counter
   end do
   do i = 1, 257
      counter2 = counter2*fortran  ! Increment low counter
   end do
   decoder = transfer(counter2,decoder)
   write(*,'(a,i0)') ' High counter value = ', ibits(decoder,16,36)
   write(*,'(a,i0)') '  Low counter value = ', ibits(decoder,0,16)
end program fortran_count

C:\LF9556\James\clf\fortran_count>fortran_count
High counter value = 17
 Low counter value = 257

> I can't speak for Alpha, but on K7 fadd latency == fmul latency. I have
> never seen anyone convert between the two. Usually it is more effective to
> use 3DNow! or SSE. The FP add and multiply units can both perform 2
> single-precision ops per cycle or 1 double-precision op per cycle, so for
> single-precision floating-point you can get twice the throughput using
> 3DNow! or SSE.

One paper to consider is C. Temperton, "A new set of minimum-add
small-n rotated DFT modules," J. of Comput. Phys., vol. 75,
pp. 190-198, 1988.  DFT (i.e. FFT) computations leave the
programmer with lots of freedom to choose between addition and
multiplication.  As a simple example, consider the matrix
multiplication:

a*x+b*y = a*(x+y)+(b-a)*y
c*x+a*y = a*(x+y)+(c-a)*x

Assuming the matrix, hence a, b, and c are known in advance so
that (b-a) and (c-a) can be precomputed and inserted into the
code, we can choose here between computing this multiplication
by a 2X2 Toeplitz matrix via 2 additions and 4 multiplications
or 3 additions and 3 multiplications.  A complex multiplication
can be considered a multiplication by a 2X2 Toeplitz matrix.
For more complicated instances, see my web site.

The freedom to choose between additions and multiplications
means there is no single best FFT algorithm; it all depends on
the processor -- some processors (ARM, perhaps?) don't have a
multiplier, so there is impetus towards minimizing multiplications.
An Alpha 21164 can do an FP multiply every clock cycle along
with 3 other operations, one of which is an FP add.  I you let
a tick go by on that processor without carrying out an FP
multiply, you have simply wasted the opportunity; you can't
use it for anything else, except copy sign.  On such a
processor you want to make the DFT algorithmic choices that
lead to the fewest FP adds, even at the cost or more total
operations provided the FP multiplies don't actually outnumber
the FP adds, and they don't for DFT algorithms.  On a 21264 it's
perhaps less clear because you could use that FP multiply slot
to issue an integer operation, but we might assume that FP
operations are more frequent than integer operations in FFT
code.  Athlons are ever murkier than this because they have
such a narrow decoder bandwidth -- how are you supposed to
write assembly code when you've only got 3 instructions per
clock cycle?  However, they still have a fully pipelined
floating point multiplier so they can in prinpicle benefit
from code with fewer FP additions and more FP multiplications.
Careful examination of the algorithms on my web page will
reveal that they are written with at least 2-way parallelism
in mind so that you could issue 2 additions on each clock
cycle and all multiplications could be carried out in
parallel with additions so that 4 FP operations could be
issued on many clock cycles if the processor is capable of
issuing a 2-way SIMD add + a 2-way SIMD multiply per clock
cycle.  At least for the bigger FFT algorithms, latency is
meaningless, only throughput counts.  Well, throughput and
the little matter or memory usage.  For really big FFTs,
the whole algorithm has to be rebuilt around memory access
patterns, as FFTs naturally want to touch memory in the
worst way possible.  I have some code like that, too, but
it's not in good enough shape to be on my web site.

-- 
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

0
Reply James 12/19/2004 7:14:32 PM

--B=_H8082f536P5af3Hydro+iBKAWmOq017400
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Grumble wrote:

> Advait Raut wrote:
> 
>> It is said that MUL and DIV are very slow instructions as compared
>> to others, in 8086. Is it true for 486 and Pentium technology?
> 
> 
> Assume two's complement representation.
> 
> Addition is never slower than multiplication.
> Multiplication is never slower than division.
> 
> Specifically, on the K7, addition takes 1 cycle, multiplication
> takes 4-6 cycles, and division takes ~40 cycles.

That's very comparable to the PentiumPro (i.e. PentiumII/III/M) numbers.

P4 otoh has a much slower MUL, but still about four times faster than DIV.

Terje
-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

--B=_H8082f536P5af3Hydro+iBKAWmOq017400
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline



***********************************************************************
NOTICE: This e-mail transmission, and any documents, files or previous
e-mail messages attached to it, may contain confidential or privileged
information. If you are not the intended recipient, or a person
responsible for delivering it to the intended recipient, you are
hereby notified that any disclosure, copying, distribution or use of
any of the information contained in or attached to this message is
STRICTLY PROHIBITED. If you have received this transmission in error,
please immediately notify the sender and delete the e-mail and attached
documents. Thank you.
***********************************************************************


--B=_H8082f536P5af3Hydro+iBKAWmOq017400--

0
Reply Terje 12/20/2004 10:42:57 AM

5 Replies
300 Views

(page loaded in 0.06 seconds)

Similiar Articles:













7/26/2012 2:53:15 PM


Reply: