Cortex-M3 vs PIC32 divide instruction

  • Follow


I've finally been considering a project to use either a
Cortex-M3 or a PIC32 processor and I've a technical question
unrelated to any "business issues" between these options --
the divide instruction operation.  Both of these cores
include one but I'm interested in any remarkable technical
details between them, including cycle counts but not limited
to that (load-store time is fair game.)

From what I've been able to garner from skimming the docs,
the Cortex-M3's MDU executes an SDIV or UDIV in anywhere from
2 to 12 clock cycles, but with a comment suggesting that it
takes less time when the operand sizes are similar.  Which
doesn't tell me what the typical time may be.  Also, it's
been a bit of a pain searching for good assembler docs on the
Cortex-M3.  But I've only been at it for about an hour or so,
so it's likely I am just slow and ignorant -- not that there
aren't good caches out there I should have found.

On the PIC32, the docs are clearer.  It's "one bit per clock"
and it includes an "early detection" of sign/zero bits in the
upper bytes to help goose that along where 7, 15, or 23 bits
worth might be skipped.  Worst case, it says, is 35 clocks.
It also stalls the 5-stage pipe if another division is issued
before the earlier one completes.

I am wondering if anyone has had direct experience playing
with either of these in the area of writing floating point
libraries and has had a chance to compare their relative
utility for that purpose and might comment on any relatively
significant details related to that effort -- speed being the
main question here.

At first blush, I'd say <=12 clocks is better than <=35.  But
there may be other issues.  And while the PIC32 approach is
something I already know how it must be done internally, I'm
curious about exactly what method is used in the Cortex-M3
approach for its division operation -- it's not clear to me.
(VHDL or Verilog code would make that very clear to me, if
anyone has it or a pseudo version of it.)

Jon
0
Reply jonk (565) 9/6/2011 7:39:57 AM

On 06/09/2011 09:39, Jon Kirwan wrote:
> I've finally been considering a project to use either a
> Cortex-M3 or a PIC32 processor and I've a technical question
> unrelated to any "business issues" between these options --
> the divide instruction operation.  Both of these cores
> include one but I'm interested in any remarkable technical
> details between them, including cycle counts but not limited
> to that (load-store time is fair game.)
>
>  From what I've been able to garner from skimming the docs,
> the Cortex-M3's MDU executes an SDIV or UDIV in anywhere from
> 2 to 12 clock cycles, but with a comment suggesting that it
> takes less time when the operand sizes are similar.  Which
> doesn't tell me what the typical time may be.  Also, it's
> been a bit of a pain searching for good assembler docs on the
> Cortex-M3.  But I've only been at it for about an hour or so,
> so it's likely I am just slow and ignorant -- not that there
> aren't good caches out there I should have found.
>
> On the PIC32, the docs are clearer.  It's "one bit per clock"
> and it includes an "early detection" of sign/zero bits in the
> upper bytes to help goose that along where 7, 15, or 23 bits
> worth might be skipped.  Worst case, it says, is 35 clocks.
> It also stalls the 5-stage pipe if another division is issued
> before the earlier one completes.
>
> I am wondering if anyone has had direct experience playing
> with either of these in the area of writing floating point
> libraries and has had a chance to compare their relative
> utility for that purpose and might comment on any relatively
> significant details related to that effort -- speed being the
> main question here.
>
> At first blush, I'd say<=12 clocks is better than<=35.  But
> there may be other issues.  And while the PIC32 approach is
> something I already know how it must be done internally, I'm
> curious about exactly what method is used in the Cortex-M3
> approach for its division operation -- it's not clear to me.
> (VHDL or Verilog code would make that very clear to me, if
> anyone has it or a pseudo version of it.)
>
> Jon

There are many tricks that can be employed with hardware division to 
make it faster in all or some cases - there is no good way to guess how 
they are implemented in these two cpu's.  But there will not be any 
"hidden issues" - the division instructions on both architectures work, 
they are both slow, and the time varies depending on the operands in a 
way that is difficult to predict and virtually impossible to utilise. 
And in both cases, the timing of the divide instruction will be only a 
small part of a software floating point division routing - the 
variations between different toolchain's floating point routines will be 
much higher than the variation between run-times for divide on either 
processor.

I don't know what more you are looking for.  If you want to divide 
unknown integers, using the cpu's divide instruction.  If you want to 
divide by a known constant integer, let the compiler handle it - either 
it will use the hardware divide instruction, or it will do something 
fancier like multiplying by the reciprocal scaled by a power of two. 
Knowing the nasty details of the hardware division implementation will 
not change that.

If you want to do very fast floating point, get a processor that has 
hardware floating point (Cortex-M4 will be available soon, there are 
real MIPS cpu's available instead of PIC32, there are plenty of 
PPC-based microcontrollers with hardware floating point, etc.).

0
Reply david2384 (1912) 9/6/2011 7:54:00 AM


On Tue, 06 Sep 2011 09:54:00 +0200, David Brown
<david@westcontrol.removethisbit.com> wrote:

>On 06/09/2011 09:39, Jon Kirwan wrote:
>> I've finally been considering a project to use either a
>> Cortex-M3 or a PIC32 processor and I've a technical question
>> unrelated to any "business issues" between these options --
>> the divide instruction operation.  Both of these cores
>> include one but I'm interested in any remarkable technical
>> details between them, including cycle counts but not limited
>> to that (load-store time is fair game.)
>>
>>  From what I've been able to garner from skimming the docs,
>> the Cortex-M3's MDU executes an SDIV or UDIV in anywhere from
>> 2 to 12 clock cycles, but with a comment suggesting that it
>> takes less time when the operand sizes are similar.  Which
>> doesn't tell me what the typical time may be.  Also, it's
>> been a bit of a pain searching for good assembler docs on the
>> Cortex-M3.  But I've only been at it for about an hour or so,
>> so it's likely I am just slow and ignorant -- not that there
>> aren't good caches out there I should have found.
>>
>> On the PIC32, the docs are clearer.  It's "one bit per clock"
>> and it includes an "early detection" of sign/zero bits in the
>> upper bytes to help goose that along where 7, 15, or 23 bits
>> worth might be skipped.  Worst case, it says, is 35 clocks.
>> It also stalls the 5-stage pipe if another division is issued
>> before the earlier one completes.
>>
>> I am wondering if anyone has had direct experience playing
>> with either of these in the area of writing floating point
>> libraries and has had a chance to compare their relative
>> utility for that purpose and might comment on any relatively
>> significant details related to that effort -- speed being the
>> main question here.
>>
>> At first blush, I'd say<=12 clocks is better than<=35.  But
>> there may be other issues.  And while the PIC32 approach is
>> something I already know how it must be done internally, I'm
>> curious about exactly what method is used in the Cortex-M3
>> approach for its division operation -- it's not clear to me.
>> (VHDL or Verilog code would make that very clear to me, if
>> anyone has it or a pseudo version of it.)
>>
>> Jon
>
>There are many tricks that can be employed with hardware division to 
>make it faster in all or some cases - there is no good way to guess how 
>they are implemented in these two cpu's.  But there will not be any 
>"hidden issues" - the division instructions on both architectures work, 
>they are both slow, and the time varies depending on the operands in a 
>way that is difficult to predict and virtually impossible to utilise. 
>And in both cases, the timing of the divide instruction will be only a 
>small part of a software floating point division routing - the 
>variations between different toolchain's floating point routines will be 
>much higher than the variation between run-times for divide on either 
>processor.
>
>I don't know what more you are looking for.  If you want to divide 
>unknown integers, using the cpu's divide instruction.  If you want to 
>divide by a known constant integer, let the compiler handle it - either 
>it will use the hardware divide instruction, or it will do something 
>fancier like multiplying by the reciprocal scaled by a power of two. 
>Knowing the nasty details of the hardware division implementation will 
>not change that.
>
>If you want to do very fast floating point, get a processor that has 
>hardware floating point (Cortex-M4 will be available soon, there are 
>real MIPS cpu's available instead of PIC32, there are plenty of 
>PPC-based microcontrollers with hardware floating point, etc.).

I have other reasons that factor into this decision that
preclude any other choice, right now.  I'm not looking for
the fastest FP, anyway.  So that's not the primary goal here.
I am curious about the details.  That's all.  And I'd like to
make my _own_ judgment, not simply compare other peoples' FP
packages that already exist.  I'm looking at gaining a deep
understanding of these two processors' approaches in the
NARROW case of these particular instructions.

I do not need an education about "time varies" and "let the
compiler handle it."  You should know me well enough by now
for that.  I'm already prepared to examine flash, sram, and
cache issues.  I need to know the specific details here. Part
of where I may be going is into things you may not think to
consider, such as interrupt latency, for example, or simply
for self-education about how the Cortex-M3 does it (I already
_know_ how the PIC32 does it internally.)  Don't presume too
much about my purposes -- they are not run of the mill at the
very least.

I simply need very detailed information.  I've been having a
little difficultly laying hands on it in the Cortex-M3 case.
I'm hoping someone can point me well.

But thanks for the time.  It is appreciated.

Jon
0
Reply jonk (565) 9/6/2011 9:45:13 AM

On 09/06/2011 09:39 AM, Jon Kirwan wrote:
> I've finally been considering a project to use either a
> Cortex-M3 or a PIC32 processor and I've a technical question
> unrelated to any "business issues" between these options --
> the divide instruction operation.  Both of these cores
> include one but I'm interested in any remarkable technical
> details between them, including cycle counts but not limited
> to that (load-store time is fair game.)
>
>  From what I've been able to garner from skimming the docs,
> the Cortex-M3's MDU executes an SDIV or UDIV in anywhere from
> 2 to 12 clock cycles, but with a comment suggesting that it
> takes less time when the operand sizes are similar.  Which
> doesn't tell me what the typical time may be.  Also, it's
> been a bit of a pain searching for good assembler docs on the
> Cortex-M3.  But I've only been at it for about an hour or so,
> so it's likely I am just slow and ignorant -- not that there
> aren't good caches out there I should have found.
>
> On the PIC32, the docs are clearer.  It's "one bit per clock"
> and it includes an "early detection" of sign/zero bits in the
> upper bytes to help goose that along where 7, 15, or 23 bits
> worth might be skipped.  Worst case, it says, is 35 clocks.
> It also stalls the 5-stage pipe if another division is issued
> before the earlier one completes.
>

In the ARM reference there's the following comment: "Division operations 
use early termination to minimize the number of cycles required based on 
the number of leading ones and zeroes in the input operands."

That looks similar to what the PIC32 does, but with more bits/cycle.
0
Reply Arlet 9/6/2011 9:53:26 AM

On 06/09/2011 11:45, Jon Kirwan wrote:
> On Tue, 06 Sep 2011 09:54:00 +0200, David Brown
> <david@westcontrol.removethisbit.com>  wrote:
>
>> On 06/09/2011 09:39, Jon Kirwan wrote:
>>> I've finally been considering a project to use either a
>>> Cortex-M3 or a PIC32 processor and I've a technical question
>>> unrelated to any "business issues" between these options --
>>> the divide instruction operation.  Both of these cores
>>> include one but I'm interested in any remarkable technical
>>> details between them, including cycle counts but not limited
>>> to that (load-store time is fair game.)
>>>
>>>   From what I've been able to garner from skimming the docs,
>>> the Cortex-M3's MDU executes an SDIV or UDIV in anywhere from
>>> 2 to 12 clock cycles, but with a comment suggesting that it
>>> takes less time when the operand sizes are similar.  Which
>>> doesn't tell me what the typical time may be.  Also, it's
>>> been a bit of a pain searching for good assembler docs on the
>>> Cortex-M3.  But I've only been at it for about an hour or so,
>>> so it's likely I am just slow and ignorant -- not that there
>>> aren't good caches out there I should have found.
>>>
>>> On the PIC32, the docs are clearer.  It's "one bit per clock"
>>> and it includes an "early detection" of sign/zero bits in the
>>> upper bytes to help goose that along where 7, 15, or 23 bits
>>> worth might be skipped.  Worst case, it says, is 35 clocks.
>>> It also stalls the 5-stage pipe if another division is issued
>>> before the earlier one completes.
>>>
>>> I am wondering if anyone has had direct experience playing
>>> with either of these in the area of writing floating point
>>> libraries and has had a chance to compare their relative
>>> utility for that purpose and might comment on any relatively
>>> significant details related to that effort -- speed being the
>>> main question here.
>>>
>>> At first blush, I'd say<=12 clocks is better than<=35.  But
>>> there may be other issues.  And while the PIC32 approach is
>>> something I already know how it must be done internally, I'm
>>> curious about exactly what method is used in the Cortex-M3
>>> approach for its division operation -- it's not clear to me.
>>> (VHDL or Verilog code would make that very clear to me, if
>>> anyone has it or a pseudo version of it.)
>>>
>>> Jon
>>
>> There are many tricks that can be employed with hardware division to
>> make it faster in all or some cases - there is no good way to guess how
>> they are implemented in these two cpu's.  But there will not be any
>> "hidden issues" - the division instructions on both architectures work,
>> they are both slow, and the time varies depending on the operands in a
>> way that is difficult to predict and virtually impossible to utilise.
>> And in both cases, the timing of the divide instruction will be only a
>> small part of a software floating point division routing - the
>> variations between different toolchain's floating point routines will be
>> much higher than the variation between run-times for divide on either
>> processor.
>>
>> I don't know what more you are looking for.  If you want to divide
>> unknown integers, using the cpu's divide instruction.  If you want to
>> divide by a known constant integer, let the compiler handle it - either
>> it will use the hardware divide instruction, or it will do something
>> fancier like multiplying by the reciprocal scaled by a power of two.
>> Knowing the nasty details of the hardware division implementation will
>> not change that.
>>
>> If you want to do very fast floating point, get a processor that has
>> hardware floating point (Cortex-M4 will be available soon, there are
>> real MIPS cpu's available instead of PIC32, there are plenty of
>> PPC-based microcontrollers with hardware floating point, etc.).
>
> I have other reasons that factor into this decision that
> preclude any other choice, right now.  I'm not looking for
> the fastest FP, anyway.  So that's not the primary goal here.
> I am curious about the details.  That's all.  And I'd like to
> make my _own_ judgment, not simply compare other peoples' FP
> packages that already exist.  I'm looking at gaining a deep
> understanding of these two processors' approaches in the
> NARROW case of these particular instructions.
>
> I do not need an education about "time varies" and "let the
> compiler handle it."  You should know me well enough by now
> for that.

Yes, I know that - that made it a particularly odd question from you.

>  I'm already prepared to examine flash, sram, and
> cache issues.  I need to know the specific details here. Part
> of where I may be going is into things you may not think to
> consider, such as interrupt latency, for example, or simply
> for self-education about how the Cortex-M3 does it (I already
> _know_ how the PIC32 does it internally.)  Don't presume too
> much about my purposes -- they are not run of the mill at the
> very least.
>

When you ask for unusual information like this, the real purpose is 
important - otherwise I can only guess that it is /pure/ curiosity (and 
I can understand that as a reason, and wish I could help you there).

> I simply need very detailed information.  I've been having a
> little difficultly laying hands on it in the Cortex-M3 case.
> I'm hoping someone can point me well.

I would be surprised if you can get the detailed information you would 
like - such implementation details tend to be well hidden from mere mortals.

One thing you might be able to find out about is how the division 
affects pipelining - but on an M3, with its short pipeline, that won't 
make a big difference.

Regarding interrupts, AFAIK instructions on the M3 (and MIPS) are not 
interruptable (unlike some m68k cpus, for example), so maximum interrupt 
latency will be affected by division instructions.

>
> But thanks for the time.  It is appreciated.
>
> Jon

0
Reply david2384 (1912) 9/6/2011 10:21:08 AM

Jon Kirwan <jonk@infinitefactors.org> wrote:
> Also, it's been a bit of a pain searching for good assembler docs on the
> Cortex-M3.  But I've only been at it for about an hour or so,
> so it's likely I am just slow and ignorant -- not that there
> aren't good caches out there I should have found.

You want the ARMv7-M Architecture Reference Manual off of ARM's website.

-a
0
Reply Anders.Montonen (86) 9/6/2011 10:32:53 AM

On Tue, 6 Sep 2011 10:32:53 +0000 (UTC),
Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:

>Jon Kirwan <jonk@infinitefactors.org> wrote:
>> Also, it's been a bit of a pain searching for good assembler docs on the
>> Cortex-M3.  But I've only been at it for about an hour or so,
>> so it's likely I am just slow and ignorant -- not that there
>> aren't good caches out there I should have found.
>
>You want the ARMv7-M Architecture Reference Manual off of ARM's website.

I think I have that for the assembly part of things.  If you
are referring to the near-end where the Appendices are at,
then I'm already aware of those sections (B, C, F, G, H.)  I
did also look at the timing information in Chapter 18-1, for
example, of DDI0337 on the Cortex-M3 for r1p1, r2p0, and
r2p1.  Though perhaps I haven't read it well enough.

I think I have been there.  But I may have missed something,
too, and I appreciate the suggestion

Jon
0
Reply jonk (565) 9/6/2011 11:17:01 AM

On Tue, 06 Sep 2011 12:21:08 +0200, David Brown
<david@westcontrol.removethisbit.com> wrote:

>On 06/09/2011 11:45, Jon Kirwan wrote:
>> On Tue, 06 Sep 2011 09:54:00 +0200, David Brown
>> <david@westcontrol.removethisbit.com>  wrote:
>>
>>> On 06/09/2011 09:39, Jon Kirwan wrote:
>>>> I've finally been considering a project to use either a
>>>> Cortex-M3 or a PIC32 processor and I've a technical question
>>>> unrelated to any "business issues" between these options --
>>>> the divide instruction operation.  Both of these cores
>>>> include one but I'm interested in any remarkable technical
>>>> details between them, including cycle counts but not limited
>>>> to that (load-store time is fair game.)
>>>>
>>>>   From what I've been able to garner from skimming the docs,
>>>> the Cortex-M3's MDU executes an SDIV or UDIV in anywhere from
>>>> 2 to 12 clock cycles, but with a comment suggesting that it
>>>> takes less time when the operand sizes are similar.  Which
>>>> doesn't tell me what the typical time may be.  Also, it's
>>>> been a bit of a pain searching for good assembler docs on the
>>>> Cortex-M3.  But I've only been at it for about an hour or so,
>>>> so it's likely I am just slow and ignorant -- not that there
>>>> aren't good caches out there I should have found.
>>>>
>>>> On the PIC32, the docs are clearer.  It's "one bit per clock"
>>>> and it includes an "early detection" of sign/zero bits in the
>>>> upper bytes to help goose that along where 7, 15, or 23 bits
>>>> worth might be skipped.  Worst case, it says, is 35 clocks.
>>>> It also stalls the 5-stage pipe if another division is issued
>>>> before the earlier one completes.
>>>>
>>>> I am wondering if anyone has had direct experience playing
>>>> with either of these in the area of writing floating point
>>>> libraries and has had a chance to compare their relative
>>>> utility for that purpose and might comment on any relatively
>>>> significant details related to that effort -- speed being the
>>>> main question here.
>>>>
>>>> At first blush, I'd say<=12 clocks is better than<=35.  But
>>>> there may be other issues.  And while the PIC32 approach is
>>>> something I already know how it must be done internally, I'm
>>>> curious about exactly what method is used in the Cortex-M3
>>>> approach for its division operation -- it's not clear to me.
>>>> (VHDL or Verilog code would make that very clear to me, if
>>>> anyone has it or a pseudo version of it.)
>>>>
>>>> Jon
>>>
>>> There are many tricks that can be employed with hardware division to
>>> make it faster in all or some cases - there is no good way to guess how
>>> they are implemented in these two cpu's.  But there will not be any
>>> "hidden issues" - the division instructions on both architectures work,
>>> they are both slow, and the time varies depending on the operands in a
>>> way that is difficult to predict and virtually impossible to utilise.
>>> And in both cases, the timing of the divide instruction will be only a
>>> small part of a software floating point division routing - the
>>> variations between different toolchain's floating point routines will be
>>> much higher than the variation between run-times for divide on either
>>> processor.
>>>
>>> I don't know what more you are looking for.  If you want to divide
>>> unknown integers, using the cpu's divide instruction.  If you want to
>>> divide by a known constant integer, let the compiler handle it - either
>>> it will use the hardware divide instruction, or it will do something
>>> fancier like multiplying by the reciprocal scaled by a power of two.
>>> Knowing the nasty details of the hardware division implementation will
>>> not change that.
>>>
>>> If you want to do very fast floating point, get a processor that has
>>> hardware floating point (Cortex-M4 will be available soon, there are
>>> real MIPS cpu's available instead of PIC32, there are plenty of
>>> PPC-based microcontrollers with hardware floating point, etc.).
>>
>> I have other reasons that factor into this decision that
>> preclude any other choice, right now.  I'm not looking for
>> the fastest FP, anyway.  So that's not the primary goal here.
>> I am curious about the details.  That's all.  And I'd like to
>> make my _own_ judgment, not simply compare other peoples' FP
>> packages that already exist.  I'm looking at gaining a deep
>> understanding of these two processors' approaches in the
>> NARROW case of these particular instructions.
>>
>> I do not need an education about "time varies" and "let the
>> compiler handle it."  You should know me well enough by now
>> for that.
>
>Yes, I know that - that made it a particularly odd question from you.
>
>>  I'm already prepared to examine flash, sram, and
>> cache issues.  I need to know the specific details here. Part
>> of where I may be going is into things you may not think to
>> consider, such as interrupt latency, for example, or simply
>> for self-education about how the Cortex-M3 does it (I already
>> _know_ how the PIC32 does it internally.)  Don't presume too
>> much about my purposes -- they are not run of the mill at the
>> very least.
>
>When you ask for unusual information like this, the real purpose is 
>important - otherwise I can only guess that it is /pure/ curiosity (and 
>I can understand that as a reason, and wish I could help you there).

The purpose is due diligence and to illuminate speculations I
may yet develop.  It's not a crystal clear process that I can
readily explain.  But I do know _what_ I want to know.

If it helps, imagine that I'd like to develop a cycle-
accurate simulator.

>> I simply need very detailed information.  I've been having a
>> little difficultly laying hands on it in the Cortex-M3 case.
>> I'm hoping someone can point me well.
>
>I would be surprised if you can get the detailed information you would 
>like - such implementation details tend to be well hidden from mere mortals.

Appears to be hidden from me, tonight.  So maybe you are
right.

I _am_ able to garner better information from the M4k.  I
still need to find out if the DIV can be interrupted.

>One thing you might be able to find out about is how the division 
>affects pipelining - but on an M3, with its short pipeline, that won't 
>make a big difference.

Yes, 3 stage vs 5 stage on the M4k.  I also took note that
Microchip licensed the M14k, too.

>Regarding interrupts, AFAIK instructions on the M3 (and MIPS) are not 
>interruptable (unlike some m68k cpus, for example), so maximum interrupt 
>latency will be affected by division instructions.

Yes, that is one of several considerations I have in mind.
Only one of them.  But an important one.  I am not yet
certain about the M4k on this point.

Anyway, thanks for the thoughts.  I will see what I can find
out there.  It is an omen that you don't know.  So that
suggests your earlier point about the difficulty here may be
correct.

Jon

>>
>> But thanks for the time.  It is appreciated.
>>
>> Jon
0
Reply jonk (565) 9/6/2011 11:30:40 AM

On Tue, 06 Sep 2011 12:21:08 +0200, David Brown
<david@westcontrol.removethisbit.com> wrote:

><snip>
>Regarding interrupts, AFAIK instructions on the M3 (and MIPS) are not 
>interruptable
><snip>

So far, I've found the phrase "Autonomous multiply/divide
unit" in the datasheet for the 5xx, 6xx, and 7xx units from
Microchip.  Their dual bus choice also supports transaction
aborts to improve interrupt latency.  I already know that
issuing another MDU instruction before an earlier divide has
completed will result in an "IU pipeline stall."  But this
doesn't make it clear what happens if another MDU instruction
is NOT issued in the interrupt routine, for example.  It may
be possible that the "autonomous" unit works in parallel, so
long as no attempt is made to access the MDU until it is
done.  If so, that would be fine to learn.

I'll write Microchip on this point to get clarification.  You
may be right about all this.  Might as well dot that i, cross
that t.

BTW, I am also considering porting my own O/S to either the
Cortex-M3 or the PIC32.  But again, this is only one facet of
what I'm thinking about.  it is NOT the totality.  But this
question is germane here, too.

Jon
0
Reply jonk (565) 9/6/2011 12:14:03 PM

Jon Kirwan <jonk@infinitefactors.org> wrote:

> I still need to find out if the DIV can be interrupted.

Footnote e to table 18-1 in the Cortex-M3 r2p0 TRM states that
"DIV is interruptible (abandoned/restarted), with worst case latency of
one cycle."

-a
0
Reply Anders.Montonen (86) 9/6/2011 12:40:00 PM

On Tue, 6 Sep 2011 12:40:00 +0000 (UTC),
Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:

>Jon Kirwan <jonk@infinitefactors.org> wrote:
>
>> I still need to find out if the DIV can be interrupted.
>
>Footnote e to table 18-1 in the Cortex-M3 r2p0 TRM states that
>"DIV is interruptible (abandoned/restarted), with worst case latency of
>one cycle."
>
>-a

Thanks!

Jon
0
Reply jonk (565) 9/6/2011 1:01:48 PM

On 06/09/2011 13:30, Jon Kirwan wrote:
> On Tue, 06 Sep 2011 12:21:08 +0200, David Brown
> <david@westcontrol.removethisbit.com>  wrote:
> The purpose is due diligence and to illuminate speculations I
> may yet develop.  It's not a crystal clear process that I can
> readily explain.  But I do know _what_ I want to know.
>
> If it helps, imagine that I'd like to develop a cycle-
> accurate simulator.
>
>>> I simply need very detailed information.  I've been having a
>>> little difficultly laying hands on it in the Cortex-M3 case.
>>> I'm hoping someone can point me well.
>>
>> I would be surprised if you can get the detailed information you would
>> like - such implementation details tend to be well hidden from mere mortals.
>
> Appears to be hidden from me, tonight.  So maybe you are
> right.
>
> I _am_ able to garner better information from the M4k.  I
> still need to find out if the DIV can be interrupted.
>

The M4K is an older architecture (or at least it is closer to the older 
MIPS architectures), with a simpler structure and lots more information 
about it.  You'll get better luck there.

>> One thing you might be able to find out about is how the division
>> affects pipelining - but on an M3, with its short pipeline, that won't
>> make a big difference.
>
> Yes, 3 stage vs 5 stage on the M4k.  I also took note that
> Microchip licensed the M14k, too.
>
>> Regarding interrupts, AFAIK instructions on the M3 (and MIPS) are not
>> interruptable (unlike some m68k cpus, for example), so maximum interrupt
>> latency will be affected by division instructions.
>
> Yes, that is one of several considerations I have in mind.
> Only one of them.  But an important one.  I am not yet
> certain about the M4k on this point.
>

The key thing to look for here is the data that is stored on the stack, 
or in dedicated registers, when an interrupt or other exception hits. 
On the m68k, for example, the processor can generate a rather extensive 
stack frame including the state of internal registers that are not 
otherwise accessible, holding partial results for division, progress 
counters for move-multiple instructions, etc.  On RISC architectures you 
don't get a stack frame for exceptions, but critical context data is put 
into dedicated registers that must be preserved if you are going to 
enable nested interrupts.  You should be able to see from the details of 
these registers where things can be interrupted.

> Anyway, thanks for the thoughts.  I will see what I can find
> out there.  It is an omen that you don't know.  So that
> suggests your earlier point about the difficulty here may be
> correct.
>

While I know many things, I don't know everything!  My knowledge of MIPS 
is based on a book on the "MIPS RISC Microarchitecture" I found in a 
second hand bookstore 20 years ago, and read for fun before I had even 
thought of doing embedded programming as a job.

> Jon
>
>>>
>>> But thanks for the time.  It is appreciated.
>>>
>>> Jon

0
Reply david2384 (1912) 9/6/2011 1:31:58 PM

On 06/09/2011 14:40, Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:
> Jon Kirwan<jonk@infinitefactors.org>  wrote:
>
>> I still need to find out if the DIV can be interrupted.
>
> Footnote e to table 18-1 in the Cortex-M3 r2p0 TRM states that
> "DIV is interruptible (abandoned/restarted), with worst case latency of
> one cycle."
>

OK, it is interruptible in that way - that's good for avoiding long 
interrupt latency.  Some cpu's (such as some m68k devices) can be 
interrupted in the middle of an instruction like divide, and then 
continue where they left off rather than starting anew.


0
Reply david2384 (1912) 9/6/2011 1:34:23 PM

On Tue, 06 Sep 2011 15:31:58 +0200, David Brown
<david@westcontrol.removethisbit.com> wrote:

>On 06/09/2011 13:30, Jon Kirwan wrote:
>> On Tue, 06 Sep 2011 12:21:08 +0200, David Brown
>> <david@westcontrol.removethisbit.com>  wrote:
>> The purpose is due diligence and to illuminate speculations I
>> may yet develop.  It's not a crystal clear process that I can
>> readily explain.  But I do know _what_ I want to know.
>>
>> If it helps, imagine that I'd like to develop a cycle-
>> accurate simulator.
>>
>>>> I simply need very detailed information.  I've been having a
>>>> little difficultly laying hands on it in the Cortex-M3 case.
>>>> I'm hoping someone can point me well.
>>>
>>> I would be surprised if you can get the detailed information you would
>>> like - such implementation details tend to be well hidden from mere mortals.
>>
>> Appears to be hidden from me, tonight.  So maybe you are
>> right.
>>
>> I _am_ able to garner better information from the M4k.  I
>> still need to find out if the DIV can be interrupted.
>
>The M4K is an older architecture (or at least it is closer to the older 
>MIPS architectures), with a simpler structure and lots more information 
>about it.  You'll get better luck there.

ARM has been around a LONG time.  But I worked on MIPS R2000
back circa 1986/1987.  Was that before the ARM/Acorn?  I
don't recall when the R4000 came out but it must have been
after the Acorn.  I think trying to decide which is older is
going to be a bunch of quibbling.

>>> One thing you might be able to find out about is how the division
>>> affects pipelining - but on an M3, with its short pipeline, that won't
>>> make a big difference.
>>
>> Yes, 3 stage vs 5 stage on the M4k.  I also took note that
>> Microchip licensed the M14k, too.
>>
>>> Regarding interrupts, AFAIK instructions on the M3 (and MIPS) are not
>>> interruptable (unlike some m68k cpus, for example), so maximum interrupt
>>> latency will be affected by division instructions.
>>
>> Yes, that is one of several considerations I have in mind.
>> Only one of them.  But an important one.  I am not yet
>> certain about the M4k on this point.
>
>The key thing to look for here is the data that is stored on the stack, 
>or in dedicated registers, when an interrupt or other exception hits. 
>On the m68k, for example, the processor can generate a rather extensive 
>stack frame including the state of internal registers that are not 
>otherwise accessible, holding partial results for division, progress 
>counters for move-multiple instructions, etc.  On RISC architectures you 
>don't get a stack frame for exceptions, but critical context data is put 
>into dedicated registers that must be preserved if you are going to 
>enable nested interrupts.  You should be able to see from the details of 
>these registers where things can be interrupted.

There's a point for me to go look up.

>> Anyway, thanks for the thoughts.  I will see what I can find
>> out there.  It is an omen that you don't know.  So that
>> suggests your earlier point about the difficulty here may be
>> correct.
>
>While I know many things, I don't know everything!  My knowledge of MIPS 
>is based on a book on the "MIPS RISC Microarchitecture" I found in a 
>second hand bookstore 20 years ago, and read for fun before I had even 
>thought of doing embedded programming as a job.

Mine all comes from working with the R2000 and a nice, long
lecture for a couple of days from Hennessey when I visited
them back when they first opened up an office near Weitek
(their first office.)  I'm very comfortable with the R2000.

Jon
0
Reply jonk (565) 9/6/2011 1:51:04 PM

On 09/06/2011 03:31 PM, David Brown wrote:

> The key thing to look for here is the data that is stored on the stack,
> or in dedicated registers, when an interrupt or other exception hits. On
> the m68k, for example, the processor can generate a rather extensive
> stack frame including the state of internal registers that are not
> otherwise accessible, holding partial results for division, progress
> counters for move-multiple instructions, etc. On RISC architectures you
> don't get a stack frame for exceptions, but critical context data is put
> into dedicated registers that must be preserved if you are going to
> enable nested interrupts. You should be able to see from the details of
> these registers where things can be interrupted.

Interestingly, the Cortex isn't very pure RISC anymore, and it does have 
a stack frame for exceptions. It doesn't save partial results, but it 
does save a couple of registers, which allow an interrupt handler to be 
written in pure C, and it allows hardware nesting of interrupts. The 
link register which normally contains the return address is set to a 
magic value, so on function return, the core knows to do a return from 
exception instead.
0
Reply Arlet 9/6/2011 2:31:53 PM

On 06/09/2011 15:51, Jon Kirwan wrote:
> On Tue, 06 Sep 2011 15:31:58 +0200, David Brown
> <david@westcontrol.removethisbit.com>  wrote:
>
>> On 06/09/2011 13:30, Jon Kirwan wrote:
>>> On Tue, 06 Sep 2011 12:21:08 +0200, David Brown
>>> <david@westcontrol.removethisbit.com>   wrote:
>>> The purpose is due diligence and to illuminate speculations I
>>> may yet develop.  It's not a crystal clear process that I can
>>> readily explain.  But I do know _what_ I want to know.
>>>
>>> If it helps, imagine that I'd like to develop a cycle-
>>> accurate simulator.
>>>
>>>>> I simply need very detailed information.  I've been having a
>>>>> little difficultly laying hands on it in the Cortex-M3 case.
>>>>> I'm hoping someone can point me well.
>>>>
>>>> I would be surprised if you can get the detailed information you would
>>>> like - such implementation details tend to be well hidden from mere mortals.
>>>
>>> Appears to be hidden from me, tonight.  So maybe you are
>>> right.
>>>
>>> I _am_ able to garner better information from the M4k.  I
>>> still need to find out if the DIV can be interrupted.
>>
>> The M4K is an older architecture (or at least it is closer to the older
>> MIPS architectures), with a simpler structure and lots more information
>> about it.  You'll get better luck there.
>
> ARM has been around a LONG time.  But I worked on MIPS R2000
> back circa 1986/1987.  Was that before the ARM/Acorn?  I
> don't recall when the R4000 came out but it must have been
> after the Acorn.  I think trying to decide which is older is
> going to be a bunch of quibbling.
>

Yes, ARM has been around for ages - it was probably around 1988 that I 
first used an ARM (Acorn Risc Machine) on an Archimedes.  But the 
architecture has gone through a great many changes since then - the 
Cortex M3 is significantly different both in programming model and in 
implementation.  MIPS has remained a lot more constant.  So the M3 is 
really one a few years old, while the R4000 is /much/ older, and much 
more studied.

>>>> One thing you might be able to find out about is how the division
>>>> affects pipelining - but on an M3, with its short pipeline, that won't
>>>> make a big difference.
>>>
>>> Yes, 3 stage vs 5 stage on the M4k.  I also took note that
>>> Microchip licensed the M14k, too.
>>>
>>>> Regarding interrupts, AFAIK instructions on the M3 (and MIPS) are not
>>>> interruptable (unlike some m68k cpus, for example), so maximum interrupt
>>>> latency will be affected by division instructions.
>>>
>>> Yes, that is one of several considerations I have in mind.
>>> Only one of them.  But an important one.  I am not yet
>>> certain about the M4k on this point.
>>
>> The key thing to look for here is the data that is stored on the stack,
>> or in dedicated registers, when an interrupt or other exception hits.
>> On the m68k, for example, the processor can generate a rather extensive
>> stack frame including the state of internal registers that are not
>> otherwise accessible, holding partial results for division, progress
>> counters for move-multiple instructions, etc.  On RISC architectures you
>> don't get a stack frame for exceptions, but critical context data is put
>> into dedicated registers that must be preserved if you are going to
>> enable nested interrupts.  You should be able to see from the details of
>> these registers where things can be interrupted.
>
> There's a point for me to go look up.
>
>>> Anyway, thanks for the thoughts.  I will see what I can find
>>> out there.  It is an omen that you don't know.  So that
>>> suggests your earlier point about the difficulty here may be
>>> correct.
>>
>> While I know many things, I don't know everything!  My knowledge of MIPS
>> is based on a book on the "MIPS RISC Microarchitecture" I found in a
>> second hand bookstore 20 years ago, and read for fun before I had even
>> thought of doing embedded programming as a job.
>
> Mine all comes from working with the R2000 and a nice, long
> lecture for a couple of days from Hennessey when I visited
> them back when they first opened up an office near Weitek
> (their first office.)  I'm very comfortable with the R2000.
>
> Jon

0
Reply david2384 (1912) 9/6/2011 2:56:02 PM

On Tue, 06 Sep 2011 09:54:00 +0200, David Brown wrote:

> On 06/09/2011 09:39, Jon Kirwan wrote:

>> snip <<

> If you want to do very fast floating point, get a processor that has
> hardware floating point (Cortex-M4 will be available soon, there are
> real MIPS cpu's available instead of PIC32, there are plenty of
> PPC-based microcontrollers with hardware floating point, etc.).

Sometimes the goal is to write fast-enough floating point in a processor 
that won't otherwise break the system budget, be it power consumption/
dissipation, size, BOM cost, etc.

Jon's asking about _writing_ a floating point library, so I assume he's 
working at a project front-end, counting clock cycles to make sure that 
things will work.

-- 
www.wescottdesign.com
0
Reply tim177 (4425) 9/6/2011 4:28:09 PM

On Sep 6, 11:17=A0pm, Jon Kirwan <j...@infinitefactors.org> wrote:
> On Tue, 6 Sep 2011 10:32:53 +0000 (UTC),
>
> Anders.Monto...@kapsi.spam.stop.fi.invalid wrote:
> >Jon Kirwan <j...@infinitefactors.org> wrote:
> >> Also, it's been a bit of a pain searching for good assembler docs on t=
he
> >> Cortex-M3. =A0But I've only been at it for about an hour or so,
> >> so it's likely I am just slow and ignorant -- not that there
> >> aren't good caches out there I should have found.
>
> >You want the ARMv7-M Architecture Reference Manual off of ARM's website.
>
> I think I have that for the assembly part of things. =A0If you
> are referring to the near-end where the Appendices are at,
> then I'm already aware of those sections (B, C, F, G, H.) =A0I
> did also look at the timing information in Chapter 18-1, for
> example, of DDI0337 on the Cortex-M3 for r1p1, r2p0, and
> r2p1. =A0Though perhaps I haven't read it well enough.
>
> I think I have been there. =A0But I may have missed something,
> too, and I appreciate the suggestion
>
> Jon

 If the speed of this matters a lot, you are best to simply get a
device, and try it.
 'Modern data' tends to be more and more superficial, and that is one
reason there are more cheap Eval/Starter kits.
 Note that other  devices are not standing still either - I see both
TI and ADI are now boasting of sub $2 DSPs (tho RAM based)

 TI's strangely lacks Timer capture, (they must want you to buy other
variants there) but does have high speed USB for a small cost adder.
 ADIs has good timers, but no USB.
 Both, of course, have very fast maths support, and quite large ROMS
with Floating point as well.
 -jg
0
Reply j.m.granville (126) 9/7/2011 11:42:13 AM

On Sep 7, 2:42=A0pm, Jim Granville <j.m.granvi...@gmail.com> wrote:
> On Sep 6, 11:17=A0pm, Jon Kirwan <j...@infinitefactors.org> wrote:
>
>
>
> > On Tue, 6 Sep 2011 10:32:53 +0000 (UTC),
>
> > Anders.Monto...@kapsi.spam.stop.fi.invalid wrote:
> > >Jon Kirwan <j...@infinitefactors.org> wrote:
> > >> Also, it's been a bit of a pain searching for good assembler docs on=
 the
> > >> Cortex-M3. =A0But I've only been at it for about an hour or so,
> > >> so it's likely I am just slow and ignorant -- not that there
> > >> aren't good caches out there I should have found.
>
> > >You want the ARMv7-M Architecture Reference Manual off of ARM's websit=
e.
>
> > I think I have that for the assembly part of things. =A0If you
> > are referring to the near-end where the Appendices are at,
> > then I'm already aware of those sections (B, C, F, G, H.) =A0I
> > did also look at the timing information in Chapter 18-1, for
> > example, of DDI0337 on the Cortex-M3 for r1p1, r2p0, and
> > r2p1. =A0Though perhaps I haven't read it well enough.
>
> > I think I have been there. =A0But I may have missed something,
> > too, and I appreciate the suggestion
>
> > Jon
>
> =A0If the speed of this matters a lot, you are best to simply get a
> device, and try it.
> =A0'Modern data' tends to be more and more superficial, and that is one
> reason there are more cheap Eval/Starter kits.
> =A0Note that other =A0devices are not standing still either - I see both
> TI and ADI are now boasting of sub $2 DSPs (tho RAM based)
>
> =A0TI's strangely lacks Timer capture, (they must want you to buy other
> variants there) but does have high speed USB for a small cost adder.
> =A0ADIs has good timers, but no USB.
> =A0Both, of course, have very fast maths support, and quite large ROMS
> with Floating point as well.
> =A0-jg

Last (only...:) ) time I used a TI DSP was apr. 10 years ago,
the 5420. Their divide was straight forward, use "subtract
conditionally"
in a repeat (penalty free) loop.
I also have wondered - just vaguely, though - how do they accelerate
division on various architectures, e.g. the power core I use now
needs only 14 (or was it 16?) cycles for a 32/32, older
implementations
of that core (the original 603e, that is) needed 30+, 37 IIRC.
I have been moaning so many times of having to write yet another
division - I think the only architecture which saved me that was the
68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit
machines) that I use the chance to ask Jon to share his findings,
I am also really curious about it.

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/
0
Reply dp (745) 9/7/2011 5:27:57 PM

On Wed, 7 Sep 2011 04:42:13 -0700 (PDT), Jim Granville
<j.m.granville@gmail.com> wrote:

>On Sep 6, 11:17�pm, Jon Kirwan <j...@infinitefactors.org> wrote:
>> On Tue, 6 Sep 2011 10:32:53 +0000 (UTC),
>>
>> Anders.Monto...@kapsi.spam.stop.fi.invalid wrote:
>> >Jon Kirwan <j...@infinitefactors.org> wrote:
>> >> Also, it's been a bit of a pain searching for good assembler docs on the
>> >> Cortex-M3. �But I've only been at it for about an hour or so,
>> >> so it's likely I am just slow and ignorant -- not that there
>> >> aren't good caches out there I should have found.
>>
>> >You want the ARMv7-M Architecture Reference Manual off of ARM's website.
>>
>> I think I have that for the assembly part of things. �If you
>> are referring to the near-end where the Appendices are at,
>> then I'm already aware of those sections (B, C, F, G, H.) �I
>> did also look at the timing information in Chapter 18-1, for
>> example, of DDI0337 on the Cortex-M3 for r1p1, r2p0, and
>> r2p1. �Though perhaps I haven't read it well enough.
>>
>> I think I have been there. �But I may have missed something,
>> too, and I appreciate the suggestion
>>
>> Jon
>
> If the speed of this matters a lot, you are best to simply get a
>device, and try it.

Jim, there is a difference between knowing something through
theory and knowing something only through experimental
result.  Although it is _practical_ and often _sufficient_ to
know through result, it is also true that all I'd learn is
the results for the specific cases I'm able to spend time
testing.  Theory informs a volume.  Results inform specific
points within that volume.  I want both.  Just buying a
device only gives me a few data points.  That's not enough.

In the case of the PIC32, I have the theory.  So I am fully
able to predict just about any situation I'm given.  (Except
that I still don't have the theory about what happens in the
presence of an exception -- but I will get that from
Microchip directly.)

Anyway, I know you are being practical.  But I want to go
beyond knowing only what a few tests may tell me.

> 'Modern data' tends to be more and more superficial, and that is one
>reason there are more cheap Eval/Starter kits.

Yes, but the designers _know_ the theory.  So it is available
somewhere.  And I'm not really wanting to poke out
experimental results and try and develop theories of my own
that match what I observe when it might just be nice to get
the low-down from someone who actually knows what is going
on.  Which is why I decided to just ask here.  (The other
option would be to write ARM, I suppose -- and I will do that
if nothing comes of the details here and simply hope they are
moved to respond to me.  I _know_ Microchip will respond,
from past experience with them.)

> Note that other  devices are not standing still either - I see both
>TI and ADI are now boasting of sub $2 DSPs (tho RAM based)
> TI's strangely lacks Timer capture, (they must want you to buy other
>variants there) but does have high speed USB for a small cost adder.
> ADIs has good timers, but no USB.
> Both, of course, have very fast maths support, and quite large ROMS
>with Floating point as well.

I am familiar with older families from both through coding
applications -- the ADSP-21xx from ADI; the TMS320C30 and C40
from TI.  I'm not completely unaware of newer parts, too.

But like most projects, there are a number of boundary
conditions involved and the DIV details I mentioned is only
one of many.  But DSP processing is decidely NOT the main
focus nor is floating point.  I merely mentioned FP as a
segue, because I felt that anyone writing assembly coded FP
would possibly know the theory I was looking for.  That
doesn't mean that is my focus.  I also mentioned interrupt
latency issues, later.  There are many considerations.

Jon
0
Reply jonk (565) 9/7/2011 8:38:15 PM

On Wed, 7 Sep 2011 10:27:57 -0700 (PDT), dp <dp@tgi-sci.com>
wrote:

><snip>
>I also have wondered - just vaguely, though - how do they accelerate
>division on various architectures, e.g. the power core I use now
>needs only 14 (or was it 16?) cycles for a 32/32, older
>implementations
>of that core (the original 603e, that is) needed 30+, 37 IIRC.
>I have been moaning so many times of having to write yet another
>division - I think the only architecture which saved me that was the
>68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit
>machines) that I use the chance to ask Jon to share his findings,
>I am also really curious about it.
>
>Dimiter

I am similarly curious and would like to know the theoretical
details.

If I do uncover the details in the Cortex-M3 case, I'll write
a little something about it here.  It's possible that there
are some university docs I'll find that clue me in.  I might
even get lucky and someone at ARM may respond kindly.  It may
be that someone here knows, too, but just hasn't said as
much, yet.  Chances are this isn't some deep dark secret.
Just that I haven't yet come across it, is all.  I am
remarkably ignorant.

If you are interested in the details regarding the M4K
(PIC32) method, then I can write a lot on that unremarkable
topic.  That one is easy.  I could design the hardware myself
almost in my sleep.

Jon
0
Reply jonk (565) 9/7/2011 9:15:59 PM

On 07/09/2011 19:27, dp wrote:

> Last (only...:) ) time I used a TI DSP was apr. 10 years ago,
> the 5420. Their divide was straight forward, use "subtract
> conditionally"
> in a repeat (penalty free) loop.
> I also have wondered - just vaguely, though - how do they accelerate
> division on various architectures, e.g. the power core I use now
> needs only 14 (or was it 16?) cycles for a 32/32, older
> implementations
> of that core (the original 603e, that is) needed 30+, 37 IIRC.

The basic division algorithm is a "subtract conditionally" loop, which 
is approximately one cycle per loop.  You can double the speed by simply 
doing two bits at a time.  That takes more hardware - you have to do 
three comparisons in parallel rather than just one, but it's not too 
expensive (multiplying by 0b00, 0b01, and 0b10 are all zero cost, and 
multiply by 0b11 is not hard).  That trick does not scale well, however 
- doing another bit in each cycle means much more hardware, and the 
depth of the combinational logic involved will mean slower clock speeds.

You can also save a clock cycle or two at the ends of the algorithm by 
careful setup - the cycles are usually still there in the latency, but 
get hidden within the rest of the instruction pipeline.

Early-exit testing can also be done - typically once the numerator part 
has been reduced to 0 (or less than the numerator), you can do a fast 
exit.  Some implementations may also have tricks like barrel-shifting at 
the start to "cancel out" any factors of 2 in the figures.

Beyond that, faster division is usually done by computing the 
reciprocal, then multiplying.  That is particularly useful for large 
bitwidths and floating point (i.e., hardware 64-bit floating point).

For integer work, it is usually best to leave that to the compiler's 
optimiser - a compiler may turn "x/3" into "(x * (2^n / 3)) >> n" for a 
suitable n.  This generally only makes sense if the cpu can quickly 
multiply numbers of twice the bitlength of x.


> I have been moaning so many times of having to write yet another
> division - I think the only architecture which saved me that was the
> 68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit
> machines) that I use the chance to ask Jon to share his findings,
> I am also really curious about it.
>

0
Reply david2384 (1912) 9/8/2011 7:31:01 AM

On Sep 8, 10:31=A0am, David Brown <da...@westcontrol.removethisbit.com>
wrote:
> On 07/09/2011 19:27, dp wrote:
>
> > Last (only...:) ) time I used a TI DSP was apr. 10 years ago,
> > the 5420. Their divide was straight forward, use "subtract
> > conditionally"
> > in a repeat (penalty free) loop.
> > I also have wondered - just vaguely, though - how do they accelerate
> > division on various architectures, e.g. the power core I use now
> > needs only 14 (or was it 16?) cycles for a 32/32, older
> > implementations
> > of that core (the original 603e, that is) needed 30+, 37 IIRC.
>
> The basic division algorithm is a "subtract conditionally" loop, which
> is approximately one cycle per loop. =A0You can double the speed by simpl=
y
> doing two bits at a time. =A0That takes more hardware - you have to do
> three comparisons in parallel rather than just one, but it's not too
> expensive (multiplying by 0b00, 0b01, and 0b10 are all zero cost, and
> multiply by 0b11 is not hard). =A0That trick does not scale well, however
> - doing another bit in each cycle means much more hardware, and the
> depth of the combinational logic involved will mean slower clock speeds.
>
> You can also save a clock cycle or two at the ends of the algorithm by
> careful setup - the cycles are usually still there in the latency, but
> get hidden within the rest of the instruction pipeline.
>
> Early-exit testing can also be done - typically once the numerator part
> has been reduced to 0 (or less than the numerator), you can do a fast
> exit. =A0Some implementations may also have tricks like barrel-shifting a=
t
> the start to "cancel out" any factors of 2 in the figures.
>
> Beyond that, faster division is usually done by computing the
> reciprocal, then multiplying. =A0That is particularly useful for large
> bitwidths and floating point (i.e., hardware 64-bit floating point).
>
> For integer work, it is usually best to leave that to the compiler's
> optimiser - a compiler may turn "x/3" into "(x * (2^n / 3)) >> n" for a
> suitable n. =A0This generally only makes sense if the cpu can quickly
> multiply numbers of twice the bitlength of x.
>
> > I have been moaning so many times of having to write yet another
> > division - I think the only architecture which saved me that was the
> > 68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit
> > machines) that I use the chance to ask Jon to share his findings,
> > I am also really curious about it.
>
>

Oh I know what can be done, Jon apparently does too.
Like I said I have written numerous divisions on various machines.
I did not use the "count leading zeroes" on the power 64/32 division
though (was tempted but my time was a lot more important than the CPUs
back then). Then dropping out on a 0 dividend (and/or shifting in
advance
by the least of leading zeroes) can reduce execution time in some
(many) cases, but it does add cycles to the worst case so I am
not sure I would go for it anyway, would be application specific
I guess if it goes that narrow to the wire.

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276/

0
Reply dp (745) 9/8/2011 10:40:42 AM

On Thu, 8 Sep 2011 03:40:42 -0700 (PDT), dp <dp@tgi-sci.com>
wrote:

>On Sep 8, 10:31�am, David Brown <da...@westcontrol.removethisbit.com>
>wrote:
>> On 07/09/2011 19:27, dp wrote:
>>
>> > Last (only...:) ) time I used a TI DSP was apr. 10 years ago,
>> > the 5420. Their divide was straight forward, use "subtract
>> > conditionally"
>> > in a repeat (penalty free) loop.
>> > I also have wondered - just vaguely, though - how do they accelerate
>> > division on various architectures, e.g. the power core I use now
>> > needs only 14 (or was it 16?) cycles for a 32/32, older
>> > implementations
>> > of that core (the original 603e, that is) needed 30+, 37 IIRC.
>>
>> The basic division algorithm is a "subtract conditionally" loop, which
>> is approximately one cycle per loop. �You can double the speed by simply
>> doing two bits at a time. �That takes more hardware - you have to do
>> three comparisons in parallel rather than just one, but it's not too
>> expensive (multiplying by 0b00, 0b01, and 0b10 are all zero cost, and
>> multiply by 0b11 is not hard). �That trick does not scale well, however
>> - doing another bit in each cycle means much more hardware, and the
>> depth of the combinational logic involved will mean slower clock speeds.
>>
>> You can also save a clock cycle or two at the ends of the algorithm by
>> careful setup - the cycles are usually still there in the latency, but
>> get hidden within the rest of the instruction pipeline.
>>
>> Early-exit testing can also be done - typically once the numerator part
>> has been reduced to 0 (or less than the numerator), you can do a fast
>> exit. �Some implementations may also have tricks like barrel-shifting at
>> the start to "cancel out" any factors of 2 in the figures.
>>
>> Beyond that, faster division is usually done by computing the
>> reciprocal, then multiplying. �That is particularly useful for large
>> bitwidths and floating point (i.e., hardware 64-bit floating point).
>>
>> For integer work, it is usually best to leave that to the compiler's
>> optimiser - a compiler may turn "x/3" into "(x * (2^n / 3)) >> n" for a
>> suitable n. �This generally only makes sense if the cpu can quickly
>> multiply numbers of twice the bitlength of x.
>>
>> > I have been moaning so many times of having to write yet another
>> > division - I think the only architecture which saved me that was the
>> > 68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit
>> > machines) that I use the chance to ask Jon to share his findings,
>> > I am also really curious about it.
>
>Oh I know what can be done, Jon apparently does too.

Yup.

>Like I said I have written numerous divisions on various machines.

And David especially knows I have done so.  We've talked
about it.  And he's aware of both my ignorance and skill.  So
I think he knows my boundaries.

>I did not use the "count leading zeroes" on the power 64/32 division
>though (was tempted but my time was a lot more important than the CPUs
>back then). Then dropping out on a 0 dividend (and/or shifting in
>advance
>by the least of leading zeroes) can reduce execution time in some
>(many) cases, but it does add cycles to the worst case so I am
>not sure I would go for it anyway, would be application specific
>I guess if it goes that narrow to the wire.

One doc I saw wrote "11 to 32" clock cycles for the division.
But more of a marketing piece.  Another, in a datasheet, I
find a maximum of 35 cycles.  That one I believe, as it
accounts for leading byte checks and result posting.  I've
looked a little bit at the 5-stage pipe docs and it's pretty
fancy.  It includes two separate result bypass paths to
accelerate results for following instructions to avoid
waiting for the register posting to take place.  Nice.

Interestingly, although the PIC32 achieves a fair pace (up to
80MHz), it isn't up to the MIPS synthesizable core, claimed
to be over 400MHz capable at 90nm and over 200Mhz capable at
130nm.

Anyway, I'd love to delve into the details.  Too bad the .v
RTL modules aren't public.  (Unless some kind soul can point
me to them?  ;)

Per other suggestions, I already plan to spend about $1000 or
so getting various tools set up for the PIC32 (swapping in my
older Pro Mate II tools, buying a REAL ICE and ICD3, and then
some 'stuff' to update my toolset.)  I already have the tools
purchased for the midrange Energy Micro EFM32, which is a
Cortex-M3, and have started testing development there.  It
was some work to get unlimited, free tools up and working --
thanks very much to CodeSourcery for that, by the way, as I
am now in debt to them.  Next week or two, I will start up on
the PIC32, as well.

The PIC32's MDU operates "autonomously."  So it continues on
a division _and_ following instructions so long as an IU
pipeline stall isn't triggered with the use of an MDU op.  I
am curious about how it functions in the presence of
interrupts.  But there is a LOT OF DOC to read, yet.  So I am
behind on that score.  In any case, a cursory glance over the
IEMAW pipeline doesn't seem to require a stall by itself.  So
I remain unsure.

I haven't read up on the Cortex-M3 DIV and SDIV.  Doc seems
in several different places, too.  But the Cortex-M3 may not
be autonomous.  Looking at DDI0337G, page 1-12, Figure 1-2,
seems to suggest it isn't autonomous and is squarely in the
"Ex" pathway.  So I'd guess a Cortex-M3 must wait for it.
Either way, it's always a two-edged sword so I don't think
one is necessarily better than another.

Noting from that same figure, register write-backs must
complete in Ex -- no need for the PIC32 bypass routes.  But
that is probably more the reason why Cortex-M3 would cycle
slower on the same feature size/process, too.

I am curious about the Cortex-M3 DIV and SDIV implementation
in hardware.  I'd love to see the details.  But it is looking
as though the CPU waits for it.  So if you have a 12 cycle
DIV in progress, that figure seems to suggest to me that
there will be a series of Ex pipeline stalls while the
division completes and posts its results to registers.

Anyway, this is all on my 'off hours' for hobby work and will
be some joy ahead.

Jon
0
Reply jonk (565) 9/8/2011 11:49:18 PM

On Sep 9, 2:49=A0am, Jon Kirwan <j...@infinitefactors.org> wrote:
> ...
> Noting from that same figure, register write-backs must
> complete in Ex -- no need for the PIC32 bypass routes. =A0But
> that is probably more the reason why Cortex-M3 would cycle
> slower on the same feature size/process, too.

Division not stalling the pipeline may be pretty rare, at
least I have not seen it on the power parts I use (5200B,
a really nice one). But division just takes up everything.

As a side note I found out the hard way the depth of the
pipeline. Needed MAC - FMADD, as they have it, FP add and
accumulate. Naively, I did it in a loop as with a DSP
and expected to get the specified 2 cycles per 64*64 FMADD.
 Got > 10. Ouch, this was close to ruining the entire
design effort. So I spent a day or two and eventually
wrestled down the data dependencies which were causing this;
it took using 24 of the 32 FP regs to do so, though.
Well, as a side benefit I saved some loads (once I
have 8 samples and 8 coefficients in the regs I did not
have to waste them and load again). So eventually I got
5.5 nS (2.5 nS per cycle) doing the loop (this includes
everything, perhaps loading from DDRAM to cache at times
etc.).

Well, this got way aside. I won;t delete it though, not
so many chances to talk work :-).

Dimiter
0
Reply dp (745) 9/9/2011 12:11:15 AM

Jon Kirwan <jonk@infinitefactors.org> wrote:
> The PIC32's MDU operates "autonomously."  So it continues on
> a division _and_ following instructions so long as an IU
> pipeline stall isn't triggered with the use of an MDU op.  I
> am curious about how it functions in the presence of
> interrupts.

I couldn't find any information in either the PIC32 docs or the MIPS
architecture docs, but I'm leaning towards the MDU ignoring interrupts
altogether. If the ISR uses the MDU, then the pipeline will stall until
the previous instruction completes.

> I haven't read up on the Cortex-M3 DIV and SDIV.  Doc seems
> in several different places, too.  But the Cortex-M3 may not
> be autonomous.  Looking at DDI0337G, page 1-12, Figure 1-2,
> seems to suggest it isn't autonomous and is squarely in the
> "Ex" pathway.  So I'd guess a Cortex-M3 must wait for it.

That is my understanding. You could try asking for information regarding
the division implementation on ARM's tech forum[1]. There's a lot of
noise, but also some interesting posts there.

-a

[1] <http://forums.arm.com/index.php?/forum/3-arm-tech/>
0
Reply Anders.Montonen (86) 9/9/2011 12:35:36 PM

Anders.Montonen@kapsi.spam.stop.fi.invalid wrote:
> I couldn't find any information in either the PIC32 docs or the MIPS
> architecture docs, but I'm leaning towards the MDU ignoring interrupts
> altogether. If the ISR uses the MDU, then the pipeline will stall until
> the previous instruction completes.

Replying to myself, but this is apparently how the MDU worked in
pre-MIPS32/64 days, nowadays it's a bit smarter. See pp. 108-109 in See
MIPS Run 2nd ed. (Seems to be easy enough to find naughty PDF versions if
you don't own a paper copy.)

-a
0
Reply Anders.Montonen (86) 9/9/2011 12:55:47 PM

On Sep 9, 11:49=A0am, Jon Kirwan <j...@infinitefactors.org> wrote:
>
> Interestingly, although the PIC32 achieves a fair pace (up to
> 80MHz), it isn't up to the MIPS synthesizable core, claimed
> to be over 400MHz capable at 90nm and over 200Mhz capable at
> 130nm.

That's more a flash artifact, than a process one.

Flash based uC seem to be rather stuck, for the last half decade, in
speed at the 80/100/120MHz region, and only the RAM based ones get
into the hundreds of MHz (like the sub $2 DSPs I mentioned above)

 Even if the Flash limits the CPU speed, one of my peeves, is very few
parts allow the peripherals to run to the silicon process speed,
instead forcing the peripheral clock to be <=3D CPU clock.
-jg
0
Reply j.m.granville (126) 9/10/2011 11:50:32 PM

In article <c831ff5d-0f43-4707-9788-
efc01c5e13ce@n19g2000prh.googlegroups.com>, j.m.granville@gmail.com 
says...
> 
> On Sep 9, 11:49�am, Jon Kirwan <j...@infinitefactors.org> wrote:
> >
> > Interestingly, although the PIC32 achieves a fair pace (up to
> > 80MHz), it isn't up to the MIPS synthesizable core, claimed
> > to be over 400MHz capable at 90nm and over 200Mhz capable at
> > 130nm.
> 
> That's more a flash artifact, than a process one.
> 
> Flash based uC seem to be rather stuck, for the last half decade, in
> speed at the 80/100/120MHz region, and only the RAM based ones get
> into the hundreds of MHz (like the sub $2 DSPs I mentioned above)
> 
>  Even if the Flash limits the CPU speed, one of my peeves, is very few
> parts allow the peripherals to run to the silicon process speed,
> instead forcing the peripheral clock to be <= CPU clock.

Peripherals, having to produce and accept signals from the outside
world using traces with larger capacitance and inductance, will 
naturally be more limited in speed.

Mark Borgerson

0
Reply mborgerson (483) 9/11/2011 3:32:11 AM

On Sep 11, 3:32=A0pm, Mark Borgerson <mborger...@comcast.net> wrote:
> In article <c831ff5d-0f43-4707-9788-
> efc01c5e1...@n19g2000prh.googlegroups.com>, j.m.granvi...@gmail.com
> says...
>
>
>
>
>
> > On Sep 9, 11:49 am, Jon Kirwan <j...@infinitefactors.org> wrote:
>
> > > Interestingly, although the PIC32 achieves a fair pace (up to
> > > 80MHz), it isn't up to the MIPS synthesizable core, claimed
> > > to be over 400MHz capable at 90nm and over 200Mhz capable at
> > > 130nm.
>
> > That's more a flash artifact, than a process one.
>
> > Flash based uC seem to be rather stuck, for the last half decade, in
> > speed at the 80/100/120MHz region, and only the RAM based ones get
> > into the hundreds of MHz (like the sub $2 DSPs I mentioned above)
>
> > =A0Even if the Flash limits the CPU speed, one of my peeves, is very fe=
w
> > parts allow the peripherals to run to the silicon process speed,
> > instead forcing the peripheral clock to be <=3D CPU clock.
>
> Peripherals, having to produce and accept signals from the outside
> world using traces with larger capacitance and inductance, will
> naturally be more limited in speed.

 Perhaps, but that ceiling is rather above the Flash-Speed limit I was
mentioning, and it clearly is not too much of an actual problem, as
SOME uC vendors can manage Peripheral clocks faster than core speeds.
It is a slow trend, I'd like to see become faster...

-jg
0
Reply j.m.granville (126) 9/11/2011 5:55:41 AM

29 Replies
31 Views

(page loaded in 0.312 seconds)


Reply: