question about macro?

  • Follow


Hi all,

Actually I want my code to be


for (i=0;i<N;i++)
{
    array2[i*30]=array1[i];
    array2[i*30+1]=array1[i];
    array2[i*30+2]=array1[i];
    array2[i*30+3]=array1[i];
..................
..................
..................
    array2[i*30+28]=array1[i];
    array2[i*30+29]=array1[i];
}

If there anyway to use macros to do it instead for listing out
manually in the code like above?
(the listing out such as the above example will be very bad in the
case I need, say,  array2[i*1000+0]=array1[i];  to ...
array2[i*1000+999]=array1[i];)

And please note that for some reasons of my device, I do not want to
use another loop such as

for (i=0;i<N;i++)
for (j=0;j<30;j++)
array2[i*30+j]=array1[i];


If I write a macro as follows:

#define myMacro (array1, array2)          \
for ( j=0;j<30;j++)                                  \
     array2[i*30+j]=array1[i];                    \



Does it help? I doubt it because it will be the same as the case with
two loops that I want to avoid.

Thanks a lot.



0
Reply mbalover9 (23) 3/9/2010 8:32:51 PM

MBALOVER wrote:
> Hi all,
> 
> Actually I want my code to be
> 
> 
> for (i=0;i<N;i++)
> {
>     array2[i*30]=array1[i];
>     array2[i*30+1]=array1[i];
>     array2[i*30+2]=array1[i];
>     array2[i*30+3]=array1[i];
> ..................
> ..................
> ..................
>     array2[i*30+28]=array1[i];
>     array2[i*30+29]=array1[i];
> }
> 
> If there anyway to use macros to do it instead for listing out
> manually in the code like above?

Forget macros, they won't help here.

How about memset?

(untested)

memset( array2+i*30, array1[i], 30*sizeof(array2[0]) );

-- 
Ian Collins
0
Reply Ian 3/9/2010 8:40:26 PM


On Mar 9, 2:32=A0pm, MBALOVER <mbalov...@gmail.com> wrote:
> Hi all,
>
> Actually I want my code to be
>
> for (i=3D0;i<N;i++)
> {
> =A0 =A0 array2[i*30]=3Darray1[i];
> =A0 =A0 array2[i*30+1]=3Darray1[i];
> =A0 =A0 array2[i*30+2]=3Darray1[i];
> =A0 =A0 array2[i*30+3]=3Darray1[i];
> .................
> .................
> .................
> =A0 =A0 array2[i*30+28]=3Darray1[i];
> =A0 =A0 array2[i*30+29]=3Darray1[i];
>
> }
>
> If there anyway to use macros to do it instead for listing out
> manually in the code like above?
> (the listing out such as the above example will be very bad in the
> case I need, say, =A0array2[i*1000+0]=3Darray1[i]; =A0to ...
> array2[i*1000+999]=3Darray1[i];)
>
> And please note that for some reasons of my device, I do not want to
> use another loop such as
>
> for (i=3D0;i<N;i++)
> for (j=3D0;j<30;j++)
> array2[i*30+j]=3Darray1[i];
>
> If I write a macro as follows:
>
> #define myMacro (array1, array2) =A0 =A0 =A0 =A0 =A0\
> for ( j=3D0;j<30;j++) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0\
> =A0 =A0 =A0array2[i*30+j]=3Darray1[i]; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0\
>
> Does it help? I doubt it because it will be the same as the case with
> two loops that I want to avoid.
>
> Thanks a lot.

I'm curious as to why an inner loop would be that bad.

Frankly, you're stuck; if you absolutely cannot use an inner loop,
then you'll have to write out each initialization manually.  You could
write another program to automate the process for you, such as:

    printf("for (i =3D 0; i < N; i++)\n{\n");
    for (j =3D 0; j < K; j++) /** where K is the number of items */
    {
      printf("  array[i*30+%d] =3D array[i];\n", j);
    }
    printf("}\n");

and then paste that output into your code.

Another alternative would be to look into a variation of Duff's device
(see http://en.wikipedia.org/wiki/Duff's_device), where you use an
inner loop but partially unroll it.

    for (i =3D 0; i < N; i++)
    {
      size_t j =3D (K + 7)/8;
      switch(j % 8)
      {
        case 0 : do {
        case 7 :   array[i*30+j+7] =3D array[i];
        case 6 :   array[i*30+j+6] =3D array[i];
        case 5 :   array[i*30+j+5] =3D array[i];
        case 4 :   array[i*30+j+4] =3D array[i];
        case 3 :   array[i*30+j+3] =3D array[i];
        case 2 :   array[i*30+j+2] =3D array[i];
        case 1 :   array[i*30+j+1] =3D array[i];
        } while (--j > 0);
      }
    }
0
Reply John 3/9/2010 8:53:09 PM

MBALOVER <mbalover9@gmail.com> writes:
> Actually I want my code to be
>
>
> for (i=0;i<N;i++)
> {
>     array2[i*30]=array1[i];
>     array2[i*30+1]=array1[i];
>     array2[i*30+2]=array1[i];
>     array2[i*30+3]=array1[i];
> .................
> .................
> .................
>     array2[i*30+28]=array1[i];
>     array2[i*30+29]=array1[i];
> }

It certainly looks like a job for a nested for loop (but see below).

> If there anyway to use macros to do it instead for listing out
> manually in the code like above?
> (the listing out such as the above example will be very bad in the
> case I need, say,  array2[i*1000+0]=array1[i];  to ...
> array2[i*1000+999]=array1[i];)

There's not really any way for the preprocessor, given a number N, to
generate N statements.

> And please note that for some reasons of my device, I do not want to
> use another loop such as
>
> for (i=0;i<N;i++)
> for (j=0;j<30;j++)
> array2[i*30+j]=array1[i];

So you insist on having, say, 30 distinct assignment statements; an
equivalent loop that iterates 30 times and performs exactly the same
assignments isn't good enough.

It might help if we know *how* your device imposes this requirement.

If you just write the nested for loop, how exactly does that not work?

> If I write a macro as follows:
>
> #define myMacro (array1, array2)          \
> for ( j=0;j<30;j++)                                  \
>      array2[i*30+j]=array1[i];                    \
>
>
>
> Does it help? I doubt it because it will be the same as the case with
> two loops that I want to avoid.

Right, the generated code will almost certainly be the same whether
the loop is written out or generated by a macro.

What you're doing is loop unrolling.  It's an optimization technique
that can result in faster but larger code.  (In some cases, the
code can be slower just because it's larger, due to cache effects.)
Doing loop unrolling at the source level would usually be considered
premature optimization; a good optimizing compiler should be able
to do this for you.

But if you really need to do this, you might consider writing another
tool that generates C code from an input specification.  (m4 is a
popular preprocessing tool; I don't know whether it's powerful enough
for this, but it's worth looking into.)

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Nokia
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
0
Reply Keith 3/9/2010 8:53:46 PM

Ian Collins <ian-news@hotmail.com> writes:
> MBALOVER wrote:
>> Actually I want my code to be
>>
>>
>> for (i=0;i<N;i++)
>> {
>>     array2[i*30]=array1[i];
>>     array2[i*30+1]=array1[i];
>>     array2[i*30+2]=array1[i];
>>     array2[i*30+3]=array1[i];
>> ..................
>> ..................
>> ..................
>>     array2[i*30+28]=array1[i];
>>     array2[i*30+29]=array1[i];
>> }
>>
>> If there anyway to use macros to do it instead for listing out
>> manually in the code like above?
>
> Forget macros, they won't help here.
>
> How about memset?
>
> (untested)
>
> memset( array2+i*30, array1[i], 30*sizeof(array2[0]) );

No, memset sets each byte of the target to the same value.  Unless
array1 and array2 are byte arrays, it won't help here.

Something similar to memset that works on elements bigger than bytes
would do the trick.  But the obvious way to implement such a function
is to use a for loop, which the OP wants to avoid for some unknown
reason.

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Nokia
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
0
Reply Keith 3/9/2010 8:58:53 PM

Keith Thompson wrote:
> Ian Collins <ian-news@hotmail.com> writes:
>> MBALOVER wrote:
>>> Actually I want my code to be
>>>
>>>
>>> for (i=0;i<N;i++)
>>> {
>>>     array2[i*30]=array1[i];
>>>     array2[i*30+1]=array1[i];
>>>     array2[i*30+2]=array1[i];
>>>     array2[i*30+3]=array1[i];
>>> ..................
>>> ..................
>>> ..................
>>>     array2[i*30+28]=array1[i];
>>>     array2[i*30+29]=array1[i];
>>> }
>>>
>>> If there anyway to use macros to do it instead for listing out
>>> manually in the code like above?
>> Forget macros, they won't help here.
>>
>> How about memset?
>>
>> (untested)
>>
>> memset( array2+i*30, array1[i], 30*sizeof(array2[0]) );
> 
> No, memset sets each byte of the target to the same value.  Unless
> array1 and array2 are byte arrays, it won't help here.

Bugger, I forgot about that...  I agree with your other posting.

-- 
Ian Collins
0
Reply Ian 3/9/2010 9:07:42 PM

Thanks a lot for responses.

Let me explain why I do not want to use a inner loop.

for (i=3D0;i<N;i++)
for (j=3D0;j<M;j++)
array2[i*M+j]=3Darray1[i];

for each for iteration of a for loop, we need a comparison and one
addition. In my system, each comparison takes 3 cycles and one
addition takes 1 cycle.
So each iteration takes 4 cycles.

If I use a inner loop, for each i, I need 4*M additional cycles.  In
my case M , N both are large numbers and the system's time budget is
limited.

That is why I do not want to use this way.






On Mar 9, 3:53=A0pm, Keith Thompson <ks...@mib.org> wrote:
> MBALOVER <mbalov...@gmail.com> writes:
> > Actually I want my code to be
>
> > for (i=3D0;i<N;i++)
> > {
> > =A0 =A0 array2[i*30]=3Darray1[i];
> > =A0 =A0 array2[i*30+1]=3Darray1[i];
> > =A0 =A0 array2[i*30+2]=3Darray1[i];
> > =A0 =A0 array2[i*30+3]=3Darray1[i];
> > .................
> > .................
> > .................
> > =A0 =A0 array2[i*30+28]=3Darray1[i];
> > =A0 =A0 array2[i*30+29]=3Darray1[i];
> > }
>
> It certainly looks like a job for a nested for loop (but see below).
>
> > If there anyway to use macros to do it instead for listing out
> > manually in the code like above?
> > (the listing out such as the above example will be very bad in the
> > case I need, say, =A0array2[i*1000+0]=3Darray1[i]; =A0to ...
> > array2[i*1000+999]=3Darray1[i];)
>
> There's not really any way for the preprocessor, given a number N, to
> generate N statements.
>
> > And please note that for some reasons of my device, I do not want to
> > use another loop such as
>
> > for (i=3D0;i<N;i++)
> > for (j=3D0;j<30;j++)
> > array2[i*30+j]=3Darray1[i];
>
> So you insist on having, say, 30 distinct assignment statements; an
> equivalent loop that iterates 30 times and performs exactly the same
> assignments isn't good enough.
>
> It might help if we know *how* your device imposes this requirement.
>
> If you just write the nested for loop, how exactly does that not work?
>
> > If I write a macro as follows:
>
> > #define myMacro (array1, array2) =A0 =A0 =A0 =A0 =A0\
> > for ( j=3D0;j<30;j++) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0\
> > =A0 =A0 =A0array2[i*30+j]=3Darray1[i]; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0\
>
> > Does it help? I doubt it because it will be the same as the case with
> > two loops that I want to avoid.
>
> Right, the generated code will almost certainly be the same whether
> the loop is written out or generated by a macro.
>
> What you're doing is loop unrolling. =A0It's an optimization technique
> that can result in faster but larger code. =A0(In some cases, the
> code can be slower just because it's larger, due to cache effects.)
> Doing loop unrolling at the source level would usually be considered
> premature optimization; a good optimizing compiler should be able
> to do this for you.
>
> But if you really need to do this, you might consider writing another
> tool that generates C code from an input specification. =A0(m4 is a
> popular preprocessing tool; I don't know whether it's powerful enough
> for this, but it's worth looking into.)
>
> --
> Keith Thompson (The_Other_Keith) ks...@mib.org =A0<http://www.ghoti.net/~=
kst>
> Nokia
> "We must do something. =A0This is something. =A0Therefore, we must do thi=
s."
> =A0 =A0 -- Antony Jay and Jonathan Lynn, "Yes Minister"

0
Reply MBALOVER 3/9/2010 9:09:12 PM

MBALOVER wrote:

[please don't top post]

> Thanks a lot for responses.
> 
> Let me explain why I do not want to use a inner loop.
> 
> for (i=0;i<N;i++)
> for (j=0;j<M;j++)
> array2[i*M+j]=array1[i];
> 
> for each for iteration of a for loop, we need a comparison and one
> addition. In my system, each comparison takes 3 cycles and one
> addition takes 1 cycle.
> So each iteration takes 4 cycles.
> 
> If I use a inner loop, for each i, I need 4*M additional cycles.  In
> my case M , N both are large numbers and the system's time budget is
> limited.
> 
> That is why I do not want to use this way.

Have you looked at the generated code with full optimisation on your 
compiler?  All those I've used will perform some degree of loop 
unrolling, so you will only suffer an extra

(4*M)/L

where L is the number of iterations inlined by the compiler.

-- 
Ian Collins
0
Reply Ian 3/9/2010 9:13:28 PM

MBALOVER wrote:
> Thanks a lot for responses.
>
> Let me explain why I do not want to use a inner loop.
>
> for (i=0;i<N;i++)
> for (j=0;j<M;j++)
> array2[i*M+j]=array1[i];
>
> for each for iteration of a for loop, we need a comparison and one
> addition. In my system, each comparison takes 3 cycles and one
> addition takes 1 cycle.
> So each iteration takes 4 cycles.
>
> If I use a inner loop, for each i, I need 4*M additional cycles.  In
> my case M , N both are large numbers and the system's time budget is
> limited.
>
> That is why I do not want to use this way.

How about a partially unrolled inner loop?

So in the the inner loop it does 10 assignments at a time, and the loop 
executes M/10 times (with any odd assignments added at the end). Then the 
overhead might only be 4*M/10 cycles, and you only ever have to write out 
in, full, 10 assignments (or however many you want to do in one go).

-- 
Bartc 

0
Reply bartc 3/9/2010 9:28:10 PM

On 3/9/2010 3:32 PM, MBALOVER wrote:
> Hi all,
>
> Actually I want my code to be
>
>
> for (i=0;i<N;i++)
> {
>      array2[i*30]=array1[i];
>      array2[i*30+1]=array1[i];
>      array2[i*30+2]=array1[i];
>      array2[i*30+3]=array1[i];
> .................
> .................
> .................
>      array2[i*30+28]=array1[i];
>      array2[i*30+29]=array1[i];
> }
>
> If there anyway to use macros to do it instead for listing out
> manually in the code like above?

     Use a second loop nested inside the first.

> (the listing out such as the above example will be very bad in the
> case I need, say,  array2[i*1000+0]=array1[i];  to ...
> array2[i*1000+999]=array1[i];)
>
> And please note that for some reasons of my device, I do not want to
> use another loop such as
>
> for (i=0;i<N;i++)
> for (j=0;j<30;j++)
> array2[i*30+j]=array1[i];

     Use a second loop nested inside-- oh, wait, sorry.  It's
the right solution, but if you don't want to use it ...  What
are these mysterious "reasons" of yours?  Tell us about them,
and maybe somebody will have a better idea than writing line
after line after repetitive line of code.

> If I write a macro as follows:
>
> #define myMacro (array1, array2)          \
> for ( j=0;j<30;j++)                                  \
>       array2[i*30+j]=array1[i];                    \
>
> Does it help? I doubt it because it will be the same as the case with
> two loops that I want to avoid.

     It does the right thing, which is to use a second loop-- oh,
wait, we've been through that already.

     The preprocessor has no iterative constructs.

-- 
Eric Sosman
esosman@ieee-dot-org.invalid
0
Reply Eric 3/9/2010 9:30:15 PM

On Mar 9, 3:32=A0pm, MBALOVER <mbalov...@gmail.com> wrote:
> Hi all,
>
> Actually I want my code to be
>
> for (i=3D0;i<N;i++)
> {
> =A0 =A0 array2[i*30]=3Darray1[i];
> =A0 =A0 array2[i*30+1]=3Darray1[i];
> =A0 =A0 array2[i*30+2]=3Darray1[i];
> =A0 =A0 array2[i*30+3]=3Darray1[i];
> .................
> .................
> .................
> =A0 =A0 array2[i*30+28]=3Darray1[i];
> =A0 =A0 array2[i*30+29]=3Darray1[i];
>
> }
>
> If there anyway to use macros to do it instead for listing out
> manually in the code like above?
> (the listing out such as the above example will be very bad in the
> case I need, say, =A0array2[i*1000+0]=3Darray1[i]; =A0to ...
> array2[i*1000+999]=3Darray1[i];)
>
> And please note that for some reasons of my device, I do not want to
> use another loop such as
>
> for (i=3D0;i<N;i++)
> for (j=3D0;j<30;j++)
> array2[i*30+j]=3Darray1[i];
>
> If I write a macro as follows:
>
> #define myMacro (array1, array2) =A0 =A0 =A0 =A0 =A0\
> for ( j=3D0;j<30;j++) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0\
> =A0 =A0 =A0array2[i*30+j]=3Darray1[i]; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0\
>
> Does it help? I doubt it because it will be the same as the case with
> two loops that I want to avoid.
>
> Thanks a lot.

Just a question.  Is the performance gain worth the maintenance and
code size (if embedded) costs that you're adding?  I haven't
encountered a scenario where doing something like this is worth the
imposed maintenance cost, so I'm curious what the scenario is.
0
Reply ImpalerCore 3/9/2010 9:48:41 PM

On 3/9/2010 4:09 PM, MBALOVER wrote:
> Thanks a lot for responses.
>
> Let me explain why I do not want to use a inner loop.
>
> for (i=0;i<N;i++)
> for (j=0;j<M;j++)
> array2[i*M+j]=array1[i];
>
> for each for iteration of a for loop, we need a comparison and one
> addition.

     Maybe, but optimizing compilers are already pretty good
at unrolling loops -- especially loops whose iteration count
are compile-time constants.

> addition. In my system, each comparison takes 3 cycles and one
> addition takes 1 cycle.
> So each iteration takes 4 cycles.

     It was reasonable to count cycles as recently as, oh, twenty
years ago.  Since then, CPU's have become far more intricate and
cycle counts have become extremely difficult to compute.  Unless
you're using an unusually simple processor (single-issue, little
or no pipelining, at most one level of memory cache, maybe only
a one- or two-slot store buffer, ...), these cycle counts are
only loosely related to elapsed time.

     Your example showed thirty assignments, and you indicated you
might need as many as a thousand.  How many cycles will you spend
in wait states with the CPU twiddling its thumbs while RAM ever so
slowly retrieves a fresh batch of instructions?

> If I use a inner loop, for each i, I need 4*M additional cycles.  In
> my case M , N both are large numbers and the system's time budget is
> limited.
>
> That is why I do not want to use this way.

     Write it both ways (John Bode's suggestion of a "helper"
program will save some time), and MEASURE the difference.  Cycle
counting is nearly impossible unless you happen to know that most
of the data is already in CPU registers; a couple of cache misses
(even just in L1) can disrupt the whole calculation.  Measure!

-- 
Eric Sosman
esosman@ieee-dot-org.invalid
0
Reply Eric 3/9/2010 10:03:35 PM

"MBALOVER" <mbalover9@gmail.com> wrote in message 
news:4c7dc4cf-a06e-44b8-b201-971e045c38cb@g7g2000yqe.googlegroups.com...
> Hi all,
>
> Actually I want my code to be
>
>
> for (i=0;i<N;i++)
> {
>    array2[i*30]=array1[i];
>    array2[i*30+1]=array1[i];
>    array2[i*30+2]=array1[i];
>    array2[i*30+3]=array1[i];
> .................
> .................
> .................
>    array2[i*30+28]=array1[i];
>    array2[i*30+29]=array1[i];
> }
>
> If there anyway to use macros to do it instead for listing out
> manually in the code like above?
> (the listing out such as the above example will be very bad in the
> case I need, say,  array2[i*1000+0]=array1[i];  to ...
> array2[i*1000+999]=array1[i];)

If you *have* to use macro's, how about:

#define A(x) array2[i*30+x] = array1[i];
#define A_5(x) A(x) A(x+1) A(x+2) A(x+3) A(x+4)
#define A_15(x) A_5(x) A_5(x+5) A_5(x+10)
#define A_30(x) A_15(x) A_15(x+15)

for (i=0; i<N; i++)
{
    A_30(i);
}


>
> And please note that for some reasons of my device, I do not want to
> use another loop such as
>
> for (i=0;i<N;i++)
> for (j=0;j<30;j++)
> array2[i*30+j]=array1[i];
>

Have you tried this?  Depending on your processor, it's possible that the 
read and the write will be overlapped with the loop overhead, and so the 
loop might cost you no time at all (and hence give you more 
readable/maintainable code for free)?

>
> If I write a macro as follows:
>
> #define myMacro (array1, array2)          \
> for ( j=0;j<30;j++)                                  \
>     array2[i*30+j]=array1[i];                    \
>
>
>
> Does it help? I doubt it because it will be the same as the case with
> two loops that I want to avoid.
>
> Thanks a lot.
>
>
> 


0
Reply Scott 3/9/2010 10:19:43 PM

Ian Collins wrote:
> MBALOVER wrote:
> 
> [please don't top post]
> 
>> Thanks a lot for responses.
>>
>> Let me explain why I do not want to use a inner loop.
>>
>> for (i=0;i<N;i++)
>> for (j=0;j<M;j++)
>> array2[i*M+j]=array1[i];
>>
>> for each for iteration of a for loop, we need a comparison and one
>> addition. In my system, each comparison takes 3 cycles and one
>> addition takes 1 cycle.
>> So each iteration takes 4 cycles.
>>
>> If I use a inner loop, for each i, I need 4*M additional cycles.  In
>> my case M , N both are large numbers and the system's time budget is
>> limited.
>>
>> That is why I do not want to use this way.
> 
> Have you looked at the generated code with full optimisation on your 
> compiler?  All those I've used will perform some degree of loop 
> unrolling, so you will only suffer an extra
> 
> (4*M)/L
> 
> where L is the number of iterations inlined by the compiler.

A wacky alternative if you have a C++ compiler for your target is to use 
template meta-programming to generate the code.

-- 
Ian Collins
0
Reply Ian 3/9/2010 10:21:20 PM

Ian Collins <ian-news@hotmail.com> writes:

> MBALOVER wrote:
>> Hi all,
>>
>> Actually I want my code to be
>>
>>
>> for (i=0;i<N;i++)
>> {
>>     array2[i*30]=array1[i];
>>     array2[i*30+1]=array1[i];
>>     array2[i*30+2]=array1[i];
>>     array2[i*30+3]=array1[i];
>> ..................
>> ..................
>> ..................
>>     array2[i*30+28]=array1[i];
>>     array2[i*30+29]=array1[i];
>> }
>>
>> If there anyway to use macros to do it instead for listing out
>> manually in the code like above?
>
> Forget macros, they won't help here.
>
> How about memset?
>
> (untested)
>
> memset( array2+i*30, array1[i], 30*sizeof(array2[0]) );

As Keith has pointed out memset won't do.  memcpy, on the other hand,
can do it in 5 calls:

  array2[i+30] = array1[i];
  memcpy(array2+i*30 +  1, array2[i+30],      sizeof array2[i]);
  memcpy(array2+i*30 +  2, array2[i+30],  2 * sizeof array2[i]);
  memcpy(array2+i*30 +  4, array2[i+30],  4 * sizeof array2[i]);
  memcpy(array2+i*30 +  8, array2[i+30],  8 * sizeof array2[i]);
  memcpy(array2+i*30 + 16, array2[i+30], 14 * sizeof array2[i]);

Note the final multiplier (if I've go it right).  The initial
assignment can be skipped if array2 and array1 have the same type (by
copying from array1 rather than array2) but it is likely that it is
faster to do the first few by assignment anyway before switching to
memcpy.  The only way to be sure is to measure.  I'd also measure the
plain loop because as so many people have pointed out, it is often
hard to predict the costs of loops these days.

The 1000 element copy takes 10 calls.  Of course even these calls can
be put in a loop that where the overhead is likely to be small
compared to the work done.  The result would be for more maintainable
than huge pile of assignments.

-- 
Ben.
0
Reply Ben 3/9/2010 11:13:05 PM

On 2010-03-09, MBALOVER <mbalover9@gmail.com> wrote:
> Let me explain why I do not want to use a inner loop.

Okay.

> for each for iteration of a for loop, we need a comparison and one
> addition. In my system, each comparison takes 3 cycles and one
> addition takes 1 cycle.

Stop right there.

Write it the simple, clear, and correct way.

Now test:  IS IT ACTUALLY TOO SLOW?

If the answer is "no", you're done.

DO NOT TRY TO OPTIMIZE WITHOUT TESTING FIRST.  EVER.

You may actually find out that the for loop is faster.  Why?  Any of
several reasons.  Optimizers are pretty good about inner loops; they
get a lot of practice.  Furthermore, it is not uncommon for a processor
to be such that an inner loop will be faster than a fully-unrolled
loop, because the fully-unrolled loop breaks the cache.  You think
branches are slow, wait until you see the cost of waiting for a single
byte of memory that wasn't in cache...

Seriously, though, you should NOT be trying to optimize stuff like this
at your level of experience.  When you're regularly advising the experts
in the group on performance, then you could possibly benefit from talking
about cycle counts.  At your level of experience, you're just wasting
your time.

-s
-- 
Copyright 2010, all wrongs reversed.  Peter Seebach / usenet-nospam@seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
0
Reply Seebs 3/9/2010 11:22:40 PM

In article <4c7dc4cf-a06e-44b8-b201-971e045c38cb@g7g2000yqe.googlegroups.com>,
MBALOVER  <mbalover9@gmail.com> wrote:
>Hi all,
>
>Actually I want my code to be
>
>
>for (i=0;i<N;i++)
>{
>    array2[i*30]=array1[i];
>    array2[i*30+1]=array1[i];
>    array2[i*30+2]=array1[i];
>    array2[i*30+3]=array1[i];
>.................
>.................
>.................
>    array2[i*30+28]=array1[i];
>    array2[i*30+29]=array1[i];
>}
>
>If there anyway to use macros to do it instead for listing out
>manually in the code like above?
>(the listing out such as the above example will be very bad in the
>case I need, say,  array2[i*1000+0]=array1[i];  to ...
>array2[i*1000+999]=array1[i];)
>
>And please note that for some reasons of my device, I do not want to
>use another loop such as
>
>for (i=0;i<N;i++)
>for (j=0;j<30;j++)
>array2[i*30+j]=array1[i];

You could to this:

  for (j = 0; j < 30 * N; j++)
  {
    array2[j] = array1[j/30];
  }
0
Reply ike 3/10/2010 12:45:52 AM

Eric Sosman wrote:
> On 3/9/2010 4:09 PM, MBALOVER wrote:
>> Thanks a lot for responses.
>>
>> Let me explain why I do not want to use a inner loop.
>>
>> for (i=0;i<N;i++)
>> for (j=0;j<M;j++)
>> array2[i*M+j]=array1[i];
>>
>> for each for iteration of a for loop, we need a comparison and one
>> addition.
> 
>     Maybe, but optimizing compilers are already pretty good
> at unrolling loops -- especially loops whose iteration count
> are compile-time constants.

Yes, but possibly not enough. The OP needs to check this.

Personally, if the OP needs to do the full unroll, I would be inclined 
to write a small C program to run on the ost system to generate the code 
to be compiled for the target.

>> addition. In my system, each comparison takes 3 cycles and one
>> addition takes 1 cycle.
>> So each iteration takes 4 cycles.
> 
>     It was reasonable to count cycles as recently as, oh, twenty
> years ago.  Since then, CPU's have become far more intricate and
> cycle counts have become extremely difficult to compute.  Unless
> you're using an unusually simple processor (single-issue, little
> or no pipelining, at most one level of memory cache, maybe only
> a one- or two-slot store buffer, ...), these cycle counts are
> only loosely related to elapsed time.

It's not *that* long ago that I was programming an embedded processor 
that had a trick three stage pipeline (all instructions when through all 
three stages even if they only did something in one of them), no 
out-of-order execution and no cache. So excluding wait states for memory 
all instructions took in three cycles. However, in practice, due to the 
pipelining, all instructions effectively took one cycle except in the 
presence of a branch. The processor, and it's derivatives, are still 
available today.

>     Your example showed thirty assignments, and you indicated you
> might need as many as a thousand.  How many cycles will you spend
> in wait states with the CPU twiddling its thumbs while RAM ever so
> slowly retrieves a fresh batch of instructions?

On the embedded system I was processing the memory and clock speeds 
where carefully selected so there were no wait states.

>> If I use a inner loop, for each i, I need 4*M additional cycles.  In
>> my case M , N both are large numbers and the system's time budget is
>> limited.
>>
>> That is why I do not want to use this way.
> 
>     Write it both ways (John Bode's suggestion of a "helper"
> program will save some time), and MEASURE the difference.  Cycle
> counting is nearly impossible unless you happen to know that most
> of the data is already in CPU registers; a couple of cache misses
> (even just in L1) can disrupt the whole calculation.  Measure!

I agree with measure, although *if* it is a simple enough processor (and 
they are still out there) you *can* count.
-- 
Flash Gordon
0
Reply Flash 3/11/2010 1:11:32 AM

On Mar 10, 5:11=A0pm, Flash Gordon <s...@spam.causeway.com> wrote:
> Eric Sosman wrote:
> > On 3/9/2010 4:09 PM, MBALOVER wrote:
> >> Thanks a lot for responses.
>
> >> Let me explain why I do not want to use a inner loop.
>
> >> for (i=3D0;i<N;i++)
> >> for (j=3D0;j<M;j++)
> >> array2[i*M+j]=3Darray1[i];
>
> >> for each for iteration of a for loop, we need a comparison and one
> >> addition.
>
> > =A0 =A0 Maybe, but optimizing compilers are already pretty good
> > at unrolling loops -- especially loops whose iteration count
> > are compile-time constants.
>
> Yes, but possibly not enough. The OP needs to check this.
>
> Personally, if the OP needs to do the full unroll, I would be inclined
> to write a small C program to run on the ost system to generate the code
> to be compiled for the target.
>
> >> addition. In my system, each comparison takes 3 cycles and one
> >> addition takes 1 cycle.
> >> So each iteration takes 4 cycles.
>
> > =A0 =A0 It was reasonable to count cycles as recently as, oh, twenty
> > years ago. =A0Since then, CPU's have become far more intricate and
> > cycle counts have become extremely difficult to compute. =A0Unless
> > you're using an unusually simple processor (single-issue, little
> > or no pipelining, at most one level of memory cache, maybe only
> > a one- or two-slot store buffer, ...), these cycle counts are
> > only loosely related to elapsed time.
>
> It's not *that* long ago that I was programming an embedded processor
> that had a trick three stage pipeline (all instructions when through all
> three stages even if they only did something in one of them), no
> out-of-order execution and no cache. So excluding wait states for memory
> all instructions took in three cycles. However, in practice, due to the
> pipelining, all instructions effectively took one cycle except in the
> presence of a branch. The processor, and it's derivatives, are still
> available today.

What processor?  Just curious...
0
Reply Squeamizh 3/11/2010 5:03:25 AM

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> Ian Collins <ian-news@hotmail.com> writes:
>> MBALOVER wrote:
>>> Hi all,
>>>
>>> Actually I want my code to be
>>>
>>> for (i=0;i<N;i++)
>>> {
>>>     array2[i*30]=array1[i];
>>> ..................
>>>     array2[i*30+29]=array1[i];
>>> }
>>>
>>> If there anyway to use macros to do it instead for listing out
>>> manually in the code like above?
>>
>> Forget macros, they won't help here.
>>
>> How about memset?

On many installations, you'll find that's a macro! ;-)

>> (untested)
>>
>> memset( array2+i*30, array1[i], 30*sizeof(array2[0]) );
>
> As Keith has pointed out memset won't do.  memcpy, on the other hand,
> can do it in 5 calls:
>
>   array2[i+30] = array1[i];
>   memcpy(array2+i*30 +  1, array2[i+30],      sizeof array2[i]);
>   memcpy(array2+i*30 +  2, array2[i+30],  2 * sizeof array2[i]);
>   memcpy(array2+i*30 +  4, array2[i+30],  4 * sizeof array2[i]);
>   memcpy(array2+i*30 +  8, array2[i+30],  8 * sizeof array2[i]);
>   memcpy(array2+i*30 + 16, array2[i+30], 14 * sizeof array2[i]);
>
> Note the final multiplier (if I've go it right).

14's right. The +30's should be *30's.

I timed this against a naive loop for int, long long, and 
double types. Conclusions were inconclusive. For double, 
the naive loop was faster when compiled for plain i386, but 
slower when 'optimised' for the actual athlon-xp architecture. 
For int and long long, the memcpys were faster than the loop.
The memcpys were in fact simply unrolled completely.

Given that there was a case where the memcpys were measurably 
slower, it's not worth blindly going for the clever technique 
until you've both found that the naive loop is slowing you down
and measured the two to make sure that the optimisation is in 
fact an improvement. Writing a clever version for speed without 
measuring whether it's faster or not is dumb. And frequently 
done, alas.

Phil
-- 
I find the easiest thing to do is to k/f myself and just troll away
-- David Melville on r.a.s.f1
0
Reply Phil 3/11/2010 8:15:27 AM

Phil Carmody <thefatphil_demunged@yahoo.co.uk> writes:

> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
>> Ian Collins <ian-news@hotmail.com> writes:
>>> MBALOVER wrote:
>>>> Hi all,
>>>>
>>>> Actually I want my code to be
>>>>
>>>> for (i=0;i<N;i++)
>>>> {
>>>>     array2[i*30]=array1[i];
>>>> ..................
>>>>     array2[i*30+29]=array1[i];
>>>> }
>>>>
>>>> If there anyway to use macros to do it instead for listing out
>>>> manually in the code like above?
>>>
>>> Forget macros, they won't help here.
>>>
>>> How about memset?
>
> On many installations, you'll find that's a macro! ;-)
>
>>> (untested)
>>>
>>> memset( array2+i*30, array1[i], 30*sizeof(array2[0]) );
>>
>> As Keith has pointed out memset won't do.  memcpy, on the other hand,
>> can do it in 5 calls:
>>
>>   array2[i+30] = array1[i];
>>   memcpy(array2+i*30 +  1, array2[i+30],      sizeof array2[i]);
>>   memcpy(array2+i*30 +  2, array2[i+30],  2 * sizeof array2[i]);
>>   memcpy(array2+i*30 +  4, array2[i+30],  4 * sizeof array2[i]);
>>   memcpy(array2+i*30 +  8, array2[i+30],  8 * sizeof array2[i]);
>>   memcpy(array2+i*30 + 16, array2[i+30], 14 * sizeof array2[i]);
>>
>> Note the final multiplier (if I've go it right).
>
> 14's right. The +30's should be *30's.

Of course.  Thanks.

> I timed this against a naive loop for int, long long, and 
> double types. Conclusions were inconclusive. For double, 
> the naive loop was faster when compiled for plain i386, but 
> slower when 'optimised' for the actual athlon-xp architecture. 
> For int and long long, the memcpys were faster than the loop.
> The memcpys were in fact simply unrolled completely.

Was this for the *30 example or the latter *1000 example?

> Given that there was a case where the memcpys were measurably 
> slower, it's not worth blindly going for the clever technique 
> until you've both found that the naive loop is slowing you down
> and measured the two to make sure that the optimisation is in 
> fact an improvement. Writing a clever version for speed without 
> measuring whether it's faster or not is dumb. And frequently 
> done, alas.

The measuring is important but you can't find out that it is not
always faster without writing it!  Having written it, I'd be tempted
to keep it around in some sort of conditional compilation so that it
can be switched in when/if measurements show that it is better on some
architecture/implementation.

One does ave to weight the maintenance cost, so there must be at least
the possibility of a benefit.

-- 
Ben.
0
Reply Ben 3/11/2010 2:05:40 PM

Squeamizh wrote:
> On Mar 10, 5:11 pm, Flash Gordon <s...@spam.causeway.com> wrote:

<snip>

>> It's not *that* long ago that I was programming an embedded processor
>> that had a trick three stage pipeline (all instructions when through all
>> three stages even if they only did something in one of them), no
>> out-of-order execution and no cache. So excluding wait states for memory
>> all instructions took in three cycles. However, in practice, due to the
>> pipelining, all instructions effectively took one cycle except in the
>> presence of a branch. The processor, and it's derivatives, are still
>> available today.
> 
> What processor?  Just curious...

Well, it seems the exact range I used is out of production, but the 
family continues with similar attributes... the manual I just looked at 
gives a 6 stage pipeline, but still not cache or out-of-order execution 
or anything like that.

http://focus.ti.com/lit/ug/spru131g/spru131g.pdf

One advantage of such processors is that sometimes you have hard 
real-time requirements where something has to take an *exact* amount of 
time, and doing the output early is just as bad as doing it late. Or 
sometimes, the exact time does not matter as long as every run is a 
consistent amount of time. In such situations *not* having a cache or 
anything else introducing variability is *very* useful, and you can 
count clock cycles.
-- 
Flash Gordon
0
Reply Flash 3/11/2010 9:23:26 PM

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> Phil Carmody <thefatphil_demunged@yahoo.co.uk> writes:
>> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
>>> Ian Collins <ian-news@hotmail.com> writes:
>>>> MBALOVER wrote:
>>>>> Hi all,
>>>>>
>>>>> Actually I want my code to be
>>>>>
>>>>> for (i=0;i<N;i++)
>>>>> {
>>>>>     array2[i*30]=array1[i];
>>>>> ..................
>>>>>     array2[i*30+29]=array1[i];
>>>>> }
>>>>>
>>>>> If there anyway to use macros to do it instead for listing out
>>>>> manually in the code like above?
>>>>
>>>> Forget macros, they won't help here.
>>>>
>>>> How about memset?
>>
>> On many installations, you'll find that's a macro! ;-)
>>
>>>> (untested)
>>>>
>>>> memset( array2+i*30, array1[i], 30*sizeof(array2[0]) );
>>>
>>> As Keith has pointed out memset won't do.  memcpy, on the other hand,
>>> can do it in 5 calls:
>>>
>>>   array2[i+30] = array1[i];
>>>   memcpy(array2+i*30 +  1, array2[i+30],      sizeof array2[i]);
>>>   memcpy(array2+i*30 +  2, array2[i+30],  2 * sizeof array2[i]);
>>>   memcpy(array2+i*30 +  4, array2[i+30],  4 * sizeof array2[i]);
>>>   memcpy(array2+i*30 +  8, array2[i+30],  8 * sizeof array2[i]);
>>>   memcpy(array2+i*30 + 16, array2[i+30], 14 * sizeof array2[i]);
>>>
>>> Note the final multiplier (if I've go it right).
>>
>> 14's right. The +30's should be *30's.
>
> Of course.  Thanks.
>
>> I timed this against a naive loop for int, long long, and 
>> double types. Conclusions were inconclusive. For double, 
>> the naive loop was faster when compiled for plain i386, but 
>> slower when 'optimised' for the actual athlon-xp architecture. 
>> For int and long long, the memcpys were faster than the loop.
>> The memcpys were in fact simply unrolled completely.
>
> Was this for the *30 example or the latter *1000 example?

*30

>> Given that there was a case where the memcpys were measurably 
>> slower, it's not worth blindly going for the clever technique 
>> until you've both found that the naive loop is slowing you down
>> and measured the two to make sure that the optimisation is in 
>> fact an improvement. Writing a clever version for speed without 
>> measuring whether it's faster or not is dumb. And frequently 
>> done, alas.
>
> The measuring is important but you can't find out that it is not
> always faster without writing it!  Having written it, I'd be tempted
> to keep it around in some sort of conditional compilation so that it
> can be switched in when/if measurements show that it is better on some
> architecture/implementation.

Indeed. I hope that my comment wasn't taken as a "don't even 
suggest it" one. Of course, you have to suggest possibilities
before they are demonstrated to be preferable (likewise inferior).

> One does ave to weight the maintenance cost, so there must be at least
> the possibility of a benefit.

A simple comment like "/* we now have 16, only need 14 more */" would
be more than enough in your memcpy version. As someone who's maintained
a lot of old code (not admitting to how much of it was originally mine -
having said that, having about a million lines of crap dumped on me once
probably dwarfs all the crap that I've written (I hope)), I do quite
often very clearly see the maintainter's view.

In fact one mantra I picked up recently was something along the lines of
"you are always writing code to be read by someone else - yourself in the 
future".

Phil
-- 
I find the easiest thing to do is to k/f myself and just troll away
-- David Melville on r.a.s.f1
0
Reply Phil 3/12/2010 12:40:45 AM

Phil Carmody wrote:
 
> In fact one mantra I picked up recently was something
> along the lines of
> "you are always writing code to be read by someone else
> - yourself in the future".

The first time that I saw C code that I wrote,
but hadn't seen in six months,
I was surprised by how hard it was for me to understand it.

-- 
pete
0
Reply pete 3/12/2010 12:50:20 AM

On 2010-03-12, pete <pfiland@mindspring.com> wrote:
> The first time that I saw C code that I wrote,
> but hadn't seen in six months,
> I was surprised by how hard it was for me to understand it.

I recently had occasion to dive into some code I had written roughly a
year and a half ago, and let me tell you, I am glad that there were
comments.

My favorite:

/* Want to mess with this?  You will need a young priest and an old priest. */

That code is gone now.  :)

-s
-- 
Copyright 2010, all wrongs reversed.  Peter Seebach / usenet-nospam@seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
0
Reply Seebs 3/12/2010 1:10:46 AM

On 03/12/10 01:40 PM, Phil Carmody wrote:
>
> In fact one mantra I picked up recently was something along the lines of
> "you are always writing code to be read by someone else - yourself in the
> future".

Which is why all new code should have comprehensive unit tests.  Want to 
see examples of how to use the code? Read the tests.  Want to see if you 
changes break the code?  Run the tests.  Simple.

My tests certainly save me a lot of time and pain when going back to 
update my old code.

-- 
Ian Collins
0
Reply Ian 3/12/2010 1:15:15 AM

25 Replies
129 Views

(page loaded in 1.103 seconds)

Similiar Articles:


















7/30/2012 1:24:34 PM


Reply: