f



x86-64 performance: 32 bit vs 64 bit

Dear All --

In a question that has a vague link to Fortran, can anyone enlighten me 
whether the new x86-64 architecture exhibits any performance boost in 
running 64-bit code, as opposed to 32-bit code? I've been labouring 
under the notion that, in 64-bit mode, the extensions to the x86 
instruction set give access to a number of extra general-purpose 
registers. Since register starvation is one of the historical 
bottlenecks with the x86 architecture, it follows that 64-bit code will 
experience a significant performance boost over 32-bit code on the 
x86-64 platform.

This I've gathered from reading a number of articles. However, I've 
recently upgraded to an Athlon64 3000+ system (running Gentoo Linux 
1.4-AMD64), and I've found no real performance boost in running 32-bit 
code (as compiled using the Intel IA32 Fortran compiler) versus running 
64-bit code (as compiled usin the Intel EMT Fortran compiler). I can see 
four reasons for the lack of a performance hike in the 64-bit code:

1) The extra GPRs are available to both 32-bit and 64-bit code
2) The extra GPRs are avaliable only to 64-bit code, but the Intel EMT 
compiler can't take advantage of them
3) My code does not suffer from register starvation, therefore the extra 
GPRs in 64-bit code have little effect
4) Everything I've read about the GPRs is utter baloney

I'd appreciate hearing people's thoughts (esp. Steve Lionel's) on which 
one of these might be the correct answer.

cheers,

Rich

-- 
Dr Richard H D Townsend
Bartol Research Institute
University of Delaware

[ Delete VOID for valid email address ]
0
rhdt (1081)
12/14/2004 6:25:02 AM
comp.lang.fortran 11941 articles. 2 followers. Post Follow

7 Replies
1232 Views

Similar Articles

[PageSpeed] 42

Rich Townsend wrote:
> Dear All --
>
> ............
> 3) My code does not suffer from register starvation, therefore the
extra
> GPRs in 64-bit code have little effect
>

I too am very interested in the subject.
I can tell nothing about the other possibilitis, but there was one
instance in Intel classic 32 bits where register starvation  hit me in
a very strange way. Name of the game is static allocation. From
perusing the generated assembler code, it appears that x86 assembler
can take a 32 bit displacement; it follows that if the total allocated
data span an area less than 4GB, the starting address of a bunch of
vectors can be computed by using just one base register, plus different
displacements, computed at compile time. This sure reduces the impact
of register starvation; the same code, with the same arrays in the same
module, declared allocatable instead of static suffered from more than
a 50% slowdown; the slowdown was somewhat reduced by compiling with
SSE2 instructions, since in that case the address load for each data
access is amortized over 2 double precision operations.
Looking forward to further insight into this.

Cheers
Salvatore

0
sfilippone (77)
12/14/2004 8:18:10 AM
Salvatore wrote:

> I too am very interested in the subject.
> I can tell nothing about the other possibilitis, but there was one
> instance in Intel classic 32 bits where register starvation  hit me in
> a very strange way. Name of the game is static allocation. From
> perusing the generated assembler code, it appears that x86 assembler
> can take a 32 bit displacement; it follows that if the total allocated
> data span an area less than 4GB, the starting address of a bunch of
> vectors can be computed by using just one base register, plus different
> displacements, computed at compile time. 

(snip)

Over the years there have been many processors where static
allocation was faster than dynamic allocation for exactly
this reason.

Likely it slowed the acceptance of compilers and languages
that did a lot of dynamic allocation.

-- glen

0
gah (12851)
12/14/2004 8:57:22 AM
In article <cpm16d$5nj$1@scrotar.nss.udel.edu>, Rich Townsend wrote:
> Dear All --
> 
> In a question that has a vague link to Fortran, can anyone enlighten me 
> whether the new x86-64 architecture exhibits any performance boost in 
> running 64-bit code, as opposed to 32-bit code? I've been labouring 
> under the notion that, in 64-bit mode, the extensions to the x86 
> instruction set give access to a number of extra general-purpose 
> registers. 

Yes, correct. This link, which has a comparison of x86-64 and plain
old x86, came up on comp.arch recently

http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf

In addition to extra GPR:s, x86-64 also doubles the number of SSE2
registers.

> This I've gathered from reading a number of articles. However, I've 
> recently upgraded to an Athlon64 3000+ system (running Gentoo Linux 
> 1.4-AMD64), and I've found no real performance boost in running 32-bit 
> code (as compiled using the Intel IA32 Fortran compiler) versus running 
> 64-bit code (as compiled usin the Intel EMT Fortran compiler). I can see 
> four reasons for the lack of a performance hike in the 64-bit code:
> 
> 1) The extra GPRs are available to both 32-bit and 64-bit code

No, they are available only in 64-bit mode.

> 2) The extra GPRs are avaliable only to 64-bit code, but the Intel EMT 
> compiler can't take advantage of them

I'd be _very_ surprised if this is the case.

> 3) My code does not suffer from register starvation, therefore the extra 
> GPRs in 64-bit code have little effect

Probably, to an extent.

> 4) Everything I've read about the GPRs is utter baloney

AFAIK you seem to be on the right track.

> I'd appreciate hearing people's thoughts (esp. Steve Lionel's) on which 
> one of these might be the correct answer.

Well I ain't Steve Lionel, but let me offer my guess.

The main thing you seem to have forgotten is that in 64-bit mode, all
addresses are 64 bits instead of 32. This means that the code size
will be larger, and thus it requires more memory bandwidth. I read
somewhere that on average, code size will be about 20 % bigger with
x86-64 than with x86 (OTOH I guess that for some numerical code which
is mostly about huge dense arrays and not pointer heavy data
structures, the difference might be much smaller). So, whether your
code gains from x86-64 depends on whether the speedup from more
registers outweighs the slowdown due to increased memory bandwidth
usage.


-- 
Janne Blomqvist
0
foo33 (1454)
12/14/2004 9:18:21 AM
Janne Blomqvist wrote:
> In article <cpm16d$5nj$1@scrotar.nss.udel.edu>, Rich Townsend wrote:
> 
>>Dear All --
>>
>>In a question that has a vague link to Fortran, can anyone enlighten me 
>>whether the new x86-64 architecture exhibits any performance boost in 
>>running 64-bit code, as opposed to 32-bit code? I've been labouring 
>>under the notion that, in 64-bit mode, the extensions to the x86 
>>instruction set give access to a number of extra general-purpose 
>>registers. 
> 
> 
> Yes, correct. This link, which has a comparison of x86-64 and plain
> old x86, came up on comp.arch recently
> 
> http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf
> 
> In addition to extra GPR:s, x86-64 also doubles the number of SSE2
> registers.
> 
> 
>>This I've gathered from reading a number of articles. However, I've 
>>recently upgraded to an Athlon64 3000+ system (running Gentoo Linux 
>>1.4-AMD64), and I've found no real performance boost in running 32-bit 
>>code (as compiled using the Intel IA32 Fortran compiler) versus running 
>>64-bit code (as compiled usin the Intel EMT Fortran compiler). I can see 
>>four reasons for the lack of a performance hike in the 64-bit code:
>>
>>1) The extra GPRs are available to both 32-bit and 64-bit code
> 
> 
> No, they are available only in 64-bit mode.
> 
> 
>>2) The extra GPRs are avaliable only to 64-bit code, but the Intel EMT 
>>compiler can't take advantage of them
> 
> 
> I'd be _very_ surprised if this is the case.
> 
> 
>>3) My code does not suffer from register starvation, therefore the extra 
>>GPRs in 64-bit code have little effect
> 
> 
> Probably, to an extent.
> 
> 
>>4) Everything I've read about the GPRs is utter baloney
> 
> 
> AFAIK you seem to be on the right track.
> 
> 
>>I'd appreciate hearing people's thoughts (esp. Steve Lionel's) on which 
>>one of these might be the correct answer.
> 
> 
> Well I ain't Steve Lionel, but let me offer my guess.
> 
> The main thing you seem to have forgotten is that in 64-bit mode, all
> addresses are 64 bits instead of 32. This means that the code size
> will be larger, and thus it requires more memory bandwidth. I read
> somewhere that on average, code size will be about 20 % bigger with
> x86-64 than with x86 (OTOH I guess that for some numerical code which
> is mostly about huge dense arrays and not pointer heavy data
> structures, the difference might be much smaller). So, whether your
> code gains from x86-64 depends on whether the speedup from more
> registers outweighs the slowdown due to increased memory bandwidth
> usage.
> 
> 
Perhaps's Rich's code is constrained by memory bandwidth, but it could 
be data rather than code that is the constraint.

Our benchmarks show, on average, about 4% improvement from using the 
EM64T as opposed to the IA32 compiler.  However, some benchmarks show 
~15% improvement, while others show none.  For example, our result for 
CHANNEL is about the same for both compilers.  CHANNEL is quite often 
anomalous - the time it takes seems to depend more on memory bandwidth 
than anything else, including CPU speed.

The PGI compiler behaves is a similar way, but Absoft and NAG both show 
some speed-up for CHANNEL in 64 bit mode - so it's not clear-cut.

-- 
John Appleyard  - (send email to john!news@.. rather than spamtrap@..)
Polyhedron Software
Programs for Programmers - QA, Compilers, Graphics, Consultancy
********* Visit our Web site on http://www.polyhedron.co.uk/ *********
0
spamtrap7925 (139)
12/14/2004 10:43:56 AM
"Janne Blomqvist" <foo@bar.invalid> wrote in message
news:slrncrtbut.a53.foo@vipunen.hut.fi...
> In article <cpm16d$5nj$1@scrotar.nss.udel.edu>, Rich Townsend wrote:
> > Dear All --
> >
> > In a question that has a vague link to Fortran, can anyone enlighten me
> > whether the new x86-64 architecture exhibits any performance boost in
> > running 64-bit code, as opposed to 32-bit code? I've been labouring
> > under the notion that, in 64-bit mode, the extensions to the x86
> > instruction set give access to a number of extra general-purpose
> > registers.
>
> Yes, correct. This link, which has a comparison of x86-64 and plain
> old x86, came up on comp.arch recently
>
>
http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf
>
> In addition to extra GPR:s, x86-64 also doubles the number of SSE2
> registers.
>
> > This I've gathered from reading a number of articles. However, I've
> > recently upgraded to an Athlon64 3000+ system (running Gentoo Linux
> > 1.4-AMD64), and I've found no real performance boost in running 32-bit
> > code (as compiled using the Intel IA32 Fortran compiler) versus running
> > 64-bit code (as compiled usin the Intel EMT Fortran compiler). I can see
> > four reasons for the lack of a performance hike in the 64-bit code:
> >
> > 1) The extra GPRs are available to both 32-bit and 64-bit code
>
> No, they are available only in 64-bit mode.
>
> > 2) The extra GPRs are avaliable only to 64-bit code, but the Intel EMT
> > compiler can't take advantage of them
>
> I'd be _very_ surprised if this is the case.
When vectorizing, the Intel EM64T compiler "distributes" (splits) loops down
to avoid thrashing Write Combine buffers, aiming to boost HyperThreading
performance.  In loops which store to more than 4 array sections, this tends
to under-utilize the additional registers.  There are options to change
this, most effective if you turn off HT (or run on AMD64).  I don't see why
the emphasis on GPRs, which would be used for scalar integers and pointers,
while floating point or integer vector operations are important in many
Fortran applications.
>
> > 3) My code does not suffer from register starvation, therefore the extra
> > GPRs in 64-bit code have little effect
>
> Probably, to an extent.
In that case, no compiler will give you an advantage from additional
registers.
>
> > 4) Everything I've read about the GPRs is utter baloney
>
> AFAIK you seem to be on the right track.
>
> The main thing you seem to have forgotten is that in 64-bit mode, all
> addresses are 64 bits instead of 32. This means that the code size
> will be larger, and thus it requires more memory bandwidth. I read
> somewhere that on average, code size will be about 20 % bigger with
> x86-64 than with x86 (OTOH I guess that for some numerical code which
> is mostly about huge dense arrays and not pointer heavy data
> structures, the difference might be much smaller). So, whether your
> code gains from x86-64 depends on whether the speedup from more
> registers outweighs the slowdown due to increased memory bandwidth
> usage.
The above-maligned Intel compiler performs certain optimizations within
loops, when it can determine from the source code that nothing will overflow
32-bit pointers.  This would minimize the potential loss in performance of
an application which fits in 32 bits, and might contribute to leaving some
GPRs idle. Certain other brands of compilers require you to throw on a
switch to use full 64-bit addressing.


0
tprince (584)
12/14/2004 2:46:36 PM
Tim Prince wrote:

> When vectorizing, the Intel EM64T compiler "distributes" (splits) loops down
> to avoid thrashing Write Combine buffers, aiming to boost HyperThreading
> performance.  In loops which store to more than 4 array sections, this tends
> to under-utilize the additional registers.  There are options to change
> this, most effective if you turn off HT (or run on AMD64).  I don't see why
> the emphasis on GPRs, which would be used for scalar integers and pointers,
> while floating point or integer vector operations are important in many
> Fortran applications.

Ah, I hadn't realized that the GPRs are for integer and pointers only; I 
thought they could hold floating-point data too. That explains a lot, as 
do all of the other replies that I have received in this thread -- 
thanks! The PDF file that Janne Blomqvist pointed to was particularly 
helpful.

cheers,

Rich

-- 
Dr Richard H D Townsend
Bartol Research Institute
University of Delaware

[ Delete VOID for valid email address ]
0
rhdt (1081)
12/14/2004 4:52:35 PM
Rich Townsend wrote:
> Tim Prince wrote:
>
>......
> > this, most effective if you turn off HT (or run on AMD64).  I don't
see why
> > the emphasis on GPRs, which would be used for scalar integers and
pointers,
> > while floating point or integer vector operations are important in
many
> > Fortran applications.
>
> Ah, I hadn't realized that the GPRs are for integer and pointers
only; I
> thought they could hold floating-point data too. That explains a lot,
as
> do all of the other replies that I have received in this thread --
> .....

Suppose you have 8 vectors in one and the same loop (I have a number of
examples of this). The addresses into those vectors have to be
calculated, and in x86-32 this sure puts a lot of stress on the GPRs,
unless they are statically allocated, in which case some of the burden
can be put back onto the linker. 8 GPRs are way too few, considering
that around 2 of them are almost always reserved for linkage purposes.
Plus, the register allocation techniques used in the compilers are
heuristics; the references I have read on the subject mention that
current heuristics start working well with 16 or more registers, so
there is a nonlinear effect in doubling the GPRs.
As usual, YMMV

Cheers
Salvatore

0
sfilippone (77)
12/14/2004 5:10:33 PM
Reply: