Is there anyone out there actually using SSE in their programs?
Is there a forum anywhere for discussing how to do SSE programming?
I think I may have figured out the easiest way to make effective
use of SSE with the gcc toolchain.
Thanks,
Dave Feustel
|
|
0
|
|
|
|
Reply
|
spamtrap2 (1628)
|
10/31/2006 12:03:38 AM |
|
dave <spamtrap@crayne.org> writes:
> Is there anyone out there actually using SSE in their programs?
SSE or SSE/SSE2/SSE3/SSSE3?
> Is there a forum anywhere for discussing how to do SSE programming?
I'd like to see it take place here. I could learn a lot from
several of the regulars here if they were to join in, I'm sure.
> I think I may have figured out the easiest way to make effective
> use of SSE with the gcc toolchain.
*intrinsics.h ?
Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
|
|
0
|
|
|
|
Reply
|
Phil
|
10/31/2006 8:36:33 AM
|
|
dave <spamtrap@crayne.org> writes:
> I think I may have figured out the easiest way to make effective
> use of SSE with the gcc toolchain.
Not gcc /per se/, but Dan Bernstein is working on a new version
of QHAsm, which might be of use. I've just downloaded the latest
version and expect to play with it in coming days, and can report
back if it makes SS* programming easier than raw asm.
Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
|
|
0
|
|
|
|
Reply
|
Phil
|
10/31/2006 8:38:16 AM
|
|
> Is there anyone out there actually using SSE in their programs?
http://board.flatassembler.net/topic.php?t=6020
for more search the board
> Is there a forum anywhere for discussing how to do SSE programming?
you surely can discuss this on
board.flatassembler.net
asmcommunity.net
> I think I may have figured out the easiest way to make effective
> use of SSE with the gcc toolchain.
no idea what you are talking about ;)
maybe you could explain... ?
|
|
0
|
|
|
|
Reply
|
vid512
|
10/31/2006 11:42:20 AM
|
|
Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
> dave <spamtrap@crayne.org> writes:
>> Is there anyone out there actually using SSE in their programs?
>
> SSE or SSE/SSE2/SSE3/SSSE3?
>
>> Is there a forum anywhere for discussing how to do SSE programming?
>
> I'd like to see it take place here. I could learn a lot from
> several of the regulars here if they were to join in, I'm sure.
>
>> I think I may have figured out the easiest way to make effective
>> use of SSE with the gcc toolchain.
>
> *intrinsics.h ?
Using intrinsics works, but the overhead introduced by gcc because
of the way intrinsics work makes execution of SSE instructions just
as slow as not using SSE at all.
> Phil
|
|
0
|
|
|
|
Reply
|
dave
|
10/31/2006 10:14:19 PM
|
|
vid512@gmail.com <spamtrap@crayne.org> wrote:
>> Is there anyone out there actually using SSE in their programs?
> http://board.flatassembler.net/topic.php?t=6020
> for more search the board
>
>> Is there a forum anywhere for discussing how to do SSE programming?
> you surely can discuss this on
> board.flatassembler.net
> asmcommunity.net
>
>> I think I may have figured out the easiest way to make effective
>> use of SSE with the gcc toolchain.
> no idea what you are talking about ;)
> maybe you could explain... ?
The intrinsic functions for sse implemented by gcc actually are function
calls to subroutines in which the sse instruction is embedded. I suspect
the overhead of the call nullifies any speed advantage of doing 2 or 4
arithmetic operations at once. I've decided that the way to use the sse
instructions efficiently is to write three-address math functions in c,
use gcc to generate an assembly source file from the c code, and then
replace the gcc-generated assembly code with customized sse assembly code.
One advantage of this approach is that the construct 'double x[4][4];'
makes sse data initialization and computation easy from C once the sse
functions (eg matrix arithmetic, etc) have been implemented.
|
|
0
|
|
|
|
Reply
|
dave
|
10/31/2006 10:24:26 PM
|
|
Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
> dave <spamtrap@crayne.org> writes:
>> Is there anyone out there actually using SSE in their programs?
>
> SSE or SSE/SSE2/SSE3/SSSE3?
all of the above plus SSE4 and GPU, although not all of those are
available to me.
>> Is there a forum anywhere for discussing how to do SSE programming?
>
> I'd like to see it take place here. I could learn a lot from
> several of the regulars here if they were to join in, I'm sure.
>
>> I think I may have figured out the easiest way to make effective
>> use of SSE with the gcc toolchain.
>
> *intrinsics.h ?
Intrinsics as implemented in gcc are so inefficient they are not worth
using.
> Phil
|
|
0
|
|
|
|
Reply
|
dave
|
10/31/2006 11:49:49 PM
|
|
Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
> dave <spamtrap@crayne.org> writes:
>> I think I may have figured out the easiest way to make effective
>> use of SSE with the gcc toolchain.
>
> Not gcc /per se/, but Dan Bernstein is working on a new version
> of QHAsm, which might be of use. I've just downloaded the latest
> version and expect to play with it in coming days, and can report
> back if it makes SS* programming easier than raw asm.
>
> Phil
Dan Bernstein's web page looks interesting.
|
|
0
|
|
|
|
Reply
|
dave
|
10/31/2006 11:50:38 PM
|
|
vid512@gmail.com <spamtrap@crayne.org> wrote:
>> Is there anyone out there actually using SSE in their programs?
> http://board.flatassembler.net/topic.php?t=6020
> for more search the board
>
>> Is there a forum anywhere for discussing how to do SSE programming?
> you surely can discuss this on
> board.flatassembler.net
> asmcommunity.net
>
>> I think I may have figured out the easiest way to make effective
>> use of SSE with the gcc toolchain.
> no idea what you are talking about ;)
> maybe you could explain... ?
>
SSE instructions speed up math by doing operations in parallel.
But the SSE (sse,sse2,sse3,ssse3,sse4,3dnow,3dnow+) operations are not
really accessible in C or C++; they are only accessible in assembler.
It looks to me like the way to use sse instructions is to write c
functions with 3 arguments - addresses of 2 source and 1 destination
matrix- and then have gcc generate the assembly version of the function.
Then replace the assembly code in the function body with custom code
using SSE instructions. The code produced makes good use of sse
capabilities.
|
|
0
|
|
|
|
Reply
|
dave
|
10/31/2006 11:56:46 PM
|
|
> Then replace the assembly code in the function body with custom code
> using SSE instructions. The code produced makes good use of sse
> capabilities.
why do it the hard hackish way? you can write ASM part in any assembler
that supports needed object format (ELF i think), and then link it with
your C program.
|
|
0
|
|
|
|
Reply
|
vid512
|
11/1/2006 2:34:51 AM
|
|
dave wrote:
> The intrinsic functions for sse implemented by gcc actually are
> function calls to subroutines in which the sse instruction is
> embedded.
You've got to be kidding - it can't be *that* lame?
Are you sure you're not missing something to actually make the intrinsics
intrinsic? For instance with VC2003, the Interlocked* functions are actually
function calls until you "do some stuff", after which they translate
directly to stuff like xadd.
|
|
0
|
|
|
|
Reply
|
f0dder
|
11/1/2006 9:59:25 AM
|
|
vid512@gmail.com <spamtrap@crayne.org> wrote:
>> Then replace the assembly code in the function body with custom code
>> using SSE instructions. The code produced makes good use of sse
>> capabilities.
> why do it the hard hackish way? you can write ASM part in any assembler
> that supports needed object format (ELF i think), and then link it with
> your C program.
>
True, but gcc generates a lot of extra stuff which seems to be used by
gdb in debugging. Plus the passing of arguments to the assembly code is
done by gcc.
|
|
0
|
|
|
|
Reply
|
dave
|
11/1/2006 10:16:01 AM
|
|
dave <spamtrap@crayne.org> writes:
> The intrinsic functions for sse implemented by gcc actually are function
> calls to subroutines in which the sse instruction is embedded. I suspect
> the overhead of the call nullifies any speed advantage of doing 2 or 4
> arithmetic operations at once.
Have you forgotten to turn on -O?
The intrinsics are inline when I use them.
Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
|
|
0
|
|
|
|
Reply
|
Phil
|
11/1/2006 5:01:59 PM
|
|
dave <spamtrap@crayne.org> writes:
> Intrinsics as implemented in gcc are so inefficient they are not worth
> using.
bash-3.1$ gcc-4.1 -march=pentium4 -O -o bs.o -g -c bs.c
bash-3.1$ objdump -d -S bs.o
[...]
inline void c4(register complex *a)
{
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
register __m128d p1=((__m128d*)a)[1];
register __m128d p3=((__m128d*)a)[3];
6: 66 0f 28 50 30 movapd 0x30(%eax),%xmm2
register __m128d p0=((__m128d*)a)[0];
b: 66 0f 28 00 movapd (%eax),%xmm0
register __m128d p2=((__m128d*)a)[2];
f: 66 0f 28 48 20 movapd 0x20(%eax),%xmm1
register __m128d p6=p1-p3; //x1-x3,y1-y3
register __m128d p7=p1+p3; //x1+x3,y1+y3
14: 66 0f 28 58 10 movapd 0x10(%eax),%xmm3
19: 66 0f 58 da addpd %xmm2,%xmm3
register __m128d p4=p0-p2; //x0-x2,y0-y2
1d: 66 0f 28 d0 movapd %xmm0,%xmm2
21: 66 0f 5c d1 subpd %xmm1,%xmm2
register __m128d p5=p0+p2; //x0+x2,y0+y2
25: 66 0f 58 c1 addpd %xmm1,%xmm0
register __m128d p9;
_mm_shuffle_pd(p9,p6,_MM_SHUFFLE2(0,1)); //y1-y3,x1-x3
((__m128d*)a)[0]=p5+p7;
29: 66 0f 28 cb movapd %xmm3,%xmm1
2d: 66 0f 58 c8 addpd %xmm0,%xmm1
31: 66 0f 29 08 movapd %xmm1,(%eax)
((__m128d*)a)[2]=p5-p7;
35: 66 0f 5c c3 subpd %xmm3,%xmm0
39: 66 0f 29 40 20 movapd %xmm0,0x20(%eax)
((__m128d*)a)[1]=p4-p9;
3e: 66 0f 28 ca movapd %xmm2,%xmm1
42: 66 0f 5c c8 subpd %xmm0,%xmm1
46: 66 0f 29 48 10 movapd %xmm1,0x10(%eax)
((__m128d*)a)[3]=p4+p9;
4b: 66 0f 58 d1 addpd %xmm1,%xmm2
4f: 66 0f 29 50 30 movapd %xmm2,0x30(%eax)
}
54: 5d pop %ebp
55: c3 ret
Precisely which of those lines did you consider to be calls
to helper functions which contain the actual SSE operation?
(as claimed in a different post in this thread.)
Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
|
|
0
|
|
|
|
Reply
|
Phil
|
11/1/2006 5:24:34 PM
|
|
Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
> dave <spamtrap@crayne.org> writes:
>> Intrinsics as implemented in gcc are so inefficient they are not worth
>> using.
>
>
> bash-3.1$ gcc-4.1 -march=pentium4 -O -o bs.o -g -c bs.c
> bash-3.1$ objdump -d -S bs.o
> [...]
> inline void c4(register complex *a)
> {
> 0: 55 push %ebp
> 1: 89 e5 mov %esp,%ebp
> 3: 8b 45 08 mov 0x8(%ebp),%eax
> register __m128d p1=((__m128d*)a)[1];
> register __m128d p3=((__m128d*)a)[3];
> 6: 66 0f 28 50 30 movapd 0x30(%eax),%xmm2
> register __m128d p0=((__m128d*)a)[0];
> b: 66 0f 28 00 movapd (%eax),%xmm0
> register __m128d p2=((__m128d*)a)[2];
> f: 66 0f 28 48 20 movapd 0x20(%eax),%xmm1
> register __m128d p6=p1-p3; //x1-x3,y1-y3
> register __m128d p7=p1+p3; //x1+x3,y1+y3
> 14: 66 0f 28 58 10 movapd 0x10(%eax),%xmm3
> 19: 66 0f 58 da addpd %xmm2,%xmm3
> register __m128d p4=p0-p2; //x0-x2,y0-y2
> 1d: 66 0f 28 d0 movapd %xmm0,%xmm2
> 21: 66 0f 5c d1 subpd %xmm1,%xmm2
> register __m128d p5=p0+p2; //x0+x2,y0+y2
> 25: 66 0f 58 c1 addpd %xmm1,%xmm0
> register __m128d p9;
> _mm_shuffle_pd(p9,p6,_MM_SHUFFLE2(0,1)); //y1-y3,x1-x3
> ((__m128d*)a)[0]=p5+p7;
> 29: 66 0f 28 cb movapd %xmm3,%xmm1
> 2d: 66 0f 58 c8 addpd %xmm0,%xmm1
> 31: 66 0f 29 08 movapd %xmm1,(%eax)
> ((__m128d*)a)[2]=p5-p7;
> 35: 66 0f 5c c3 subpd %xmm3,%xmm0
> 39: 66 0f 29 40 20 movapd %xmm0,0x20(%eax)
> ((__m128d*)a)[1]=p4-p9;
> 3e: 66 0f 28 ca movapd %xmm2,%xmm1
> 42: 66 0f 5c c8 subpd %xmm0,%xmm1
> 46: 66 0f 29 48 10 movapd %xmm1,0x10(%eax)
> ((__m128d*)a)[3]=p4+p9;
> 4b: 66 0f 58 d1 addpd %xmm1,%xmm2
> 4f: 66 0f 29 50 30 movapd %xmm2,0x30(%eax)
> }
> 54: 5d pop %ebp
> 55: c3 ret
>
>
> Precisely which of those lines did you consider to be calls
> to helper functions which contain the actual SSE operation?
> (as claimed in a different post in this thread.)
>
>
> Phil
I notice you are using gcc 4.1 on i386.
I am using gcc 3.3.5 on amd64 (on 64-bit OpenBSD 3.9).
I don't immediately grasp what the posted code is doing, but I will
figure it out. Is there some associated C code? That might help me out.
Dave Feustel
>
|
|
0
|
|
|
|
Reply
|
dave
|
11/1/2006 6:29:59 PM
|
|
f0dder <spamtrap@crayne.org> wrote:
> dave wrote:
>
>> The intrinsic functions for sse implemented by gcc actually are
>> function calls to subroutines in which the sse instruction is
>> embedded.
>
> You've got to be kidding - it can't be *that* lame?
It's possible that I misinterpreted what I saw. I'm quite rusty at
assembly and I'm new to gcc assembly.
> Are you sure you're not missing something to actually make the intrinsics
> intrinsic? For instance with VC2003, the Interlocked* functions are actually
> function calls until you "do some stuff", after which they translate
> directly to stuff like xadd.
This is possible, but unlikely IMHO on OpenBSD. (But I might be wrong).
>
|
|
0
|
|
|
|
Reply
|
dave
|
11/1/2006 6:34:00 PM
|
|
Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
> dave <spamtrap@crayne.org> writes:
>> The intrinsic functions for sse implemented by gcc actually are function
>> calls to subroutines in which the sse instruction is embedded. I suspect
>> the overhead of the call nullifies any speed advantage of doing 2 or 4
>> arithmetic operations at once.
>
> Have you forgotten to turn on -O?
Didn't know about -O. I'll try it. Thanks for the tip.
> The intrinsics are inline when I use them.
>
> Phil
|
|
0
|
|
|
|
Reply
|
dave
|
11/1/2006 6:35:17 PM
|
|
dave <spamtrap@crayne.org> writes:
> vid512@gmail.com <spamtrap@crayne.org> wrote:
> >> Then replace the assembly code in the function body with custom code
> >> using SSE instructions. The code produced makes good use of sse
> >> capabilities.
> > why do it the hard hackish way? you can write ASM part in any assembler
> > that supports needed object format (ELF i think), and then link it with
> > your C program.
>
> True, but gcc generates a lot of extra stuff which seems to be used by
> gdb in debugging.
Care to give an example? Perhaps you'd like to point out the extra stuff
in the example I posted in my previous post?
In particular compare and contrast the output without the '-g' switch:
<<<
bash-3.1$ gcc-4.1 -march=pentium4 -O -o bs.o -c bs.c
bash-3.1$ objdump -d -S bs.o
bs.o: file format elf32-i386
Disassembly of section .text:
00000000 <c4>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
6: 66 0f 28 50 30 movapd 0x30(%eax),%xmm2
b: 66 0f 28 00 movapd (%eax),%xmm0
f: 66 0f 28 48 20 movapd 0x20(%eax),%xmm1
14: 66 0f 28 58 10 movapd 0x10(%eax),%xmm3
19: 66 0f 58 da addpd %xmm2,%xmm3
1d: 66 0f 28 d0 movapd %xmm0,%xmm2
21: 66 0f 5c d1 subpd %xmm1,%xmm2
25: 66 0f 58 c1 addpd %xmm1,%xmm0
29: 66 0f 28 cb movapd %xmm3,%xmm1
2d: 66 0f 58 c8 addpd %xmm0,%xmm1
31: 66 0f 29 08 movapd %xmm1,(%eax)
35: 66 0f 5c c3 subpd %xmm3,%xmm0
39: 66 0f 29 40 20 movapd %xmm0,0x20(%eax)
3e: 66 0f 28 ca movapd %xmm2,%xmm1
42: 66 0f 5c c8 subpd %xmm0,%xmm1
46: 66 0f 29 48 10 movapd %xmm1,0x10(%eax)
4b: 66 0f 58 d1 addpd %xmm1,%xmm2
4f: 66 0f 29 50 30 movapd %xmm2,0x30(%eax)
54: 5d pop %ebp
55: c3 ret
>>>
versus the output with the '-g' switch:
<<<
bash-3.1$ gcc-4.1 -march=pentium4 -O -o bs.o -g -c bs.c
bash-3.1$ objdump -d bs.o
bs.o: file format elf32-i386
Disassembly of section .text:
00000000 <c4>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
6: 66 0f 28 50 30 movapd 0x30(%eax),%xmm2
b: 66 0f 28 00 movapd (%eax),%xmm0
f: 66 0f 28 48 20 movapd 0x20(%eax),%xmm1
14: 66 0f 28 58 10 movapd 0x10(%eax),%xmm3
19: 66 0f 58 da addpd %xmm2,%xmm3
1d: 66 0f 28 d0 movapd %xmm0,%xmm2
21: 66 0f 5c d1 subpd %xmm1,%xmm2
25: 66 0f 58 c1 addpd %xmm1,%xmm0
29: 66 0f 28 cb movapd %xmm3,%xmm1
2d: 66 0f 58 c8 addpd %xmm0,%xmm1
31: 66 0f 29 08 movapd %xmm1,(%eax)
35: 66 0f 5c c3 subpd %xmm3,%xmm0
39: 66 0f 29 40 20 movapd %xmm0,0x20(%eax)
3e: 66 0f 28 ca movapd %xmm2,%xmm1
42: 66 0f 5c c8 subpd %xmm0,%xmm1
46: 66 0f 29 48 10 movapd %xmm1,0x10(%eax)
4b: 66 0f 58 d1 addpd %xmm1,%xmm2
4f: 66 0f 29 50 30 movapd %xmm2,0x30(%eax)
54: 5d pop %ebp
55: c3 ret
>>>
> Plus the passing of arguments to the assembly code is
> done by gcc.
What 'passing of arguments' to what 'assembly code'?
Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
|
|
0
|
|
|
|
Reply
|
Phil
|
11/1/2006 7:43:18 PM
|
|
dave <spamtrap@crayne.org> writes:
> Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
> > dave <spamtrap@crayne.org> writes:
> >> Intrinsics as implemented in gcc are so inefficient they are not worth
> >> using.
> >
> >
> > bash-3.1$ gcc-4.1 -march=pentium4 -O -o bs.o -g -c bs.c
> > bash-3.1$ objdump -d -S bs.o
> > [...]
>
> I notice you are using gcc 4.1 on i386.
> I am using gcc 3.3.5 on amd64 (on 64-bit OpenBSD 3.9).
You're wiggling. You make claims, yet you post *nothing* that
supports those claims. Let's see what gcc-3.3.5 does, shall we?
<<<
phil@kilospaz:tmp$ gcc -v
Reading specs from /usr/lib/gcc-lib/i486-linux/3.3.5/specs
Configured with: ../src/configure -v --enable-languages=c,c++,java,f77,pascal,objc,ada,treelang --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-gxx-include-dir=/usr/include/c++/3.3 --enable-shared --enable-__cxa_atexit --with-system-zlib --enable-nls --without-included-gettext --enable-clocale=gnu --enable-debug --enable-java-gc=boehm --enable-java-awt=xlib --enable-objc-gc i486-linux
Thread model: posix
gcc version 3.3.5 (Debian 1:3.3.5-13)
phil@kilospaz:tmp$ gcc -march=pentium4 -O -o bs.o -g -c bs.c
phil@kilospaz:tmp$ objdump -d bs.o
bs.o: file format elf32-i386
Disassembly of section .text:
00000000 <c4>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
6: 66 0f 28 50 10 movapd 0x10(%eax),%xmm2
b: 66 0f 28 00 movapd (%eax),%xmm0
f: 66 0f 28 48 20 movapd 0x20(%eax),%xmm1
14: 66 0f 58 50 30 addpd 0x30(%eax),%xmm2
19: 66 0f 28 d8 movapd %xmm0,%xmm3
1d: 66 0f 5c d9 subpd %xmm1,%xmm3
21: 66 0f 58 c1 addpd %xmm1,%xmm0
25: 66 0f 28 c8 movapd %xmm0,%xmm1
29: 66 0f 58 ca addpd %xmm2,%xmm1
2d: 66 0f 29 08 movapd %xmm1,(%eax)
31: 66 0f 5c c2 subpd %xmm2,%xmm0
35: 66 0f 29 40 20 movapd %xmm0,0x20(%eax)
3a: 66 0f 28 c3 movapd %xmm3,%xmm0
3e: 66 0f 5c c4 subpd %xmm4,%xmm0
42: 66 0f 29 40 10 movapd %xmm0,0x10(%eax)
47: 66 0f 58 dc addpd %xmm4,%xmm3
4b: 66 0f 29 58 30 movapd %xmm3,0x30(%eax)
50: 5d pop %ebp
51: c3 ret
>>>
> I don't immediately grasp what the posted code is doing, but I will
> figure it out.
Clue - it's *not* calling any functions. You claimed it called functions.
> Is there some associated C code? That might help me out.
Yes. I have posted it. What did you think the C code that interspersed
with the objdump output in my first objdump example was, COBOL?
Your clue-resistance is really becoming a drag.
Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
|
|
0
|
|
|
|
Reply
|
Phil
|
11/1/2006 9:12:41 PM
|
|
dave wrote:
> Is there anyone out there actually using SSE in their programs?
>
> Is there a forum anywhere for discussing how to do SSE programming?
>
> I think I may have figured out the easiest way to make effective
> use of SSE with the gcc toolchain.
>
> Thanks,
> Dave Feustel
I've done a lot of SSE and SSE2 programming over the years. I have
an optimization website that goes over some basic tricks to speed up
code with SSE/SSE2 ( along with other tricks).
http://www.mark.masmcode.com/
P4's and up on the Intel side really run SSE/SSE2 code very fast. So
I've used that advantage a lot to make code run extremely fast.
converting a string to a qword using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=4253.msg28940#msg28940
SSE2 quaternion multiply
http://www.oldboard.assemblercode.com/index.php?topic=3469.0
Mersenne Twister Random Number Generator in SSE2
http://www.oldboard.assemblercode.com/index.php?topic=3565.0
my account on masmforum got messed up ( all these links are for
masmforum). So some messages will say they are from hutch- instead of
marklarson. The way you tell it's the real me, is it'll say "guest"
under "hutch--".
Counting the number of lines in a file using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=2692.msg18800#msg18800
string copy using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=2632.msg18047#msg18047
Computing MD5 using SSE2
http://www.oldboard.assemblercode.com/index.php?topic=2921.0
Mark Larson
|
|
0
|
|
|
|
Reply
|
Mark_Larson
|
11/1/2006 9:19:17 PM
|
|
forgot a few links:
I am working on a raytracer that I haven't finished yet. You can use
scalar SSE code just like FP code ( you don't do stuff in parallel,
it's a single floating point value you are doing an operation on).
Scalar code is faster on a P4. ( not sure about AMD).
http://www.masm32.com/board/index.php?topic=1140.0
line counting again. But I actually have 2 different versions using
2 different algorithms. If you scroll down the second posted one is
done in a non-intuitive manner.
http://www.masm32.com/board/index.php?topic=5434.msg40666#msg40666
Mark Larson
|
|
0
|
|
|
|
Reply
|
Mark_Larson
|
11/1/2006 9:34:03 PM
|
|
Thanks for the examples, Mark!
Mark_Larson <spamtrap@crayne.org> wrote:
> dave wrote:
>> Is there anyone out there actually using SSE in their programs?
>>
>> Is there a forum anywhere for discussing how to do SSE programming?
>>
>> I think I may have figured out the easiest way to make effective
>> use of SSE with the gcc toolchain.
>>
>> Thanks,
>> Dave Feustel
>
> I've done a lot of SSE and SSE2 programming over the years. I have
> an optimization website that goes over some basic tricks to speed up
> code with SSE/SSE2 ( along with other tricks).
> http://www.mark.masmcode.com/
>
>
> P4's and up on the Intel side really run SSE/SSE2 code very fast. So
> I've used that advantage a lot to make code run extremely fast.
>
> converting a string to a qword using SSE2
> http://www.oldboard.assemblercode.com/index.php?topic=4253.msg28940#msg28940
>
> SSE2 quaternion multiply
> http://www.oldboard.assemblercode.com/index.php?topic=3469.0
>
> Mersenne Twister Random Number Generator in SSE2
> http://www.oldboard.assemblercode.com/index.php?topic=3565.0
>
> my account on masmforum got messed up ( all these links are for
> masmforum). So some messages will say they are from hutch- instead of
> marklarson. The way you tell it's the real me, is it'll say "guest"
> under "hutch--".
>
> Counting the number of lines in a file using SSE2
> http://www.oldboard.assemblercode.com/index.php?topic=2692.msg18800#msg18800
>
> string copy using SSE2
> http://www.oldboard.assemblercode.com/index.php?topic=2632.msg18047#msg18047
>
> Computing MD5 using SSE2
> http://www.oldboard.assemblercode.com/index.php?topic=2921.0
>
> Mark Larson
>
|
|
0
|
|
|
|
Reply
|
dave
|
11/1/2006 11:58:49 PM
|
|
dave <spamtrap@crayne.org> wrote:
>Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
>>
>> bash-3.1$ gcc-4.1 -march=pentium4 -O -o bs.o -g -c bs.c
>> bash-3.1$ objdump -d -S bs.o
>> [...]
>> inline void c4(register complex *a)
>> {
>> 0: 55 push %ebp
>> 1: 89 e5 mov %esp,%ebp
>> 3: 8b 45 08 mov 0x8(%ebp),%eax
>> register __m128d p1=((__m128d*)a)[1];
>> register __m128d p3=((__m128d*)a)[3];
>> 6: 66 0f 28 50 30 movapd 0x30(%eax),%xmm2
>> register __m128d p0=((__m128d*)a)[0];
>> b: 66 0f 28 00 movapd (%eax),%xmm0
>> register __m128d p2=((__m128d*)a)[2];
>> f: 66 0f 28 48 20 movapd 0x20(%eax),%xmm1
>> register __m128d p6=p1-p3; //x1-x3,y1-y3
>> register __m128d p7=p1+p3; //x1+x3,y1+y3
>> 14: 66 0f 28 58 10 movapd 0x10(%eax),%xmm3
>> 19: 66 0f 58 da addpd %xmm2,%xmm3
>> register __m128d p4=p0-p2; //x0-x2,y0-y2
>> 1d: 66 0f 28 d0 movapd %xmm0,%xmm2
>> 21: 66 0f 5c d1 subpd %xmm1,%xmm2
>> register __m128d p5=p0+p2; //x0+x2,y0+y2
>> 25: 66 0f 58 c1 addpd %xmm1,%xmm0
>> register __m128d p9;
>> _mm_shuffle_pd(p9,p6,_MM_SHUFFLE2(0,1)); //y1-y3,x1-x3
>> ((__m128d*)a)[0]=p5+p7;
>> 29: 66 0f 28 cb movapd %xmm3,%xmm1
>> 2d: 66 0f 58 c8 addpd %xmm0,%xmm1
>> 31: 66 0f 29 08 movapd %xmm1,(%eax)
>> ((__m128d*)a)[2]=p5-p7;
>> 35: 66 0f 5c c3 subpd %xmm3,%xmm0
>> 39: 66 0f 29 40 20 movapd %xmm0,0x20(%eax)
>> ((__m128d*)a)[1]=p4-p9;
>> 3e: 66 0f 28 ca movapd %xmm2,%xmm1
>> 42: 66 0f 5c c8 subpd %xmm0,%xmm1
>> 46: 66 0f 29 48 10 movapd %xmm1,0x10(%eax)
>> ((__m128d*)a)[3]=p4+p9;
>> 4b: 66 0f 58 d1 addpd %xmm1,%xmm2
>> 4f: 66 0f 29 50 30 movapd %xmm2,0x30(%eax)
>> }
>> 54: 5d pop %ebp
>> 55: c3 ret
>...
>I notice you are using gcc 4.1 on i386.
>I am using gcc 3.3.5 on amd64 (on 64-bit OpenBSD 3.9).
....which is 2 years old...
>I don't immediately grasp what the posted code is doing, but I will
>figure it out. Is there some associated C code? That might help me out.
Ummm, the lines that don't have addresses ARE the C code...
--
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.
|
|
0
|
|
|
|
Reply
|
Tim
|
11/3/2006 4:04:32 AM
|
|
Tim Roberts <spamtrap@crayne.org> wrote:
> dave <spamtrap@crayne.org> wrote:
>
>>Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
>>>
>>> bash-3.1$ gcc-4.1 -march=pentium4 -O -o bs.o -g -c bs.c
>>> bash-3.1$ objdump -d -S bs.o
>>> [...]
>>> inline void c4(register complex *a)
>>> {
>>> 0: 55 push %ebp
>>> 1: 89 e5 mov %esp,%ebp
>>> 3: 8b 45 08 mov 0x8(%ebp),%eax
>>> register __m128d p1=((__m128d*)a)[1];
>>> register __m128d p3=((__m128d*)a)[3];
>>> 6: 66 0f 28 50 30 movapd 0x30(%eax),%xmm2
>>> register __m128d p0=((__m128d*)a)[0];
>>> b: 66 0f 28 00 movapd (%eax),%xmm0
>>> register __m128d p2=((__m128d*)a)[2];
>>> f: 66 0f 28 48 20 movapd 0x20(%eax),%xmm1
>>> register __m128d p6=p1-p3; //x1-x3,y1-y3
>>> register __m128d p7=p1+p3; //x1+x3,y1+y3
>>> 14: 66 0f 28 58 10 movapd 0x10(%eax),%xmm3
>>> 19: 66 0f 58 da addpd %xmm2,%xmm3
>>> register __m128d p4=p0-p2; //x0-x2,y0-y2
>>> 1d: 66 0f 28 d0 movapd %xmm0,%xmm2
>>> 21: 66 0f 5c d1 subpd %xmm1,%xmm2
>>> register __m128d p5=p0+p2; //x0+x2,y0+y2
>>> 25: 66 0f 58 c1 addpd %xmm1,%xmm0
>>> register __m128d p9;
>>> _mm_shuffle_pd(p9,p6,_MM_SHUFFLE2(0,1)); //y1-y3,x1-x3
>>> ((__m128d*)a)[0]=p5+p7;
>>> 29: 66 0f 28 cb movapd %xmm3,%xmm1
>>> 2d: 66 0f 58 c8 addpd %xmm0,%xmm1
>>> 31: 66 0f 29 08 movapd %xmm1,(%eax)
>>> ((__m128d*)a)[2]=p5-p7;
>>> 35: 66 0f 5c c3 subpd %xmm3,%xmm0
>>> 39: 66 0f 29 40 20 movapd %xmm0,0x20(%eax)
>>> ((__m128d*)a)[1]=p4-p9;
>>> 3e: 66 0f 28 ca movapd %xmm2,%xmm1
>>> 42: 66 0f 5c c8 subpd %xmm0,%xmm1
>>> 46: 66 0f 29 48 10 movapd %xmm1,0x10(%eax)
>>> ((__m128d*)a)[3]=p4+p9;
>>> 4b: 66 0f 58 d1 addpd %xmm1,%xmm2
>>> 4f: 66 0f 29 50 30 movapd %xmm2,0x30(%eax)
>>> }
>>> 54: 5d pop %ebp
>>> 55: c3 ret
>>...
>>I notice you are using gcc 4.1 on i386.
>>I am using gcc 3.3.5 on amd64 (on 64-bit OpenBSD 3.9).
>
> ...which is 2 years old...
>
>>I don't immediately grasp what the posted code is doing, but I will
>>figure it out. Is there some associated C code? That might help me out.
>
> Ummm, the lines that don't have addresses ARE the C code...
Sorry I missed that. I have vision problems plus my monitor is old and
getting dim. It makes reading some screens hard for me.
|
|
0
|
|
|
|
Reply
|
dave
|
11/3/2006 12:17:00 PM
|
|
|
23 Replies
284 Views
(page loaded in 0.328 seconds)
|