How to cast using MSVC++ Intrinsics

  • Follow


I'm playing around with intrinsics with MSVC++ and would like
to be able to load a vector of four integers but using movlps
and movhps which are floating point instructions.

The reason for that is using the floating point instructions
is that they are faster (and recommended in the optimization
guide for AMD). Maybe for Intel too.

I tried doing a simple cast:

quadp1 = (__m128i) _mm_loadl_pi(quad1, (__m64 *) x);

but this failed with:

mike.c(10) : error C2440: 'type cast' : cannot convert from '__m128' to '__m128i'

michael

0
Reply Michael 5/8/2005 6:33:11 PM

Michael Moy schrieb:
> I'm playing around with intrinsics with MSVC++ and would like
> to be able to load a vector of four integers but using movlps
> and movhps which are floating point instructions.

Nope, not for integers. Movdqa is the fastest way to load a register
with four 32-bit ints from memory:

void foo ( int vector[] /* should be 16 byte aligned */
{
  __m128i *pv128 = (__m128i *) vector
  __m128i quadp1 = pv128[0];
  ...
}

alternately you may use:
__m128i _mm_load_si128 (__m128i *p);
or for not 16-byet aligned data:
__m128i _mm_loadu_si128 (__m128i *p);

To load 4 short 16-bit ints you may use
__m128i _mm_move_epi64 (__128i a).


>
> The reason for that is using the floating point instructions
> is that they are faster (and recommended in the optimization
> guide for AMD). Maybe for Intel too.
> I tried doing a simple cast:
>
> quadp1 = (__m128i) _mm_loadl_pi(quad1, (__m64 *) x);
>
> but this failed with:
>
> mike.c(10) : error C2440: 'type cast' : cannot convert from '__m128'
to '__m128i'
>
> michael

You can not cast instrinsics datatypes even if their size is the same,
__m128i to _m128 or _m128d.

Mixing SSE2 types (float, double, int (byte, word, dword, qword)) may
result in performance penalties, see:

Appendix E
Software Optimization Guide for AMD Athlon� 64 and AMD Opteron�
Processors

Gerd

0
Reply Gerd 5/9/2005 7:31:30 PM


Gerd Isenberg wrote:
> Michael Moy schrieb:
> 
>>I'm playing around with intrinsics with MSVC++ and would like
>>to be able to load a vector of four integers but using movlps
>>and movhps which are floating point instructions.
> 
> 
> Nope, not for integers. Movdqa is the fastest way to load a register
> with four 32-bit ints from memory:

Perhaps you mean from aligned memory. If the memory is unaligned, you
have a latency of 7 using a VectorPath instruction. This is regarding
AMD K8 processors. From the SOG:

The AMD Athlon 64 and AMD Opteron processors can perform two 64-bit
loads per clock cycle. Two 64-bit MOVLPS loads can be issued in the
same cycle, assuming the data is 8-byte aligned. (Page 202)

....

When data alignment cannot be guaranteed, use MOVLPD/MOVHPD, MOVLPS/MOVHPS
or MOVLPD/MOVHPD pairs in lieu of MOVUPD, MOVUPS or MOVDQU, repectively.

The MOVUPS, MOVUPD and MOVDQU instructions are VectorPath when one of the
operands is a memory location. It is better to use one of the MOVLPx/MOVHPx
or MOVQ/MOVHPD pairs. It is preferable to load or store the 64-bit halves of
an XMM register separately when the memory location cannot by guaranteed to
be aligned. (page 204)

------

Even if the data is aligned, there's a bug in Athlon 64 chips before
Stepping E0 such that the cost of MOVDQA is more expensive than the
advertized 2 cycles. My empiracal testing shows me that movlps/movhps
is faster than movdqa. I have a stepping C (either 0 or G) processor
and the E was released pretty recently so I don't think that there are
that many out there. You still have a ton of Socket 754 and Socket
939 Pre E chips out there.

I seem to recall something similar in the Intel SOG but can't find it
right now. I generally spend a lot more time in the AMD SOG anyways.

BTW, I did figure out how to do the casts but ran into some other
problems with intrinsics. Namely that MSVC++ tosses in a movaps
to load the SIMD register from the stack on a quadword floating
point load. I wasn't able to find a way to get rid of this cost.
The other problem is that MSVC++ intrinsics generates lousy code.

It's similar to the code generated with the -arch:SSE2 switch.
MSVC++ is incredibly stingy with SIMD registers to the point that
it always tries to use as few as possible. Which means that it
doesn't take advantage of possible parallel execution of
instructions.

I had a load, load, store, store sequence which would ideally be
executed as movlps xmm0, a; movlps xmm1, b; movlps c, xmm0;
movlps d, xmm1. MSVC++ serialized the load/store using one register
ignoring a more efficient solution.

> void foo ( int vector[] /* should be 16 byte aligned */
> {
>   __m128i *pv128 = (__m128i *) vector
>   __m128i quadp1 = pv128[0];
>   ...
> }
> 
> alternately you may use:
> __m128i _mm_load_si128 (__m128i *p);
> or for not 16-byet aligned data:
> __m128i _mm_loadu_si128 (__m128i *p);
> 
> To load 4 short 16-bit ints you may use
> __m128i _mm_move_epi64 (__128i a).
> 
> 
> 
>>The reason for that is using the floating point instructions
>>is that they are faster (and recommended in the optimization
>>guide for AMD). Maybe for Intel too.
>>I tried doing a simple cast:
>>
>>quadp1 = (__m128i) _mm_loadl_pi(quad1, (__m64 *) x);
>>
>>but this failed with:
>>
>>mike.c(10) : error C2440: 'type cast' : cannot convert from '__m128'
> 
> to '__m128i'
> 
>>michael
> 
> 
> You can not cast instrinsics datatypes even if their size is the same,
> __m128i to _m128 or _m128d.
> 
> Mixing SSE2 types (float, double, int (byte, word, dword, qword)) may
> result in performance penalties, see:
> 
> Appendix E
> Software Optimization Guide for AMD Athlon� 64 and AMD Opteron�
> Processors
> 
> Gerd

See section 9.3 on why it's better to use the quadword instructions.
As far asa the movdqa problem goes, I spoke to a guy on the Ihub AMD
forum who had done some performance testing with movdqa and found the
problem and reported it to AMD.

michael

0
Reply Michael 5/10/2005 2:45:09 AM

Michael Moy schrieb:
> Gerd Isenberg wrote:
> > Michael Moy schrieb:
> >
> >>I'm playing around with intrinsics with MSVC++ and would like
> >>to be able to load a vector of four integers but using movlps
> >>and movhps which are floating point instructions.
> >
> >
> > Nope, not for integers. Movdqa is the fastest way to load a
register
> > with four 32-bit ints from memory:
>
> Perhaps you mean from aligned memory. If the memory is unaligned, you
> have a latency of 7 using a VectorPath instruction.


Yes 16 byte aligned data.
For 8 byte aligned integers data i thought before two MOVQ and one
PUNPCKLQDQ instruction is best, bur never considerd MOVLPD/MOVHPD for
integers.


>This is regarding
> AMD K8 processors. From the SOG:
>
> The AMD Athlon 64 and AMD Opteron processors can perform two 64-bit
> loads per clock cycle. Two 64-bit MOVLPS loads can be issued in the
> same cycle, assuming the data is 8-byte aligned. (Page 202)
>
> ...
>
> When data alignment cannot be guaranteed, use MOVLPD/MOVHPD,
MOVLPS/MOVHPS
> or MOVLPD/MOVHPD pairs in lieu of MOVUPD, MOVUPS or MOVDQU,
repectively.
>
> The MOVUPS, MOVUPD and MOVDQU instructions are VectorPath when one of
the
> operands is a memory location. It is better to use one of the
MOVLPx/MOVHPx
> or MOVQ/MOVHPD pairs. It is preferable to load or store the 64-bit
halves of
> an XMM register separately when the memory location cannot by
guaranteed to
> be aligned. (page 204)
>

Page 206 in my current SOG !? 25112 Rev. 3.05 November 2004
Hmmm... ok i see - MOVQ/MOVHPD pair.


> ------
>
> Even if the data is aligned, there's a bug in Athlon 64 chips before
> Stepping E0 such that the cost of MOVDQA is more expensive than the
> advertized 2 cycles. My empiracal testing shows me that movlps/movhps
> is faster than movdqa. I have a stepping C (either 0 or G) processor
> and the E was released pretty recently so I don't think that there
are
> that many out there. You still have a ton of Socket 754 and Socket
> 939 Pre E chips out there.

Interesting.
Are we talking about SSE2 in 32-bit mode or 64-bit mode?

>
> I seem to recall something similar in the Intel SOG but can't find it
> right now. I generally spend a lot more time in the AMD SOG anyways.
>
> BTW, I did figure out how to do the casts but ran into some other
> problems with intrinsics.

A union of m128 and m128i?

> Namely that MSVC++ tosses in a movaps
> to load the SIMD register from the stack on a quadword floating
> point load. I wasn't able to find a way to get rid of this cost.
> The other problem is that MSVC++ intrinsics generates lousy code.

I had similar problems with msvc++ 6.
But the 2005� seems much better IMHO (32-bit mode).

>
> It's similar to the code generated with the -arch:SSE2 switch.
> MSVC++ is incredibly stingy with SIMD registers to the point that
> it always tries to use as few as possible. Which means that it
> doesn't take advantage of possible parallel execution of
> instructions.

Hmm.. i had some sse2 intrinsic routines where all eight available xmm
registers are used in a rather optimal way - in 32-bit mode.

>
> I had a load, load, store, store sequence which would ideally be
> executed as movlps xmm0, a; movlps xmm1, b; movlps c, xmm0;
> movlps d, xmm1. MSVC++ serialized the load/store using one register
> ignoring a more efficient solution.
>

In 64-bit mode it might be ok, since some floats/doubles may still
resists in other regsiters. Have you noticed SOG 5.16 Interleave Loads
and Stores:

Rationale

When using SSE and SSE2 instructions to perform loads and stores, it is
best to interleave them in the following pattern-Load, Store, Load,
Store, Load, Store, etc. This enables the processor to maximize the
load/store bandwidth.

If using MMX loads and stores in 32-bit mode, the loads and stores
should be arranged in the following pattern-Load, Load, Store, Store,
Load, Load, Store, Store, etc.

<snip>
>
> See section 9.3 on why it's better to use the quadword instructions.
> As far asa the movdqa problem goes, I spoke to a guy on the Ihub AMD
> forum who had done some performance testing with movdqa and found the
> problem and reported it to AMD.
>
> michael

Thanks for pointing out the movqda problem, i was not aware of.

Cheers,
Gerd

0
Reply Gerd 5/10/2005 6:35:47 PM

Gerd Isenberg wrote:
> Yes 16 byte aligned data.
> For 8 byte aligned integers data i thought before two MOVQ and one
> PUNPCKLQDQ instruction is best, bur never considerd MOVLPD/MOVHPD for
> integers.

movq tends to be a more expensive solution as it has to zero the high
quadword. movsd (SSE2) would be nice but I couldn't get it to work.

> Interesting.
> Are we talking about SSE2 in 32-bit mode or 64-bit mode?

I haven't done a lot of testing in 64-bit mode. I don't think it matters
as the documentation doesn't make a distinction. I have a Windows 64
Beta on a partition on my notebook and need to get a release version
from a friend with MSDN. I plan to get started on some development
stuff with it one of these days. I have a new PowerMac G5 which I'm
also playing with and am getting up to speed on Altivec.

> A union of m128 and m128i?

I think that it was done with pointers. I don't have the code anymore
as I found a few other approaches that are useful.

>>Namely that MSVC++ tosses in a movaps
>>to load the SIMD register from the stack on a quadword floating
>>point load. I wasn't able to find a way to get rid of this cost.
>>The other problem is that MSVC++ intrinsics generates lousy code.
> 
> I had similar problems with msvc++ 6.
> But the 2005� seems much better IMHO (32-bit mode).

I was able to get around the extra movaps by loading the high
quadword as well. What would be nice is to have init to 0 or -1
instructions. If you put in a pxor or pcmpeq, you'll get a warning
and a movaps for initialization.

VS2005 is much better but I think that GCC does a better job at
autovectorization. And there are more architecture choices.

> Hmm.. i had some sse2 intrinsic routines where all eight available xmm
> registers are used in a rather optimal way - in 32-bit mode.

If you have code that needs them, then it uses them. I'm writing some
very short routines to optimize some open source code so what I'm doing
ranges from a few instructions to a few hundred instructions.

Inline assembler has its issues too in that there are many constructs
that you can't pass to inline.

> In 64-bit mode it might be ok, since some floats/doubles may still
> resists in other regsiters. Have you noticed SOG 5.16 Interleave Loads
> and Stores:

I'm not doing any computation here. All I want to do is make a copy of
an object instance. So I just read from memory and then write to memory.
So I can use any type that I want to for some of these routines.

> Rationale
> 
> When using SSE and SSE2 instructions to perform loads and stores, it is
> best to interleave them in the following pattern-Load, Store, Load,
> Store, Load, Store, etc. This enables the processor to maximize the
> load/store bandwidth.

I'm partial to load, load, store, store or load, load, load, store, store,
store as the mov instructions can have latencies of one, two, three or four
with quadwords. If you do a load, store, load, store, the first store may
have to wait for one or more cycles before it can start to execute. I
generally try to schedule instructions so that the data is there, if from
L1, by the time I have in instruction that needs to use the data.

> Thanks for pointing out the movqda problem, i was not aware of.

I can't wait to get a dual-core low-power A64 system but I suspect that
I'll have to wait until next year. Already used up this year's computing
budget on the PowerMac.

0
Reply Michael 5/11/2005 4:40:21 AM

4 Replies
254 Views

(page loaded in 0.324 seconds)

Similiar Articles:













7/23/2012 9:33:48 AM


Reply: