shuffling bits

  • Follow


Is there a way to shuffle bits on x86 hardware.

I have three integers 

A = a_1 a_2 a_3 ... a_32
B = b_1 b_2 b_3 ... b_32
C = c_1 c_2 c_3 ... c_32

where all a_i's b_i's and c_i's are bits
and my shuffle looks like this

X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32

Is there a way i can generate X using MMX or SIMD instructions 
such that X is 128 bit integer.


Thanks in advance for your comments,
--Elijah

0
Reply geomrock 12/30/2003 1:47:18 AM

"Elijah Bailey" <geomrock@hotmail.com> wrote in message
news:e008fef8.0312291705.63a558dd@posting.google.com...
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
>
>
> Thanks in advance for your comments,
> --Elijah
>

well.. 32 bits + 32 bits + 32 bits = 96 bits... so... could you define where
you'd like the remaining 32 bits in X to be? (or how they should be
defined?)


0
Reply Bx 12/30/2003 2:35:33 AM


"Elijah Bailey" <geomrock@hotmail.com> wrote in message
news:e008fef8.0312291705.63a558dd@posting.google.com...
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.

3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
variables. There are no convenient x86 instructions for bit swizzling,
although multiplies can sometimes accomplish that.

I can think of 2 basic strategies:
1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s in
between bits, then or them together
2. Concatenate all 4 32-bit values together to form a 128-bit value, then
swizzle using shifts & masking.

I can't think of any decent way of interleaving individual bits. Perhaps you
could somehow abuse punpck* instructions. Data swizzling is not one of the
x86's strong points.

-Matt


0
Reply Matt 12/30/2003 5:48:09 AM

"Matt Taylor" <para@tampabay.rr.com> wrote in message
news:qg8Ib.154113$%h4.110602@twister.tampabay.rr.com...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> news:e008fef8.0312291705.63a558dd@posting.google.com...
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
>
> 3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
> variables. There are no convenient x86 instructions for bit swizzling,
> although multiplies can sometimes accomplish that.
>
> I can think of 2 basic strategies:
> 1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s
in
> between bits, then or them together
> 2. Concatenate all 4 32-bit values together to form a 128-bit value, then
> swizzle using shifts & masking.
>
> I can't think of any decent way of interleaving individual bits. Perhaps
you
> could somehow abuse punpck* instructions. Data swizzling is not one of the
> x86's strong points.
>
> -Matt
>
>

yah.. 4 would be a better number... but in either case, i don't know SSE,..
so i don't know any 128-bit media instructions to work with...

the best i'd come up with, would be the double-precision shift instructions,
SHLD and SHRD.... shift a single bit at a time.. combine that with a loop..
taking a bit from each reg, each time through the loop... and you have the
finalized product....


0
Reply Bx 12/30/2003 7:27:43 AM

"Bx. C" <null@the.void> wrote in message
news:0i9Ib.10497$sb4.5001@bignews5.bellsouth.net...
<snip>
> the best i'd come up with, would be the double-precision shift
instructions,
> SHLD and SHRD.... shift a single bit at a time.. combine that with a
loop..
> taking a bit from each reg, each time through the loop... and you have the
> finalized product....

Well at a minimum I would use the shift instructions instead of shld/shrd:

; Swizzle A -> edx inserting 0s
; Trashes eax, ecx
mov eax, A ; eax = A
mov ecx, 8 ; ecx = bits to copy
xor edx, edx ; edx = masked copy of A

...copyloop:
lea edx, [edx*8] ; edx = edx << 3
shr eax, 1 ; CF = next_bit
adc edx, edx ; edx = (edx << 1) | next_bit
dec ecx
jnz .copyloop

Here's how to do it with multiplies:

; Swizzle al -> eax inserting 0s
movzx eax, byte A
imul eax, 0x00000101 ; ah = al
and eax, 0x0000F00F ; ah = <A7, A6, A5, A4>, al = <A3, A2, A1, A0>
imul eax, 0x00001111
and eax, 0x03030303 ; eax = <A7 | A6, A5 | A4, A3 | A2, A1 | A0>
lea eax, [eax+eax*8] ; Swizzle into nibbles
and eax, 0x11111111

Perhaps someone can improve upon either of those. Each produces 8 bits of
the result at a time. The multiply method takes 17 clocks on an Athlon, 8 of
which are spent in the multiply unit. Unrolling could give a throughput of
~8 clocks per 8 bits on the multiply routine.

-Matt


0
Reply Matt 12/30/2003 6:34:23 PM

If you need to do many swizzles the BT and ADC/RCL/RCR instructions are
the way to go in general - use the carry flag.  The AMD processors do
quite well on bit instructions.

	mov	eax, A
	mov	ebx, B
	mov	ecx, C
	mov	edi, pResult ; pointer to three DWORDs

	bt	eax, 31
	adc	edx, edx
	bt	ebx, 31
	adc	edx, edx
	bt	ecx, 31
	adc	edx, edx

	bt	eax, 30
	adc	edx, edx
	bt	ebx, 30
	adc	edx, edx
	bt	ecx, 30
	adc	edx, edx

....{repeat a few times}

	bt	eax, 21
	adc	edx, edx
	bt	ebx, 21
	adc	edx, edx
	mov	[edi][4*2], esi
	bt	ecx, 21
	adc	edx, edx
	dec	edx

....{repeat a few times}

	bt	eax, 10
	adc	edx, edx
	mov	[edi][4*1], edx
	bt	ebx, 10
	adc	edx, edx
	bt	ecx, 10
	adc	edx, edx
	dec	edx

....{repeat a few times}

	bt	eax, 1
	adc	edx, edx
	bt	ebx, 1
	adc	edx, edx
	bt	ecx, 1
	adc	edx, edx

	bt	eax, 0
	adc	edx, edx
	bt	ebx, 0
	adc	edx, edx
	bt	ecx, 0
	adc	edx, edx

	mov	[edi][4*0], edx


bitRAKE


geomrock@hotmail.com (Elijah Bailey) wrote in message news:<e008fef8.0312291705.63a558dd@posting.google.com>...
> Is there a way to shuffle bits on x86 hardware.
> 
> I have three integers 
> 
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
> 
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
> 
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> 
> Is there a way i can generate X using MMX or SIMD instructions 
> such that X is 128 bit integer.
> 
> 
> Thanks in advance for your comments,
> --Elijah

0
Reply bit 12/30/2003 7:29:03 PM

I was thinking that there might be a MMX/SSE or some other
instruction that already does this job for me...I've see
shuffling instructions in the INTEL manuals...what are they
for? What does shuffle instruction mean in SSE?

"Matt Taylor" <para@tampabay.rr.com> wrote in message news:<qg8Ib.154113$%h4.110602@twister.tampabay.rr.com>...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> news:e008fef8.0312291705.63a558dd@posting.google.com...
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
> 
> 3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
> variables. There are no convenient x86 instructions for bit swizzling,
> although multiplies can sometimes accomplish that.
> 
> I can think of 2 basic strategies:
> 1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s in
> between bits, then or them together
> 2. Concatenate all 4 32-bit values together to form a 128-bit value, then
> swizzle using shifts & masking.
> 
> I can't think of any decent way of interleaving individual bits. Perhaps you
> could somehow abuse punpck* instructions. Data swizzling is not one of the
> x86's strong points.
> 
> -Matt

0
Reply geomrock 12/30/2003 7:29:43 PM

"Bx. C" <null@the.void> wrote in message news:<TZ4Ib.8827$sb4.2173@bignews5.bellsouth.net>...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> news:e008fef8.0312291705.63a558dd@posting.google.com...
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
> >
> >
> > Thanks in advance for your comments,
> > --Elijah
> >
> 
> well.. 32 bits + 32 bits + 32 bits = 96 bits... so... could you define where
> you'd like the remaining 32 bits in X to be? (or how they should be
> defined?)


The rest can all be zeroes...

0
Reply geomrock 12/30/2003 7:29:43 PM

"Elijah Bailey" <geomrock@hotmail.com> wrote in message
news:e008fef8.0312301034.60a17c1f@posting.google.com...
> I was thinking that there might be a MMX/SSE or some other
> instruction that already does this job for me...I've see
> shuffling instructions in the INTEL manuals...what are they
> for? What does shuffle instruction mean in SSE?
<snip>

MMX/SSE shuffles give you 4 multiplexers with 4 inputs. The source and
destination MMX/SSE registers are partitioned into fourths (2-bytes for MMX,
4-bytes for SSE). Each destination fourth can select any of the source
fourths. SSE will also let you select 2-byte quantities from the low or high
8-bytes of the register (pshuflw/pshufhw).

The unpck* instructions are more appropriate since they interleave different
quantities of data. However, you can't use them for interleaving bits.

-Matt


0
Reply Matt 12/30/2003 8:39:36 PM

> I was thinking that there might be a MMX/SSE or some other
> instruction that already does this job for me...I've see
> shuffling instructions in the INTEL manuals...what are they
> for? What does shuffle instruction mean in SSE?

They shuffle bytes, not bits. PSHUFW mm1, mm2/mem, imm8 does the following:
mm1[i] = mm2[BITS(imm8, i, 2)], i = 0..3,  where mm1/2 are represented as
arrays of 4 words and BITS(expr, i, j) means bits i to i+j of expr. Again,
SHUFPS xmm1, xmm2/mem, imm8 does the same with XMM registers represented as
4 single-precision floats. There are also SSE2 instructions that operate
with integer
words in XMM regs. MMX\SIMD aren't going to help you in your case.

Ivan





0
Reply Ivan 12/30/2003 10:08:38 PM

"Elijah Bailey" <geomrock@hotmail.com> wrote in message
news:e008fef8.0312291705.63a558dd@posting.google.com...
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
>
>
> Thanks in advance for your comments,
> --Elijah
>

Simplify to

A = a_1 a_2 a_3 ... a_8
B = b_1 b_2 b_3 ... b_8
C = c_1 c_2 c_3 ... c_8

For the output

a_1 b_1 c_1 ... a_8 b_8 c_8

index lookup a table (256 entries in this case) for each of A, B and C,
where each entry is

0 0 n_1 0 0 n_2 ... 0 0 n_8

where n_x is set if the corresponding a_x, b_x or c_x bit is set. The entry
is 24 bits long (3 bytes) which is a bit awkward for an OR operation, so pad
with a leading 0 byte, so that the entire table is 1024 bytes. In pseudo
code;

  R = TABLE[A 1..8]
  R << 1bit
  R = R OR TABLE[B 1..8]
  R << 1bit
  R = R OR TABLE[C 1..8]
  R << 24bits
  R = TABLE[A 9..16]
  R << 1bit
  R = R OR TABLE[B 9..16]
  R << 1bit

.... and so on for the 4 bytes. If you want to avoid the 1 bit shift, have 3
tables with entries

a_1 0 0 a_2 0 0 ... a_8 0 0
0 b_1 0 0 b_2 0 ... 0 b_8 0
0 0 c_1 0 0 c_2 ... 0 0 c_8

  R = TABLEA[A 1..8] OR TABLEB[B 1..8] OR TABLEC[C 1..8]
  R << 24bits
  R = R OR TABLEA[A 9..16] OR TABLEB[B 9..16] OR TABLEC[C 9..16]
....

Still only 3K of tables, and it'll be about as fast as it gets. The only
awkward fact is that you've got 32 bit registers; I'm not sure whether you
can shift and or with SSE on a 128bit register. If you can't then the
partial bit pattern would need moved and masked to the MMX/SIMD registers
from the general purpose registers. Perhaps someone else can confirm/deny.

-- 
Regards
Alex McDonald



0
Reply Alex 12/30/2003 10:54:10 PM

"Rickey Bowers Jr." <bit@vsavoldi.com> wrote in message news:905a2eba.0312300812.fa4aab9@posting.google.com...
> If you need to do many swizzles the BT and ADC/RCL/RCR instructions are
> the way to go in general - use the carry flag.  The AMD processors do
> quite well on bit instructions.
>
> mov eax, A
> mov ebx, B
> mov ecx, C
> mov edi, pResult ; pointer to three DWORDs
>
> bt eax, 31
> adc edx, edx
> bt ebx, 31
> adc edx, edx
> bt ecx, 31
> adc edx, edx

BT may be a fairly slow instruction on some machines. Instead of
using BT, I would suggest use of ADD to push bits into the carry
flag, and ADC to push them from the carry flag into the destination
operand (the following code is untested):

;; use ESI:EDI:EDX as the destination

MOV    EAX, SRC_A
MOV    EBX, SRC_B
MOV    ECX, SRC_C

;; gather up top 32 bits in ESI

REPT   10
   ADD    ECX, ECX
   ADC    ESI, ESI
   ADD    EBX, EBX
   ADC    ESI, ESI
   ADD    EAX, EAX
   ADC    ESI, ESI
ENDM

ADD    ECX, ECX
ADC    ESI, ESI
ADD    EBX, EBX
ADC    ESI, ESI

;; gather up middle 32 bits in EDI

REPT    10
   ADD    EAX, EAX
   ADC    EDI, EDI
   ADD    ECX, ECX
   ADC    EDI, EDI
   ADD    EBX, EBX
   ADC    EDI, EDI
ENDM

ADD    EAX, EAX
ADC    EDI, EDI
ADD    ECX, ECX
ADC    EDI, EDI

;; gather up bottom 32 bits in EDX

REPT    10
   ADD    EBX, EBX
   ADC    EDX, EDX
   ADD    EAX, EAX
   ADC    EDX, EDX
   ADD    ECX, ECX
   ADC    EDX, EDX
ENDM

ADD    EBX, EBX
ADC    EDX, EDX
ADD    EAX, EAX
ADC    EDX, EDX

;; store result

-- Norbert


0
Reply Norbert 12/31/2003 1:50:15 AM

Elijah Bailey wrote:

> Is there a way to shuffle bits on x86 hardware.
> 
> I have three integers 
> 
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
> 
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
> 
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> 
> Is there a way i can generate X using MMX or SIMD instructions 
> such that X is 128 bit integer.

This is actually quite hard, but not as hard as the inverse operation. :-)

I have looked at it a few times over the years, and I still think lookup 
tables are your best bet:

Setup a 256-entry table with the bits expanded for a, then use it 
multiple times:

   x0 = expand[a & 255]|(expand[b & 255]<<1)|(expand[c & 255]<<2);
   x1 = expand[(a>>8) & 255]|(expand[(b>>8) & 255]<<1)|
        (expand[(c>>8) & 255]<<2);
   x2 = expand[(a>>16) & 255]|(expand[(b>>16) & 255]<<1)|
        (expand[(c>>16) & 255]<<2);
   x3 = expand[a>>24]|(expand[b>>24]<<1)|(expand[c>>24]<<2);

   X = x0 | (x1 << 24) | (x2 << 48) | (x3 << 72);  // 96-bit result

The expand[] table will be 1K in size, it might be feasible to use three 
separate tables, one each for a, b and c, preshifted to save runtime 
shift operations.

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"


0
Reply Terje 1/1/2004 12:59:32 PM

12 Replies
196 Views

(page loaded in 4.434 seconds)

Similiar Articles:


















7/14/2012 12:55:12 PM


Reply: