COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### shuffling bits

• Follow

```Is there a way to shuffle bits on x86 hardware.

I have three integers

A = a_1 a_2 a_3 ... a_32
B = b_1 b_2 b_3 ... b_32
C = c_1 c_2 c_3 ... c_32

where all a_i's b_i's and c_i's are bits
and my shuffle looks like this

X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32

Is there a way i can generate X using MMX or SIMD instructions
such that X is 128 bit integer.

--Elijah

```
 0

```"Elijah Bailey" <geomrock@hotmail.com> wrote in message
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
>
>
> --Elijah
>

well.. 32 bits + 32 bits + 32 bits = 96 bits... so... could you define where
you'd like the remaining 32 bits in X to be? (or how they should be
defined?)

```
 0

```"Elijah Bailey" <geomrock@hotmail.com> wrote in message
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.

3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
variables. There are no convenient x86 instructions for bit swizzling,
although multiplies can sometimes accomplish that.

I can think of 2 basic strategies:
1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s in
between bits, then or them together
2. Concatenate all 4 32-bit values together to form a 128-bit value, then

I can't think of any decent way of interleaving individual bits. Perhaps you
could somehow abuse punpck* instructions. Data swizzling is not one of the
x86's strong points.

-Matt

```
 0

```"Matt Taylor" <para@tampabay.rr.com> wrote in message
news:qg8Ib.154113\$%h4.110602@twister.tampabay.rr.com...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
>
> 3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
> variables. There are no convenient x86 instructions for bit swizzling,
> although multiplies can sometimes accomplish that.
>
> I can think of 2 basic strategies:
> 1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s
in
> between bits, then or them together
> 2. Concatenate all 4 32-bit values together to form a 128-bit value, then
> swizzle using shifts & masking.
>
> I can't think of any decent way of interleaving individual bits. Perhaps
you
> could somehow abuse punpck* instructions. Data swizzling is not one of the
> x86's strong points.
>
> -Matt
>
>

yah.. 4 would be a better number... but in either case, i don't know SSE,..
so i don't know any 128-bit media instructions to work with...

the best i'd come up with, would be the double-precision shift instructions,
SHLD and SHRD.... shift a single bit at a time.. combine that with a loop..
taking a bit from each reg, each time through the loop... and you have the
finalized product....

```
 0

```"Bx. C" <null@the.void> wrote in message
news:0i9Ib.10497\$sb4.5001@bignews5.bellsouth.net...
<snip>
> the best i'd come up with, would be the double-precision shift
instructions,
> SHLD and SHRD.... shift a single bit at a time.. combine that with a
loop..
> taking a bit from each reg, each time through the loop... and you have the
> finalized product....

Well at a minimum I would use the shift instructions instead of shld/shrd:

; Swizzle A -> edx inserting 0s
; Trashes eax, ecx
mov eax, A ; eax = A
mov ecx, 8 ; ecx = bits to copy
xor edx, edx ; edx = masked copy of A

...copyloop:
lea edx, [edx*8] ; edx = edx << 3
shr eax, 1 ; CF = next_bit
adc edx, edx ; edx = (edx << 1) | next_bit
dec ecx
jnz .copyloop

Here's how to do it with multiplies:

; Swizzle al -> eax inserting 0s
movzx eax, byte A
imul eax, 0x00000101 ; ah = al
and eax, 0x0000F00F ; ah = <A7, A6, A5, A4>, al = <A3, A2, A1, A0>
imul eax, 0x00001111
and eax, 0x03030303 ; eax = <A7 | A6, A5 | A4, A3 | A2, A1 | A0>
lea eax, [eax+eax*8] ; Swizzle into nibbles
and eax, 0x11111111

Perhaps someone can improve upon either of those. Each produces 8 bits of
the result at a time. The multiply method takes 17 clocks on an Athlon, 8 of
which are spent in the multiply unit. Unrolling could give a throughput of
~8 clocks per 8 bits on the multiply routine.

-Matt

```
 0

```If you need to do many swizzles the BT and ADC/RCL/RCR instructions are
the way to go in general - use the carry flag.  The AMD processors do
quite well on bit instructions.

mov	eax, A
mov	ebx, B
mov	ecx, C
mov	edi, pResult ; pointer to three DWORDs

bt	eax, 31
bt	ebx, 31
bt	ecx, 31

bt	eax, 30
bt	ebx, 30
bt	ecx, 30

....{repeat a few times}

bt	eax, 21
bt	ebx, 21
mov	[edi][4*2], esi
bt	ecx, 21
dec	edx

....{repeat a few times}

bt	eax, 10
mov	[edi][4*1], edx
bt	ebx, 10
bt	ecx, 10
dec	edx

....{repeat a few times}

bt	eax, 1
bt	ebx, 1
bt	ecx, 1

bt	eax, 0
bt	ebx, 0
bt	ecx, 0

mov	[edi][4*0], edx

bitRAKE

geomrock@hotmail.com (Elijah Bailey) wrote in message news:<e008fef8.0312291705.63a558dd@posting.google.com>...
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
>
>
> --Elijah

```
 0

```I was thinking that there might be a MMX/SSE or some other
instruction that already does this job for me...I've see
shuffling instructions in the INTEL manuals...what are they
for? What does shuffle instruction mean in SSE?

"Matt Taylor" <para@tampabay.rr.com> wrote in message news:<qg8Ib.154113\$%h4.110602@twister.tampabay.rr.com>...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
>
> 3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
> variables. There are no convenient x86 instructions for bit swizzling,
> although multiplies can sometimes accomplish that.
>
> I can think of 2 basic strategies:
> 1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s in
> between bits, then or them together
> 2. Concatenate all 4 32-bit values together to form a 128-bit value, then
> swizzle using shifts & masking.
>
> I can't think of any decent way of interleaving individual bits. Perhaps you
> could somehow abuse punpck* instructions. Data swizzling is not one of the
> x86's strong points.
>
> -Matt

```
 0

```"Bx. C" <null@the.void> wrote in message news:<TZ4Ib.8827\$sb4.2173@bignews5.bellsouth.net>...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
> >
> >
> > --Elijah
> >
>
> well.. 32 bits + 32 bits + 32 bits = 96 bits... so... could you define where
> you'd like the remaining 32 bits in X to be? (or how they should be
> defined?)

The rest can all be zeroes...

```
 0

```"Elijah Bailey" <geomrock@hotmail.com> wrote in message
> I was thinking that there might be a MMX/SSE or some other
> instruction that already does this job for me...I've see
> shuffling instructions in the INTEL manuals...what are they
> for? What does shuffle instruction mean in SSE?
<snip>

MMX/SSE shuffles give you 4 multiplexers with 4 inputs. The source and
destination MMX/SSE registers are partitioned into fourths (2-bytes for MMX,
4-bytes for SSE). Each destination fourth can select any of the source
fourths. SSE will also let you select 2-byte quantities from the low or high
8-bytes of the register (pshuflw/pshufhw).

The unpck* instructions are more appropriate since they interleave different
quantities of data. However, you can't use them for interleaving bits.

-Matt

```
 0

```> I was thinking that there might be a MMX/SSE or some other
> instruction that already does this job for me...I've see
> shuffling instructions in the INTEL manuals...what are they
> for? What does shuffle instruction mean in SSE?

They shuffle bytes, not bits. PSHUFW mm1, mm2/mem, imm8 does the following:
mm1[i] = mm2[BITS(imm8, i, 2)], i = 0..3,  where mm1/2 are represented as
arrays of 4 words and BITS(expr, i, j) means bits i to i+j of expr. Again,
SHUFPS xmm1, xmm2/mem, imm8 does the same with XMM registers represented as
4 single-precision floats. There are also SSE2 instructions that operate
with integer

Ivan

```
 0

```"Elijah Bailey" <geomrock@hotmail.com> wrote in message
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
>
>
> --Elijah
>

Simplify to

A = a_1 a_2 a_3 ... a_8
B = b_1 b_2 b_3 ... b_8
C = c_1 c_2 c_3 ... c_8

For the output

a_1 b_1 c_1 ... a_8 b_8 c_8

index lookup a table (256 entries in this case) for each of A, B and C,
where each entry is

0 0 n_1 0 0 n_2 ... 0 0 n_8

where n_x is set if the corresponding a_x, b_x or c_x bit is set. The entry
is 24 bits long (3 bytes) which is a bit awkward for an OR operation, so pad
with a leading 0 byte, so that the entire table is 1024 bytes. In pseudo
code;

R = TABLE[A 1..8]
R << 1bit
R = R OR TABLE[B 1..8]
R << 1bit
R = R OR TABLE[C 1..8]
R << 24bits
R = TABLE[A 9..16]
R << 1bit
R = R OR TABLE[B 9..16]
R << 1bit

.... and so on for the 4 bytes. If you want to avoid the 1 bit shift, have 3
tables with entries

a_1 0 0 a_2 0 0 ... a_8 0 0
0 b_1 0 0 b_2 0 ... 0 b_8 0
0 0 c_1 0 0 c_2 ... 0 0 c_8

R = TABLEA[A 1..8] OR TABLEB[B 1..8] OR TABLEC[C 1..8]
R << 24bits
R = R OR TABLEA[A 9..16] OR TABLEB[B 9..16] OR TABLEC[C 9..16]
....

Still only 3K of tables, and it'll be about as fast as it gets. The only
awkward fact is that you've got 32 bit registers; I'm not sure whether you
can shift and or with SSE on a 128bit register. If you can't then the
partial bit pattern would need moved and masked to the MMX/SIMD registers
from the general purpose registers. Perhaps someone else can confirm/deny.

--
Regards
Alex McDonald

```
 0

```"Rickey Bowers Jr." <bit@vsavoldi.com> wrote in message news:905a2eba.0312300812.fa4aab9@posting.google.com...
> If you need to do many swizzles the BT and ADC/RCL/RCR instructions are
> the way to go in general - use the carry flag.  The AMD processors do
> quite well on bit instructions.
>
> mov eax, A
> mov ebx, B
> mov ecx, C
> mov edi, pResult ; pointer to three DWORDs
>
> bt eax, 31
> bt ebx, 31
> bt ecx, 31

BT may be a fairly slow instruction on some machines. Instead of
using BT, I would suggest use of ADD to push bits into the carry
flag, and ADC to push them from the carry flag into the destination
operand (the following code is untested):

;; use ESI:EDI:EDX as the destination

MOV    EAX, SRC_A
MOV    EBX, SRC_B
MOV    ECX, SRC_C

;; gather up top 32 bits in ESI

REPT   10
ENDM

;; gather up middle 32 bits in EDI

REPT    10
ENDM

;; gather up bottom 32 bits in EDX

REPT    10
ENDM

;; store result

-- Norbert

```
 0

```Elijah Bailey wrote:

> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.

This is actually quite hard, but not as hard as the inverse operation. :-)

I have looked at it a few times over the years, and I still think lookup

Setup a 256-entry table with the bits expanded for a, then use it
multiple times:

x0 = expand[a & 255]|(expand[b & 255]<<1)|(expand[c & 255]<<2);
x1 = expand[(a>>8) & 255]|(expand[(b>>8) & 255]<<1)|
(expand[(c>>8) & 255]<<2);
x2 = expand[(a>>16) & 255]|(expand[(b>>16) & 255]<<1)|
(expand[(c>>16) & 255]<<2);
x3 = expand[a>>24]|(expand[b>>24]<<1)|(expand[c>>24]<<2);

X = x0 | (x1 << 24) | (x2 << 48) | (x3 << 72);  // 96-bit result

The expand[] table will be 1K in size, it might be feasible to use three
separate tables, one each for a, b and c, preshifted to save runtime
shift operations.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

12 Replies
196 Views

Similiar Articles:

7/14/2012 12:55:12 PM