COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### shuffling bits

• Email
• Follow

```Is there a way to shuffle bits on x86 hardware.

I have three integers

A = a_1 a_2 a_3 ... a_32
B = b_1 b_2 b_3 ... b_32
C = c_1 c_2 c_3 ... c_32

where all a_i's b_i's and c_i's are bits
and my shuffle looks like this

X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32

Is there a way i can generate X using MMX or SIMD instructions
such that X is 128 bit integer.

--Elijah

```
 0

See related articles to this posting

```"Elijah Bailey" <geomrock@hotmail.com> wrote in message
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
>
>
> --Elijah
>

well.. 32 bits + 32 bits + 32 bits = 96 bits... so... could you define where
you'd like the remaining 32 bits in X to be? (or how they should be
defined?)

```
 0

```"Elijah Bailey" <geomrock@hotmail.com> wrote in message
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.

3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
variables. There are no convenient x86 instructions for bit swizzling,
although multiplies can sometimes accomplish that.

I can think of 2 basic strategies:
1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s in
between bits, then or them together
2. Concatenate all 4 32-bit values together to form a 128-bit value, then

I can't think of any decent way of interleaving individual bits. Perhaps you
could somehow abuse punpck* instructions. Data swizzling is not one of the
x86's strong points.

-Matt

```
 0

```"Matt Taylor" <para@tampabay.rr.com> wrote in message
news:qg8Ib.154113\$%h4.110602@twister.tampabay.rr.com...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
>
> 3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
> variables. There are no convenient x86 instructions for bit swizzling,
> although multiplies can sometimes accomplish that.
>
> I can think of 2 basic strategies:
> 1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s
in
> between bits, then or them together
> 2. Concatenate all 4 32-bit values together to form a 128-bit value, then
> swizzle using shifts & masking.
>
> I can't think of any decent way of interleaving individual bits. Perhaps
you
> could somehow abuse punpck* instructions. Data swizzling is not one of the
> x86's strong points.
>
> -Matt
>
>

yah.. 4 would be a better number... but in either case, i don't know SSE,..
so i don't know any 128-bit media instructions to work with...

the best i'd come up with, would be the double-precision shift instructions,
SHLD and SHRD.... shift a single bit at a time.. combine that with a loop..
taking a bit from each reg, each time through the loop... and you have the
finalized product....

```
 0

```"Bx. C" <null@the.void> wrote in message
news:0i9Ib.10497\$sb4.5001@bignews5.bellsouth.net...
<snip>
> the best i'd come up with, would be the double-precision shift
instructions,
> SHLD and SHRD.... shift a single bit at a time.. combine that with a
loop..
> taking a bit from each reg, each time through the loop... and you have the
> finalized product....

Well at a minimum I would use the shift instructions instead of shld/shrd:

; Swizzle A -> edx inserting 0s
; Trashes eax, ecx
mov eax, A ; eax = A
mov ecx, 8 ; ecx = bits to copy
xor edx, edx ; edx = masked copy of A

...copyloop:
lea edx, [edx*8] ; edx = edx << 3
shr eax, 1 ; CF = next_bit
adc edx, edx ; edx = (edx << 1) | next_bit
dec ecx
jnz .copyloop

Here's how to do it with multiplies:

; Swizzle al -> eax inserting 0s
movzx eax, byte A
imul eax, 0x00000101 ; ah = al
and eax, 0x0000F00F ; ah = <A7, A6, A5, A4>, al = <A3, A2, A1, A0>
imul eax, 0x00001111
and eax, 0x03030303 ; eax = <A7 | A6, A5 | A4, A3 | A2, A1 | A0>
lea eax, [eax+eax*8] ; Swizzle into nibbles
and eax, 0x11111111

Perhaps someone can improve upon either of those. Each produces 8 bits of
the result at a time. The multiply method takes 17 clocks on an Athlon, 8 of
which are spent in the multiply unit. Unrolling could give a throughput of
~8 clocks per 8 bits on the multiply routine.

-Matt

```
 0

```If you need to do many swizzles the BT and ADC/RCL/RCR instructions are
the way to go in general - use the carry flag.  The AMD processors do
quite well on bit instructions.

mov	eax, A
mov	ebx, B
mov	ecx, C
mov	edi, pResult ; pointer to three DWORDs

bt	eax, 31
bt	ebx, 31
bt	ecx, 31

bt	eax, 30
bt	ebx, 30
bt	ecx, 30

....{repeat a few times}

bt	eax, 21
bt	ebx, 21
mov	[edi][4*2], esi
bt	ecx, 21
dec	edx

....{repeat a few times}

bt	eax, 10
mov	[edi][4*1], edx
bt	ebx, 10
bt	ecx, 10
dec	edx

....{repeat a few times}

bt	eax, 1
bt	ebx, 1
bt	ecx, 1

bt	eax, 0
bt	ebx, 0
bt	ecx, 0

mov	[edi][4*0], edx

bitRAKE

geomrock@hotmail.com (Elijah Bailey) wrote in message news:<e008fef8.0312291705.63a558dd@posting.google.com>...
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
>
>
> --Elijah

```
 0

```"Bx. C" <null@the.void> wrote in message news:<TZ4Ib.8827\$sb4.2173@bignews5.bellsouth.net>...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
> >
> >
> > --Elijah
> >
>
> well.. 32 bits + 32 bits + 32 bits = 96 bits... so... could you define where
> you'd like the remaining 32 bits in X to be? (or how they should be
> defined?)

The rest can all be zeroes...

```
 0

```I was thinking that there might be a MMX/SSE or some other
instruction that already does this job for me...I've see
shuffling instructions in the INTEL manuals...what are they
for? What does shuffle instruction mean in SSE?

"Matt Taylor" <para@tampabay.rr.com> wrote in message news:<qg8Ib.154113\$%h4.110602@twister.tampabay.rr.com>...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
>
> 3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
> variables. There are no convenient x86 instructions for bit swizzling,
> although multiplies can sometimes accomplish that.
>
> I can think of 2 basic strategies:
> 1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s in
> between bits, then or them together
> 2. Concatenate all 4 32-bit values together to form a 128-bit value, then
> swizzle using shifts & masking.
>
> I can't think of any decent way of interleaving individual bits. Perhaps you
> could somehow abuse punpck* instructions. Data swizzling is not one of the
> x86's strong points.
>
> -Matt

```
 0

```"Elijah Bailey" <geomrock@hotmail.com> wrote in message
> I was thinking that there might be a MMX/SSE or some other
> instruction that already does this job for me...I've see
> shuffling instructions in the INTEL manuals...what are they
> for? What does shuffle instruction mean in SSE?
<snip>

MMX/SSE shuffles give you 4 multiplexers with 4 inputs. The source and
destination MMX/SSE registers are partitioned into fourths (2-bytes for MMX,
4-bytes for SSE). Each destination fourth can select any of the source
fourths. SSE will also let you select 2-byte quantities from the low or high
8-bytes of the register (pshuflw/pshufhw).

The unpck* instructions are more appropriate since they interleave different
quantities of data. However, you can't use them for interleaving bits.

-Matt

```
 0

```> I was thinking that there might be a MMX/SSE or some other
> instruction that already does this job for me...I've see
> shuffling instructions in the INTEL manuals...what are they
> for? What does shuffle instruction mean in SSE?

They shuffle bytes, not bits. PSHUFW mm1, mm2/mem, imm8 does the following:
mm1[i] = mm2[BITS(imm8, i, 2)], i = 0..3,  where mm1/2 are represented as
arrays of 4 words and BITS(expr, i, j) means bits i to i+j of expr. Again,
SHUFPS xmm1, xmm2/mem, imm8 does the same with XMM registers represented as
4 single-precision floats. There are also SSE2 instructions that operate
with integer

Ivan

```
 0

```"Elijah Bailey" <geomrock@hotmail.com> wrote in message
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
>
>
> --Elijah
>

Simplify to

A = a_1 a_2 a_3 ... a_8
B = b_1 b_2 b_3 ... b_8
C = c_1 c_2 c_3 ... c_8

For the output

a_1 b_1 c_1 ... a_8 b_8 c_8

index lookup a table (256 entries in this case) for each of A, B and C,
where each entry is

0 0 n_1 0 0 n_2 ... 0 0 n_8

where n_x is set if the corresponding a_x, b_x or c_x bit is set. The entry
is 24 bits long (3 bytes) which is a bit awkward for an OR operation, so pad
with a leading 0 byte, so that the entire table is 1024 bytes. In pseudo
code;

R = TABLE[A 1..8]
R << 1bit
R = R OR TABLE[B 1..8]
R << 1bit
R = R OR TABLE[C 1..8]
R << 24bits
R = TABLE[A 9..16]
R << 1bit
R = R OR TABLE[B 9..16]
R << 1bit

.... and so on for the 4 bytes. If you want to avoid the 1 bit shift, have 3
tables with entries

a_1 0 0 a_2 0 0 ... a_8 0 0
0 b_1 0 0 b_2 0 ... 0 b_8 0
0 0 c_1 0 0 c_2 ... 0 0 c_8

R = TABLEA[A 1..8] OR TABLEB[B 1..8] OR TABLEC[C 1..8]
R << 24bits
R = R OR TABLEA[A 9..16] OR TABLEB[B 9..16] OR TABLEC[C 9..16]
....

Still only 3K of tables, and it'll be about as fast as it gets. The only
awkward fact is that you've got 32 bit registers; I'm not sure whether you
can shift and or with SSE on a 128bit register. If you can't then the
partial bit pattern would need moved and masked to the MMX/SIMD registers
from the general purpose registers. Perhaps someone else can confirm/deny.

--
Regards
Alex McDonald

```
 0

```"Rickey Bowers Jr." <bit@vsavoldi.com> wrote in message news:905a2eba.0312300812.fa4aab9@posting.google.com...
> If you need to do many swizzles the BT and ADC/RCL/RCR instructions are
> the way to go in general - use the carry flag.  The AMD processors do
> quite well on bit instructions.
>
> mov eax, A
> mov ebx, B
> mov ecx, C
> mov edi, pResult ; pointer to three DWORDs
>
> bt eax, 31
> bt ebx, 31
> bt ecx, 31

BT may be a fairly slow instruction on some machines. Instead of
using BT, I would suggest use of ADD to push bits into the carry
flag, and ADC to push them from the carry flag into the destination
operand (the following code is untested):

;; use ESI:EDI:EDX as the destination

MOV    EAX, SRC_A
MOV    EBX, SRC_B
MOV    ECX, SRC_C

;; gather up top 32 bits in ESI

REPT   10
ENDM

;; gather up middle 32 bits in EDI

REPT    10
ENDM

;; gather up bottom 32 bits in EDX

REPT    10
ENDM

;; store result

-- Norbert

```
 0

```Elijah Bailey wrote:

> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.

This is actually quite hard, but not as hard as the inverse operation. :-)

I have looked at it a few times over the years, and I still think lookup

Setup a 256-entry table with the bits expanded for a, then use it
multiple times:

x0 = expand[a & 255]|(expand[b & 255]<<1)|(expand[c & 255]<<2);
x1 = expand[(a>>8) & 255]|(expand[(b>>8) & 255]<<1)|
(expand[(c>>8) & 255]<<2);
x2 = expand[(a>>16) & 255]|(expand[(b>>16) & 255]<<1)|
(expand[(c>>16) & 255]<<2);
x3 = expand[a>>24]|(expand[b>>24]<<1)|(expand[c>>24]<<2);

X = x0 | (x1 << 24) | (x2 << 48) | (x3 << 72);  // 96-bit result

The expand[] table will be 1K in size, it might be feasible to use three
separate tables, one each for a, b and c, preshifted to save runtime
shift operations.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

12 Replies
243 Views

Similar Articles

12/6/2013 8:43:13 AM
page loaded in 29457 ms. (0)

Similar Artilces:

Getting at the bits of a 32-bit integer
Hi there list, I'm a beginning programmer, so please correct me if any of the following assumptions are wrong. Suppose I have the decimal number 255. Since integers in Python are 32 bits, it would look like this in binary: 00000000 00000000 00000000 11111111 There are plenty of unused bits to the left of the number itself. If I wanted to use some of these bits as true/false flags, how would I go about it? In Python, how do I write code that gets at the leftmost byte, or the third bit from the left of the set of 32, or the rightmost byte plus the bit to its left, etc... Thanks in advance for any help on this. :) Jacob If you really want to do that, you can use the bitwise "and" and "or" operators to unset/set bits, respectivly. For example, if you want to set bit 9, that's 0x0100, or 0000 0001 0000 0000 in binary. The integer 'i' with bit 9 unset is 'i & ~0x0100' (~ is the one's complement or bitwise not operator; it inverts all the bits), and 'i | 0x0100' is 'i' with bit 9 set. You can also use the bitwise exclusive or "^" to toggle a bit, like 'i ^ 0x0100'. However, doing this sort

About the bits reserved for float variable
I looking at the Chapter 5 of the Bulding Aplication. It says, for float variables that it's a 32 bits number in the range of +/-10^38 withe approximately six or seven decimal places of significance. What I'm missing here? How can a number 32 bit number range between -10^38 and +10^38? In article <c8l0c4\$jle\$1@pegasus.fccn.pt>, "Nuno Oliveira" <nmoliveira@fc.ul.pt> wrote: > I looking at the Chapter 5 of the Bulding Aplication. It says, for > float variables that it's a 32 bits number in the range of +/-10^38 > withe approximately six or seven decimal places of significance. What > I'm missing here? How can a number 32 bit number range between -10^38 > and +10^38? > For an 32 bit floating point number, the first bit is the sign bit. the next 8 bits are the exponent, the last 23 bits are the mantissa (IEEE) The exponent has 8 bits, it can do -128 -> 128 in base 2, 2^128 = 3.4 x 10^38 2^-128 = ...10^-38 The 23 bits of the mantissa represent a number between 0 and 2 (scaled). 2^23 = 8388608, a 7 digit number There's an equation to convert them on the IEEE website I think. Chris. Nuno Oliveira wrote: > I

Reading individual bits of MODIS BRDF quality files
I am running IDL 7.0.6 on Windows vista (with ENVI 4.6). I am using MODIS BRDF corrected 500 m data (MCD43A4). I have sucessfully written a program to open the HDF files and produce the output vegetation indices I'm interested in. However, I need to filter out cloud- contaminated pixels, and am having problem reading the bit-packed quality control file (MCD43A2). I have written a program to extract the layer I'm interested in from the HDF file and save it as a 32-bit unsigned integer. However, I really need to extract the raw values of the first 12 bits of this number (the values of the remaining fields do not matter to me). I started writing a program to list all the numbers (from 0-4294967295) that result from my required bit combination of these first 12 bits, but that approach is not going to work as there are several million 'allowable' numbers. Is there a function in IDL that can read individual bits, or any other clever way to produce an array of 1s and 0s from a 32-bit integer? Many thanks for your help, Edward Mitchard On Fri, 18 Sep 2009 02:54:55 -0700 (PDT), emitchard <emitchard@googlemail.com> wrote: >I am running IDL 7.0.6 on Windows

Trouble setting IPv6 traffic class (QoS) bits with iperf
I am running iperf 2.0.5 on 2.6.23.1-42.fc8. I want to use the -S option to set the IPv6 traffic class (QoS) bits. I used the following command: iperf -c fec0::4:f5d2:c695:40e3:cab6 -b 10k -t 1000000 -S 0x38 -p 5111 -l 64 -V With tcpdump, I can see that the traffic class is still 0x0. I know the -S option works with IPv4. Does it work with IPv6? If so, what am I doing wrong?