Is there a way to shuffle bits on x86 hardware.
I have three integers
A = a_1 a_2 a_3 ... a_32
B = b_1 b_2 b_3 ... b_32
C = c_1 c_2 c_3 ... c_32
where all a_i's b_i's and c_i's are bits
and my shuffle looks like this
X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
Is there a way i can generate X using MMX or SIMD instructions
such that X is 128 bit integer.
Thanks in advance for your comments,
--Elijah
|
|
0
|
|
|
|
Reply
|
geomrock
|
12/30/2003 1:47:18 AM |
|
"Elijah Bailey" <geomrock@hotmail.com> wrote in message
news:e008fef8.0312291705.63a558dd@posting.google.com...
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
>
>
> Thanks in advance for your comments,
> --Elijah
>
well.. 32 bits + 32 bits + 32 bits = 96 bits... so... could you define where
you'd like the remaining 32 bits in X to be? (or how they should be
defined?)
|
|
0
|
|
|
|
Reply
|
Bx
|
12/30/2003 2:35:33 AM
|
|
"Elijah Bailey" <geomrock@hotmail.com> wrote in message
news:e008fef8.0312291705.63a558dd@posting.google.com...
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
variables. There are no convenient x86 instructions for bit swizzling,
although multiplies can sometimes accomplish that.
I can think of 2 basic strategies:
1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s in
between bits, then or them together
2. Concatenate all 4 32-bit values together to form a 128-bit value, then
swizzle using shifts & masking.
I can't think of any decent way of interleaving individual bits. Perhaps you
could somehow abuse punpck* instructions. Data swizzling is not one of the
x86's strong points.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
12/30/2003 5:48:09 AM
|
|
"Matt Taylor" <para@tampabay.rr.com> wrote in message
news:qg8Ib.154113$%h4.110602@twister.tampabay.rr.com...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> news:e008fef8.0312291705.63a558dd@posting.google.com...
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
>
> 3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
> variables. There are no convenient x86 instructions for bit swizzling,
> although multiplies can sometimes accomplish that.
>
> I can think of 2 basic strategies:
> 1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s
in
> between bits, then or them together
> 2. Concatenate all 4 32-bit values together to form a 128-bit value, then
> swizzle using shifts & masking.
>
> I can't think of any decent way of interleaving individual bits. Perhaps
you
> could somehow abuse punpck* instructions. Data swizzling is not one of the
> x86's strong points.
>
> -Matt
>
>
yah.. 4 would be a better number... but in either case, i don't know SSE,..
so i don't know any 128-bit media instructions to work with...
the best i'd come up with, would be the double-precision shift instructions,
SHLD and SHRD.... shift a single bit at a time.. combine that with a loop..
taking a bit from each reg, each time through the loop... and you have the
finalized product....
|
|
0
|
|
|
|
Reply
|
Bx
|
12/30/2003 7:27:43 AM
|
|
"Bx. C" <null@the.void> wrote in message
news:0i9Ib.10497$sb4.5001@bignews5.bellsouth.net...
<snip>
> the best i'd come up with, would be the double-precision shift
instructions,
> SHLD and SHRD.... shift a single bit at a time.. combine that with a
loop..
> taking a bit from each reg, each time through the loop... and you have the
> finalized product....
Well at a minimum I would use the shift instructions instead of shld/shrd:
; Swizzle A -> edx inserting 0s
; Trashes eax, ecx
mov eax, A ; eax = A
mov ecx, 8 ; ecx = bits to copy
xor edx, edx ; edx = masked copy of A
...copyloop:
lea edx, [edx*8] ; edx = edx << 3
shr eax, 1 ; CF = next_bit
adc edx, edx ; edx = (edx << 1) | next_bit
dec ecx
jnz .copyloop
Here's how to do it with multiplies:
; Swizzle al -> eax inserting 0s
movzx eax, byte A
imul eax, 0x00000101 ; ah = al
and eax, 0x0000F00F ; ah = <A7, A6, A5, A4>, al = <A3, A2, A1, A0>
imul eax, 0x00001111
and eax, 0x03030303 ; eax = <A7 | A6, A5 | A4, A3 | A2, A1 | A0>
lea eax, [eax+eax*8] ; Swizzle into nibbles
and eax, 0x11111111
Perhaps someone can improve upon either of those. Each produces 8 bits of
the result at a time. The multiply method takes 17 clocks on an Athlon, 8 of
which are spent in the multiply unit. Unrolling could give a throughput of
~8 clocks per 8 bits on the multiply routine.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
12/30/2003 6:34:23 PM
|
|
If you need to do many swizzles the BT and ADC/RCL/RCR instructions are
the way to go in general - use the carry flag. The AMD processors do
quite well on bit instructions.
mov eax, A
mov ebx, B
mov ecx, C
mov edi, pResult ; pointer to three DWORDs
bt eax, 31
adc edx, edx
bt ebx, 31
adc edx, edx
bt ecx, 31
adc edx, edx
bt eax, 30
adc edx, edx
bt ebx, 30
adc edx, edx
bt ecx, 30
adc edx, edx
....{repeat a few times}
bt eax, 21
adc edx, edx
bt ebx, 21
adc edx, edx
mov [edi][4*2], esi
bt ecx, 21
adc edx, edx
dec edx
....{repeat a few times}
bt eax, 10
adc edx, edx
mov [edi][4*1], edx
bt ebx, 10
adc edx, edx
bt ecx, 10
adc edx, edx
dec edx
....{repeat a few times}
bt eax, 1
adc edx, edx
bt ebx, 1
adc edx, edx
bt ecx, 1
adc edx, edx
bt eax, 0
adc edx, edx
bt ebx, 0
adc edx, edx
bt ecx, 0
adc edx, edx
mov [edi][4*0], edx
bitRAKE
geomrock@hotmail.com (Elijah Bailey) wrote in message news:<e008fef8.0312291705.63a558dd@posting.google.com>...
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
>
>
> Thanks in advance for your comments,
> --Elijah
|
|
0
|
|
|
|
Reply
|
bit
|
12/30/2003 7:29:03 PM
|
|
I was thinking that there might be a MMX/SSE or some other
instruction that already does this job for me...I've see
shuffling instructions in the INTEL manuals...what are they
for? What does shuffle instruction mean in SSE?
"Matt Taylor" <para@tampabay.rr.com> wrote in message news:<qg8Ib.154113$%h4.110602@twister.tampabay.rr.com>...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> news:e008fef8.0312291705.63a558dd@posting.google.com...
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
>
> 3 is an awkward number. If you can, I would suggest interleaving 4 32-bit
> variables. There are no convenient x86 instructions for bit swizzling,
> although multiplies can sometimes accomplish that.
>
> I can think of 2 basic strategies:
> 1. Expand 4 32-bit values into 4 separate 128-bit values by inserting 0s in
> between bits, then or them together
> 2. Concatenate all 4 32-bit values together to form a 128-bit value, then
> swizzle using shifts & masking.
>
> I can't think of any decent way of interleaving individual bits. Perhaps you
> could somehow abuse punpck* instructions. Data swizzling is not one of the
> x86's strong points.
>
> -Matt
|
|
0
|
|
|
|
Reply
|
geomrock
|
12/30/2003 7:29:43 PM
|
|
"Bx. C" <null@the.void> wrote in message news:<TZ4Ib.8827$sb4.2173@bignews5.bellsouth.net>...
> "Elijah Bailey" <geomrock@hotmail.com> wrote in message
> news:e008fef8.0312291705.63a558dd@posting.google.com...
> > Is there a way to shuffle bits on x86 hardware.
> >
> > I have three integers
> >
> > A = a_1 a_2 a_3 ... a_32
> > B = b_1 b_2 b_3 ... b_32
> > C = c_1 c_2 c_3 ... c_32
> >
> > where all a_i's b_i's and c_i's are bits
> > and my shuffle looks like this
> >
> > X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
> >
> > Is there a way i can generate X using MMX or SIMD instructions
> > such that X is 128 bit integer.
> >
> >
> > Thanks in advance for your comments,
> > --Elijah
> >
>
> well.. 32 bits + 32 bits + 32 bits = 96 bits... so... could you define where
> you'd like the remaining 32 bits in X to be? (or how they should be
> defined?)
The rest can all be zeroes...
|
|
0
|
|
|
|
Reply
|
geomrock
|
12/30/2003 7:29:43 PM
|
|
"Elijah Bailey" <geomrock@hotmail.com> wrote in message
news:e008fef8.0312301034.60a17c1f@posting.google.com...
> I was thinking that there might be a MMX/SSE or some other
> instruction that already does this job for me...I've see
> shuffling instructions in the INTEL manuals...what are they
> for? What does shuffle instruction mean in SSE?
<snip>
MMX/SSE shuffles give you 4 multiplexers with 4 inputs. The source and
destination MMX/SSE registers are partitioned into fourths (2-bytes for MMX,
4-bytes for SSE). Each destination fourth can select any of the source
fourths. SSE will also let you select 2-byte quantities from the low or high
8-bytes of the register (pshuflw/pshufhw).
The unpck* instructions are more appropriate since they interleave different
quantities of data. However, you can't use them for interleaving bits.
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
12/30/2003 8:39:36 PM
|
|
> I was thinking that there might be a MMX/SSE or some other
> instruction that already does this job for me...I've see
> shuffling instructions in the INTEL manuals...what are they
> for? What does shuffle instruction mean in SSE?
They shuffle bytes, not bits. PSHUFW mm1, mm2/mem, imm8 does the following:
mm1[i] = mm2[BITS(imm8, i, 2)], i = 0..3, where mm1/2 are represented as
arrays of 4 words and BITS(expr, i, j) means bits i to i+j of expr. Again,
SHUFPS xmm1, xmm2/mem, imm8 does the same with XMM registers represented as
4 single-precision floats. There are also SSE2 instructions that operate
with integer
words in XMM regs. MMX\SIMD aren't going to help you in your case.
Ivan
|
|
0
|
|
|
|
Reply
|
Ivan
|
12/30/2003 10:08:38 PM
|
|
"Elijah Bailey" <geomrock@hotmail.com> wrote in message
news:e008fef8.0312291705.63a558dd@posting.google.com...
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
>
>
> Thanks in advance for your comments,
> --Elijah
>
Simplify to
A = a_1 a_2 a_3 ... a_8
B = b_1 b_2 b_3 ... b_8
C = c_1 c_2 c_3 ... c_8
For the output
a_1 b_1 c_1 ... a_8 b_8 c_8
index lookup a table (256 entries in this case) for each of A, B and C,
where each entry is
0 0 n_1 0 0 n_2 ... 0 0 n_8
where n_x is set if the corresponding a_x, b_x or c_x bit is set. The entry
is 24 bits long (3 bytes) which is a bit awkward for an OR operation, so pad
with a leading 0 byte, so that the entire table is 1024 bytes. In pseudo
code;
R = TABLE[A 1..8]
R << 1bit
R = R OR TABLE[B 1..8]
R << 1bit
R = R OR TABLE[C 1..8]
R << 24bits
R = TABLE[A 9..16]
R << 1bit
R = R OR TABLE[B 9..16]
R << 1bit
.... and so on for the 4 bytes. If you want to avoid the 1 bit shift, have 3
tables with entries
a_1 0 0 a_2 0 0 ... a_8 0 0
0 b_1 0 0 b_2 0 ... 0 b_8 0
0 0 c_1 0 0 c_2 ... 0 0 c_8
R = TABLEA[A 1..8] OR TABLEB[B 1..8] OR TABLEC[C 1..8]
R << 24bits
R = R OR TABLEA[A 9..16] OR TABLEB[B 9..16] OR TABLEC[C 9..16]
....
Still only 3K of tables, and it'll be about as fast as it gets. The only
awkward fact is that you've got 32 bit registers; I'm not sure whether you
can shift and or with SSE on a 128bit register. If you can't then the
partial bit pattern would need moved and masked to the MMX/SIMD registers
from the general purpose registers. Perhaps someone else can confirm/deny.
--
Regards
Alex McDonald
|
|
0
|
|
|
|
Reply
|
Alex
|
12/30/2003 10:54:10 PM
|
|
"Rickey Bowers Jr." <bit@vsavoldi.com> wrote in message news:905a2eba.0312300812.fa4aab9@posting.google.com...
> If you need to do many swizzles the BT and ADC/RCL/RCR instructions are
> the way to go in general - use the carry flag. The AMD processors do
> quite well on bit instructions.
>
> mov eax, A
> mov ebx, B
> mov ecx, C
> mov edi, pResult ; pointer to three DWORDs
>
> bt eax, 31
> adc edx, edx
> bt ebx, 31
> adc edx, edx
> bt ecx, 31
> adc edx, edx
BT may be a fairly slow instruction on some machines. Instead of
using BT, I would suggest use of ADD to push bits into the carry
flag, and ADC to push them from the carry flag into the destination
operand (the following code is untested):
;; use ESI:EDI:EDX as the destination
MOV EAX, SRC_A
MOV EBX, SRC_B
MOV ECX, SRC_C
;; gather up top 32 bits in ESI
REPT 10
ADD ECX, ECX
ADC ESI, ESI
ADD EBX, EBX
ADC ESI, ESI
ADD EAX, EAX
ADC ESI, ESI
ENDM
ADD ECX, ECX
ADC ESI, ESI
ADD EBX, EBX
ADC ESI, ESI
;; gather up middle 32 bits in EDI
REPT 10
ADD EAX, EAX
ADC EDI, EDI
ADD ECX, ECX
ADC EDI, EDI
ADD EBX, EBX
ADC EDI, EDI
ENDM
ADD EAX, EAX
ADC EDI, EDI
ADD ECX, ECX
ADC EDI, EDI
;; gather up bottom 32 bits in EDX
REPT 10
ADD EBX, EBX
ADC EDX, EDX
ADD EAX, EAX
ADC EDX, EDX
ADD ECX, ECX
ADC EDX, EDX
ENDM
ADD EBX, EBX
ADC EDX, EDX
ADD EAX, EAX
ADC EDX, EDX
;; store result
-- Norbert
|
|
0
|
|
|
|
Reply
|
Norbert
|
12/31/2003 1:50:15 AM
|
|
Elijah Bailey wrote:
> Is there a way to shuffle bits on x86 hardware.
>
> I have three integers
>
> A = a_1 a_2 a_3 ... a_32
> B = b_1 b_2 b_3 ... b_32
> C = c_1 c_2 c_3 ... c_32
>
> where all a_i's b_i's and c_i's are bits
> and my shuffle looks like this
>
> X = a_1 b_1 c_1 a_2 b_2 c_2 ... a_32 b_32 c_32
>
> Is there a way i can generate X using MMX or SIMD instructions
> such that X is 128 bit integer.
This is actually quite hard, but not as hard as the inverse operation. :-)
I have looked at it a few times over the years, and I still think lookup
tables are your best bet:
Setup a 256-entry table with the bits expanded for a, then use it
multiple times:
x0 = expand[a & 255]|(expand[b & 255]<<1)|(expand[c & 255]<<2);
x1 = expand[(a>>8) & 255]|(expand[(b>>8) & 255]<<1)|
(expand[(c>>8) & 255]<<2);
x2 = expand[(a>>16) & 255]|(expand[(b>>16) & 255]<<1)|
(expand[(c>>16) & 255]<<2);
x3 = expand[a>>24]|(expand[b>>24]<<1)|(expand[c>>24]<<2);
X = x0 | (x1 << 24) | (x2 << 48) | (x3 << 72); // 96-bit result
The expand[] table will be 1K in size, it might be feasible to use three
separate tables, one each for a, b and c, preshifted to save runtime
shift operations.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
|
|
0
|
|
|
|
Reply
|
Terje
|
1/1/2004 12:59:32 PM
|
|
|
12 Replies
196 Views
(page loaded in 4.434 seconds)
Similiar Articles: Signed shift of 32-bit int using 16-bit instructions? - comp.lang ...128 bit integer - comp.lang.c Signed shift of 32-bit int using 16-bit instructions? - comp.lang ... shuffling bits - comp.lang.asm.x86... 32 Is there a way i can generate ... Randperm, Randi, and Shuffle - comp.soft-sys.matlabProblem is now Shuffle is fast, 0.32sec for the above extreme, but not as fast as randi ... Dell Precision 690, Intel 2xQuad-Core E5345 @ 2.33GHz 16GB RAM, Windows7 64-bit ... sse2 and rolr - comp.lang.asm.x86In general I believe you'd need 2 64-bit shifts and an OR to merge them. Working with 32-bit shifts only could be handled with two initial 4x32 shuffle operations, that ... Fast bit-reverse on an x86? - comp.dspI need to bit-reverse a slew of > bytes, so the faster the better. If you need the bit reversal for a bit-reversal shuffle loop for FFT I may be able to help out ... How to generate random number without replacement? - comp.lang ...$ perldoc -q shuffle Found in /usr/local/lib/perl5/5.10.1/pod/perlfaq4.pod How do I ... You will make this group a way more friendlier place, and hey, maybe it gets a bit ... imellipse to mask (speed up) - comp.soft-sys.matlabI want to create a mask from an ellipse draw with imellipse. Currently I am using the following piece of code (in a loop): figure; % previousl... [comp.publish.cdrom] CD-Recordable FAQ, Part 1/4 - comp.publish ...Archive-name: cdrom/cd-recordable/part1 Posting-Frequency: monthly Last-modified: 2008/10/09 Version: 2.71 Send corrections and updates to And... SSE Programming - comp.lang.asm.x86... 25: 66 0f 58 c1 addpd %xmm1,%xmm0 register __m128d p9; _mm_shuffle ... notice you are using gcc 4.1 on i386. > I am using gcc 3.3.5 on amd64 (on 64-bit ... Packet timestamps when using Windows-7/Vista - comp.protocols.time ...I suspect if you shuffle which boxes have reference clocks, the system clock stepping ... file list archive devel manuals source TIMESTAMP ... 10, 64 ... with running 32-bit ... ms2gt MODIS reprojection toolkit - comp.lang.idl-pvwaveAlso, I assume some kind of 32-bit OS? Have to tried a different projection, just to see if you can get something to work? I have to do some car shuffling this ... Find NT disk size with script/bat file - comp.os.ms-windows.nt ...There is no temp file in the script shown in the posting ... A word about the "without temp files" bit in this ... disk more ... config\temp.txt config.txt The replace ... Umax SuperMac C600/Apus 3000 - comp.sys.mac.hardware.misc ...speaking of the performa 5260, it is only a bit older then the Umax - and uses a 603e as well. Can I cannibalize something from it and put it in the UMAX? PROMAC 2A Eprom Programmer - comp.os.cpm(So, it is 8-bit CP/M-compatible!) I leave it to more knowledgeable readers to ... data in the buffer BLOCK STORE Fill a range with a specific byte SHUFF/SPLIT Shuffle ... Random array? - comp.cad.solidworksSeth Renigar wrote: > By the way, calculating this may take quite a bit of time ... This will give you a series of random cards from ... push_back( buff ); } shuffle ... how to calcule the division modulo 2 reminder - comp.sys.hp48 ...I know that you can shuffle around factors manually or write an RPL program to achieve ... Extended-Precision Division - comp.lang.asm.x86 8 Bit CRC - comp.compression ... LMFAO SHUFFLING BIT UTSAV 2012 - YouTubePublished on Jun 4, 2012 by siyadsalam LMFAO SHUFFLING BIT UTSAV 2012 Dance Shuffle Category: People & Blogs Tags: LMFAO SHUFFLING BIT UTSAV 2012 Dance ... Danny Shuffling to The Time (Dirty Bit) - YouTubeDo not own music. My brother just getting down at the house. Not good quality but Its pretty good (: 7/14/2012 12:55:12 PM
|