nibble array/BCD notation/MMX

  • Follow


Hello there,

I want to access an array of nibbles. I have developed in C the
function
that everyone will do without too much effort, but it is terribly
slow. I am
trying to approach the problem from another point of view in order to
see
if I get something faster. What I am looking for is a magic
instruction
(ala btsl) that given a pointer and an index, it returns the nth
nibble.

This is a similar problem to obtain the nth digit of a BCD (packed)
format.

I am wondering if there is a MMX (or some other extension) that could
do this job (I heard they treat a word in sets of 4/8/32/64/128 bits,
but
I am not into Intel assembly extensions).

Thanks in advance, any light is highly appreciated,

David

0
Reply ditiem 8/13/2008 7:12:30 PM

ditiem schrieb:
> Hello there,
> 
> I want to access an array of nibbles. I have developed in C the
> function
> that everyone will do without too much effort, but it is terribly
> slow. 

Either you're using division and modulus per access or have some
uncommon definition of "terribly slow". Let's see the code.

> I am wondering if there is a MMX (or some other extension) that could
> do this job (I heard they treat a word in sets of 4/8/32/64/128 bits,
> but I am not into Intel assembly extensions).

There's nothing which will help you for extracting a single specific nibble.
There may be something for you if you want many nibbles in fixed patterns
and do some operations on them.


Hendrik vdH

0
Reply Hendrik 8/14/2008 6:41:59 AM


David "ditiem" wrote:

> Hello there,
Hi,
> I want to access an array of nibbles. I have developed in C the
> function that everyone will do without too much effort, but it is
> terribly slow.

Sure. C isn't prone to work at low level :)

> I am trying to approach the problem from another point of view in
> order to see if I get something faster.
> What I am looking for is a magic instruction (ala btsl)
> that given a pointer and an index, it returns the nth nibble.

BTS isn't fast anyway.

My solution is quite simple,
even not optimised except it avoids branches:
(but don't ask me how to do this in C, perhaps Inline.Asm ?)
______________________
;push ...               ;save registers as required
    mov ebx,nibble_index
    mov esi,source_ptr
;mov edi,result_buffer  ;if desired
;push ebx               ;make it a LOCAL  if for a loop
;L0:
;__________________
    shr ebx,1            ;DIV by 2, bit0 -> Carry, indicates ODD/EVEN
    mov al,[esi+ebx]     ;get byte from [esi + ebx/2]
;mov ebx,[esp]        ;restore nibble index
    setc cl              ;CL= 0 or 1 (= bit0 of index)
    shl cl,2             ;*4= 0 or 4
    shr al,cl            ;no change if the index were EVEN
    and al,0f            ;
;__________________
DONE:                    ;AL contains the nibble yet
;mov [edi],al           ;store it
;inc/dec edi            ;advance destination ptr
;inc/dec ebx            ;advance nibble index
;cmp ebx,[last_index]   ;better use a register  ie: edx
;jc/jns/jnbe L0         ;iterate as desired/required
;add esp,4              ;kill the LOCAL after the loop
;pop ...                ;resore what you pushed on top
_______________________
You see this are just 6 lines (16 bytes) for a single nibble access.
Feel free to uncomment/delete/modify it for your own LOOP-solution.

Would be interesting to hear if this register- and dependency-stalling
piece of code performs any faster than your C-solution.

> This is a similar problem to obtain the nth digit of a BCD (packed)
> format.

I treat BCD and HEX-nibbles the same way (except for input limitation).

> I am wondering if there is a MMX (or some other extension) that could
> do this job (I heard they treat a word in sets of 4/8/32/64/128 bits,
> but I am not into Intel assembly extensions).

SSE instruction may be of help for conversion of packed bytes,
but I haven't seen any nibble oriented functionality in there.

__
wolfgang


0
Reply Wolfgang 8/14/2008 12:29:35 PM

"David "ditiem" wrote:

> Hello there,
Hi,
> I want to access an array of nibbles. I have developed in C the
> function that everyone will do without too much effort, but it is
> terribly slow.

Sure. C isn't prone to work at low level :)

> I am trying to approach the problem from another point of view in
> order to see if I get something faster.
> What I am looking for is a magic instruction (ala btsl)
> that given a pointer and an index, it returns the nth nibble.

BTS isn't fast anyway.

My solution is quite simple,
even not optimised except it avoids branches:
(but don't ask me how to do this in C, perhaps Inline.Asm ?)
______________________
;push ...               ;save registers as required
    mov ebx,nibble_index
    mov esi,source_ptr
;mov edi,result_buffer  ;if desired
;push ebx               ;make it a LOCAL  if for a loop
;L0:
;__________________
    shr ebx,1            ;DIV by 2, bit0 -> Carry, indicates ODD/EVEN
    mov al,[esi+ebx]     ;get byte from [esi + ebx/2]
;mov ebx,[esp]        ;restore nibble index
    setc cl              ;CL= 0 or 1 (= bit0 of index)
    shl cl,2             ;*4= 0 or 4
    shr al,cl            ;no change if the index were EVEN
    and al,0f            ;
;__________________
DONE:                    ;AL contains the nibble yet
;mov [edi],al           ;store it
;inc/dec edi            ;advance destination ptr
;inc/dec ebx            ;advance nibble index
;cmp ebx,[last_index]   ;better use a register  ie: edx
;jc/jns/jnbe L0         ;iterate as desired/required
;add esp,4              ;kill the LOCAL after the loop
;pop ...                ;resore what you pushed on top
_______________________
You see this are just 6 lines (16 bytes) for a single nibble access.
Feel free to uncomment/delete/modify it for your own LOOP-solution.

Would be interesting to hear if this register- and dependency-stalling
piece of code performs any faster than your C-solution.

> This is a similar problem to obtain the nth digit of a BCD (packed)
> format.

I treat BCD and HEX-nibbles the same way (except for input limitation).

> I am wondering if there is a MMX (or some other extension) that could
> do this job (I heard they treat a word in sets of 4/8/32/64/128 bits,
> but I am not into Intel assembly extensions).

SSE instruction may be of help for conversion of packed bytes,
but I haven't seen any nibble oriented functionality in there.

__
wolfgang


0
Reply Wolfgang 8/14/2008 12:33:01 PM

sorry for the double post, I may have hit the send button twice ...

__
wolfgang


0
Reply Wolfgang 8/14/2008 1:08:07 PM

On Aug 14, 8:41 am, Hendrik van der Heijden  <spamt...@crayne.org>
wrote:

> Either you're using division and modulus per access or have some
> uncommon definition of "terribly slow". Let's see the code.

Nah! The problem is a bit more complicated than a simple array and an
index.  I will try to explain in more detail. I am implementing a
virtual machine in which I need to store a variable and its type
(because it is dynamic). Let's say that the common technique for these
cases is to use either the 4 highest or lowest bits of a word to store
the type. In my design I dont want this to happen, so I decided that I
can store the types of several the variables all together in one
word. According to this, the memory addresses which has the last 7
bits to 0x0, 0x4,0x8 and 0xC, stores the types of the next (28) memory
addresses. For example, for a given variable (read pointer) C, the
type must be found in C&~0x7F plus the offset of C respect that
address: C & 0x7F (plus a shift of course)

The function is:

#define INDEX2_MASK 0x3F

void set_var_type ( unsigned int *C, unsigned int T )
{
  unsigned int  Cindex = ((unsigned int)(C) &  INDEX_MASK) ;
  unsigned char *CBase  = (unsigned char*)((unsigned int)(C) &
~INDEX_MASK) ;

  CBase  += Cindex >> 3 ;
  *CBase = (Cindex & 0x4)? (*CBase & 0x8F) | (T << 4) : (*CBase &
0xF8) | T ;
}

scared about the if-then-else? well, surprise! that function goes
faster (in my machine, a Pentium III 850MHz) than this:

void set_var_type_B ( unsigned int *C, unsigned int T )
{
  unsigned int  Cindex = ((unsigned int)(C) &  INDEX2_MASK) ;
  unsigned int *CBase  = (unsigned int*)((unsigned int)(C) &
~INDEX2_MASK) ;

  CBase += Cindex >> 5 ;
  Cindex &= 0x7C ;

  *CBase &= ~(0xF << Cindex) ;
  *CBase |= T << Cindex ;
}

Answering your question about "terrible slow". The original function
was:

void set_var_type_original ( unsigned int *C, unsigned char T )
{
    *C &= 0xF0000000 ;
    *C |= T << 28 ;
}

and according to the benchmarks the functions above have and slow-down
of 2.4 and 4.6 respectively (do not ask me why, I have lost whole day
trying to find the reason).

Hope this answer your question :).

Best regards,

David

0
Reply ditiem 8/14/2008 7:13:18 PM

On Aug 14, 2:33 pm, "Wolfgang Kern"  <spamt...@crayne.org> wrote:

> BTS isn't fast anyway.

Oops! may I ask how slow it is?


> My solution is quite simple,
> even not optimised except it avoids branches:
> (but don't ask me how to do this in C, perhaps Inline.Asm ?)

Leave that on my side... I am really fighting with gcc to put some
inline asm :((, really a headache!

> You see this are just 6 lines (16 bytes) for a single nibble access.
> Feel free to uncomment/delete/modify it for your own LOOP-solution.

> Would be interesting to hear if this register- and dependency-stalling
> piece of code performs any faster than your C-solution.

Your code is good for reading (I love the trick of shr and setc :D),
but now I am concentrating in writing. When I do the testing for
reading I will let you know how fast it goes (looks very good to me).

As I tried to explain to Hendrik, one of the good problems here is to
obtain the base address and the "index" of the nibble according to
that base address. Another point is how you map the index of the
nibble in the nibble-array, that is, which bits from the address you
take as the byte index and which bits you take as the nibble index.
The
solution that you propose is exactly the first one that came to my
mind, but today I found another way of mapping the things that can
save 1 or instructions.

I have a nice solution for writing the nibble with no branches in
only 10 instructions (plus the ones generated by the compiler to copy
the arguments in 2 registers). My problem now is that gcc says that I
use too many registers and that it cannot find a way to compile it...
(this should be a temporal problem).

> SSE instruction may be of help for conversion of packed bytes,
> but I haven't seen any nibble oriented functionality in there.

I abandoned that line, I read that they can work with nibbles, but
they just add, subtract and few things more. Really complicated to
obtain the nibble from there.


Best regards and thank a lot for the solution!

David

0
Reply ditiem 8/14/2008 7:29:43 PM

"ditiem" <spamtrap@crayne.org> wrote in message
news:68383c9b-13cd-4bf2-8996-6fb556aa750b@w24g2000prd.googlegroups.com...
> What I am looking for is a magic
> instruction
> (ala btsl) that given a pointer and an index, it returns the nth
> nibble.

As a general rule-of-thumb, "magic instructions" are slow on today's x86 or
later processors...

> I am wondering if there is a MMX (or some other extension) that could
> do this job

No single instruction in the general instruction set that I recognize.
Maybe something in MMX, SSE2, etc...

> I want to access an array of nibbles.

Well, since I'm interested in this issue also, I was waiting to see if
someone posted some SSE2 etc. solution.

Without such a solution, your goal is to reduce the work.  So, I'd first try
a few precomputed 256 byte lookup arrays or tables.  Alternately,  I've had
good optimization results with methods similar to your
'set_var_type_original()' function, using optimization, only 'int' and
'unsigned char' types, and inlining.  The goal is to get C types that the C
compiler will match with registers or move into registers so fewer
instructions are emitted.  This is a bit of trial and error since many
compilers no longer respect the 'register' keyword.  You might even find it
is better to copy file scope variables into procedure auto scope
variables... sometimes.  The inlining, by inline keyword or GCC attribute,
should eliminate the procedure's overhead.  This is preferred to inline
assembly since many compilers won't optimize inline assembly, but will
optimize inlined C.

To break apart an 8-bit register value or unsigned char in C (assuming
8-bits...),  the first array would be precomputed to return the upper
nybble, and the second array would return the lower nybble.  This can be
done in assembly or C easily.  This avoids all shifts, etc. after the
initilization of the tables.  You could use the 'xlatb' instruction in
assembly, but it's a "magic instruction"...

E.g.,

  upper_nybble=upper_nybble_table[byte];  /* e.g., 0xF1 into table returns
0x0F */
  lower_nybble=lower_nybble_table[byte];  /* e.g., 0xF1 into table returns
0x01 */

Hopefully, with optimization, your compiler will reduce that to two 'mov'
instructions.  Or, you could code it that way in assembly.

Depending on the compiler's optimization ability, a simple 'and' may suffice
for the lower nybble, being reduced to a single 'and' instruction.

    lower_nybble=byte&0x0F;

Or, you could code an 'and' in assembly...

To construct an 8-bit register value or unsigned char in C (assuming
8-bits...) from nybbles, an array would be precomputed to return the upper
nybble left shifted four-bits, and you already have the lower nybble.  You
then bitwise-or them together.

E.g.,

  byte=shifted_upper_nybble_table[upper_nybble]|lower_nybble;

HTH,


Rod Pemberton

0
Reply Rod 8/14/2008 10:58:21 PM

ditiem wrote:
> Answering your question about "terrible slow". The original function
> was:
> 
> void set_var_type_original ( unsigned int *C, unsigned char T )
> {
>     *C &= 0xF0000000 ;
>     *C |= T << 28 ;
> }

So basically what you're doing is to take the top byte of a 32-bit 
variable, ORing the top 4 bits together with the bottom 4 bits of T, and 
clearing out the bottom 28 bits?

Many compilers should generate code very similar to the following:

	unsigned c = *C & 0xF0000000;
	c |= T << 28;
	*C = c;

which will compile into (assume ecx, edx contains the input parameters):

	mov eax,[ecx]
	shl edx,28

	and eax,0f0000000h

	or eax,edx

	mov [ecx],eax

I.e. taking 5 cycles, and that's the best you can do without changing 
the problem.

Terje
-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 8/15/2008 7:08:25 AM

On Aug 15, 9:08�am, Terje Mathisen  <spamt...@crayne.org> wrote:

> So basically what you're doing is to take the top byte of a 32-bit
> variable, ORing the top 4 bits together with the bottom 4 bits of T, and
> clearing out the bottom 28 bits?

Exactly

> which will compile into (assume ecx, edx contains the input parameters):
>
> � � � � mov eax,[ecx]
> � � � � shl edx,28
>
> � � � � and eax,0f0000000h
>
> � � � � or eax,edx
>
> � � � � mov [ecx],eax
>
> I.e. taking 5 cycles, and that's the best you can do without changing
> the problem.

Is that solution better than this one?

         shl edx,28
         and 0f0000000h, [ecx]
         or  edx       , [ecx]

I am asking because it has been more than 10 years without touching
assembly and I dont remember neither the restrictions nor the times of
the
instructions.

regards,

David

0
Reply ditiem 8/15/2008 7:38:12 AM

David "ditiem" wrote:

>> BTS isn't fast anyway.
> Oops! may I ask how slow it is?

First, BT and friends work on single bits only.
Second, latency = 8/9 cycles for memory operands.
But it's fast for single-bit manipulation in registers
and quite useful for atomic rd/modify like for MP/MT-flags.

>> My solution is quite simple,
>> even not optimised except it avoids branches:
>> (but don't ask me how to do this in C, perhaps Inline.Asm ?)
> Leave that on my side... I am really fighting with gcc to put some
> inline asm :((, really a headache!

That's why I use asm/machine code tools ...

>> You see this are just 6 lines (16 bytes) for a single nibble access.
>> Feel free to uncomment/delete/modify it for your own LOOP-solution.

>> Would be interesting to hear if this register- and dependency-stalling
>> piece of code performs any faster than your C-solution.

> Your code is good for reading (I love the trick of shr and setc :D),
> but now I am concentrating in writing. When I do the testing for
> reading I will let you know how fast it goes (looks very good to me).

Looking forward to see the difference ...

> As I tried to explain to Hendrik, one of the good problems here is to
> obtain the base address and the "index" of the nibble according to
> that base address. Another point is how you map the index of the
> nibble in the nibble-array, that is, which bits from the address you
> take as the byte index and which bits you take as the nibble index.

I just treat it as even/odd nibbles, where the low nibble is the even
and both got the same address as the associeted byte.
So the relation between nibble-offset and byte-offset is always 2:1
and the remainder of the div by 2 (carry-bit) tells high or low nibble.

> The solution that you propose is exactly the first one that came to my
> mind, but today I found another way of mapping the things that can
> save 1 or instructions.
> I have a nice solution for writing the nibble with no branches in
> only 10 instructions (plus the ones generated by the compiler to copy
> the arguments in 2 registers).

Good, would you care to share it ? ;)

> My problem now is that gcc says that I
> use too many registers and that it cannot find a way to compile it...
> (this should be a temporal problem).

Sorry for I cannot help with this ...

>> SSE instruction may be of help for conversion of packed bytes,
>> but I haven't seen any nibble oriented functionality in there.

> I abandoned that line, I read that they can work with nibbles, but
> they just add, subtract and few things more. Really complicated to
> obtain the nibble from there.

Ok.
__
wolfgang


0
Reply Wolfgang 8/15/2008 7:43:47 AM

"ditiem" <spamtrap@crayne.org> wrote in message 
news:ec1d76e5-a53a-4ace-a156-9431edacb651@r15g2000prh.googlegroups.com...
> On Aug 15, 9:08 am, Terje Mathisen  <spamt...@crayne.org> wrote:
>
>>
>> mov eax,[ecx]
>> shl edx,28
>>
>> and eax,0f0000000h
>>
>> or eax,edx
>>
>> mov [ecx],eax
>>
>> I.e. taking 5 cycles, and that's the best you can do without 
>> changing
>> the problem.
>
> Is that solution better than this one?
>
>         shl edx,28
>         and 0f0000000h, [ecx]
>         or  edx       , [ecx]

Typical x86 assemblers notate the target at the left side. So I think 
you mean:

         shl edx,28
         and [ecx],0f0000000h
         or  [ecx],edx

The problem is, that each logical operation with the [ecx] addressing 
mode will read the memory, modify the value and write back the value to 
memory. That doubles the memory bus cycles.

The five instructions above must be read from memory too, but this is 
done with instruction pipeline, cache etc.

/Helge

0
Reply Helge 8/15/2008 1:50:11 PM

"ditiem" <spamtrap@crayne.org> wrote in message
news:ec1d76e5-a53a-4ace-a156-9431edacb651@r15g2000prh.googlegroups.com...
> On Aug 15, 9:08 am, Terje Mathisen  <spamt...@crayne.org> wrote:
>
> > So basically what you're doing is to take the top byte of a 32-bit
> > variable, ORing the top 4 bits together with the bottom 4 bits of T, and
> > clearing out the bottom 28 bits?
>
> Exactly

....questions...

> > ORing the top 4 bits together with the bottom 4 bits of T,
_and_
> > clearing out the bottom 28 bits?

If the top 4-bits isn't already clear, i.e., zeroed, doesn't the ORing
corrupt setting the upper 4-bits of C to T?  I.e., why does the _setting_ of
a type T into C also  _OR_  T with another pre-existing T value in C?  E.g.,
this _sets_ the upper 4-bits to T, and lower 28-bits to zero, without
modification:

void set_var_type_original ( unsigned int *C, unsigned char T )
{
    *C = T << 28 ;
}

Or, this:

void set_var_type_original ( unsigned int *C, unsigned char T )
{
    *C = (T&0x0F) << 28 ;
}

> > clearing out the bottom 28 bits?

Since the value of C will be cleared, i.e., zeroed, except for the upper
4-bits, wouldn't you prefer to use the lower 4-bits of C for T?  E.g.,

void set_var_type_original ( unsigned int *C, unsigned char T )
{
    *C &= 0x0F;
    *C |= T&0x0F;
}

Or, assuming you didn't want to OR T with a pre-existing T value in C, but
wanted to set the bits to T in C e.g.,

void set_var_type_original ( unsigned int *C, unsigned char T )
{
    *C = T&0x0F;
}


Rod Pemberton

0
Reply Rod 8/15/2008 4:12:39 PM

On Aug 15, 6:12 pm, "Rod Pemberton"  <spamt...@crayne.org> wrote:
> > > ORing the top 4 bits together with the bottom 4 bits of T,
> _and_
> > > clearing out the bottom 28 bits?

Oops! my fault, I copied the code too fast, the first solution I
posted (in reply to Hendrik) was:

void set_var_type_original ( unsigned int *C, unsigned char T )
{
    *C &= 0xF0000000 ;
    *C |= T << 28 ;
}

but it is wrong! there is an ~ missing (I expanded many #define). The
solution should be:

void set_var_type_original ( unsigned int *C, unsigned char T )
{
    *C &= ~0xF0000000 ;
    *C |= T << 28 ;
}

When I read "clearing out" I misunderstood you (in my defense I would
like to say it was too late :D), I thought you meant something like
"leaving the bottom 28 bits free/clear for the value we want to store
in it". Sorry for the confusion... no comments :(

> Since the value of C will be cleared, i.e., zeroed, except for the upper
> 4-bits, wouldn't you prefer to use the lower 4-bits of C for T?  E.g.,

This question maybe has no sense now, but the reason because the top 4
bits of C are used is because C is usually an offset. In virtual
machines, it is very common to have a base address to emulate a memory
area, and many pointers pointing to that memory area. Usually this
pointers are stored as an offset from the base area, so the upper 4
bits are free (and you pray everyday for not to appear an user who
creates a program big enough that would generate an offset so big that
it would overlap the 4 top bits).

Sorry for the confusion again,

David

0
Reply ditiem 8/15/2008 8:20:10 PM

> Typical x86 assemblers notate the target at the left side. So I think
> you mean:

You are right, I meant exactly that. I think in x86 assembly but I
have
to write it in GNU AS (that changes the order of the operands)

> The problem is, that each logical operation with the [ecx] addressing
> mode will read the memory, modify the value and write back the value to
> memory. That doubles the memory bus cycles.

Ummmm I would expect the cache to avoid those memory cycles... but it
is
good to know it!

Thanks a lot,
David

0
Reply ditiem 8/15/2008 8:27:00 PM

"ditiem" <spamtrap@crayne.org> wrote in message
news:5833807d-4a07-4413-b81a-a8400157f085@i20g2000prf.googlegroups.com...
> On Aug 15, 6:12 pm, "Rod Pemberton"  <spamt...@crayne.org> wrote:
> > > > ORing the top 4 bits together with the bottom 4 bits of T,
> > _and_
> > > > clearing out the bottom 28 bits?
>
> Oops! my fault, I copied the code too fast, the first solution I
> posted (in reply to Hendrik) was:
>
> void set_var_type_original ( unsigned int *C, unsigned char T )
> {
>     *C &= 0xF0000000 ;
>     *C |= T << 28 ;
> }
>
> but it is wrong! there is an ~ missing (I expanded many #define). The
> solution should be:
>
> void set_var_type_original ( unsigned int *C, unsigned char T )
> {
>     *C &= ~0xF0000000 ;
>     *C |= T << 28 ;
> }
>
> When I read "clearing out" I misunderstood you

Well, I think you misunderstood Terje...  ;-)

> > Since the value of C will be cleared, i.e., zeroed, except for the upper
> > 4-bits, wouldn't you prefer to use the lower 4-bits of C for T?  E.g.,
>
> This question maybe has no sense now, but the reason because the top 4
> bits of C are used is because C is usually an offset. In virtual
> machines, it is very common to have a base address to emulate a memory
> area, and many pointers pointing to that memory area. Usually this
> pointers are stored as an offset from the base area, so the upper 4
> bits are free (and you pray everyday for not to appear an user who
> creates a program big enough that would generate an offset so big that
> it would overlap the 4 top bits).
>

Well, that makes sense.  :-)

> you pray ... [for] an offset [limited to 28-bits]

The value of this offset can't exceed the size allocated for the memory
area.  To me, this implies your application has no method to limit the
allocation size of the memory area, or something other than your application
is allocating the memory area, or there needs to be size checks of the
offset(s) in use.

Of course, if you are using offsets from a base pointer instead of a pointer
directly, you increase computing overhead.  I.e., you or the compiler must
add the offset to the base pointer to get the actual pointer used.  Repeat
over and over again many times...


Rod Pemberton

0
Reply Rod 8/15/2008 11:25:01 PM

ditiem wrote:
> but it is wrong! there is an ~ missing (I expanded many #define). The
> solution should be:
> 
> void set_var_type_original ( unsigned int *C, unsigned char T )
> {
>     *C &= ~0xF0000000 ;
>     *C |= T << 28 ;
> }

OK, what you are doing here is a classic LISP/Scheme style combining 
variables/pointers with tags!

You should look very closely at the actual usage patterns of the various 
types, it is quite possible that it would be faster to reserve 1,2 or 3 
bits in the bottom, and then use an alternate mechanism to differentiate 
between any extra types.

I.e. since all pointers must be 32 or 64-bit aligned (depending upon 
your OS/platform/malloc library), you can specify that the bottom 2 or 3 
bits can be used as tags, and then simply subtract them out when 
accessing the pointed-to objects:

	mov ebx,[pvar]
	mov eax,[ebx-3]		;; pvar uses +3 as the tag

This has the added benefit that by enabling alignment trapping you can 
get zero-overhead type checking. :-)

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 8/17/2008 1:34:09 PM

Hi,

On Aug 16, 5:20 am, ditiem <spamt...@crayne.org> wrote:
> The solution should be:
>
> void set_var_type_original ( unsigned int *C, unsigned char T )
> {
>     *C &= ~0xF0000000 ;
>     *C |= T << 28 ;
>
> }

Um, how about something like:

    mov eax,<C>
    mov edx,<T>

    shl eax,4
    shrd eax,edx,4

    mov <C>,eax


Cheers,

Brendan

0
Reply Brendan 8/17/2008 4:59:14 PM

On Aug 15, 9:43 am, "Wolfgang Kern"  <spamt...@crayne.org> wrote:

> Looking forward to see the difference ...
> (...)
> Good, would you care to share it ? ;)

Ok, I try to gather all the experimental results that have been
driving me to the crazyness during the last week of my life. I must
say that in my case the nibble-array has certain properties that add a
couple of instructions on the head of the assembly programs, but
assuming to have the base address and the index in the registers the
code is easily translatable into a general nibble-array access problem
presented at the beginning of the thread-discusion.

In my case the nibble-array has a fixed size, no longer no shorter
than N, where N is a power of 2. The base address of my nibble-array
is a pointer with the last N + 2 bits clear (set to 0). This is
neccesary because the access to the nibble-array is done from C in the
way:

unsigned int *base_nibble_addr = ... ;

get_nibble( base_nibble_addr + i ) ;

where i is the ith nibble.

This problem can be solved using byte (8bits) or double word (32bits)
access. A double word requires indexing inside the word, while in
byte-access mode we only need 1 bit to distinguish if it is the high
or the low nibble. Let's go into more detail studying how the
address is stored:


1.- For byte access:

   31 ... 8  7 6 5 4  3 2 1 0
    B ... B  B X X X  X N 0 0

   the bits B are the base address

   the bits X are the byte index

   the bit N is the nibble index, in this case for a size of 2^4 = 16
   bytes with 2 nibbles (1 bit in N) each, that is 32 nibbles in
   total.


2.- For double word access:

   31 ... 8  7 6 5 4  3 2 1 0
    B ... B  B X X N  N N 0 0

   in this case we have 4 words (X X bits) with 8 nibbles each (N N N
   bits). Notice that a different mapping can be done:

   31 ... 8  7 6 5 4  3 2 1 0
    B ... B  B N N N  X X 0 0

   This approach is interesting because it requires few instructions
   less (this will come with the next post about "SET", read below).

This pretends to be an alternative to the usual approach (as pointed
out in a previous thread by Terje) in many system that embed the
nibble in the data there are storing (tag-in-data) or in the pointer
to the data (tag-in-pointer). As an alternative, it must compete with
them, so that is the starting meassuring point to determine how "fast"
is our approach. Also it must considered that the problem requires to
SET and to GET the nibble. Latter case will be another post.

For the GET case, as said, the "best" case is:

inline
unsigned int get_var_type_best ( unsigned int *C )
{
    return (*C)&0xF0000000 ;
}


This is one approach that uses "if-then-else":

#define VARS_PER_REGION 32

#define INDEX2_MASK    ((VARS_PER_REGION << 2) - 1)       // MASK TO
FIND THE INDEX WORD


// To optimize
inline
unsigned int get_var_type ( unsigned int *C )
{
  unsigned int   Cindex = ((unsigned int)(C) &  INDEX_MASK) ;
  unsigned char *CBase  = (unsigned char*)((unsigned int)(C) &
~INDEX_MASK) ;

  Cindex >>= 2 ;

  CBase += Cindex >> 1 ;

  return (Cindex & 1)? (((*CBase) & 0x70)>>4) : ((*CBase) & 0x07) ;
}


This is another approach that avoid them:

inline
unsigned int get_var_type_2 ( unsigned int *C )
{
  unsigned int   Cindex = ((unsigned int)(C) &  INDEX_MASK) ;
  unsigned char *CBase  = (unsigned char*)((unsigned int)(C) &
~INDEX_MASK) ;

  return (CBase[ Cindex >> 3 ] >> (Cindex & 0x4)) & 0x7 ;
}

Previous approach optimized by gcc (-O3)

inline
unsigned int get_var_type_2_asm ( unsigned int *C )
{
    unsigned char ret ;

    asm ( "movl	  %%ecx, %%eax\n\t"
	  "movl	  %%ecx, %%edx\n\t"
	  "andl	  $127, %%eax\n\t"
	  "andl	  $-128, %%edx\n\t"
	  "shrl	  $3, %%eax\n\t"
	  "andl   $4, %%ecx\n\t"
	  "movzbl (%%edx,%%eax), %%eax\n\t"
	  "sarl	  %%cl, %%eax\n\t"
	  "andl	  $7, %%eax\n\t"
	  : "=a"(ret) : "c"(C) : "edx" ) ;

    return ret ;
}


This is wolfgang approach:

inline
unsigned char get_var_type_wg ( unsigned int *C )
{
    unsigned char ret ;

    asm ( "mov   %%eax, %%ebx\n\t"
	  "andl  $-128, %%ebx\n\t"
	  "andl  $127, %%eax\n\t"
	  "shrl  $3  , %%eax\n\t"
	  "movzbl (%%ebx,%%eax),%%eax\n\t"
	  "setc  %%cl      \n\t"
	  "shlb  $2, %%cl  \n\t"
	  "shrl  %%cl, %%eax\n\t"
	  "andl  $7, %%eax  \n\t"
	  : "=a"(ret) : "a"(C) : "ebx", "ecx" ) ;

    return ret ;
}

I called this wolfgang-32, because it tries to avoid byte instructions
(movb, shlb, etc.)

inline
unsigned char get_var_type_wg32 ( unsigned int *C )
{
    unsigned char ret ;

    asm ( "mov   %%eax, %%ebx\n\t"
	  "mov   %%eax, %%ecx\n\t"
	  "andl  $4, %%ecx\n\t"
	  "andl  $-128, %%ebx\n\t"
	  "andl  $127, %%eax\n\t"
	  "shrl  $3  , %%eax\n\t"
	  "movzbl (%%ebx,%%eax),%%eax\n\t"
//	  "setc  %%cl      \n\t"
//	  "shlb  $2, %%cl  \n\t"
	  "shrl  %%cl, %%eax\n\t"
	  "andl  $7, %%eax  \n\t"
	  : "=a"(ret) : "a"(C) : "ebx", "ecx" ) ;
    return ret ;
}


The main loop is really simple:

#define test()							\
    memset( arr, 0, sizeof( arr ) ) ;				\
    cnt = 0 ;							\
    begin = times( 0 ) ;					\
    for ( j = 0 ; j < 0x05FFFFF ; j ++, cnt ++ )		\
	for ( i = 130 ; i < 150 ; i ++, cnt ++ )		\
	{ CODE ;  }						\
    end = times( 0 ) ;						\
    printf( "%d\n\n", end - begin ) ;

    printf( "BEST CASE: " ) ;
#define CODE arr[ i ] = get_var_type_best( &arr[ i ] )
   test( )
#undef CODE

    printf( "what we try to optimize: " ) ;
#define CODE arr[ i ] = get_var_type( &arr[ i ] )
   test( )
#undef CODE

etc etc

It is not recomendable to ask GCC to optimize the loops (-O3) because
sometimes it do it too well and it decides to eliminate code that we
want to meassure. Notice that the time of the loop is included, this
should not be a problem because it should be similar in all the
test. Executing several times the program we will get to this
conclusion. The variable cnt is a watchdog to be sure that all the
iterations are executed.

Here are the results:

BEST CASE: 179
CNT 132120555

what we try to optimize: 527
CNT 132120555

what we try to optimize (AMS-OPT): 302
CNT 132120555

Try 2: 354
CNT 132120555

Try 2 (ASM-OPT): 276
CNT 132120555

Try 3 (Wolfgang 32 bits): 326
CNT 132120555

Try 4 (Wolfgang): 439
CNT 132120555


Some conclusions:

First of all, if someone can find any error in the code presented or
can argue about the conclusion I will eternally thank him/her, because
what it was obvious in my asm programming times it seems it is not
obvious anyway.

1.- Using instructions that works with bytes instead of double words
is slower, even is the manual says the opposite. (please, someone can
argue it??)

2.- Less number of cycles does not mean faster (I can imagine reason,
structure dependencies, data depenencies, pipeline...), but my
godness! can anyone point out a documment with an scheme of the
pipeline of a Pentium?

3.- This time compiler wins, it was able to find 2 versions (sorry
wolfgang) faster than the human. I think compiler knows more
information than the one I could get from the manual :P

I tried to find a different approach to my problem (not to locating a
nibble in the array, but attaching a nibble to a word). I used char
instead of nibble (just for curiousity, although I know it takes more
space). I found this function (only 2 instructions!!), using double of
items, 64 instead of 32 and double space, char instead of nibbles:

inline
unsigned char get_var_type_fchar (unsigned int *C)
{
    unsigned char ret ;

    __asm__ ( "shrb    $2, %%al\n\t"
	      "movzbl (%%eax), %%eax\n\t"
	      : "=a"(ret) : "a" (C) ) ;

    return ret ;
}

And this other version of the same idea:

inline
unsigned char get_var_type_fchar2 (unsigned int *C)
{
    unsigned char ret ;

    __asm__ ( "movzbl %%al, %%ecx\n\t"
	      "andl   $-256,%%eax\n\t"
	      "shrl   $2, %%ecx\n\t"
	      "orl    %%eax,%%ecx\n\t"
	      "movzbl (%%ecx), %%eax\n\t"
	      : "=a"(ret) : "a" (C) : "ecx" ) ;

    return ret ;
}

Want to bet which one is faster?

Try 5 (fchar): 422
CNT 132120555

Try 5.1 (fchar2): 243
CNT 132120555

This clearly makes me believe that there is a data-dependency in the
pipeline and I dont know what dark force on the earth make Pentium
delays so much. I have been hitting my head against the wall and
repeating once after another "gcc is my friend, gcc is my friend", so
if anyone is able to give me a good reason about why the timmings are
so controversial I would appreciate it a lot.

David

0
Reply ditiem 8/18/2008 7:23:57 PM

On Aug 16, 1:25 am, "Rod Pemberton"  <spamt...@crayne.org> wrote:
>
> > you pray ... [for] an offset [limited to 28-bits]
>
> The value of this offset can't exceed the size allocated for the memory
> area.  To me, this implies your application has no method to limit the
> allocation size of the memory area, or something other than your application
> is allocating the memory area, or there needs to be size checks of the
> offset(s) in use.

There are methods, but you want the invention to run fast ;)

0
Reply ditiem 8/18/2008 7:26:30 PM

ditiem schrieb:
> This problem can be solved using byte (8bits) or double word (32bits)
> access. A double word requires indexing inside the word, while in
> byte-access mode we only need 1 bit to distinguish if it is the high
> or the low nibble. 

> Try 3 (Wolfgang 32 bits): 326
> CNT 132120555
> 
> Try 4 (Wolfgang): 439
> CNT 132120555

> 1.- Using instructions that works with bytes instead of double words
> is slower, even is the manual says the opposite. (please, someone can
> argue it??)

No, byte instructions should not be slower than i32 ones.
But converting data between both has a penalty. Writing a
partial register (AL, AH, AX) and then reading the full
register (EAX/RAX) causes a "partial register stall".
For memory accesses with different sizes to overlapping
locations its even worse.
> 2.- Less number of cycles does not mean faster (I can imagine reason,
> structure dependencies, data depenencies, pipeline...), but my
> godness! can anyone point out a documment with an scheme of the
> pipeline of a Pentium?

Forget it, it's too complex and changes with each CPU model.

> This clearly makes me believe that there is a data-dependency in the
> pipeline and I dont know what dark force on the earth make Pentium
> delays so much.

Go read the "Intel 64 and IA-32 Architectures Optimization Reference Manual".


Hendrik vdH

0
Reply Hendrik 8/19/2008 6:16:57 AM

On 18 Aug, 20:23, ditiem <spamt...@crayne.org> wrote:
....
> This clearly makes me believe that there is a data-dependency in the
> pipeline and I dont know what dark force on the earth make Pentium
> delays so much. I have been hitting my head against the wall and
> repeating once after another "gcc is my friend, gcc is my friend", so
> if anyone is able to give me a good reason about why the timmings are
> so controversial I would appreciate it a lot.

I think you would find "Inner Loops" immensely helpful to make sense
of the base architecture. It is a book written by Rick Booth which
shows why and how Pentium and Pentium Pro code can be optimised in
assembler and C. Although these chips are old now and the book is
sometimes seen as out of date I've not seen anything better and the
basic lessons should still be a help. It is also very readable. You
may be able to get a copy second hand from Amazon. In fact, before
buying check out Amazon for various customer opinions. They tend to be
very positive.

0
Reply James 8/19/2008 2:01:18 PM

David "ditiem" wrote:
....
> In my case the nibble-array has a fixed size, no longer no shorter
> than N, where N is a power of 2. The base address of my nibble-array
> is a pointer with the last N + 2 bits clear (set to 0). This is
> neccesary because the access to the nibble-array is done from C in the
> way:

> unsigned int *base_nibble_addr = ... ;
> get_nibble( base_nibble_addr + i ) ;
> where i is the ith nibble.

I see, your nibbles actually become dwords just to fit C.
Many thanks to all Gods in LL (LordLight LordLogic and LowLevel)
for saving me and my code from this bloating and detouring needs.

> This problem can be solved using byte (8bits) or double word (32bits)
> access. A double word requires indexing inside the word, while in
> byte-access mode we only need 1 bit to distinguish if it is the high
> or the low nibble.

Dword access may fit well with C, but my solution is made up to access
any nibble in data fields up to 2GB with a dword index, and I may have
variable types of any nibble based size, even odd and only useful for
not otherwise defined input limits.

> Let's go into more detail studying how the address is stored:
> 1.- For byte access:

>    31 ... 8  7 6 5 4  3 2 1 0
>     B ... B  B X X X  X N 0 0
>    the bits B are the base address

>    the bits X are the byte index

>    the bit N is the nibble index, in this case for a size of 2^4 = 16
>    bytes with 2 nibbles (1 bit in N) each, that is 32 nibbles in
>    total.

You sure got a reason for using a 128 byte bound base with a 5 bit index
and access only the first 32 nibbles (16 byte) within this data struct.
I guess this is due to the limited varable definition in C ?.

I'd see the nibble bit and the byte index as one piece:
"the nibble index", while the byte index is always half of it.

> 2.)...
> This pretends to be an alternative to the usual approach (as pointed
> out in a previous thread by Terje) in many system that embed the
> nibble in the data there are storing (tag-in-data) or in the pointer
> to the data (tag-in-pointer). As an alternative, it must compete with
> them, so that is the starting meassuring point to determine how "fast"
> is our approach. Also it must considered that the problem requires to
> SET and to GET the nibble. Latter case will be another post.

Systems are quite different...
ie:
My system got only 32 numeric variable types where 24 are user definable.
The possible range is from signed nibble to unsigned 512 bit extended
by an up to dword exponent. So the type is a 5 bit field which is used
as an index to the definition struct, and here is just a dword:
00  memory image size (max. bytes)
01  mantissa size (nibbles)
02  exponent size (nibbles)
03  decimal image size (max. digits)

The varables itself doesn't contain any info about the stored format,
it's the callers job to define type, output format and precision/input-
limits by preceeding the var-pointer by a set of flags.
So my 'variables' are always 8 byte: 00 xx xx xx yyyyyyyy , and this
allows to display the same stored value in different interpretations.


> For the GET case, as said, the "best" case is:
> inline
> unsigned int get_var_type_best ( unsigned int *C )
> {
>     return (*C)&0xF0000000 ;
> }

mmh..., 179 cycles for 20 iterations of this ??
______________________________________________
Wouldn't this assemble to(NASM/FASM-syntax) ? :
getvar:
 push ebp             ;push is a memory access
 mov ebp,esp
 mov eax,[ebp+8]      ;this too
 and eax,0xF0000000   ;not sure if this is what you wanted
 pop ebp              ;this is a memory access as well
ret 4                 ;and finally this too

and be called by:

mov eax,[varptr]      ;this memory access is the only needed
push eax              ;again to memory
call getvar           ;push the return address to memory...
....

this is what I usually find in C-created code.
__________________

I don't know GCC nor C or C+-, but I've seen ASM-Inline
without PROC-styled detours.
Can't tell how to make GCC work for this:

:ASM                  ;??
mov eax,[varptr]      ;or GAS form if required
and eax,0xf0000000    ;I think you wanted SHR eax,28 instead ?
:endASM               ;??

This should perform in max.3 cycles per iteration plus cache burst
reads. So if I assume your data are aligned then the worst case for
20 iterations is ~100 cycles as theoretical maximum, but when your
loop calls this then add 5 cycles for every CALL/RET which gives a
max. of ~160 cycles, almost close to 179 due to the timer-time  :)

.....
> This is wolfgang approach:

Looks somehow different to what I posted...

> inline
> unsigned char get_var_type_wg ( unsigned int *C )
> {
>     unsigned char ret ;
>
>     asm ( "mov   %%eax, %%ebx\n\t"
>   "andl  $-128, %%ebx\n\t"
>   "andl  $127, %%eax\n\t"
>   "shrl  $3  , %%eax\n\t"
>   "movzbl (%%ebx,%%eax),%%eax\n\t"
**   ^^^^^^  what's this ?  MOVZX byte eax,BL or AL  ??
**       AND ebx,0xFFFFFF80
**       AND eax,0x7F
**       SHR eax,3     ;DIV 8
**    wouldn't it help on speed if you think over the used structure ?

>   "setc  %%cl      \n\t"
**       Ok, the carry comes from eax bit2 yet

>   "shlb  $2, %%cl  \n\t"
>   "shrl  %%cl, %%eax\n\t"
>   "andl  $7, %%eax  \n\t"
>   : "=a"(ret) : "a"(C) : "ebx", "ecx" ) ;
>
>     return ret ;
> }
____________________
> I called this wolfgang-32, because it tries to avoid byte instructions
> (movb, shlb, etc.)
> inline
> unsigned char get_var_type_wg32 ( unsigned int *C )
> {
>     unsigned char ret ;
>
>     asm ( "mov   %%eax, %%ebx\n\t"
>   "mov   %%eax, %%ecx\n\t"
>   "andl  $4, %%ecx\n\t"
>   "andl  $-128, %%ebx\n\t"
>   "andl  $127, %%eax\n\t"
>   "shrl  $3  , %%eax\n\t"
>   "movzbl (%%ebx,%%eax),%%eax\n\t"
> //   "setc  %%cl      \n\t"
> //   "shlb  $2, %%cl  \n\t"
>   "shrl  %%cl, %%eax\n\t"
>   "andl  $7, %%eax  \n\t"
>   : "=a"(ret) : "a"(C) : "ebx", "ecx" ) ;
>     return ret ;
> }

I'm not sure if this will at all work as I had in mind.
____________
....
> Here are the results:
>
> BEST CASE: 179
> CNT 132120555
>
> what we try to optimize: 527
> CNT 132120555
>
> what we try to optimize (AMS-OPT): 302
> CNT 132120555
>
> Try 2: 354
> CNT 132120555
>
> Try 2 (ASM-OPT): 276
> CNT 132120555
>
> Try 3 (Wolfgang 32 bits): 326
> CNT 132120555
>
> Try 4 (Wolfgang): 439
> CNT 132120555


Interesting :)
Yes, the unoptimised loop code I posted performs in 240..300 cycles
for 20 iterations on my machine because it accesses memory too often.
But the (commented) single access version takes only 6..8 cycles
(after caching).

> Some conclusions:
>
> First of all, if someone can find any error in the code presented or
> can argue about the conclusion I will eternally thank him/her, because
> what it was obvious in my asm programming times it seems it is not
> obvious anyway.
>
> 1.- Using instructions that works with bytes instead of double words
> is slower, even is the manual says the opposite. (please, someone can
> argue it??)

Hendrik answered this already.

> 2.- Less number of cycles does not mean faster (I can imagine reason,
> structure dependencies, data depenencies, pipeline...), but my
> godness! can anyone point out a documment with an scheme of the
> pipeline of a Pentium?

It's not the CPU's nor the programmers fault... you know it's C  :)

> 3.- This time compiler wins, it was able to find 2 versions (sorry
> wolfgang) faster than the human. I think compiler knows more
> information than the one I could get from the manual :P

Oh yes, when the target is C-code then the compiler shall win  ;)

.....
> This clearly makes me believe that there is a data-dependency in the
> pipeline and I dont know what dark force on the earth make Pentium
> delays so much. I have been hitting my head against the wall and
> repeating once after another "gcc is my friend, gcc is my friend", so
> if anyone is able to give me a good reason about why the timmings are
> so controversial I would appreciate it a lot.

Have you ever tried to disassemble the final code ?
the smelling fish may be found there...
__
welfgang

0
Reply Wolfgang 8/19/2008 2:18:28 PM

22 Replies
127 Views

(page loaded in 0.224 seconds)


Reply: