Any more graceful way to saturate adds?

  • Follow


I am working in 16-bit 808x assembler (yes, on an old 808x CPU) and
adding two 8-bit numbers to each other and saturating the result.
Here is the best I could come up with:

; AL and DL have unsigned 8-bit values to add
; DL is actually extended into DX (ie. 00xxh)
   cbw            ; Make AL a signed word
   add   dx,ax    ; Do word sized add, into DX
   jns   @@HIGH   ; jump if result not signed/negative
   xor   dx,dx    ; Underflowed; make 0
@@HIGH:
   or   dh,dh     ; Did it overflow?
   jz   @@OK      ; if not, we're ok
   mov   dx,00FFh ; Overflowed; force our saturation value (FFh)
@@OK:

This is part of an ADPCM inner loop.

Is this about as fast as I can expect?  Or am I missing an obvious
technique?  (I am not an 808x assembler expert)

0
Reply Jim 4/10/2008 9:59:55 PM

In article <e9aa9511-4087-485f-822d-6dad776bce9a@c65g2000hsa.googlegroups.com>,
	Jim Leonard  <spamtrap@crayne.org> writes:
> I am working in 16-bit 808x assembler (yes, on an old 808x CPU) and
> adding two 8-bit numbers to each other and saturating the result.
> Here is the best I could come up with:
> 
> ; AL and DL have unsigned 8-bit values to add

And the result should be unsigned also I suppose. In that case you
only need to test for overflow. Unsigned overflow sets the carry
flag. Extend the carry flag to all ones (FFh) for overflow and
zero for the normal case. Then it goes something like this:

     add  dl,al     ; do byte size add
     sbb  dh,dh     ; carry -> FFh no carry -> 0
     or   dl,dh     ; saturate
;    xor  dh.dh     ; Want word size result???

0
Reply Dick 4/11/2008 2:32:55 AM


Jim Leonard wrote:
> I am working in 16-bit 808x assembler (yes, on an old 808x CPU) and
> adding two 8-bit numbers to each other and saturating the result.
> Here is the best I could come up with:
> 
> ; AL and DL have unsigned 8-bit values to add
> ; DL is actually extended into DX (ie. 00xxh)
>    cbw            ; Make AL a signed word
>    add   dx,ax    ; Do word sized add, into DX
>    jns   @@HIGH   ; jump if result not signed/negative
>    xor   dx,dx    ; Underflowed; make 0
> @@HIGH:
>    or   dh,dh     ; Did it overflow?
>    jz   @@OK      ; if not, we're ok
>    mov   dx,00FFh ; Overflowed; force our saturation value (FFh)
> @@OK:
> 
> This is part of an ADPCM inner loop.
> 
> Is this about as fast as I can expect?  Or am I missing an obvious
> technique?  (I am not an 808x assembler expert)

What's wrong with the obvious approach?

  add dl,al	; Carry set if it overflows!
  sbb al,al	; AL = 0xff if overflow
  or dl,al	; Turns any overflow into 0xff

OK?

Terje
-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 4/11/2008 7:41:42 AM

On Apr 11, 2:41 am, Terje Mathisen  <spamt...@crayne.org> wrote:
> Jim Leonard wrote:
> > I am working in 16-bit 808x assembler (yes, on an old 808x CPU) and
> > adding two 8-bit numbers to each other and saturating the result.
> > Here is the best I could come up with:
>
> > ; AL and DL have unsigned 8-bit values to add
> > ; DL is actually extended into DX (ie. 00xxh)
> >    cbw            ; Make AL a signed word
> >    add   dx,ax    ; Do word sized add, into DX
> >    jns   @@HIGH   ; jump if result not signed/negative
> >    xor   dx,dx    ; Underflowed; make 0
> > @@HIGH:
> >    or   dh,dh     ; Did it overflow?
> >    jz   @@OK      ; if not, we're ok
> >    mov   dx,00FFh ; Overflowed; force our saturation value (FFh)
> > @@OK:
>
> > This is part of an ADPCM inner loop.
>
> > Is this about as fast as I can expect?  Or am I missing an obvious
> > technique?  (I am not an 808x assembler expert)
>
> What's wrong with the obvious approach?
>
>   add dl,al     ; Carry set if it overflows!
>   sbb al,al     ; AL = 0xff if overflow
>   or dl,al      ; Turns any overflow into 0xff
>
> OK?

Yes -- I was missing the forest for the trees, again.  Thanks! (and
thanks to Mr. Wesseling as well).

0
Reply Jim 4/11/2008 6:05:19 PM

I wouldn't do it that way.
I would add the two supposedly unsigned 8-bit integers and test the 9
th bit for 1, which if present, the addition has overflowed.

Note that any non-overflow result of 255 is for your purposes
indistinguishable from an overflow,
So you just add, test 9th bit, jump around or load or OR 255h.

0
Reply Terence 4/11/2008 11:40:10 PM

On Apr 12, 9:40�am, Terence  <spamt...@crayne.org> wrote:
> I wouldn't do it that way.
> I would add the two supposedly unsigned 8-bit integers and test the 9
> th bit for 1, which if present, the addition has overflowed.
>
> Note that any non-overflow result of 255 is for your purposes
> indistinguishable from an overflow,
> So you just add, test 9th bit, jump around or load or OR 255h.

Make that last "255h" into 255 or else 0FFh! (Sigh!)

0
Reply Terence 4/12/2008 6:28:22 AM

Terence wrote:

>> I wouldn't do it that way.
>> I would add the two supposedly unsigned 8-bit integers and test the 9
>> th bit for 1, which if present, the addition has overflowed.

>> Note that any non-overflow result of 255 is for your purposes
>> indistinguishable from an overflow,
>> So you just add, test 9th bit, jump around or load or OR 255h.

> Make that last "255h" into 255 or else 0FFh! (Sigh!)

ok. :)

First I thought: 'what's wrong by using the Carry-flag ? ',
but now I may see the flaw in the whole story.
The result can turn out to be FF(h) without an overflow,
so using FF(h) as overflow indication may lead to errors.

;just use carry-flag as indication:
ADD DL,AL
JC error

;or expand it to a true result
MOV DH,0    ;
ADD DL,AL   ;also possible: MOVZX DX,AL
ADC DH,0    ;DH NZ meens overflow but DX holds true value

;or use just bit0 of DH for overflow (a bit shorter/faster):
ADD DL,AL
SETC DH     ;DH NZ means overflow

;or if DL should just be clamped to FF(h):
ADD DL,AL
SETC DH      ;becomes 0 or 1
NEG DH       ;becomes 0 or FF
OR DL,DH     ;DL keeps value or becomes FF(h)

SETcc can avoid branches and gain speed by this,
but it will not work on older CPUs.
(my example suffers from partial register stalls and
dependencies, so it should by used within other pairing
instructions for full speed),

__
wolfgang


0
Reply Wolfgang 4/12/2008 10:51:54 AM

Terence <spamtrap@crayne.org> wrote in part:
> On Apr 12, 9:40�am, Terence  <spamt...@crayne.org> wrote:
>> I wouldn't do it that way.  I would add the two supposedly
>> unsigned 8-bit integers and test the 9 th bit for 1,
>> which if present, the addition has overflowed.

CF is also set with two byte adds.

>> Note that any non-overflow result of 255 is for your
>> purposes indistinguishable from an overflow, So you just
>> add, test 9th bit, jump around or load or OR 255h.
> Make that last "255h" into 255 or else 0FFh! (Sigh!)
 
Well, the OP did specify an old 808x which gives a more
complex answer:

Terje's clever SBB/OR  is fastest on modern CPUs by a 
great deal since there are no poorly predictable branches.

Your JC saves one instruction (two byte-transfers),
but only on the non-saturate case.  For the saturate case,
it costs two more instr ( MOV/JMP ) [4 bytes].

Optimum will depend on the data.

-- Robert

0
Reply Robert 4/12/2008 7:29:18 PM

On 11 Apr, 08:41, Terje Mathisen  <spamt...@crayne.org> wrote:
> What's wrong with the obvious approach?
>
> � add al,dl � � ; Carry set if it overflows!
> � sbb al,al � � ; AL = 0xff if overflow
> � or dl,al � � �; Turns any overflow into 0xff

Very neat.  Now, what about the signed case (saturating at +127 or
-128)?  The best I've managed is:

    add al,dl     ; Signed addition
    jno ok        ; Jump if no overflow
    sbb al,al     ; al is 0x00 or 0xff
    xor al,127    ; Saturate to -128 or +127
ok:

Richard.
http://www.rtrussell.co.uk/
To reply by email change 'news' to my forename.

0
Reply Richard 4/12/2008 8:52:03 PM

On Apr 13, 5:29�am, Robert Redelmeier <red...@ev1.net.invalid> wrote:

> CF is also set with two byte adds.

But this never occurs with adding one byte to one bye in two-byte
registers (with zero bits to the left). The objective is to count to
no more than 254 since 255 itself is the overflow signal.

0
Reply Terence 4/13/2008 12:03:07 AM

"Terence" <spamtrap@crayne.org> wrote in message
news:55e84590-158c-4042-b75c-1bf284404b9d@p25g2000pri.googlegroups.com...
> On Apr 13, 5:29 am, Robert Redelmeier <red...@ev1.net.invalid> wrote:
>
> > CF is also set with two byte adds.
>
> But this never occurs with adding one byte to one bye in two-byte
> registers (with zero bits to the left). The objective is to count to
> no more than 254 since 255 itself is the overflow signal.
>

Do you like these two changes to "Terje's solution"?

  inc al
  adc dl,al ; Carry set if it overflows!
  sbb al,al ; AL = 0xff if overflow
  or dl,al ; Turns any overflow into 0xff


Rod Pemberton

0
Reply Rod 4/13/2008 3:06:56 AM

"Terence" <spamtrap@crayne.org> wrote in message
news:55e84590-158c-4042-b75c-1bf284404b9d@p25g2000pri.googlegroups.com...
> On Apr 13, 5:29 am, Robert Redelmeier <red...@ev1.net.invalid> wrote:
>
> > CF is also set with two byte adds.
>
> But this never occurs with adding one byte to one bye in two-byte
> registers (with zero bits to the left). The objective is to count to
> no more than 254 since 255 itself is the overflow signal.
>

Oh, you didn't like the prior two changes? (inc doesn't update carry...)

Do you like these two changes (instead) to "Terje's solution"?

  add al, 1
  adc dl,al ; Carry set if it overflows!
  sbb al,al ; AL = 0xff if overflow
  or dl,al ; Turns any overflow into 0xff


Rod Pemberton

0
Reply Rod 4/13/2008 3:10:57 AM

Terence wrote:
> On Apr 13, 5:29 am, Robert Redelmeier <red...@ev1.net.invalid> wrote:
> 
>> CF is also set with two byte adds.
> 
> But this never occurs with adding one byte to one bye in two-byte
> registers (with zero bits to the left). The objective is to count to
> no more than 254 since 255 itself is the overflow signal.

That's a red herring:

It really doesn't matter if you say:

   if ((sum = a+b) > 254) sum = 255;

or if you write:

   if ((sum = a+b) > 255) sum = 255;

Please notice that as long as the sum is above 254, saturation will 
always return 255, and at that point there is no way to determine if 
this was the result of saturation after overflow, or simply because the 
sum actually was 255!

OK?

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 4/13/2008 1:00:06 PM

Richard Russell wrote:
> On 11 Apr, 08:41, Terje Mathisen  <spamt...@crayne.org> wrote:
>> What's wrong with the obvious approach?
>>
>>   add al,dl     ; Carry set if it overflows!
>>   sbb al,al     ; AL = 0xff if overflow
>>   or dl,al      ; Turns any overflow into 0xff
> 
> Very neat.  Now, what about the signed case (saturating at +127 or
> -128)?  The best I've managed is:
> 
>     add al,dl     ; Signed addition
>     jno ok        ; Jump if no overflow
>     sbb al,al     ; al is 0x00 or 0xff
>     xor al,127    ; Saturate to -128 or +127
> ok:

That is pretty neat code, I like it!

The generic way to handle arbitrary clamping values is something like this:

;;  if ( x < MIN) x = MIN;
;;  else if (x > MAX) x = MAX;

   cmp eax,MIN		; Carry set if (eax < MIN)
   sbb ebx,ebx		; EBX = -1 if underflow
   cmp eax,MAX+1		; Carry clear if (eax > MAX)
   adc ebx,ebx		; Merge carry with previous result

At this point there are 4 possible combinations:

	  second 0  second 1
first  0    OVFL      OK
first  1 impossible  UNDFL

These four corresponds to EBX = 0, 1, -2, -1 which means that it is 
possible to store the original value into a four-element table, then use 
EBX to retrieve either the same value back or the clamped results:

   cmp eax,MIN		; Carry set if (eax < MIN)
   sbb ebx,ebx		; EBX = -1 if underflow
   cmp eax,MAX+1		; Carry clear if (eax > MAX)
   adc ebx,ebx		; Merge carry with previous result
   mov clamp_table[4], eax
   mov eax,clamp_table[ebx*4+8]

This code will only be faster than the naive approach if clamped values 
are both common and non-predictable, if this isn't true then it is much 
better to simply use the generic C-style if/else if/else approach!

Terje
-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 4/13/2008 1:31:59 PM

"Terje Mathisen" <spamtrap@crayne.org> wrote in message
news:rrSdnYmBeZ1Knp_VnZ2dneKdnZydnZ2d@giganews.com...
> Terence wrote:
> > On Apr 13, 5:29 am, Robert Redelmeier <red...@ev1.net.invalid> wrote:
> >
> >> CF is also set with two byte adds.
> >
> > But this never occurs with adding one byte to one bye in two-byte
> > registers (with zero bits to the left). The objective is to count to
> > no more than 254 since 255 itself is the overflow signal.
>
> That's a red herring:
>
> It really doesn't matter if you say:
>
>    if ((sum = a+b) > 254) sum = 255;
>
> or if you write:
>
>    if ((sum = a+b) > 255) sum = 255;
>
> Please notice that as long as the sum is above 254, saturation will
> always return 255, and at that point there is no way to determine if
> this was the result of saturation after overflow, or simply because the
> sum actually was 255!
>
> OK?
>

Let's plug some values into the code you posted:

>  add dl,al ; Carry set if it overflows!
>  sbb al,al ; AL = 0xff if overflow
>  or dl,al ; Turns any overflow into 0xff

if (al+dl) = 0, result al=0, dl=0
if 1< (al+dl) < 254, result al=0, dl=dl<>0
if (al+dl) = 255, result al=0, dl=dl=0xff
if 255 < (al+dl), result al=0xff, dl=0xff

dl is 0xff at sum > 254, but al is 0xff at sum > 255 ...

> Please notice that as long as the sum is above 254, saturation will
> always return 255,

False, if one uses al.  True, if one uses dl.

> and at that point there is no way to determine if
> this was the result of saturation after overflow, or simply because the
> sum actually was 255!

False, if one uses al.  True, if one uses dl.

If Jim needs to distinguish 255 from overflow as Terence noted, perhaps this
is sufficient:

  add dl,al ; Carry set if it overflows!
  sbb dl,dl ; DL = 0xff if overflow

if (al+dl) = 0, result dl=0, al=al
if 1< (al+dl) < 254, result dl=0, al=al
if (al+dl) = 255, result dl=0, al=al
if 255 < (al+dl), result dl=0xff, al=al


Rod Pemberton

0
Reply Rod 4/13/2008 3:27:29 PM

On Apr 13, 8:27 am, "Rod Pemberton"  <spamt...@crayne.org> wrote:
> "Terje Mathisen" <spamt...@crayne.org> wrote in message
....

While we're at it, how about complementing this with a similar
solution for signed saturation (-128 to +127) of signed bytes for the
fun of it? :)

Alex

0
Reply Alexei 4/14/2008 12:19:25 AM

"Alexei A. Frounze" <spamtrap@crayne.org> wrote in message
news:58afc87d-dcb4-4688-97d6-be2c044c8828@s13g2000prd.googlegroups.com...
> On Apr 13, 8:27 am, "Rod Pemberton"  <spamt...@crayne.org> wrote:
> > "Terje Mathisen" <spamt...@crayne.org> wrote in message
> ...
>
> While we're at it, how about complementing this with a similar
> solution for signed saturation (-128 to +127) of signed bytes for the
> fun of it? :)
>

???  How's your server?

http://groups.google.com/group/comp.lang.asm.x86/msg/36af00ed6b8201ae
http://groups.google.com/group/comp.lang.asm.x86/msg/b994b1ae492f9833

RP

0
Reply Rod 4/14/2008 6:41:38 AM

On Apr 13, 11:41 pm, "Rod Pemberton"  <spamt...@crayne.org> wrote:
> "Alexei A. Frounze" <spamt...@crayne.org> wrote in messagenews:58afc87d-dcb4-4688-97d6-be2c044c8828@s13g2000prd.googlegroups.com...
>
> > On Apr 13, 8:27 am, "Rod Pemberton"  <spamt...@crayne.org> wrote:
> > > "Terje Mathisen" <spamt...@crayne.org> wrote in message
> > ...
>
> > While we're at it, how about complementing this with a similar
> > solution for signed saturation (-128 to +127) of signed bytes for the
> > fun of it? :)
>
> ???  How's your server?
>
> http://groups.google.com/group/comp.lang.asm.x86/msg/36af00ed6b8201aehttp://groups.google.com/group/comp.lang.asm.x86/msg/b994b1ae492f9833
>
> RP

Missed the posts. Still I was wondering if someone came up with a
compact method not involving jumps and LUTs.

Alex

0
Reply Alexei 4/14/2008 10:39:12 AM

On Apr 13, 10:27 am, "Rod Pemberton"  <spamt...@crayne.org> wrote:
> If Jim needs to distinguish 255 from overflow as Terence noted, perhaps this
> is sufficient:

I do not need to distinguish 255 from overflow.

The quick background for this is that it is the clamping function for
adding a signed 8-bit integer to an unsigned 8-bit integer.  The
function is an adapted ADPCM routine; whereas most ADPCM works on
signed 16-bit samples, I have a need for it to work on unsigned 8-bit
samples.
The inner loop of the routine has a previous unsigned sample, and a
new signed sample, and needs to add them together to get the new
sample.  The result needs to clamp at both 0 and 255.

With this in mind, what is the best solution, assuming the previous
unsigned sample is in DL and the new signed sample is in AL?

0
Reply Jim 4/14/2008 1:58:12 PM

In article  <7f1061af-6435-47c0-b2e3-38b64cb1f044@a22g2000hsc.googlegroups.com>
           spamtrap@crayne.org "Jim Leonard " writes:

> On Apr 13, 10:27 am, "Rod Pemberton"  <spamt...@crayne.org> wrote:
> > If Jim needs to distinguish 255 from overflow as Terence noted, perhaps this
> > is sufficient:
> 
> I do not need to distinguish 255 from overflow.
> 
> The quick background for this is that it is the clamping function for
> adding a signed 8-bit integer to an unsigned 8-bit integer.  The
> function is an adapted ADPCM routine; whereas most ADPCM works on
> signed 16-bit samples, I have a need for it to work on unsigned 8-bit
> samples.
> The inner loop of the routine has a previous unsigned sample, and a
> new signed sample, and needs to add them together to get the new
> sample.  The result needs to clamp at both 0 and 255.
> 
> With this in mind, what is the best solution, assuming the previous
> unsigned sample is in DL and the new signed sample is in AL?

Just to be absolutely clear about the problem (and without any 
attempt at optimising the code), can you confirm that the below 
is what you're after?

(unsigned sample in DL, signed "update" in AL)

        sub  dh, dh             ;DH = 0
        cbw                     ;sign-extend AL to AX
        add  dx, ax             ;signed addition
Now:
        if DH == 0, nothing to do: ans in DL
        if DH == FF, -ve overflow so clamp to 0: set DL = 0
        if DH == 1, +ve overflow so clamp to FF: set DL = FF

Is that correct?

Pete
-- 
   "We have not inherited the earth from our ancestors,
    we have borrowed it from our descendants."

0
Reply pete 4/15/2008 5:11:19 AM

On Apr 15, 12:11 am, pete <spamt...@crayne.org> wrote:
> Just to be absolutely clear about the problem (and without any
> attempt at optimizing the code), can you confirm that the below
> is what you're after?
>
> (unsigned sample in DL, signed "update" in AL)
>
>         sub  dh, dh             ;DH = 0
>         cbw                     ;sign-extend AL to AX
>         add  dx, ax             ;signed addition
> Now:
>         if DH == 0, nothing to do: ans in DL
>         if DH == FF, -ve overflow so clamp to 0: set DL = 0
>         if DH == 1, +ve overflow so clamp to FF: set DL = FF
>
> Is that correct?

Yes, that's exactly correct.  My beginner's attempt at applying the
clamping has been done with JMPs, which are really costly in an inner
loop (especially on my lowly 808x) and I feel that I'm missing
something obvious.

Was Terje's solution:

  add dl,al     ; Carry set if it overflows!
  sbb al,al     ; AL = 0xff if overflow
  or dl,al      ; Turns any overflow into 0xff

....the right one?  I see overflow handling but not underflow...

0
Reply Jim 4/15/2008 2:30:34 PM

In article  <9c1c6a51-9572-4ffb-9546-b94987c1720f@l42g2000hsc.googlegroups.com>
           spamtrap@crayne.org "Jim Leonard " writes:

> On Apr 15, 12:11 am, pete <spamt...@crayne.org> wrote:
> > Just to be absolutely clear about the problem (and without any
> > attempt at optimizing the code), can you confirm that the below
> > is what you're after?
> >
> > (unsigned sample in DL, signed "update" in AL)
> >
> >         sub  dh, dh             ;DH = 0
> >         cbw                     ;sign-extend AL to AX
> >         add  dx, ax             ;signed addition
> > Now:
> >         if DH == 0, nothing to do: ans in DL
> >         if DH == FF, -ve overflow so clamp to 0: set DL = 0
> >         if DH == 1, +ve overflow so clamp to FF: set DL = FF
> >
> > Is that correct?
> 
> Yes, that's exactly correct.  My beginner's attempt at applying the
> clamping has been done with JMPs, which are really costly in an inner
> loop (especially on my lowly 808x) and I feel that I'm missing
> something obvious.
> 
> Was Terje's solution:
> 
>   add dl,al     ; Carry set if it overflows!
>   sbb al,al     ; AL = 0xff if overflow
>   or dl,al      ; Turns any overflow into 0xff
> 
> ...the right one?  I see overflow handling but not underflow...

Well, I'd guess that Terje has more expertise in his little 
finger that I in my entire body :-) but below is the best I can 
come up with -- still a jmp in there though...

; unsigned sample in DL, signed "update" in AL
        sub     dh, dh
	cbw
	add	dx, ax
	or	dh, dh
	jz	done
; either DH == FF (-ve overflow) or DH == 1 (+ve overflow)
        sar     dh, 1   ;now DH is 0 or FF, but the wrong way round
        not     dh      ;invert it
	mov	dl, dh
  done:

Perhaps (certainly!) someone can improve on that?

Pete
-- 
   "We have not inherited the earth from our ancestors,
    we have borrowed it from our descendants."

0
Reply pete 4/17/2008 5:05:43 PM

pete wrote:
> In article  <9c1c6a51-9572-4ffb-9546-b94987c1720f@l42g2000hsc.googlegroups.com>
>> Was Terje's solution:
>>
>>   add dl,al     ; Carry set if it overflows!
>>   sbb al,al     ; AL = 0xff if overflow
>>   or dl,al      ; Turns any overflow into 0xff
>>
>> ...the right one?  I see overflow handling but not underflow...

That didn't even try to handle a unsigned/signed mix.
> 
> Well, I'd guess that Terje has more expertise in his little 

This only means I have more scars and have made more mistakes than most 
of the rest of you.

> finger that I in my entire body :-) but below is the best I can 
> come up with -- still a jmp in there though...
> 
> ; unsigned sample in DL, signed "update" in AL
>         sub     dh, dh
> 	cbw
> 	add	dx, ax
> 	or	dh, dh
> 	jz	done
> ; either DH == FF (-ve overflow) or DH == 1 (+ve overflow)
>         sar     dh, 1   ;now DH is 0 or FF, but the wrong way round
>         not     dh      ;invert it
> 	mov	dl, dh
>   done:

What about converting the problem (sort of) to an unsigned case?

; The two first instructions can be skipped if the high
; halfs are known to be zero
   and dx,255
   and ax,255
; Move the signed update into the [0-255] range:
   add ax,128

   add ax,dx	; Result must be in [0..510] range, of which
                 ; 128..383 is in range

; The trivial way to clamp this is to adjust & branch:
   sub ax,128
    jc underflow
   cmp ax,256
    jnc overflow

Can we do this differently? Sure, but only if clamping is both common 
and non-predictable:

   sub ax,128
   mov dx,0xff00
   cmovc al,dl
   cmp ax,256
   cmovnc al,dh

How about doing it without conditional moves?

   sub ax,128
   sbb dx,dx	; -1 if underflow
   cmp ax,256
   sbb bx,bx
   xor dx,-1

   and ax,dx
   or ax,bx

Anyway, unless clamping is very common, it is hard to beat the branch 
predictors!

Terje
-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 4/17/2008 6:14:08 PM

On 17 Apr, 20:14, Terje Mathisen  <spamt...@crayne.org> wrote:
> pete wrote:
> >> Was Terje's solution:
>
> >> � add dl,al � � ; Carry set if it overflows!
> >> � sbb al,al � � ; AL = 0xff if overflow
> >> � or dl,al � � �; Turns any overflow into 0xff
>
> >> ...the right one? �I see overflow handling but not underflow...
>
[...]
>
> Anyway, unless clamping is very common, it is hard to beat the branch
> predictors!
>

I don't think the 808x CPU has a branch predictor. I guess one could
just add up the cycles used for each instruction on such an old CPU
and come up with the best performing code for various statistics of
the data. It should be pretty much deterministic (does it even use
cache memory?).

BR,
Danjel McGougan

0
Reply Danjel 4/18/2008 9:31:30 AM

Danjel McGougan wrote:
> On 17 Apr, 20:14, Terje Mathisen  <spamt...@crayne.org> wrote:
>> pete wrote:
>>>> Was Terje's solution:
>>>>   add dl,al     ; Carry set if it overflows!
>>>>   sbb al,al     ; AL = 0xff if overflow
>>>>   or dl,al      ; Turns any overflow into 0xff
>>>> ...the right one?  I see overflow handling but not underflow...
> [...]
>> Anyway, unless clamping is very common, it is hard to beat the branch
>> predictors!
>>
> 
> I don't think the 808x CPU has a branch predictor. I guess one could

Ouch, I forgot about the 8088/8086 target!

> just add up the cycles used for each instruction on such an old CPU
> and come up with the best performing code for various statistics of
> the data. It should be pretty much deterministic (does it even use
> cache memory?).

The only cache at all is a 6/8 byte prefetch queue for instruction 
bytes, but on 8088 that queue is very often empty since it has to 
compete with all memory accesses.

The performance is easy to calculate though:

Without taken branches:
4 cycles per byte touched, code & data.

Taken branches were something like 4-8 cycles afair, might be wrong.

  ; Input: AL = signed byte, DL = unsigned byte
  ; Output: AL unsigned clamped byte


  xor dx,dx
next:			; bytes
  mov dl,[bx+si]		; 2+1
  lodsb			; 1+1
  cbw			; 1	-128 to 127
  add ax,dx		; 2	-128 to 383
   js underflow		; 2
  test ah,ah		; 2
   jnz overflow		; 2
store:
  stosb			; 1+1	Store clamped result
  jmp next		; 2+t

This looks like 14 bytes total, i.e. about 56 cycles for a nonclamped 
result.

underflow:
  xor ax,ax		; 2
  stosb			; 1+1
  jmp next		; 2+t

Underflow adds a taken branch but get rid of 2 code bytes, so we're 
talking about +/- zero extra cost.

overflow:
  mov al,255		; 3
  stosb			; 1+1
  jmp next		; 2+t

Overflow requires 3 extra code bytes and a taken branch, i.e. 
approximately 16+ cycles.

If most samples turn out to be in range, then we should simplify the 
code to optimize this path even further:

  xor dx,dx
next:			; bytes
  mov dl,[bx+si]		; 2+1
  lodsb			; 1+1
  cbw			; 1	-128 to 127
  add ax,dx		; 2	-128 to 383
  test ah,0ffh
   jnz clamp
  stosb
  jmp next

clamp:
   js underflow
  mov al,255
  stosb
  jmp next

underflow:
  xor ax,ax
  stosb
  jmp next

Terje
-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 4/18/2008 6:21:01 PM

On the real 8088 IBM PC the cycle count is total bytes in the
instruction and data being referenced times the wait states.  The PC
had so many waits on memory access that only a few instructions such
as multiply and divide would take more CPU cycles than the fetch.

On Fri, 18 Apr 2008 02:31:30 -0700 (PDT), Danjel McGougan
<spamtrap@crayne.org> wrote:

>On 17 Apr, 20:14, Terje Mathisen  <spamt...@crayne.org> wrote:
>> pete wrote:
>> >> Was Terje's solution:
>>
>> >> � add dl,al � � ; Carry set if it overflows!
>> >> � sbb al,al � � ; AL = 0xff if overflow
>> >> � or dl,al � � �; Turns any overflow into 0xff
>>
>> >> ...the right one? �I see overflow handling but not underflow...
>>
>[...]
>>
>> Anyway, unless clamping is very common, it is hard to beat the branch
>> predictors!
>>
>
>I don't think the 808x CPU has a branch predictor. I guess one could
>just add up the cycles used for each instruction on such an old CPU
>and come up with the best performing code for various statistics of
>the data. It should be pretty much deterministic (does it even use
>cache memory?).
>
>BR,
>Danjel McGougan

0
Reply dave 4/18/2008 8:57:11 PM

On Apr 18, 3:57 pm, dave <spamt...@crayne.org> wrote:
> On the real 8088 IBM PC the cycle count is total bytes in the
> instruction and data being referenced times the wait states.  The PC
> had so many waits on memory access that only a few instructions such
> as multiply and divide would take more CPU cycles than the fetch.

....but they still had to execute.  Just because an opcode takes 4 or 8
cycles to fetch doesn't mean its execution time is magically erased.
Since the (tiny) prefetch queue can be full after a slow operation,
optimization still matters.

I originally asked about the topic because a jump clears the prefetch
queue, and was trying to find a way to clamp without jumping.  I have
to ask, though:  The penalty on my CPU for jumping is 1. clearing the
prefetch queue and 2. 17 cycles.  I will have to look at Terje's
suggestions and see if the solutions he provided actually beat that,
which may have been what he was referring to when he wrote "unless
clamping is very common, it is hard to beat the branch
predictors!"

0
Reply Jim 4/21/2008 2:23:39 PM

On Apr 18, 1:21 pm, Terje Mathisen  <spamt...@crayne.org> wrote:
> The only cache at all is a 6/8 byte prefetch queue for instruction
> bytes, but on 8088 that queue is very often empty since it has to
> compete with all memory accesses.

4 bytes, actually, on 8088.  It's 6 on 8086 and NEC V20/30.

> Taken branches were something like 4-8 cycles afair, might be wrong.

17, actually.  That's where the detective work comes in; to "beat" the
penalty of a jump, the alternate solution has to be 16 cycles or less
(including opcode fetch as you noted).

I think, based on what you wrote, that 17 cycles is not that big a
deal when alternate solutions would most likely be longer/slower.

> If most samples turn out to be in range, then we should simplify the
> code to optimize this path even further:

"Most" is subjective, but in an ADPCM decoder (what this code is for)
I would agree that the ratio of needing clamping to not needing is
about 1:8, so yes, most samples will be in range.

>   xor dx,dx
> next:                   ; bytes
>   mov dl,[bx+si]                ; 2+1
>   lodsb                 ; 1+1
>   cbw                   ; 1     -128 to 127
>   add ax,dx             ; 2     -128 to 383
>   test ah,0ffh
>    jnz clamp
>   stosb
>   jmp next
>
> clamp:
>    js underflow
>   mov al,255
>   stosb
>   jmp next
>
> underflow:
>   xor ax,ax
>   stosb
>   jmp next
>
> Terje

Looks like a winner; thanks for all your help!

0
Reply Jim 4/21/2008 2:33:56 PM

Jim Leonard wrote:
> On Apr 18, 1:21 pm, Terje Mathisen  <spamt...@crayne.org> wrote:
>> The only cache at all is a 6/8 byte prefetch queue for instruction
>> bytes, but on 8088 that queue is very often empty since it has to
>> compete with all memory accesses.
> 
> 4 bytes, actually, on 8088.  It's 6 on 8086 and NEC V20/30.
> 
>> Taken branches were something like 4-8 cycles afair, might be wrong.
> 
> 17, actually.  That's where the detective work comes in; to "beat" the
> penalty of a jump, the alternate solution has to be 16 cycles or less
> (including opcode fetch as you noted).
> 
> I think, based on what you wrote, that 17 cycles is not that big a
> deal when alternate solutions would most likely be longer/slower.
> 
>> If most samples turn out to be in range, then we should simplify the
>> code to optimize this path even further:
> 
> "Most" is subjective, but in an ADPCM decoder (what this code is for)
> I would agree that the ratio of needing clamping to not needing is
> about 1:8, so yes, most samples will be in range.
> 
>>   xor dx,dx
>> next:                   ; bytes
>>   mov dl,[bx+si]                ; 2+1
>>   lodsb                 ; 1+1
>>   cbw                   ; 1     -128 to 127
>>   add ax,dx             ; 2     -128 to 383
>>   test ah,0ffh
>>    jnz clamp
>>   stosb
>>   jmp next
>>
>> clamp:
>>    js underflow
>>   mov al,255
>>   stosb
>>   jmp next
>>
>> underflow:
>>   xor ax,ax
>>   stosb
>>   jmp next
>>
>> Terje
> 
> Looks like a winner; thanks for all your help!
> 

Might it be "better" (at least smaller code) to place the stosb after 
the "next" label? but then you'd need a setup JMP to start,
[and a "cx=0" test at some point!]

  xor dx,dx
  jmp short start

  next:                   ; bytes
    stosb
start:
    mov dl,[bx+si]                ; 2+1
    lodsb                 ; 1+1
    cbw                   ; 1     -128 to 127
    add ax,dx             ; 2     -128 to 383
    test ah,0ffh
    jz next

  clamp:
     js underflow
    mov al,255
    jmp next

  underflow:
    xor ax,ax
    jmp next

0
Reply Esra 4/21/2008 8:07:15 PM

Hello. You can use carry to generate a saturation mask. The idea
revolves around this concept:

0 - 1 = -1 (all bits set)
0 - 0 = 0

add al, bl ; this is ur addition
xor cl, cl ; zero d cl register
sbb cl, cl ; cl = cl - cl - carry = 0 - 0 - carry = 0 - carry
or  al, cl ; this will saturate; the cl = 0xff if carry was set.. else
it is 0x00

w00t++ w00t++

Saturating w/o branching is child's play; enjoy!

0
Reply aku 4/22/2008 7:38:05 AM

Esra Sdrawkcab wrote:
> Might it be "better" (at least smaller code) to place the stosb after 
> the "next" label? but then you'd need a setup JMP to start,
> [and a "cx=0" test at some point!]
> 
>  xor dx,dx
>  jmp short start
> 
>  next:                   ; bytes
>    stosb
> start:
>    mov dl,[bx+si]                ; 2+1
>    lodsb                 ; 1+1
>    cbw                   ; 1     -128 to 127
>    add ax,dx             ; 2     -128 to 383
>    test ah,0ffh
>    jz next

Yes indeed, this is a very common/useful optimization, I didn't do this 
simply because I didn't know what the surrounding code looked like, I 
just wrote a minimal wrapper around the core clamping code.

BTW, there is a very sneaky way to avoid that initial jump:

Instead of

  jmp short start
  next:
    stosb
start:

you can instead write:

   mov al,0AAh
org $-1		; Back up one byte into the immediate constant above
next:
   stosb

I.e. on the first iteration we do a dummy two-byte instruction with a 
one-byte immediate, which we then make sure contains the STOSB opcode byte.

The gain is that we save one opcode byte (MOV instead of JMP SHORT) and 
a taken branch, so the actual cost of rotating the loop like this is 8 
cycles on the first iteration, instead of 17, and each iteration saves 
two opcode bytes or 8 cycles.

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 4/22/2008 6:01:06 PM

aku ankka wrote:
> Hello. You can use carry to generate a saturation mask. The idea
> revolves around this concept:
> 
> 0 - 1 = -1 (all bits set)
> 0 - 0 = 0
> 
> add al, bl ; this is ur addition
> xor cl, cl ; zero d cl register

This XOR will overwrite the flags from the previous ADD!
INC/DEC are the only arithmetic opcodes which doesn't modify the carry flag.

> sbb cl, cl ; cl = cl - cl - carry = 0 - 0 - carry = 0 - carry
> or  al, cl ; this will saturate; the cl = 0xff if carry was set.. else
> it is 0x00
> 
> w00t++ w00t++
> 
> Saturating w/o branching is child's play; enjoy!

Children often make simple mistakes. :-)

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 4/22/2008 6:03:00 PM

On Apr 22, 9:03�pm, Terje Mathisen  <spamt...@crayne.org> wrote:
> aku ankka wrote:
> > Hello. You can use carry to generate a saturation mask. The idea
> > revolves around this concept:
>
> > 0 - 1 = -1 (all bits set)
> > 0 - 0 = 0
>
> > add al, bl ; this is ur addition
> > xor cl, cl ; zero d cl register
>
> This XOR will overwrite the flags from the previous ADD!
> INC/DEC are the only arithmetic opcodes which doesn't modify the carry flag.
>
> > sbb cl, cl ; cl = cl - cl - carry = 0 - 0 - carry = 0 - carry
> > or �al, cl ; this will saturate; the cl = 0xff if carry was set.. else
> > it is 0x00
>
> > w00t++ w00t++
>
> > Saturating w/o branching is child's play; enjoy!
>
> Children often make simple mistakes. :-)
>
> Terje
>
> --
> - <Terje.Mathi...@hda.hydro.com>
> "almost all programming can be viewed as an exercise in caching"

Yea; d xor was unnecessary, I was about to post follow-up but didn't
see dat post o mine because of moderation lag. But since u tried to to
correct tha sh1t, here's what I wanted to post about five seconds
after hitting d send button:

a - a - carry = 0 - carry

In other words the register doesn't need to be zeroed to begin wid.
I'm baffled u didn't see dat one.. I did but wanted to quote miself so
there u go.. thxbye

0
Reply aku 4/23/2008 6:43:47 AM

Terje Mathisen wrote:

> you can instead write:
> 
>   mov al,0AAh
> org $-1        ; Back up one byte into the immediate constant above
> next:
>   stosb
> 
> I.e. on the first iteration we do a dummy two-byte instruction with a 
> one-byte immediate, which we then make sure contains the STOSB opcode byte.
> 
> The gain is that we save one opcode byte (MOV instead of JMP SHORT) and 
> a taken branch, so the actual cost of rotating the loop like this is 8 
> cycles on the first iteration, instead of 17, and each iteration saves 
> two opcode bytes or 8 cycles.
> 
Neat.

0
Reply Esra 4/23/2008 8:20:41 AM

On Apr 22, 1:01 pm, Terje Mathisen  <spamt...@crayne.org> wrote:
> BTW, there is a very sneaky way to avoid that initial jump:
>
> Instead of
>
>   jmp short start
>   next:
>     stosb
> start:
>
> you can instead write:
>
>    mov al,0AAh
> org $-1         ; Back up one byte into the immediate constant above
> next:
>    stosb
>
> I.e. on the first iteration we do a dummy two-byte instruction with a
> one-byte immediate, which we then make sure contains the STOSB opcode byte.
>
> The gain is that we save one opcode byte (MOV instead of JMP SHORT) and
> a taken branch, so the actual cost of rotating the loop like this is 8
> cycles on the first iteration, instead of 17, and each iteration saves
> two opcode bytes or 8 cycles.

I hate to seem more daft than I already am, but after re-reading the
above for a week I still don't get the trick.  I thought ORG was for
specifying where the assembler assumes the binary code will get
loaded; how does that save two opcode bytes per iteration?

0
Reply Jim 5/1/2008 8:13:06 PM

In article  <25740676-bb54-4871-b103-d3b7135e4f28@j22g2000hsf.googlegroups.com>
           spamtrap@crayne.org "Jim Leonard" writes:

> On Apr 22, 1:01 pm, Terje Mathisen  <spamt...@crayne.org> wrote:
> > BTW, there is a very sneaky way to avoid that initial jump:
> >
> > Instead of
> >
> >   jmp short start
> >   next:
> >     stosb
> > start:
> >
> > you can instead write:
> >
> >    mov al,0AAh
> > org $-1         ; Back up one byte into the immediate constant above
> > next:
> >    stosb
> >
> > I.e. on the first iteration we do a dummy two-byte instruction with a
> > one-byte immediate, which we then make sure contains the STOSB opcode byte.
> >
> > The gain is that we save one opcode byte (MOV instead of JMP SHORT) and
> > a taken branch, so the actual cost of rotating the loop like this is 8
> > cycles on the first iteration, instead of 17, and each iteration saves
> > two opcode bytes or 8 cycles.
> 
> I hate to seem more daft than I already am, but after re-reading the
> above for a week I still don't get the trick.

The "trick" is to locate the next: label in the middle of the 
instruction 'mov al,imm8' -- the 'org $-1' winds back the program 
counter one location and assembles the STOSB opcode in the 
location occupied by the imm8 value of the previous instruction.  
This results in a benign 'mov al,0AAh' first time through the 
code (benign in the sense that AL will soon after be loaded with 
the proper value) but when looping back to the next: label the 
code will find a STOSB (i.e. jumps into the middle of the mov 
instruction).

>  I thought ORG was for
> specifying where the assembler assumes the binary code will get
> loaded; how does that save two opcode bytes per iteration?

This saving comes from having a single STOSB at the "top" of the 
loop; the alternative earlier loop had 3 STOSB instructions (one 
in each branch) -- hence the saving of 2 x STOSB.

Very neat and goes some way to justify my earlier comment about 
Terje's little finger ;-)

Pete

Incidentally, I've used the 'org $-1' technique in the past for 
self-modifying code to jump to alternative addresses (like a 
switch() in C).  Something like

        jmp short +2
        org $-1
disp    db ?

proc1:  some code
        ret
proc2:  more code etc...
        ret

Then, at some point before the jmp you just load up the 
displacement byte (disp) with 0, (proc2-proc1), (proc3-proc1), 
etc. according to which section of code you want to jump to.  
Not much use on modern processors I guess, but it worked well on 
the 8086 and z80!
-- 
   "We have not inherited the earth from our ancestors,
    we have borrowed it from our descendants."

0
Reply pete 5/2/2008 4:38:05 AM

Jim Leonard wrote:
> On Apr 22, 1:01 pm, Terje Mathisen  <spamt...@crayne.org> wrote:
>> The gain is that we save one opcode byte (MOV instead of JMP SHORT) and
>> a taken branch, so the actual cost of rotating the loop like this is 8
>> cycles on the first iteration, instead of 17, and each iteration saves
>> two opcode bytes or 8 cycles.
> 
> I hate to seem more daft than I already am, but after re-reading the
> above for a week I still don't get the trick.  I thought ORG was for
> specifying where the assembler assumes the binary code will get
> loaded; how does that save two opcode bytes per iteration?

Pete wrote a good explanation, using ORG together with $ is for relative 
instead of absolute addresses.

The two bytes saved per iteration was the unconditional jump from the 
bottom of the loop, getting rid of those required the loop to be 
rotated, i.e. moving the STOSB to the top, and getting rid of the 
initial jump past this opcode was the reason for my dummy immediate 
instruction.

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 5/2/2008 7:11:22 AM

Terje Mathisen wrote:
> 
> BTW, there is a very sneaky way to avoid that initial jump:
> 
> Instead of
> 
>  jmp short start
>  next:
>    stosb
> start:
> 
> you can instead write:
> 
>   mov al,0AAh
> org $-1        ; Back up one byte into the immediate constant above
> next:
>   stosb
> 
> I.e. on the first iteration we do a dummy two-byte instruction with a 
> one-byte immediate, which we then make sure contains the STOSB opcode byte.
> 

It's worth noting that this will cause an L1 cache miss on some platforms.

	-hpa

0
Reply H 6/2/2008 4:43:53 PM

H. Peter Anvin wrote:
> Terje Mathisen wrote:
>>
>> BTW, there is a very sneaky way to avoid that initial jump:
>>
>> Instead of
>>
>>  jmp short start
>>  next:
>>    stosb
>> start:
>>
>> you can instead write:
>>
>>   mov al,0AAh
>> org $-1        ; Back up one byte into the immediate constant above
>> next:
>>   stosb
>>
>> I.e. on the first iteration we do a dummy two-byte instruction with a 
>> one-byte immediate, which we then make sure contains the STOSB opcode 
>> byte.
>>
> 
> It's worth noting that this will cause an L1 cache miss on some platforms.

Which was why I did note it in one of my posts, and returned here when 
it was clear that the target was an original 808x, which really didn't 
care about stuff like cache coherence or partially decoded instructions. :-)

The trick was a win on every x86 up to and including the 486 which was 
the last time I used it.

Terje

-- 
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

0
Reply Terje 6/2/2008 5:53:52 PM

38 Replies
84 Views

(page loaded in 0.316 seconds)

Similiar Articles:


















7/17/2012 7:18:47 PM


Reply: