Fixed Point Square Root Improvements?

  • Follow


I will be working with an embedded system with no FPU and need to be
able to find the square root of a 16.16 fixed point integer.

My current design is using a lookup table where the index is the root
(with two decimal bits) of the value pointed to (with four decimal
bits.) This doesn't cover every possible value, so I do a search for
the nearest value that is contained in the array.

Building the table:

sqrtTable[0x400];
for (L = 0; L < 0x400; L++) {
	sqrtTable[L] = (L * L);
}

(Actually I write the results out to a file for direct compilation in)

And my current function is as such:

__inline unsigned int fsqrt(unsigned int value) {	//square root
function which works on a 16/16 fixed point integer

#ifndef PC	//straight C code

	unsigned int testValue;

	unsigned int low = 0;
	unsigned int high = 0x3ff;
	unsigned int curr = 0x1ff;

	value >>= 12;

	while (1) {
		testValue = sqrtTable[curr];

		if (value == testValue) {
			break;
		}
		else if (value < testValue) {
			high = curr;
		}
		else {
			low = curr;
		}

		if ((low + 1) == high || low == high) {
			curr = low;
			break;
		}

		curr = (low + high) >> 1;
	}

	return (curr << 14);

#else	//PC only assembly
	unsigned int returnValue;

	_asm {
		mov eax, 0;
		mov edx, 0x3ff;
		mov edi, 0x1ff;

		lea esi, sqrtTable;

		mov ecx, value;
		shr ecx, 12;

		LOOP_TOP:
			mov ebx, [esi + edi * 4];
			cmp ebx, ecx;

			je LOOP_END;
			jl LESS;
				mov edx, edi;

				add edi, eax;
				shr edi, 1;
				jmp COMPARE_END;

			LESS:
				mov eax, edi;

				add edi, edx;
				shr edi, 1;
			COMPARE_END:

			mov ebx, eax;
			cmp ebx, edx;
			je CONVERGED;

			inc ebx;
			cmp ebx, edx;
			je CONVERGED;

			jmp LOOP_TOP;

			CONVERGED:
				mov edi, eax;
		LOOP_END:

		shl edi, 14;
		mov returnValue, edi;
	}
	return returnValue;

#endif
}

The C version appears to work about 2.2 times slower than C's sqrt()
on a float and the assembly at about 1.5.
Was just wondering if anyone had any recommendations? I am relatively
pleased to have gotten that close to the standard library, but
certainly wouldn't complain if I could get more speed and/or more
decimal bits with a different approach.

Thank you,
Chris Williams

0
Reply spamtrap 11/25/2004 4:14:06 AM

On 24 Nov 2004 20:14:06 -0800, spamtrap@crayne.org (Chris Williams)
wrote:

>I will be working with an embedded system with no FPU and need to be
>able to find the square root of a 16.16 fixed point integer.
>
>My current design is using a lookup table where the index is the root
>(with two decimal bits) of the value pointed to (with four decimal
>bits.) This doesn't cover every possible value, so I do a search for
>the nearest value that is contained in the array.

int sqrttable[256]; /* full of sqrt(0..255) */

value=34758934;

sh=0;
while (value>=256)
{ 
  value>>=2;
  sh++;
}
root=table[value]<<sh;

increase table size for more precision

Jim

0
Reply Jim 11/25/2004 6:38:18 AM


spamtrap@crayne.org (Chris Williams) writes:

> I will be working with an embedded system with no FPU and need to be
> able to find the square root of a 16.16 fixed point integer.
> 
> My current design is using a lookup table where the index is the root
> (with two decimal bits) of the value pointed to (with four decimal
> bits.) This doesn't cover every possible value, so I do a search for
> the nearest value that is contained in the array.
> 
> Building the table:
> 
> sqrtTable[0x400];
> for (L = 0; L < 0x400; L++) {
> 	sqrtTable[L] = (L * L);
> }
>
> (Actually I write the results out to a file for direct compilation in)

I trust that the recorded values are actually root of the value 
being pointed to and not the square, as the above indicates.
 
 
> And my current function is as such:
> 
> __inline unsigned int fsqrt(unsigned int value) {	//square root
> function which works on a 16/16 fixed point integer
[SNIP - binary chop]
> }
> 
> The C version appears to work about 2.2 times slower than C's sqrt()
> on a float and the assembly at about 1.5.
> Was just wondering if anyone had any recommendations? I am relatively
> pleased to have gotten that close to the standard library, but
> certainly wouldn't complain if I could get more speed and/or more
> decimal bits with a different approach.

Use Newton-Raphson.
It's easiest to find 1/sqrt(x), and then post-multiply that by x.

Phil
-- 
.... one Marine noticed one of the prisoners was still breathing. A Marine 
can be heard saying on the pool footage provided to Reuters Television: 
"He's fucking faking he's dead. He faking he's fucking dead."
"The Marine then raises his rifle and fires into the man's head. 
The pictures are too graphic for us to broadcast," Sites said. 

0
Reply Phil 11/25/2004 11:16:21 AM

> int sqrttable[256]; /* full of sqrt(0..255) */
> 
> value=34758934;
> 
> sh=0;
> while (value>=256)
> { 
>   value>>=2;
>   sh++;
> }
> root=table[value]<<sh;
> 
> increase table size for more precision
> 
> Jim

Tried this and got around .5<->.6 the speed of C's sqrt() on floats.
Which I like =)
But it also appears to only have half the precision of the bit count
of the array length. (ie, 256 = 8 bits, so the first 4 bits will be
accurate.) To get the level of precision I have now, I would need to
make my array 1,048,575 (0x100000) in length--but I don't think I can
afford to go over 4096.

I will be doing a good bit of trigonometry so only having 2-bits (up
to 1/4 of a unit) of accuracy is already quite worrying. I don't think
I could afford to lose any speed from my current implementation, but
any extra precision would by wonderful.

-Chris

0
Reply spamtrap 11/25/2004 7:22:57 PM

"Chris Williams" <spamtrap@crayne.org> wrote in message 
news:da271e02.0411242014.31d3f168@posting.google.com...
[...]
> The C version appears to work about 2.2 times slower than C's sqrt()
> on a float and the assembly at about 1.5.
> Was just wondering if anyone had any recommendations? I am relatively
> pleased to have gotten that close to the standard library, but
> certainly wouldn't complain if I could get more speed and/or more
> decimal bits with a different approach.

I poked around not too long ago in search of the exact same. I have a 
non-LUT solution which is about 2-2.5 times slower than sqrt() on my Athlon 
with gcc -O3 -finline-functions. The actual speed depends on the data, and 
2.5 seems to be the worst case. While poking around I also found an integer 
sqrt() which ranges from 1.0 to 1.5 times slower than C's sqrt() function. I 
have not taken the time to fully understand it, and thus I have no 
fixed-point version. All of the algorithms in my code are credited to the 
following page: http://www.azillionmonkeys.com/qed/sqroot.html

I compared Newton-Rhapson to these algorithms also, and I seem to recall 
Newton-Rhapson being slower on average, but I'm not 100% certain. For 32-bit 
values, you need 5 iterations of Newton-Rhapson to guarantee convergence.

Note that all the algorithms I am discussing give full precision. My 
implementation of the square root functions follow:

// Size of mantissa in bits.
#define MANTISSA                    (10)

unsigned long lzc(unsigned long x)
{
#ifdef __GNUC__
 asm("bsrl %1, %0" : "=r" (x) : "r" (x));
#else
 _asm
 {
  mov eax, x
  bsr eax, eax
  mov x, eax
 }
#endif /* __GNUC__ */
 return x;
}

// Algorithm is based on polynomial arithmetic:
// 1) Compute initial guess using 2^(msb_index / 2)
// 2) Iterate over remaining bits
// 3) Set bit to 1 if (1 << bit_index) < (x - guess^2)
fix fixsqrt(fix x)
{
 fix guess, gsqr, bsqr, gxb;
 unsigned int i, bitmask;

 if (x == 0)
  return 0;

 // Compute guess
 i = (((lzc(x) - MANTISSA) / 2) + MANTISSA);
 guess = 1 << i;

 // Compute mask containing next bit to be checked
 bitmask = (1 << i) >> 1;

 // Cache guess^2, bitmask^2, and 2*guess*bitmask for fast polynomial
 // eval.
 gsqr = guess << (i - MANTISSA); // guess * guess
 gxb = guess << (i - MANTISSA); // guess * bitmask * 2
 bsqr = (bitmask << (i - MANTISSA)) >> 1; // bitmask * bitmask

 // Determine remaining bits
 do
 {
  // (guess + bitmask)^2 == guess^2 + 2*guess*bitmask + bitmask^2
  //                     == gsqr + gxb + bsqr
  if ((gsqr + gxb + bsqr) <= x)
  {
   guess |= bitmask;
   gsqr += (gxb + bsqr);
   gxb += (bsqr << 1);
  }

  bitmask >>= 1;
  bsqr >>= 2;
  gxb >>= 1;
 }
 while(bitmask != 0);

 return guess;
}

// The integer square root function. Copied directly off the webpage;
// seems to pick 1 bit per iteration, but I haven't looked closely at it.
int isqrt(int x)
{
  unsigned long temp, g=0, b, bshft;
  bshft = lzc(x) / 2;
  b = 1 << bshft;
  do {
    if (x >= (temp = (((g<<1)+b)<<bshft--))) {
      g += b;
      x -= temp;
    }
  } while (b >>= 1);
  return g;
}

-Matt 

0
Reply Matt 11/26/2004 4:06:50 AM

> > sqrtTable[0x400];
> > for (L = 0; L < 0x400; L++) {
> > 	sqrtTable[L] = (L * L);
> > }
> >
> > (Actually I write the results out to a file for direct compilation in)
> 
> I trust that the recorded values are actually root of the value 
> being pointed to and not the square, as the above indicates.

No, the recorded values are the square.

sqrtTable[0] = 0;
sqrtTable[1] = 1;
sqrtTable[2] = 4;
sqrtTable[3] = 9;

If I want to find the root of 5, then I do a search for either 5 in
the array or the first number below it.

Iteration 0:  //binary search
high = 3
low = 0
curr = 1
sqrtTable[curr] = 1

Iteration 1:
high = 3
low = 1
curr = 2
sqrtTable[curr] = 4

Iteration 2:
high = 3
low = 2
curr = 2
sqrtTable[curr] = 4

High and low have converged, so I return low.

Will look up Newton-Raphson.

-Chris

0
Reply spamtrap 11/26/2004 4:06:53 AM

> Use Newton-Raphson.
> It's easiest to find 1/sqrt(x), and then post-multiply that by x.
> 
> Phil

My test function for Newton-Raphson is coming in at about 5 times
slower than C's sqrt. (I assume because of the multiplication and
division.) 10 iterations seems to be the minimum needed for a 16 bit
value to return an accurate 8 bit value.
Unless I have done something incorrect, this would be 2 to 3 times
slower, with two fewer bits of precision than my current
implementation. Could you please review it?

int mysqrt(int value) {
	int temp;
	int result = value >> 1;
	result += 1;

	for (int L = 0; L < 10; L++) {
		temp = result * result;
		temp -= value;
		temp /= (result << 1);

		result = result - temp;
	}

	return result;
}

I was uncertain how one is supposed to generate the seed value.

-Chris

0
Reply spamtrap 11/26/2004 4:06:57 AM

"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message news:87is7utcef.fsf@nonospaz.fatphil.org...
> spamtrap@crayne.org (Chris Williams) writes:
>
> > I will be working with an embedded system with no FPU and need to be
> > able to find the square root of a 16.16 fixed point integer.
> >
[...]
>
> Use Newton-Raphson.
> It's easiest to find 1/sqrt(x), and then post-multiply that by x.

I second Phil's recommendation. This works best if one first normalizes
the input, creating something akin to a floating-point mantissa. A table
of 1-byte starting approximations, indexed by some most significant bits
of the normalized inout, results in 9-bit initial approximations. Iterate
twice using the NR iteration for 1/sqrt, then post multiply. If an exact
result is needed, square the result, compute the remainder, and adjust
the result accordingly.

The code below (tested exhaustively) computes the 16.16 square root using
an interesting algorithm I found many years ago in IEEE TOC. The result is
exact, i.e. isqrt(x)*isqrt(x) <= x < (isqrt(x)+0.0001)*(isqrt(x)+0.0001).
The speed of the code below depends very much on the speed of the integer
multiplications. On my Athlon processor, I measured a worst case execution
time of 136 cycles, and an average execution time of 112 cycles. A Newton-
Raphson approach may get you down to 70 cycles or so, but I don't have code
for that handy. When I wrote the code below, I was mostly interested in
exploring the algorithm.

-- Norbert


/*
 * 16.16 fixed-point square root based on: Reza Hashemian, "Square Rooting
 * Algorithms for Integer and Floating-Point Numbers", IEEE Transactions on
 * Computers, Vol. 39, No. 8, August 1990, pp 1025-1029
 */
__declspec(naked) __stdcall isqrt (unsigned x)
{
  __asm {
    mov   eax, [esp+4]       ; x
    bsr   ecx, eax           ; n = (31 - clz(x)); ZF = (x == 0)
    jz    $done              ; sqrt(0) = 0
    push  ebx                ; save per calling convention
    push  esi                ; save per calling convention
    push  eax                ; save x
    ; normalize argument
    not   ecx                ; ~n = clz(x) mod 32
    shl   eax, cl            ; z = x << (~n)
    mov   esi, ecx           ; ~n
    ; compute initial approximation
    mov   ecx, eax           ; oldz = z
    neg   eax                ; -z
    shr   eax, 1             ; z = ((unsigned)-z) >> 1
    mov   ebx, eax           ; z
    mul   eax                ; EDX = t = ((((u64)z)*z) >> 32)
    shr   edx, 1             ; t >> 1
    lea   eax, [ebx + edx]   ; y = z + (t >> 1)
    ; determine necessary number of iterations
    shr   ecx, 28            ; (oldz >> 28)
    sub   ecx, 8             ; (oldz >> 28) - 8
    imul  ecx, 7             ; each iteration comprises 7 bytes
    add   ecx, OFFSET $start ; compute target address
    jmp   ecx                ; jump to appropriate iteration
 $start:
    ; y = eax, z = ebx, t = edx
    mul   eax                ; EDX = t = ((((u64)y)*y) >> 32)
    shr   edx, 1             ; t >> 1
    lea   eax, [ebx + edx]   ; y = z + (t >> 1)
    mul   eax                ; EDX = t = ((((u64)y)*y) >> 32)
    shr   edx, 1             ; t >> 1
    lea   eax, [ebx + edx]   ; y = z + (t >> 1)
    mul   eax                ; EDX = t = ((((u64)y)*y) >> 32)
    shr   edx, 1             ; t >> 1
    lea   eax, [ebx + edx]   ; y = z + (t >> 1)
    mul   eax                ; EDX = t = ((((u64)y)*y) >> 32)
    shr   edx, 1             ; t >> 1
    lea   eax, [ebx + edx]   ; y = z + (t >> 1)
    mul   eax                ; EDX = t = ((((u64)y)*y) >> 32)
    shr   edx, 1             ; t >> 1
    lea   eax, [ebx + edx]   ; y = z + (t >> 1)
    mul   eax                ; EDX = t = ((((u64)y)*y) >> 32)
    shr   edx, 1             ; t >> 1
    lea   eax, [ebx + edx]   ; y = z + (t >> 1)
    mul   eax                ; EDX = t = ((((u64)y)*y) >> 32)
    shr   edx, 1             ; t >> 1
    lea   eax, [ebx + edx]   ; y = z + (t >> 1)
    mul   eax                ; EDX = t = ((((u64)y)*y) >> 32)
    shr   edx, 1             ; t >> 1
    lea   eax, [ebx + edx]   ; y = z + (t >> 1)
    mul   eax                ; EDX = t = ((((u64)y)*y) >> 32)
    shr   edx, 1             ; t >> 1
    lea   eax, [ebx + edx]   ; y = z + (t >> 1)
    mul   eax                ; EDX = t = ((((u64)y)*y) >> 32)
    shr   edx, 1             ; t >> 1
    add   ebx, edx           ; y = z + (t >> 1)
    neg   ebx                ; y = -y = preliminary result
    adc   ebx, 0xffffffff    ; adjust for argument of 0xffffffff
    ; multiply root by sqrt(2) arguments with "odd exponent"
    mov   eax, 0xb504f334    ; sqrt(2)
    mul   ebx                ; (((u64)y)*0xb504f334)
    shr   esi, 1             ; ~n >> 1; CF = ~n & 1
    cmovc ebx, edx           ; y = (n & 1) ? (((u64)y)*0xb504f334) >> 32 : y
    ; denormalize result
    lea   ecx, [esi+8+16]    ; (~n >> 1) + 8 + 16
    shr   ebx, cl            ; y = (y >> ((~n >> 1) + 8 + 16))
    pop   ecx                ; x
    ; compute remainder and adjust result to get exact square root
    mov   eax, ebx           ; y
    mul   ebx                ; EDX:EAX = y*y
    mov   esi, ecx           ; x
    shl   ecx, 16            ; ESI:ECX = x << 16
    shr   esi, 16            ; x >> 16
    sub   ecx, eax           ;
    sbb   esi, edx           ;
    sbb   ebx, 0             ; y = ((y*y) > (x << 16)) ? y-- : y

    mov   eax, ebx           ; return y = isqrt(x)
    pop   esi                ; restore per calling convention
    pop   ebx                ; restore per calling convention
 $done:
    ret   4                  ; pop function argument and return
  }
}

0
Reply Norbert 11/26/2004 8:38:37 AM

spamtrap@crayne.org (Chris Williams) writes:

> > Use Newton-Raphson.
> > It's easiest to find 1/sqrt(x), and then post-multiply that by x.
> > 
> > Phil
> 
> My test function for Newton-Raphson is coming in at about 5 times
> slower than C's sqrt. 

It's quadratic convergence, so start with a lookup table.

> (I assume because of the multiplication and
> division.) 

There is no division apart from shifting right by a bit.

> 10 iterations seems to be the minimum needed for a 16 bit
> value to return an accurate 8 bit value.

Something's very wrong, or your initial estimate is abyssmal.

> Unless I have done something incorrect, this would be 2 to 3 times
> slower, with two fewer bits of precision than my current
> implementation. Could you please review it?
> 
> int mysqrt(int value) {
> 	int temp;
> 	int result = value >> 1;
> 	result += 1;
> 
> 	for (int L = 0; L < 10; L++) {
> 		temp = result * result;
> 		temp -= value;
> 		temp /= (result << 1);
> 
> 		result = result - temp;
> 	}
> 
> 	return result;
> }

Easier to just point you to Lynn Killingbeck's solution off 
Dave Rusin's Atlas of Known Mathematics:
http://www.math.niu.edu/~rusin/known-math/00_incoming/sqrt
<<<

Take it further. Presume that your machine does not even have a divide
instruction! Now, the (x+a/x)/2 is also worthless, since the division is
not available. This requires another form of NR. In this case, one that
works is f(x) = 1/x^2 - a = 0. This x will not be what you want, since
this x will be 1/x^(1/2). But, once you have it, you can multiply by 'a'
to get the desired answer. It sounds like nonsense: to get sqrt(x),
first compute 1/sqrt(x) and then multiply by x. But, getting 1/sqrt(x)
can be done without any division, and this 'nonsense' gives a very
useable method. (From experience, since I've worked with computers that
do not have a divide instruction.)
f(x) = 1/x^2 - a
f'(x) = -2 / x^3
x <== x - (1/x^2 - a) / [-2 / x^3]
    = x + (x - a * x^2) / 2
    = x + x * (1 - a * x^2)/2
    = x * (3 - a * x^2) / 2
Again, check that awkward ASCII algebra. This time, there is also a
constraint on the closeness of the original value of x.
>>>

> I was uncertain how one is supposed to generate the seed value.

Small lookup table. 8 bits should be plenty.

Phil
-- 
.... one Marine noticed one of the prisoners was still breathing. A Marine 
can be heard saying on the pool footage provided to Reuters Television: 
"He's fucking faking he's dead. He faking he's fucking dead."
"The Marine then raises his rifle and fires into the man's head. 
The pictures are too graphic for us to broadcast," Sites said. 

0
Reply Phil 11/27/2004 4:32:44 AM

"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message news:87r7mgowt5.fsf@nonospaz.fatphil.org...
> spamtrap@crayne.org (Chris Williams) writes:
>
> > > Use Newton-Raphson.
> > > It's easiest to find 1/sqrt(x), and then post-multiply that by x.
> > >
> > > Phil
> >
> > My test function for Newton-Raphson is coming in at about 5 times
> > slower than C's sqrt.
>
> It's quadratic convergence, so start with a lookup table.
[...]
> Take it further. Presume that your machine does not even have a divide
> instruction! Now, the (x+a/x)/2 is also worthless, since the division is
> not available. This requires another form of NR. In this case, one that
> works is f(x) = 1/x^2 - a = 0. This x will not be what you want, since
> this x will be 1/x^(1/2). But, once you have it, you can multiply by 'a'
> to get the desired answer. It sounds like nonsense: to get sqrt(x),
> first compute 1/sqrt(x) and then multiply by x. But, getting 1/sqrt(x)
> can be done without any division, and this 'nonsense' gives a very
> useable method. (From experience, since I've worked with computers that
> do not have a divide instruction.)
> f(x) = 1/x^2 - a
> f'(x) = -2 / x^3
> x <== x - (1/x^2 - a) / [-2 / x^3]
>     = x + (x - a * x^2) / 2
>     = x + x * (1 - a * x^2)/2
>     = x * (3 - a * x^2) / 2
> Again, check that awkward ASCII algebra. This time, there is also a
> constraint on the closeness of the original value of x.
> >>>
>
> > I was uncertain how one is supposed to generate the seed value.
>
> Small lookup table. 8 bits should be plenty.

I apologize for the absence of any x86 assembly code in this post, but
I did not have time to make an x86 assembly language version of the code
below. It may be easier to glimpse the algorithm from C source anyhow.
The code below is a fully functional (and exhaustively tested) demo of
the technique Phil and I have recommended. A few notes:

1. The computation of "prod" maps directly to the x86 MUL instruction.
2. (prod >> 32) is simply the high 32 bits of the MUL result, i.e. EDX
3. Use the BSR instruction to determine "lz".
4. A smaller table for the initial approximation to 1/sqrt(x) is likely
   possible. I started out with what looked like a reasonable size and
   did not spend any time optimizing the table.
5. Optimization of the fixed-point representations used may be possible.

-- Norbert


unsigned char srtTab[128];

for (k = 0; k < 64; k++) {
  srtTab[k]=(unsigned char)(((1.0/sqrt(2.0+(k+0.5)/32.0))-0.5)*512+0.5);
}
for (k = 0; k < 64; k++) {
  srtTab[k+64]=(unsigned char)(((1.0/sqrt(1.0+(k+0.5)/64.0))-0.5)*512+0.5);
}

/* 16.16 fixed-point square root using NR iterations for 1/sqrt(x) */
unsigned nr_isqrt (unsigned x)
{
  unsigned __int64 prod;
  unsigned r, s, f;
  int rem;
  unsigned oldx = x;
  int lz = 0;

  /* normalize argument */
  while (((int)x) > 0) {
    x <<= 1;
    lz++;
  }
  /* initial approximation */
  r = srtTab[((lz & 1) << 6) | ((x << 1) >> 26)] | 0x100;
  x >>= (lz & 1);
  /* first NR iteration */
  s = r * r;
  prod = ((unsigned __int64)x) * (s << 14);
  f = 0xC0000000 - (prod >> 32);
  prod = ((unsigned __int64)(r << 23)) * f;
  r = prod >> 32;
  /* second NR iteration */
  prod = ((unsigned __int64)r) * (r << 1);
  s = prod >> 32;
  prod = ((unsigned __int64)x) * (s << 1);
  f = 0xC0000000 - (prod >> 32);
  prod = ((unsigned __int64)r) * f;
  r = prod >> 30;
  /* compute sqrt(x) as x * 1/sqrt(x) */
  prod = ((unsigned __int64)x) * r;
  r = prod >> 32;
  /* denormalize result */
  r = (r >> (7 + (lz >> 1)));
  /* adjustments to get the exact result */
  r++;
  rem = (((unsigned __int64)oldx) << 16) - (((unsigned __int64)r)*r);
  if (rem < 0) { rem += 2*r-1; r--; }
  if (rem < 0) { r--; }
  return r;
}

0
Reply Norbert 11/28/2004 2:05:15 AM

> Something's very wrong, or your initial estimate is abyssmal.

Cool. Then it can be fixed =)

> Easier to just point you to Lynn Killingbeck's solution off 
> Dave Rusin's Atlas of Known Mathematics:
> http://www.math.niu.edu/~rusin/known-math/00_incoming/sqrt
> <<<

Hmm, was always better at solving for zero than simply rearranging the
bits around. Unfortunately, I truly dislike using code without
understanding what it is doing. =(

x - ((x^2 - a) / 2x) == (x + (a/x))/2

Which looks a bit better on paper but still boggled my mind for a bit
until I recalled that 3 - 1/4 == 11/4 and such (multiply be the bottom
and put up top.)

  x - (x^2 - a) / 2x       //multiply leading x by the 2x on the
bottom
= (2x^2 - (x^2 - a)) / 2x  //which will still be subtracting the top
= (2x^2 - x^2 + a) / 2x    //have to flip the sign!
= (x^2 + a) / 2x           //and do the subtraction

then if we had say 2x/3x this would equate to ((2x)/x)/3

  (x^2 + a) / 2x
= ((x^2 + a)/x)/2 //divide by one of the multipliers on the bottom
= (x + (a/x))/2   //and do the division

Which makes me quite joyful.

> Take it further. Presume that your machine does not even have a divide
> instruction! Now, the (x+a/x)/2 is also worthless, since the division is
> not available. This requires another form of NR. In this case, one that
> works is f(x) = 1/x^2 - a = 0. This x will not be what you want, since
> this x will be 1/x^(1/2). But, once you have it, you can multiply by 'a'
> to get the desired answer. It sounds like nonsense: to get sqrt(x),
> first compute 1/sqrt(x) and then multiply by x. But, getting 1/sqrt(x)
> can be done without any division, and this 'nonsense' gives a very
> useable method. (From experience, since I've worked with computers that
> do not have a divide instruction.)

Giving a stab at this one:

f(x) = 1/x^2 - a
f'(x) = -2 / x^3

P x <== x - (1/x^2 - a) / [-2 / x^3]      //start

C     = x - (x^3 * (1/x^2 - a)) / -2      //perhaps?
C     = x - (x^3 * 1/x^2 - a * x^3) / -2
C     = x - (x - a * x^3) / -2            //hmmm
C     = x + (-x + a * x^3) / 2            //?
P     = x + (x - a * x^2) / 2

C     = x + (-x + a * x^3) / 2
C     = (2x + -x + a * x^3) / 2
C     = (x + a * x^3) / 2

P     = x + x * (1 - a * x^2)/2
P     = x * (3 - a * x^2) / 2

I tried plugging numbers in, and am getting the same result with my
equations as the starting point but not with yours.
But more importantly, if I set

x = 3;
a = 4;

Then I get the result 55.5

And if I set

x = 2
a = 4

Then I get 17, from my equation. But, I don't understand what these
numbers have to do with 1/sqrt(x).... (Similar with your equation,
except the results are all negative.) And trying to iterate x back
through has the number getting larger. x = 55.5 -> 350,000 about

Thank you,
Chris Williams

PS, Apologies for the latency, I am posting through Google and they
don't submit for anywhere from 3 hours to a day or so.

0
Reply spamtrap 11/29/2004 6:46:28 AM

> I apologize for the absence of any x86 assembly code in this post, but
> I did not have time to make an x86 assembly language version of the code
> below.

No problem, was mostly just testing in x86 assembly to see what sort
of speed increase I could hope for with my conversion skills. I am
still uncertain what processor type it is that I will be working with.

> 2. (prod >> 32) is simply the high 32 bits of the MUL result, i.e. EDX

This is a bit worrying as I am still not certain what processor it
will be and, having never done any assembly outside of x86, am not
certain of the odds as to whether there will be a 64 bit output for
multiplication.
Could probably get around this if I understood the code though.

> 3. Use the BSR instruction to determine "lz".

Yup. Also purchased "Hacker's Delight" over the weekend, so will be
able to grab something out of there if there is no comparable
instruction on the device.

> 4. A smaller table for the initial approximation to 1/sqrt(x) is likely
>    possible. I started out with what looked like a reasonable size and
>    did not spend any time optimizing the table.

I looked at this and have very little idea of what you are doing. Both
you and Phil have suggested a small lookup table--but I don't
understand how one relates 2^16 or 2^32 possible values to a table
with only 128 entries? Obviously you are doing it, I just don't
understand what that relationship is, nor what data it is that you are
putting in there... =(

At the moment, I don't think I could copy and paste your code for my
application as it assumes an __int64, but I do feel like if I
understood the table-creation/access you have shown and the math Phil
has presented then I will be able to make something that would be
impressive (to myself =) .)

Thank you,
Chris Williams

0
Reply spamtrap 11/29/2004 6:46:37 AM

I see many mentions of Newton-Raphson, which puts an expensive divide
in the loop, so I wanted to offer another algorithm.  This technique,
though quite linear in its convergence, uses rotates and a single
subtraction per iteration.  You loop more, but the loops are tiny.

This was designed to be called from C, so the C stack frame was the
only thing preserved.  Also, the argument is an unsigned integer, so
there was no need to check the validity of whatever got passed.  The
"8" in "mov cx,8" (just before Step1) is half the number of bits in the
argument.

I'll state the obvious: for a float, divide the exponent by two, and
run the magnitude through this routine.  (If the exponent is an odd
number, denormalize.)

Here's some Turbo Assembler and Turbo C code.
..MODEL small
..CODE
PUBLIC _Sqroot
_Sqroot PROC
mov ax,0
mov dx,0
push bp
mov bp,sp
mov es,[bp+4]
mov bp,es
;
mov cx,8
Step1:
clc
rcl ax,1
stc
rcl ax,1
clc ;needed for floating point variation
rcl bp,1
rcl dx,1
rcl bp,1
rcl dx,1
;
mov bx,dx ;compare
sub bx,ax
;
jc Below
mov dx,bx
inc ax
rcr ax,1
loop Step1
jmp OutOfHere
;
Below:
clc
rcr ax,1
loop Step1
;
OutOfHere:
pop bp
ret
_Sqroot ENDP
END

extern unsigned Sqroot(unsigned n);
main()
{
unsigned i;
unsigned j=65535;

for(i=0;i<65535;++i)
printf("%u          %u\n",i,Sqroot(i));
printf("%u          %u\n",j,Sqroot(j));
}

By the way, 15 or 16 years ago, this was my first IBM PC (80186!)
assembler program, so I can't say that it's optimized, and I am not
sure what my "floating point variation" would have looked like, had I
ever written one.  (I'll make no comments about the sparse comments,
newbie that I was.)  I recently got re-interested in Assembler, and
visited this group for the first time a couple of days ago.

If you go with Newton-Raphson, I'd try dividing the exponent by 2 to
get the seed value.  Also, I've half convinced myself that the
magnitude's square root should have about half as many bits, and that I
may be able to make one more easy starting adjustment, but I haven't
given it much thought.

- Dan

0
Reply Dan 12/6/2004 8:31:00 PM

12 Replies
233 Views

(page loaded in 0.005 seconds)

Similiar Articles:


















7/23/2012 11:40:59 AM


Reply: