I will be working with an embedded system with no FPU and need to be
able to find the square root of a 16.16 fixed point integer.
My current design is using a lookup table where the index is the root
(with two decimal bits) of the value pointed to (with four decimal
bits.) This doesn't cover every possible value, so I do a search for
the nearest value that is contained in the array.
Building the table:
sqrtTable[0x400];
for (L = 0; L < 0x400; L++) {
sqrtTable[L] = (L * L);
}
(Actually I write the results out to a file for direct compilation in)
And my current function is as such:
__inline unsigned int fsqrt(unsigned int value) { //square root
function which works on a 16/16 fixed point integer
#ifndef PC //straight C code
unsigned int testValue;
unsigned int low = 0;
unsigned int high = 0x3ff;
unsigned int curr = 0x1ff;
value >>= 12;
while (1) {
testValue = sqrtTable[curr];
if (value == testValue) {
break;
}
else if (value < testValue) {
high = curr;
}
else {
low = curr;
}
if ((low + 1) == high || low == high) {
curr = low;
break;
}
curr = (low + high) >> 1;
}
return (curr << 14);
#else //PC only assembly
unsigned int returnValue;
_asm {
mov eax, 0;
mov edx, 0x3ff;
mov edi, 0x1ff;
lea esi, sqrtTable;
mov ecx, value;
shr ecx, 12;
LOOP_TOP:
mov ebx, [esi + edi * 4];
cmp ebx, ecx;
je LOOP_END;
jl LESS;
mov edx, edi;
add edi, eax;
shr edi, 1;
jmp COMPARE_END;
LESS:
mov eax, edi;
add edi, edx;
shr edi, 1;
COMPARE_END:
mov ebx, eax;
cmp ebx, edx;
je CONVERGED;
inc ebx;
cmp ebx, edx;
je CONVERGED;
jmp LOOP_TOP;
CONVERGED:
mov edi, eax;
LOOP_END:
shl edi, 14;
mov returnValue, edi;
}
return returnValue;
#endif
}
The C version appears to work about 2.2 times slower than C's sqrt()
on a float and the assembly at about 1.5.
Was just wondering if anyone had any recommendations? I am relatively
pleased to have gotten that close to the standard library, but
certainly wouldn't complain if I could get more speed and/or more
decimal bits with a different approach.
Thank you,
Chris Williams
|
|
0
|
|
|
|
Reply
|
spamtrap
|
11/25/2004 4:14:06 AM |
|
On 24 Nov 2004 20:14:06 -0800, spamtrap@crayne.org (Chris Williams)
wrote:
>I will be working with an embedded system with no FPU and need to be
>able to find the square root of a 16.16 fixed point integer.
>
>My current design is using a lookup table where the index is the root
>(with two decimal bits) of the value pointed to (with four decimal
>bits.) This doesn't cover every possible value, so I do a search for
>the nearest value that is contained in the array.
int sqrttable[256]; /* full of sqrt(0..255) */
value=34758934;
sh=0;
while (value>=256)
{
value>>=2;
sh++;
}
root=table[value]<<sh;
increase table size for more precision
Jim
|
|
0
|
|
|
|
Reply
|
Jim
|
11/25/2004 6:38:18 AM
|
|
spamtrap@crayne.org (Chris Williams) writes:
> I will be working with an embedded system with no FPU and need to be
> able to find the square root of a 16.16 fixed point integer.
>
> My current design is using a lookup table where the index is the root
> (with two decimal bits) of the value pointed to (with four decimal
> bits.) This doesn't cover every possible value, so I do a search for
> the nearest value that is contained in the array.
>
> Building the table:
>
> sqrtTable[0x400];
> for (L = 0; L < 0x400; L++) {
> sqrtTable[L] = (L * L);
> }
>
> (Actually I write the results out to a file for direct compilation in)
I trust that the recorded values are actually root of the value
being pointed to and not the square, as the above indicates.
> And my current function is as such:
>
> __inline unsigned int fsqrt(unsigned int value) { //square root
> function which works on a 16/16 fixed point integer
[SNIP - binary chop]
> }
>
> The C version appears to work about 2.2 times slower than C's sqrt()
> on a float and the assembly at about 1.5.
> Was just wondering if anyone had any recommendations? I am relatively
> pleased to have gotten that close to the standard library, but
> certainly wouldn't complain if I could get more speed and/or more
> decimal bits with a different approach.
Use Newton-Raphson.
It's easiest to find 1/sqrt(x), and then post-multiply that by x.
Phil
--
.... one Marine noticed one of the prisoners was still breathing. A Marine
can be heard saying on the pool footage provided to Reuters Television:
"He's fucking faking he's dead. He faking he's fucking dead."
"The Marine then raises his rifle and fires into the man's head.
The pictures are too graphic for us to broadcast," Sites said.
|
|
0
|
|
|
|
Reply
|
Phil
|
11/25/2004 11:16:21 AM
|
|
> int sqrttable[256]; /* full of sqrt(0..255) */
>
> value=34758934;
>
> sh=0;
> while (value>=256)
> {
> value>>=2;
> sh++;
> }
> root=table[value]<<sh;
>
> increase table size for more precision
>
> Jim
Tried this and got around .5<->.6 the speed of C's sqrt() on floats.
Which I like =)
But it also appears to only have half the precision of the bit count
of the array length. (ie, 256 = 8 bits, so the first 4 bits will be
accurate.) To get the level of precision I have now, I would need to
make my array 1,048,575 (0x100000) in length--but I don't think I can
afford to go over 4096.
I will be doing a good bit of trigonometry so only having 2-bits (up
to 1/4 of a unit) of accuracy is already quite worrying. I don't think
I could afford to lose any speed from my current implementation, but
any extra precision would by wonderful.
-Chris
|
|
0
|
|
|
|
Reply
|
spamtrap
|
11/25/2004 7:22:57 PM
|
|
"Chris Williams" <spamtrap@crayne.org> wrote in message
news:da271e02.0411242014.31d3f168@posting.google.com...
[...]
> The C version appears to work about 2.2 times slower than C's sqrt()
> on a float and the assembly at about 1.5.
> Was just wondering if anyone had any recommendations? I am relatively
> pleased to have gotten that close to the standard library, but
> certainly wouldn't complain if I could get more speed and/or more
> decimal bits with a different approach.
I poked around not too long ago in search of the exact same. I have a
non-LUT solution which is about 2-2.5 times slower than sqrt() on my Athlon
with gcc -O3 -finline-functions. The actual speed depends on the data, and
2.5 seems to be the worst case. While poking around I also found an integer
sqrt() which ranges from 1.0 to 1.5 times slower than C's sqrt() function. I
have not taken the time to fully understand it, and thus I have no
fixed-point version. All of the algorithms in my code are credited to the
following page: http://www.azillionmonkeys.com/qed/sqroot.html
I compared Newton-Rhapson to these algorithms also, and I seem to recall
Newton-Rhapson being slower on average, but I'm not 100% certain. For 32-bit
values, you need 5 iterations of Newton-Rhapson to guarantee convergence.
Note that all the algorithms I am discussing give full precision. My
implementation of the square root functions follow:
// Size of mantissa in bits.
#define MANTISSA (10)
unsigned long lzc(unsigned long x)
{
#ifdef __GNUC__
asm("bsrl %1, %0" : "=r" (x) : "r" (x));
#else
_asm
{
mov eax, x
bsr eax, eax
mov x, eax
}
#endif /* __GNUC__ */
return x;
}
// Algorithm is based on polynomial arithmetic:
// 1) Compute initial guess using 2^(msb_index / 2)
// 2) Iterate over remaining bits
// 3) Set bit to 1 if (1 << bit_index) < (x - guess^2)
fix fixsqrt(fix x)
{
fix guess, gsqr, bsqr, gxb;
unsigned int i, bitmask;
if (x == 0)
return 0;
// Compute guess
i = (((lzc(x) - MANTISSA) / 2) + MANTISSA);
guess = 1 << i;
// Compute mask containing next bit to be checked
bitmask = (1 << i) >> 1;
// Cache guess^2, bitmask^2, and 2*guess*bitmask for fast polynomial
// eval.
gsqr = guess << (i - MANTISSA); // guess * guess
gxb = guess << (i - MANTISSA); // guess * bitmask * 2
bsqr = (bitmask << (i - MANTISSA)) >> 1; // bitmask * bitmask
// Determine remaining bits
do
{
// (guess + bitmask)^2 == guess^2 + 2*guess*bitmask + bitmask^2
// == gsqr + gxb + bsqr
if ((gsqr + gxb + bsqr) <= x)
{
guess |= bitmask;
gsqr += (gxb + bsqr);
gxb += (bsqr << 1);
}
bitmask >>= 1;
bsqr >>= 2;
gxb >>= 1;
}
while(bitmask != 0);
return guess;
}
// The integer square root function. Copied directly off the webpage;
// seems to pick 1 bit per iteration, but I haven't looked closely at it.
int isqrt(int x)
{
unsigned long temp, g=0, b, bshft;
bshft = lzc(x) / 2;
b = 1 << bshft;
do {
if (x >= (temp = (((g<<1)+b)<<bshft--))) {
g += b;
x -= temp;
}
} while (b >>= 1);
return g;
}
-Matt
|
|
0
|
|
|
|
Reply
|
Matt
|
11/26/2004 4:06:50 AM
|
|
> > sqrtTable[0x400];
> > for (L = 0; L < 0x400; L++) {
> > sqrtTable[L] = (L * L);
> > }
> >
> > (Actually I write the results out to a file for direct compilation in)
>
> I trust that the recorded values are actually root of the value
> being pointed to and not the square, as the above indicates.
No, the recorded values are the square.
sqrtTable[0] = 0;
sqrtTable[1] = 1;
sqrtTable[2] = 4;
sqrtTable[3] = 9;
If I want to find the root of 5, then I do a search for either 5 in
the array or the first number below it.
Iteration 0: //binary search
high = 3
low = 0
curr = 1
sqrtTable[curr] = 1
Iteration 1:
high = 3
low = 1
curr = 2
sqrtTable[curr] = 4
Iteration 2:
high = 3
low = 2
curr = 2
sqrtTable[curr] = 4
High and low have converged, so I return low.
Will look up Newton-Raphson.
-Chris
|
|
0
|
|
|
|
Reply
|
spamtrap
|
11/26/2004 4:06:53 AM
|
|
> Use Newton-Raphson.
> It's easiest to find 1/sqrt(x), and then post-multiply that by x.
>
> Phil
My test function for Newton-Raphson is coming in at about 5 times
slower than C's sqrt. (I assume because of the multiplication and
division.) 10 iterations seems to be the minimum needed for a 16 bit
value to return an accurate 8 bit value.
Unless I have done something incorrect, this would be 2 to 3 times
slower, with two fewer bits of precision than my current
implementation. Could you please review it?
int mysqrt(int value) {
int temp;
int result = value >> 1;
result += 1;
for (int L = 0; L < 10; L++) {
temp = result * result;
temp -= value;
temp /= (result << 1);
result = result - temp;
}
return result;
}
I was uncertain how one is supposed to generate the seed value.
-Chris
|
|
0
|
|
|
|
Reply
|
spamtrap
|
11/26/2004 4:06:57 AM
|
|
"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message news:87is7utcef.fsf@nonospaz.fatphil.org...
> spamtrap@crayne.org (Chris Williams) writes:
>
> > I will be working with an embedded system with no FPU and need to be
> > able to find the square root of a 16.16 fixed point integer.
> >
[...]
>
> Use Newton-Raphson.
> It's easiest to find 1/sqrt(x), and then post-multiply that by x.
I second Phil's recommendation. This works best if one first normalizes
the input, creating something akin to a floating-point mantissa. A table
of 1-byte starting approximations, indexed by some most significant bits
of the normalized inout, results in 9-bit initial approximations. Iterate
twice using the NR iteration for 1/sqrt, then post multiply. If an exact
result is needed, square the result, compute the remainder, and adjust
the result accordingly.
The code below (tested exhaustively) computes the 16.16 square root using
an interesting algorithm I found many years ago in IEEE TOC. The result is
exact, i.e. isqrt(x)*isqrt(x) <= x < (isqrt(x)+0.0001)*(isqrt(x)+0.0001).
The speed of the code below depends very much on the speed of the integer
multiplications. On my Athlon processor, I measured a worst case execution
time of 136 cycles, and an average execution time of 112 cycles. A Newton-
Raphson approach may get you down to 70 cycles or so, but I don't have code
for that handy. When I wrote the code below, I was mostly interested in
exploring the algorithm.
-- Norbert
/*
* 16.16 fixed-point square root based on: Reza Hashemian, "Square Rooting
* Algorithms for Integer and Floating-Point Numbers", IEEE Transactions on
* Computers, Vol. 39, No. 8, August 1990, pp 1025-1029
*/
__declspec(naked) __stdcall isqrt (unsigned x)
{
__asm {
mov eax, [esp+4] ; x
bsr ecx, eax ; n = (31 - clz(x)); ZF = (x == 0)
jz $done ; sqrt(0) = 0
push ebx ; save per calling convention
push esi ; save per calling convention
push eax ; save x
; normalize argument
not ecx ; ~n = clz(x) mod 32
shl eax, cl ; z = x << (~n)
mov esi, ecx ; ~n
; compute initial approximation
mov ecx, eax ; oldz = z
neg eax ; -z
shr eax, 1 ; z = ((unsigned)-z) >> 1
mov ebx, eax ; z
mul eax ; EDX = t = ((((u64)z)*z) >> 32)
shr edx, 1 ; t >> 1
lea eax, [ebx + edx] ; y = z + (t >> 1)
; determine necessary number of iterations
shr ecx, 28 ; (oldz >> 28)
sub ecx, 8 ; (oldz >> 28) - 8
imul ecx, 7 ; each iteration comprises 7 bytes
add ecx, OFFSET $start ; compute target address
jmp ecx ; jump to appropriate iteration
$start:
; y = eax, z = ebx, t = edx
mul eax ; EDX = t = ((((u64)y)*y) >> 32)
shr edx, 1 ; t >> 1
lea eax, [ebx + edx] ; y = z + (t >> 1)
mul eax ; EDX = t = ((((u64)y)*y) >> 32)
shr edx, 1 ; t >> 1
lea eax, [ebx + edx] ; y = z + (t >> 1)
mul eax ; EDX = t = ((((u64)y)*y) >> 32)
shr edx, 1 ; t >> 1
lea eax, [ebx + edx] ; y = z + (t >> 1)
mul eax ; EDX = t = ((((u64)y)*y) >> 32)
shr edx, 1 ; t >> 1
lea eax, [ebx + edx] ; y = z + (t >> 1)
mul eax ; EDX = t = ((((u64)y)*y) >> 32)
shr edx, 1 ; t >> 1
lea eax, [ebx + edx] ; y = z + (t >> 1)
mul eax ; EDX = t = ((((u64)y)*y) >> 32)
shr edx, 1 ; t >> 1
lea eax, [ebx + edx] ; y = z + (t >> 1)
mul eax ; EDX = t = ((((u64)y)*y) >> 32)
shr edx, 1 ; t >> 1
lea eax, [ebx + edx] ; y = z + (t >> 1)
mul eax ; EDX = t = ((((u64)y)*y) >> 32)
shr edx, 1 ; t >> 1
lea eax, [ebx + edx] ; y = z + (t >> 1)
mul eax ; EDX = t = ((((u64)y)*y) >> 32)
shr edx, 1 ; t >> 1
lea eax, [ebx + edx] ; y = z + (t >> 1)
mul eax ; EDX = t = ((((u64)y)*y) >> 32)
shr edx, 1 ; t >> 1
add ebx, edx ; y = z + (t >> 1)
neg ebx ; y = -y = preliminary result
adc ebx, 0xffffffff ; adjust for argument of 0xffffffff
; multiply root by sqrt(2) arguments with "odd exponent"
mov eax, 0xb504f334 ; sqrt(2)
mul ebx ; (((u64)y)*0xb504f334)
shr esi, 1 ; ~n >> 1; CF = ~n & 1
cmovc ebx, edx ; y = (n & 1) ? (((u64)y)*0xb504f334) >> 32 : y
; denormalize result
lea ecx, [esi+8+16] ; (~n >> 1) + 8 + 16
shr ebx, cl ; y = (y >> ((~n >> 1) + 8 + 16))
pop ecx ; x
; compute remainder and adjust result to get exact square root
mov eax, ebx ; y
mul ebx ; EDX:EAX = y*y
mov esi, ecx ; x
shl ecx, 16 ; ESI:ECX = x << 16
shr esi, 16 ; x >> 16
sub ecx, eax ;
sbb esi, edx ;
sbb ebx, 0 ; y = ((y*y) > (x << 16)) ? y-- : y
mov eax, ebx ; return y = isqrt(x)
pop esi ; restore per calling convention
pop ebx ; restore per calling convention
$done:
ret 4 ; pop function argument and return
}
}
|
|
0
|
|
|
|
Reply
|
Norbert
|
11/26/2004 8:38:37 AM
|
|
spamtrap@crayne.org (Chris Williams) writes:
> > Use Newton-Raphson.
> > It's easiest to find 1/sqrt(x), and then post-multiply that by x.
> >
> > Phil
>
> My test function for Newton-Raphson is coming in at about 5 times
> slower than C's sqrt.
It's quadratic convergence, so start with a lookup table.
> (I assume because of the multiplication and
> division.)
There is no division apart from shifting right by a bit.
> 10 iterations seems to be the minimum needed for a 16 bit
> value to return an accurate 8 bit value.
Something's very wrong, or your initial estimate is abyssmal.
> Unless I have done something incorrect, this would be 2 to 3 times
> slower, with two fewer bits of precision than my current
> implementation. Could you please review it?
>
> int mysqrt(int value) {
> int temp;
> int result = value >> 1;
> result += 1;
>
> for (int L = 0; L < 10; L++) {
> temp = result * result;
> temp -= value;
> temp /= (result << 1);
>
> result = result - temp;
> }
>
> return result;
> }
Easier to just point you to Lynn Killingbeck's solution off
Dave Rusin's Atlas of Known Mathematics:
http://www.math.niu.edu/~rusin/known-math/00_incoming/sqrt
<<<
Take it further. Presume that your machine does not even have a divide
instruction! Now, the (x+a/x)/2 is also worthless, since the division is
not available. This requires another form of NR. In this case, one that
works is f(x) = 1/x^2 - a = 0. This x will not be what you want, since
this x will be 1/x^(1/2). But, once you have it, you can multiply by 'a'
to get the desired answer. It sounds like nonsense: to get sqrt(x),
first compute 1/sqrt(x) and then multiply by x. But, getting 1/sqrt(x)
can be done without any division, and this 'nonsense' gives a very
useable method. (From experience, since I've worked with computers that
do not have a divide instruction.)
f(x) = 1/x^2 - a
f'(x) = -2 / x^3
x <== x - (1/x^2 - a) / [-2 / x^3]
= x + (x - a * x^2) / 2
= x + x * (1 - a * x^2)/2
= x * (3 - a * x^2) / 2
Again, check that awkward ASCII algebra. This time, there is also a
constraint on the closeness of the original value of x.
>>>
> I was uncertain how one is supposed to generate the seed value.
Small lookup table. 8 bits should be plenty.
Phil
--
.... one Marine noticed one of the prisoners was still breathing. A Marine
can be heard saying on the pool footage provided to Reuters Television:
"He's fucking faking he's dead. He faking he's fucking dead."
"The Marine then raises his rifle and fires into the man's head.
The pictures are too graphic for us to broadcast," Sites said.
|
|
0
|
|
|
|
Reply
|
Phil
|
11/27/2004 4:32:44 AM
|
|
"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> wrote in message news:87r7mgowt5.fsf@nonospaz.fatphil.org...
> spamtrap@crayne.org (Chris Williams) writes:
>
> > > Use Newton-Raphson.
> > > It's easiest to find 1/sqrt(x), and then post-multiply that by x.
> > >
> > > Phil
> >
> > My test function for Newton-Raphson is coming in at about 5 times
> > slower than C's sqrt.
>
> It's quadratic convergence, so start with a lookup table.
[...]
> Take it further. Presume that your machine does not even have a divide
> instruction! Now, the (x+a/x)/2 is also worthless, since the division is
> not available. This requires another form of NR. In this case, one that
> works is f(x) = 1/x^2 - a = 0. This x will not be what you want, since
> this x will be 1/x^(1/2). But, once you have it, you can multiply by 'a'
> to get the desired answer. It sounds like nonsense: to get sqrt(x),
> first compute 1/sqrt(x) and then multiply by x. But, getting 1/sqrt(x)
> can be done without any division, and this 'nonsense' gives a very
> useable method. (From experience, since I've worked with computers that
> do not have a divide instruction.)
> f(x) = 1/x^2 - a
> f'(x) = -2 / x^3
> x <== x - (1/x^2 - a) / [-2 / x^3]
> = x + (x - a * x^2) / 2
> = x + x * (1 - a * x^2)/2
> = x * (3 - a * x^2) / 2
> Again, check that awkward ASCII algebra. This time, there is also a
> constraint on the closeness of the original value of x.
> >>>
>
> > I was uncertain how one is supposed to generate the seed value.
>
> Small lookup table. 8 bits should be plenty.
I apologize for the absence of any x86 assembly code in this post, but
I did not have time to make an x86 assembly language version of the code
below. It may be easier to glimpse the algorithm from C source anyhow.
The code below is a fully functional (and exhaustively tested) demo of
the technique Phil and I have recommended. A few notes:
1. The computation of "prod" maps directly to the x86 MUL instruction.
2. (prod >> 32) is simply the high 32 bits of the MUL result, i.e. EDX
3. Use the BSR instruction to determine "lz".
4. A smaller table for the initial approximation to 1/sqrt(x) is likely
possible. I started out with what looked like a reasonable size and
did not spend any time optimizing the table.
5. Optimization of the fixed-point representations used may be possible.
-- Norbert
unsigned char srtTab[128];
for (k = 0; k < 64; k++) {
srtTab[k]=(unsigned char)(((1.0/sqrt(2.0+(k+0.5)/32.0))-0.5)*512+0.5);
}
for (k = 0; k < 64; k++) {
srtTab[k+64]=(unsigned char)(((1.0/sqrt(1.0+(k+0.5)/64.0))-0.5)*512+0.5);
}
/* 16.16 fixed-point square root using NR iterations for 1/sqrt(x) */
unsigned nr_isqrt (unsigned x)
{
unsigned __int64 prod;
unsigned r, s, f;
int rem;
unsigned oldx = x;
int lz = 0;
/* normalize argument */
while (((int)x) > 0) {
x <<= 1;
lz++;
}
/* initial approximation */
r = srtTab[((lz & 1) << 6) | ((x << 1) >> 26)] | 0x100;
x >>= (lz & 1);
/* first NR iteration */
s = r * r;
prod = ((unsigned __int64)x) * (s << 14);
f = 0xC0000000 - (prod >> 32);
prod = ((unsigned __int64)(r << 23)) * f;
r = prod >> 32;
/* second NR iteration */
prod = ((unsigned __int64)r) * (r << 1);
s = prod >> 32;
prod = ((unsigned __int64)x) * (s << 1);
f = 0xC0000000 - (prod >> 32);
prod = ((unsigned __int64)r) * f;
r = prod >> 30;
/* compute sqrt(x) as x * 1/sqrt(x) */
prod = ((unsigned __int64)x) * r;
r = prod >> 32;
/* denormalize result */
r = (r >> (7 + (lz >> 1)));
/* adjustments to get the exact result */
r++;
rem = (((unsigned __int64)oldx) << 16) - (((unsigned __int64)r)*r);
if (rem < 0) { rem += 2*r-1; r--; }
if (rem < 0) { r--; }
return r;
}
|
|
0
|
|
|
|
Reply
|
Norbert
|
11/28/2004 2:05:15 AM
|
|
> Something's very wrong, or your initial estimate is abyssmal.
Cool. Then it can be fixed =)
> Easier to just point you to Lynn Killingbeck's solution off
> Dave Rusin's Atlas of Known Mathematics:
> http://www.math.niu.edu/~rusin/known-math/00_incoming/sqrt
> <<<
Hmm, was always better at solving for zero than simply rearranging the
bits around. Unfortunately, I truly dislike using code without
understanding what it is doing. =(
x - ((x^2 - a) / 2x) == (x + (a/x))/2
Which looks a bit better on paper but still boggled my mind for a bit
until I recalled that 3 - 1/4 == 11/4 and such (multiply be the bottom
and put up top.)
x - (x^2 - a) / 2x //multiply leading x by the 2x on the
bottom
= (2x^2 - (x^2 - a)) / 2x //which will still be subtracting the top
= (2x^2 - x^2 + a) / 2x //have to flip the sign!
= (x^2 + a) / 2x //and do the subtraction
then if we had say 2x/3x this would equate to ((2x)/x)/3
(x^2 + a) / 2x
= ((x^2 + a)/x)/2 //divide by one of the multipliers on the bottom
= (x + (a/x))/2 //and do the division
Which makes me quite joyful.
> Take it further. Presume that your machine does not even have a divide
> instruction! Now, the (x+a/x)/2 is also worthless, since the division is
> not available. This requires another form of NR. In this case, one that
> works is f(x) = 1/x^2 - a = 0. This x will not be what you want, since
> this x will be 1/x^(1/2). But, once you have it, you can multiply by 'a'
> to get the desired answer. It sounds like nonsense: to get sqrt(x),
> first compute 1/sqrt(x) and then multiply by x. But, getting 1/sqrt(x)
> can be done without any division, and this 'nonsense' gives a very
> useable method. (From experience, since I've worked with computers that
> do not have a divide instruction.)
Giving a stab at this one:
f(x) = 1/x^2 - a
f'(x) = -2 / x^3
P x <== x - (1/x^2 - a) / [-2 / x^3] //start
C = x - (x^3 * (1/x^2 - a)) / -2 //perhaps?
C = x - (x^3 * 1/x^2 - a * x^3) / -2
C = x - (x - a * x^3) / -2 //hmmm
C = x + (-x + a * x^3) / 2 //?
P = x + (x - a * x^2) / 2
C = x + (-x + a * x^3) / 2
C = (2x + -x + a * x^3) / 2
C = (x + a * x^3) / 2
P = x + x * (1 - a * x^2)/2
P = x * (3 - a * x^2) / 2
I tried plugging numbers in, and am getting the same result with my
equations as the starting point but not with yours.
But more importantly, if I set
x = 3;
a = 4;
Then I get the result 55.5
And if I set
x = 2
a = 4
Then I get 17, from my equation. But, I don't understand what these
numbers have to do with 1/sqrt(x).... (Similar with your equation,
except the results are all negative.) And trying to iterate x back
through has the number getting larger. x = 55.5 -> 350,000 about
Thank you,
Chris Williams
PS, Apologies for the latency, I am posting through Google and they
don't submit for anywhere from 3 hours to a day or so.
|
|
0
|
|
|
|
Reply
|
spamtrap
|
11/29/2004 6:46:28 AM
|
|
> I apologize for the absence of any x86 assembly code in this post, but
> I did not have time to make an x86 assembly language version of the code
> below.
No problem, was mostly just testing in x86 assembly to see what sort
of speed increase I could hope for with my conversion skills. I am
still uncertain what processor type it is that I will be working with.
> 2. (prod >> 32) is simply the high 32 bits of the MUL result, i.e. EDX
This is a bit worrying as I am still not certain what processor it
will be and, having never done any assembly outside of x86, am not
certain of the odds as to whether there will be a 64 bit output for
multiplication.
Could probably get around this if I understood the code though.
> 3. Use the BSR instruction to determine "lz".
Yup. Also purchased "Hacker's Delight" over the weekend, so will be
able to grab something out of there if there is no comparable
instruction on the device.
> 4. A smaller table for the initial approximation to 1/sqrt(x) is likely
> possible. I started out with what looked like a reasonable size and
> did not spend any time optimizing the table.
I looked at this and have very little idea of what you are doing. Both
you and Phil have suggested a small lookup table--but I don't
understand how one relates 2^16 or 2^32 possible values to a table
with only 128 entries? Obviously you are doing it, I just don't
understand what that relationship is, nor what data it is that you are
putting in there... =(
At the moment, I don't think I could copy and paste your code for my
application as it assumes an __int64, but I do feel like if I
understood the table-creation/access you have shown and the math Phil
has presented then I will be able to make something that would be
impressive (to myself =) .)
Thank you,
Chris Williams
|
|
0
|
|
|
|
Reply
|
spamtrap
|
11/29/2004 6:46:37 AM
|
|
I see many mentions of Newton-Raphson, which puts an expensive divide
in the loop, so I wanted to offer another algorithm. This technique,
though quite linear in its convergence, uses rotates and a single
subtraction per iteration. You loop more, but the loops are tiny.
This was designed to be called from C, so the C stack frame was the
only thing preserved. Also, the argument is an unsigned integer, so
there was no need to check the validity of whatever got passed. The
"8" in "mov cx,8" (just before Step1) is half the number of bits in the
argument.
I'll state the obvious: for a float, divide the exponent by two, and
run the magnitude through this routine. (If the exponent is an odd
number, denormalize.)
Here's some Turbo Assembler and Turbo C code.
..MODEL small
..CODE
PUBLIC _Sqroot
_Sqroot PROC
mov ax,0
mov dx,0
push bp
mov bp,sp
mov es,[bp+4]
mov bp,es
;
mov cx,8
Step1:
clc
rcl ax,1
stc
rcl ax,1
clc ;needed for floating point variation
rcl bp,1
rcl dx,1
rcl bp,1
rcl dx,1
;
mov bx,dx ;compare
sub bx,ax
;
jc Below
mov dx,bx
inc ax
rcr ax,1
loop Step1
jmp OutOfHere
;
Below:
clc
rcr ax,1
loop Step1
;
OutOfHere:
pop bp
ret
_Sqroot ENDP
END
extern unsigned Sqroot(unsigned n);
main()
{
unsigned i;
unsigned j=65535;
for(i=0;i<65535;++i)
printf("%u %u\n",i,Sqroot(i));
printf("%u %u\n",j,Sqroot(j));
}
By the way, 15 or 16 years ago, this was my first IBM PC (80186!)
assembler program, so I can't say that it's optimized, and I am not
sure what my "floating point variation" would have looked like, had I
ever written one. (I'll make no comments about the sparse comments,
newbie that I was.) I recently got re-interested in Assembler, and
visited this group for the first time a couple of days ago.
If you go with Newton-Raphson, I'd try dividing the exponent by 2 to
get the seed value. Also, I've half convinced myself that the
magnitude's square root should have about half as many bits, and that I
may be able to make one more easy starting adjustment, but I haven't
given it much thought.
- Dan
|
|
0
|
|
|
|
Reply
|
Dan
|
12/6/2004 8:31:00 PM
|
|
|
12 Replies
233 Views
(page loaded in 0.005 seconds)
Similiar Articles: nonlinear newton raphson iteration code - comp.soft-sys.matlab ...... C4=B1 need nonlinear newton raphson iteration code in order to find the > roots ... Please Help with My Fixed Point Iteration Program - comp.soft-sys ... nonlinear newton ... root finding routine - comp.soft-sys.math.mathematicaFixed Point Square Root Improvements? - comp.lang.asm.x86 ..... 64 bit Windows at the moment, adding as many improvements ... Fixed Point/Floating Point Arithmetic Books ... Fixed Point/Floating Point Arithmetic Books - comp.dspFixed Point Square Root Improvements? - comp.lang.asm.x86 ... Math logic - comp.lang.asm.x86... 64 bit Windows at the moment, adding as many improvements ... CRC lookup table: why 0xC0C1? - comp.compressionI'm studying the table-driven CRC implementation for CRC-16 (poly 0x8005). Now, in the standard CRC-16 table (http://www.xs4all.nl/~koekeroe/hacked/cr... Code for sparse stiffness matrix assembly - comp.lang.fortran ...2D Unstructured mesh in the unitary square - comp.soft-sys.matlab ... Code for sparse stiffness matrix assembly - comp.lang.fortran ... 2d unstructured mesh generation ... Message popping from root - comp.unix.solarisHi, There is a server kept popping the following message: Broadcast Message from root (???) on Tue Jun 3 15:33:45... "Communication has been lost"... optimization algorithm on lagrange multiplier method - comp.soft ...Can any one tell how to do optimization using lagrange multiplier method?Where can we find the algorithms and programs based on this? ... Division algorithm - comp.lang.asm.x86Fixed Point Square Root Improvements? - comp.lang.asm.x86 ... Division algorithm - comp.lang.asm.x86 Suitable Integer Square Root Algorithm for 32-64-Bit Integers on ... Table for x86 arithmetic instructions - comp.lang.asm.x86 ...Fixed Point Square Root Improvements? - comp.lang.asm.x86 ..... of "prod" maps directly to the x86 MUL instruction. ... if there is no comparable instruction on the ... while loop perfect square - comp.soft-sys.matlabFixed Point Square Root Improvements? - comp.lang.asm.x86 ..... system with no FPU and need to be able to find the square ... unsigned int curr = 0x1ff; value >>= 12 ... Bit-Shifting in ASM - comp.lang.asm.x86Fixed Point Square Root Improvements? - comp.lang.asm.x86 ... There is no division apart from shifting right by a bit. > 10 iterations seems to be ... CRC, add 16 parity bits in MATLAB - comp.soft-sys.matlab ...Hello, Please help me... I need to add 16 parity bits in matlab to a binary message, in order to form an error detection code C, transmit the mes... [need help]: how to automatically detect which number is mostly ...Dear experts, I have an array of numbers (from 0 ~ unknown) which is needed to be sorted based on the amount of occurance of each number but number... Unsigned Integer Overflow on Multiplication and Division - comp ...Fixed Point Square Root Improvements? - comp.lang.asm.x86 ... > And my current function is as such: > > __inline unsigned int fsqrt(unsigned ... (I assume because of the ... generate N perfect numbers. - comp.lang.prologFixed Point Square Root Improvements? - comp.lang.asm.x86 ... generate N perfect numbers. - comp.lang.prolog... below sqrt(X), so there's a bit of room for improvement ... Fixed-Point Square RootFixed Point Square Root Ken Turkowski Media Technologies: Computer Graphics Advanced Technology Group Apple Computer, Inc. Abstract: The square root of a fixed-point ... Fixed-point calculations (and square roots) in ClojureWe can now wrap up the refactoring with the following small function that uses the generic fixed-point to calculate the square root: (defn sqrt [x tolerance] (fixed ... 7/23/2012 11:40:59 AM
|