COMPGROUPS.NET | Search | Post Question | Groups | Stream | About | Register

### Hex to ascii

• Email
• Follow

```Hi, I need to convert a hex byte into its 2 byte ascii number. What
I'm doing is building a file name based on a single byte. The file
name is mapxx where xx is a 2 digit ascii number, such as map12.
Anyway my question is if my routine is called with a value of 0Ch, how
do I convert that into the needed 31h 32h, which is plugged into the
file name to make map12?

```
 0

See related articles to this posting

```Displacer <spamtrap@crayne.org> wrote in part:
> Hi, I need to convert a hex byte into its 2 byte ascii
> number. What I'm doing is building a file name based on a
> single byte. The file name is mapxx where xx is a 2 digit
> ascii number, such as map12.  Anyway my question is if my
> routine is called with a value of 0Ch, how do I convert
> that into the needed 31h 32h, which is plugged into the
> file name to make map12?

Very common question.  Assuming hex < 99 in AL:

AAM
AAM ax, 3030h

two ASCII digits are in AX.  The AAM instruction
(ASCII Adjust after Multiply) does the work for you.

-- Robert

```
 0

```On Wed, 13 Sep 2006 17:46:18 GMT, Robert Redelmeier
<redelm@ev1.net.invalid> wrote:

>Displacer <spamtrap@crayne.org> wrote in part:
>> Hi, I need to convert a hex byte into its 2 byte ascii
>> number. What I'm doing is building a file name based on a
>> single byte. The file name is mapxx where xx is a 2 digit
>> ascii number, such as map12.  Anyway my question is if my
>> routine is called with a value of 0Ch, how do I convert
>> that into the needed 31h 32h, which is plugged into the
>> file name to make map12?
>
>Very common question.  Assuming hex < 99 in AL:
>
>  AAM
>  AAM ax, 3030h

Hmmm, the routine I use is closer to:

aam
xchg    ah,al
followed by
stosw
to store the result in the output buffer

>
>two ASCII digits are in AX.  The AAM instruction
>(ASCII Adjust after Multiply) does the work for you.
>
>-- Robert
--
ArarghMail609 at [drop the 'http://www.' from ->] http://www.arargh.com
BCET Basic Compiler Page: http://www.arargh.com/basic/index.html

```
 0

```Rats, after I wen't to all the trouble of going the route of div, bx
ect. ect.
Thanks for the shortcut!

On Wed, 13 Sep 2006 17:46:18 GMT, Robert Redelmeier
<redelm@ev1.net.invalid> wrote:

>Displacer <spamtrap@crayne.org> wrote in part:
>> Hi, I need to convert a hex byte into its 2 byte ascii
>> number. What I'm doing is building a file name based on a
>> single byte. The file name is mapxx where xx is a 2 digit
>> ascii number, such as map12.  Anyway my question is if my
>> routine is called with a value of 0Ch, how do I convert
>> that into the needed 31h 32h, which is plugged into the
>> file name to make map12?
>
>Very common question.  Assuming hex < 99 in AL:
>
>  AAM
>  AAM ax, 3030h
>
>two ASCII digits are in AX.  The AAM instruction
>(ASCII Adjust after Multiply) does the work for you.
>
>-- Robert

```
 0

```----- Original Message -----
From: "Robert Redelmeier" <redelm@ev1.net.invalid>
Newsgroups: comp.lang.asm.x86
Sent: Wednesday, September 13, 2006 1:46 PM
Subject: Re: Hex to ascii

"Robert Redelmeier" <redelm@ev1.net.invalid> wrote in message
news:KtXNg.715\$vJ2.191@newssvr12.news.prodigy.com...
> Displacer <spamtrap@crayne.org> wrote in part:
> > Hi, I need to convert a hex byte into its 2 byte ascii
> > number. What I'm doing is building a file name based on a
> > single byte. The file name is mapxx where xx is a 2 digit
> > ascii number, such as map12.  Anyway my question is if my
> > routine is called with a value of 0Ch, how do I convert
> > that into the needed 31h 32h, which is plugged into the
> > file name to make map12?
>
> Very common question.  Assuming hex < 99 in AL:
>
>   AAM
>   AAM ax, 3030h
>
> two ASCII digits are in AX.  The AAM instruction
> (ASCII Adjust after Multiply) does the work for you.
>

Nice!  I'm thinking that would produce a nice nybble split for byte hex
routines if you use 'AAM imm8' with 10H for imm8... correct?
http://www.x86.org/secrets/opcodes/aam.htm

You must've missed Randall Hyde's 8/31 post "Faster HexToBuffer Routines"
since you didn't suggest using 'AAM imm8' in my or Chuck's brute force hex
routines.

Rod Pemberton

```
 0

```Displacer wrote:
> Rats, after I wen't to all the trouble of going the route of div, bx
> ect. ect.
> Thanks for the shortcut!

Do remember one thing, though, AAM *is* a slow instruction (as is DIV,
btw), so although this is a very *short* implementation of a two-digit
conversion routine, it's also very slow. If speed, at the expense of
space, is more important to you, you might consider the code I've been
working on for the HLA standard library. Probably a 100 times bigger,
but *considerably* faster.  The following code, which prints a 32-bit
value, not just a byte, was posted to alt.lang.asm:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
program t;
#include( "stdlib.hhf" )

procedure _hexToBuf32
(
d       :dword in eax;
width   :dword in ecx;
var buffer  :char in edi
);
@noframe;
@nodisplay;
@noalignstack;

noUSjt  :dword[9] :=
[
&noUS1,
&noUS2,
&noUS3,
&noUS4,
&noUS5,
&noUS6,
&noUS7,
&noUS8
];

#macro emitXDigit( posn );

mov( eax, edx );
shr( posn*4, edx );
and( \$f, edx );                     // Strip out unwanted bits.

mov( stdlib.HexDigits[edx], dl );   // Convert digit to hex
char.
mov( dl, [edi-posn] );

#endmacro

begin _hexToBuf32;

push( eax );
push( ecx );
push( edx );

// Drop down here if we're not outputting underscores inbetween
// groups of four digits in the number.

cmp( ecx, 8 );

jmp( noUSjt[ecx*4] );

noUS8:
emitXDigit(7);
noUS7:
emitXDigit(6);
noUS6:
emitXDigit(5);
noUS5:
emitXDigit(4);
noUS4:
emitXDigit(3);
noUS3:
emitXDigit(2);
noUS2:
emitXDigit(1);
noUS1:
mov( eax, edx );
and( \$f, edx );                     // Strip out unwanted bits.

mov( stdlib.HexDigits[edx], dl );   // Convert digit to hex
char.
mov( dl, [edi] );

htb32Done:
sub( ecx, edi );    // Point EDI at the first char in the buffer
pop( edx );
pop( ecx );
pop( eax );
ret();

raise( ex.WidthTooBig );

end _hexToBuf32;

procedure __hexToBuf32
(
d       :dword in eax;
width   :dword in ecx;
var buffer  :byte in edi
);
@noframe;
@nodisplay;
@noalignstack;

noUSjt  :dword[9] :=
[
&noUS1,
&noUS2,
&noUS3,
&noUS4,
&noUS5,
&noUS6,
&noUS7,
&noUS8
];

hexDigits   :char[16] :=
[
'0', '1', '2', '3',
'4', '5', '6', '7',
'8', '9', 'a', 'b',
'c', 'd', 'e', 'f'
];

hexTbl  :byte[512] :=
[
?i:byte := 0;
#while(i <= 254)
#if( i < \$10 )
'0',
#else
char( @substr( string(byte(i)), 0, 1) ),
#endif
char( @substr( string(byte(i)), 1, 1) ),
?i := i + 1;
#endwhile
'F', 'F'
];

#macro emit2XDigits( shift, posn );

mov( eax, edx );
shr( shift, edx );
and( \$ff, edx );
mov( (type word hexTbl[edx*2]), dx );
mov( dx, [edi-posn] );

#endmacro

begin __hexToBuf32;

push( eax );
push( ecx );
push( edx );

// If we're outputting underscores, then go handle the output
// using a special version of this routine. (Note:
OutputUnderscores is
// a global object in the HLA Standard Library.)

cmp( ecx, 8 );

jmp( noUSjt[ecx*4] );

noUS8:
emit2XDigits( 24, 7);
noUS6:
emit2XDigits( 16, 5);
noUS4:
emit2XDigits( 8, 3 );
noUS2:
and( \$ff, eax );
mov( (type word hexTbl[eax*2]), dx );
mov( dx, [edi-1] );
jmp htbDone;

noUS7:
emit2XDigits( 20, 6 );
noUS5:
emit2XDigits( 12, 4 );
noUS3:
emit2XDigits( 4, 2 );
noUS1:
and( \$f, eax );                     // Strip out unwanted bits.

mov( hexDigits[eax], dl );  // Convert digit to hex char.
mov( dl, [edi] );

htbDone:
sub( ecx, edi );    // Point EDI at the first char in the buffer
pop( edx );
pop( ecx );
pop( eax );
ret();

raise( ex.WidthTooBig );

end __hexToBuf32;

var
start   :dword[2];
time1   :dword[2];
time2   :dword[2];
buffer  :byte[16];

begin t;

rdtsc();
mov( eax, start );
mov( edx, start[4] );
xor( eax, eax );
loopit1:

lea( edi, buffer[15] );
__hexToBuf32( eax, 8, [edi] );
jnz loopit1;

rdtsc();
sub( start, eax );
sbb( start[4], edx );
mov( eax, time1 );
mov( edx, time1[4] );

rdtsc();
mov( eax, start );
mov( edx, start[4] );
xor( eax, eax );
loopit2:

lea( edi, buffer[15] );
_hexToBuf32( eax, 8, [edi] );
jnz loopit2;

rdtsc();
sub( start, eax );
sbb( start[4], edx );
mov( eax, time2 );
mov( edx, time2[4] );
stdout.put( "Time1: " );
stdout.putd( time1[4] );
stdout.putd( time1[0] );
stdout.newln();
stdout.put( "Time2: " );
stdout.putd( time2[4] );
stdout.putd( time2[0] );
stdout.newln();

end t;

time1 (the new code that processes two characters at a time) reports:
Time1: 000000161DE70A84

time2 (the original code) reports:
Time2: 0000001E56CE2530

<<<<<<<<<<<<<<<<<<<<<<<
Cheers,
Randy Hyde

```
 0

```Rod Pemberton <spamtrap@crayne.org> wrote in part:
> Nice!  I'm thinking that would produce a nice nybble split for byte hex
> routines if you use 'AAM imm8' with 10H for imm8... correct?
> http://www.x86.org/secrets/opcodes/aam.htm

Yes,  AAM  10h   splits nybbles, one of the undocs that pretty much
has been forced into the instruction set.

> You must've missed Randall Hyde's 8/31 post "Faster
> HexToBuffer Routines" since you didn't suggest using 'AAM
> imm8' in my or Chuck's brute force hex routines.

I saw it, but didn't think AAM applied -- it's compact
but slow.  OTOH, I don't know if AMD's MMX 32bit int to hex
routine was posted.  It's rather clever.

-- Robert

```
 0

```If you want any kind of fast binary to ascii conversion, you should take
a close look at the code I posted here several years ago, and which AMD
borrowed, without attribution :-(, for their optimization guide.

The key is that MUL is so much faster than DIV (and the related A*
opcodes) that you can do quite a lot of funky stuff.

In the case of converting a binary value less than 100 to ascii, I'd
look into using multiplication by ceil(65536/10):

; CX has binary value
mov ax,6554
mul cx
; At this point DX(DL) will contain the most significant digit!
mov al,10
mul dl
; If MUL is relatively slow, use LEA and/or shifts to do the same!

sub cx,ax
mov dh,cl

Yes, this is quite a bit larger than using AAM, but it is also 2 to 5
times faster on almost all x86 cpus.

In 32-bit mode the difference is even larger, the code AMD "borrowed"
converts a full 32-bit unsigned integer to 10 decimal digits in less
time than the cpu needs for a single DIV opcode! :-)

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Terje Mathisen wrote:
> If you want any kind of fast binary to ascii conversion, you should take
> a close look at the code I posted here several years ago, and which AMD
> borrowed, without attribution :-(, for their optimization guide.
>
> The key is that MUL is so much faster than DIV (and the related A*
> opcodes) that you can do quite a lot of funky stuff.
>
> In the case of converting a binary value less than 100 to ascii, I'd
> look into using multiplication by ceil(65536/10):
>
> ; CX has binary value
>    mov ax,6554
>    mul cx
> ; At this point DX(DL) will contain the most significant digit!
>    mov al,10
>    mul dl
> ; If MUL is relatively slow, use LEA and/or shifts to do the same!
>
>    sub cx,ax
>    mov dh,cl
>
> Yes, this is quite a bit larger than using AAM, but it is also 2 to 5
> times faster on almost all x86 cpus.
>
> In 32-bit mode the difference is even larger, the code AMD "borrowed"
> converts a full 32-bit unsigned integer to 10 decimal digits in less
> time than the cpu needs for a single DIV opcode! :-)
>
> Terje

Regards
Jean-Francois Michaud

```
 0

```Jean-Fran�ois Michaud wrote:
> Terje Mathisen wrote:
>> In 32-bit mode the difference is even larger, the code AMD "borrowed"
>> converts a full 32-bit unsigned integer to 10 decimal digits in less
>> time than the cpu needs for a single DIV opcode! :-)
>>
>> Terje
>
> How about a lookup table?

Lookup tables are indeed good here, in fact they should be the obvious
choice. :-)

However, when working with more digits and/or parallel operations,
straight inline SIMD code tends to beat lookup tables due to the load
stage bottleneck.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Terje Mathisen wrote:
> Jean-Fran�ois Michaud wrote:
> > Terje Mathisen wrote:
> >> In 32-bit mode the difference is even larger, the code AMD "borrowed"
> >> converts a full 32-bit unsigned integer to 10 decimal digits in less
> >> time than the cpu needs for a single DIV opcode! :-)
> >>
> >> Terje
> >
> > How about a lookup table?
>
> Lookup tables are indeed good here, in fact they should be the obvious
> choice. :-)
>
> However, when working with more digits and/or parallel operations,
> straight inline SIMD code tends to beat lookup tables due to the load
> stage bottleneck.

Hmm possibly but thats assuming alot ;-). Most processors out there
don't have the SIMD instruction set/registers.

Regards
Jean-Francois Michaud

```
 0

```Jean-Fran�ois Michaud wrote:
> > However, when working with more digits and/or parallel operations,
> > straight inline SIMD code tends to beat lookup tables due to the load
> > stage bottleneck.
>
> Hmm possibly but thats assuming alot ;-). Most processors out there
> don't have the SIMD instruction set/registers.

Actually, most do. But there are just enough that don't in order to
make it dangerous to use them if you have a universal application.

The experiments I've run lately show a 512-byte lookup table (which
translates one byte/two digits at a time) to be the fastest, but I've
not done careful analysis to make sure the cache isn't getting polluted
by having such a large table.  Bertrand posted an SSE version in the
"faster hex to buffer routine" thread. And I've provided a couple of
other solutions.

Note to Terje -- I tried the MUL trick with 32-bit integers (converted
to a decimal string) and wound up having to do a *huge* multiplication
to overcome the loss of precision. For two characters (one byte) you
can get away with this, but not for 32-bit integers. I'd be interested
in seeing a 32-bit conversion to decimal integer string that doesn't
involve at least three MUL instructions.  I found out that a repeated
subtraction for each digit turned out to be the fastest solution I
could come up with.
Cheers,
Randy Hyde

```
 0

```Jean-Fran�ois Michaud wrote:
> Terje Mathisen wrote:
>> However, when working with more digits and/or parallel operations,
>> straight inline SIMD code tends to beat lookup tables due to the load
>> stage bottleneck.
>
> Hmm possibly but thats assuming alot ;-). Most processors out there
> don't have the SIMD instruction set/registers.

I'm not advocating SIMD opcodes, as in MMX/SSE, but working with
multiple values simultaneously in a register that's wider than the basic
data item. :-)

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```randyhyde@earthlink.net wrote:
> Note to Terje -- I tried the MUL trick with 32-bit integers (converted
> to a decimal string) and wound up having to do a *huge* multiplication
> to overcome the loss of precision. For two characters (one byte) you
> can get away with this, but not for 32-bit integers. I'd be interested
> in seeing a 32-bit conversion to decimal integer string that doesn't
> involve at least three MUL instructions.  I found out that a repeated
> subtraction for each digit turned out to be the fastest solution I
> could come up with.

Huh?

10 digits, each averaging 5 subtractions/branches, and you run faster
than the 30-50 cycles of a 3-MUL direct conversion?

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```randyhyde@earthlink.net wrote:
> Jean-Fran�ois Michaud wrote:
> > > However, when working with more digits and/or parallel operations,
> > > straight inline SIMD code tends to beat lookup tables due to the load
> > > stage bottleneck.
> >
> > Hmm possibly but thats assuming alot ;-). Most processors out there
> > don't have the SIMD instruction set/registers.
>
>
> Actually, most do. But there are just enough that don't in order to
> make it dangerous to use them if you have a universal application.

Quite interresting. It seems we don't live on the same planet then :).
90% of the processors out there don't have those facilities. If you are
talking about desktop then yes I would agree, but most computers are
not desktop computers.

[SNIP]

Regards
Jean-Francois Michaud

```
 0

```Jean-Fran�ois Michaud wrote:

> > > Hmm possibly but thats assuming alot ;-). Most processors out there
> > > don't have the SIMD instruction set/registers.
> >
> >
> > Actually, most do. But there are just enough that don't in order to
> > make it dangerous to use them if you have a universal application.
>
> Quite interresting. It seems we don't live on the same planet then :).
> 90% of the processors out there don't have those facilities. If you are
> talking about desktop then yes I would agree, but most computers are
> not desktop computers.

This is the comp.lang.asm.x86 newsgroup, so we immediately eliminate
from consideration non-x86 processors. While x86 processors *are* used
in embedded applications that don't support SIMD instructions, I still
argue that the majority of x86s in use, do.

Your argument would have more power in ALA, but here one has to assume
that we're talking x86.
Cheers,
Randy Hyde

```
 0

```Terje Mathisen wrote:
>
> Huh?
>
> 10 digits, each averaging 5 subtractions/branches, and you run faster
> than the 30-50 cycles of a 3-MUL direct conversion?

Well, if it were only three MUL instructions I'd probably be okay with
that. But it took more. And I still had accuracy problems that I had to
hand-tweak for each digit range (and that resulted in several
additional instructions in addition to the MUL). Again, I'd like to see
a good generic MUL solution that works for 32 bits. Maybe my code was
defective, I don't know. But it took a *lot* of work to get it working
correctly for all 4 billion combinations. I wish, now, that I'd kept
the code around rather than throwing it away after determining that the
repeated subtraction worked faster so I could post it here and find out
what was wrong with the approach.

But if anyone has a 32-bit integer to decimal string conversion routine
that uses MUL for the division operation, I'd love to see it. It's not
like I'm happy with the repeated subtraction approach, just that the
repeated subtraction approach is the fastest I've created so far...
Cheers,
Randy Hyde

```
 0

```randyhyde@earthlink.net wrote:
> Terje Mathisen wrote:
>> Huh?
> Well, if it were only three MUL instructions I'd probably be okay with
> that. But it took more. And I still had accuracy problems that I had to

Aha!

You used reciprocal MUL as a simple substitution for DIV, and you
probably didn't do the full error analysis needed to prove how to get
exact results either?

Anyway, my algorithm is totally different: It splits the 32-bit binary
number into two 5-digit (decimal) halfs, using a reciprocal
multiplication by 1/100000.

The next step is to scale both of these numbers so as to make them a
binary/decimal scaled fraction, with the (scaled) decimal point after
the first digit.

At this point each pair of digits (one from each half) can be retrieved
by masking away the top digit After storing it), multiplying the
fraction by 5 (i.e. 10/2), and repeat the process, with the implied
decimal point one bit further down.

> But if anyone has a 32-bit integer to decimal string conversion routine
> that uses MUL for the division operation, I'd love to see it. It's not
> like I'm happy with the repeated subtraction approach, just that the
> repeated subtraction approach is the fastest I've created so far...

Since I invented it, I'll post my original version. As I've stated
previously, you can see AMD's minor tweaks to it in their optimization
manual.

The version below uses 3 MUL operations and a single IMUL. The rest is
mostly LEA, SUB, AND, ADD and SHR by constant.

char *utoa(char *buf, unsigned long n)
{
__asm {
mov ebx,[n]
mov eax,2814749767

mul ebx		// Reciprocal (2^n / 1e5) MUL

shr ebx,1
xor ecx,ecx

mov edi,[buf]

mov eax,100000

shr ecx,16		// ECX = high part
mov ebx,[n]

imul eax,ecx		// High part * 100k

sub ebx,eax		// Low part
mov eax,429497

mul ecx

mov ecx,eax

mov eax,429497
mov [edi],dl

mul ebx

mov ebx,eax

shr ecx,3

mov [edi+5],dl

shr ebx,3
lea ecx,[ecx+ecx*4]

mov edx,ecx
and ecx,0fffffffh

shr edx,28
lea ebx,[ebx+ebx*4]

mov eax,ebx

shr eax,28
mov [edi+1],dl

and ebx,0fffffffh

mov [edi+6],al
lea ecx,[ecx+ecx*4]

lea ebx,[ebx+ebx*4]
mov edx,ecx

mov eax,ebx
and ecx,07ffffffh

shr edx,27
and ebx,07ffffffh

shr eax,27

mov [edi+2],dl

mov [edi+7],al
lea ecx,[ecx+ecx*4]

lea ebx,[ebx+ebx*4]
mov edx,ecx

mov eax,ebx
and ecx,03ffffffh

shr edx,26
and ebx,03ffffffh

shr eax,26

mov [edi+3],dl

mov [edi+8],al
lea ecx,[ecx+ecx*4]

shr ecx,25
lea ebx,[ebx+ebx*4]

shr ebx,25

mov [edi+10],ah

mov [edi+4],cl
mov [edi+9],bl
}

return buf;
}

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Terje Mathisen wrote:
> > Terje Mathisen wrote:
> >> Huh?
> > Well, if it were only three MUL instructions I'd probably be okay with
> > that. But it took more. And I still had accuracy problems that I had to
>
> Aha!
>
> You used reciprocal MUL as a simple substitution for DIV, and you
> probably didn't do the full error analysis needed to prove how to get
> exact results either?

Oh, I did the analysis and found that in the 10th digit I was getting
only 3.25 bits of precision. That's why I added a ton of extra code and
did a 64-bit multiply to overcome the accuracy problems.

>
> Anyway, my algorithm is totally different: It splits the 32-bit binary
> number into two 5-digit (decimal) halfs, using a reciprocal
> multiplication by 1/100000.'

Okay, I'll try that and see how it works.
Cheers,
Randy Hyde

```
 0

```randyhyde@earthlink.net <spamtrap@crayne.org> wrote in part:
> But if anyone has a 32-bit integer to decimal string
> conversion routine that uses MUL for the division operation,
> I'd love to see it. It's not like I'm happy with the repeated
> subtraction approach, just that the repeated subtraction
> approach is the fastest I've created so far...  Cheers,

Why not let the firmware do the hard work:  FILD / FBSTP ?

-- Robert

```
 0

```Robert Redelmeier wrote:
> randyhyde@earthlink.net <spamtrap@crayne.org> wrote in part:
>> But if anyone has a 32-bit integer to decimal string
>> conversion routine that uses MUL for the division operation,
>> I'd love to see it. It's not like I'm happy with the repeated
>> subtraction approach, just that the repeated subtraction
>> approach is the fastest I've created so far...  Cheers,
>
> Why not let the firmware do the hard work:  FILD / FBSTP ?

Have you checked the cycle count on FBSTP lately? :-)

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```randyhyde@earthlink.net wrote:
> Terje Mathisen wrote:
>> Anyway, my algorithm is totally different: It splits the 32-bit binary
>> number into two 5-digit (decimal) halfs, using a reciprocal
>> multiplication by 1/100000.'
>
> Okay, I'll try that and see how it works.

I'm pretty confident it works, I did run an exhaustive test after all. :-)

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Terje Mathisen <spamtrap@crayne.org> wrote in part:
> Have you checked the cycle count on FBSTP lately? :-)

No, so I just did:  186 on a Pentium3.
Intel doesn't say for a P4, but AMD quotes 198 for K7.

IMHO, rather cheap -- 20 clocks per decimal digit.

-- Robert

```
 0

```Robert Redelmeier wrote:
> Terje Mathisen <spamtrap@crayne.org> wrote in part:
>> Have you checked the cycle count on FBSTP lately? :-)
>
> No, so I just did:  186 on a Pentium3.
> Intel doesn't say for a P4, but AMD quotes 198 for K7.
>
> IMHO, rather cheap -- 20 clocks per decimal digit.

If you want/need 64 bits (or at least more than 32), I totally agree!

For 32-bit unsigned values on a 32-bit cpu, my algorithm beats it by a

OTOH, with a 64-bit cpu it should be possible to extend my approach to
work with 64-bit unsigned and 20 output digits. :-)

I'm still waiting for my Core 2 Duo laptop. :-(

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Terje Mathisen <spamtrap@crayne.org> wrote in part:
> Robert Redelmeier wrote:
>> [FBSTP]Intel doesn't say for a P4, but AMD quotes 198 for K7.
>
> If you want/need 64 bits (or at least more than 32), I totally agree!
>
> For 32-bit unsigned values on a 32-bit cpu, my algorithm
> beats it by a factor of about 5.

Yes, your alg is _very_ clever.  Multiplication by the inverse is a
rather simple trick to beat DIV, always workable on FP, and usable
on integers with enough error analysis. And you also do decomposition
by integers from fixed point fractions.

I particularly like your splitting into two so the decomposition
streams can run in parallel rather than sit on latency dependancy
chains.  Bravo!

Exceedingly clever, and to be used in the unlikely event speed is
important in such an operation usually associated with glacial I/O.

-- Robert

```
 0

```Robert Redelmeier wrote:
> Terje Mathisen <spamtrap@crayne.org> wrote in part:
> > Have you checked the cycle count on FBSTP lately? :-)
>
> No, so I just did:  186 on a Pentium3.
> Intel doesn't say for a P4, but AMD quotes 198 for K7.
>
> IMHO, rather cheap -- 20 clocks per decimal digit.

Don't forget, however, that there is still a lot of work left to do
after the FBSTP instruction completes. You've got to extract each
hexadecimal digit (e.g., using the table-driven algorithm I've posted
here or in ALA, based on the 512-byte table algorithm also posted in

Fundamentally, though, I personally dislike using the FPU in this
manner as it requires an otherwise integer-only application to make
sure the FPU is in FPU mode (versus MMX mode) or you have to spend a
*lot* of time saving and restoring the FPU state.
Cheers,
Randy Hyde

```
 0

```Robert Redelmeier wrote:
> Terje Mathisen <spamtrap@crayne.org> wrote in part:
>> Robert Redelmeier wrote:
>>> [FBSTP]Intel doesn't say for a P4, but AMD quotes 198 for K7.
>> If you want/need 64 bits (or at least more than 32), I totally agree!
>>
>> For 32-bit unsigned values on a 32-bit cpu, my algorithm
>> beats it by a factor of about 5.
>
> Yes, your alg is _very_ clever.  Multiplication by the inverse is a
> rather simple trick to beat DIV, always workable on FP, and usable
> on integers with enough error analysis. And you also do decomposition
> by integers from fixed point fractions.
>
> I particularly like your splitting into two so the decomposition
> streams can run in parallel rather than sit on latency dependancy
> chains.  Bravo!

Thanks, I was quite happy when I came up with it, made it work and
proved it to be always correct. :-)

> Exceedingly clever, and to be used in the unlikely event speed is
> important in such an operation usually associated with glacial I/O.

If you take a close look at the 754r standards work, you'll notice that
they suggest implementing decimal fp arithmetic in hardware.

One of the strawman arguments (there are other, more valid) for 754r is
that this can beat regular binary for things like 'the telecom
benchmark', which consists of a lot of ascii processing.

However, if you have a fast enough binary-to-ascii conversion
(ascii-to-binary is of course trivial!), then you can keep running your
cpu in binary mode, saving all those BCD gates, while still running
faster. :-)

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Thanks to everyone for the information! I had to use a very compact
way so I used AAM. Size was more important that speed as well.

If you're curious as to what we're up to, take a look here, and if you
want to help, just let me know!

http://wasteland.wikispaces.com/

```
 0

```Terje Mathisen wrote:
> > Terje Mathisen wrote:
> >> Anyway, my algorithm is totally different: It splits the 32-bit binary
> >> number into two 5-digit (decimal) halfs, using a reciprocal
> >> multiplication by 1/100000.'
> >
> > Okay, I'll try that and see how it works.
>
> I'm pretty confident it works, I did run an exhaustive test after all. :-)

I've still got to refactor my version a bit, but the interesting thing
is that your algorithm was barely edges out the "division by repeated
subtraction" algorithm.

Here are the results I got via RDTSC on my PIV:

Terje: 000000CC04E1F460
repeated subtraction: 000000B0960388E8

Now granted, I had to do a data copy with your algorithm so that it
would leave the data in the format I needed it in (and the subtraction
algorithm was able to do without the extra copy), and I suspect that I
can eliminate this copy by refactoring the results, but I was expecting
a *huge* difference in performance and was surprised by the fact that
the two algorithms came out fairly close to one another in performance.
I've posted my code, so see if you can spot any obvious mistakes that
would skew the results.
Cheers,
Randy Hyde

program t;
#include( "stdlib.hhf" )

static
OutputUnderscores   :boolean := false;

procedure _unsToBuf32
(
d       :dword in eax;
width   :dword in ecx;
var buffer  :char in edi
);
@noframe;
@nodisplay;
@noalignstack;

noUSjt  :dword[11] :=
[
&noUS1,
&noUS2,
&noUS3,
&noUS4,
&noUS5,
&noUS6,
&noUS7,
&noUS8,
&noUS9,
&noUS10
];

#macro subDigit( subValue, posn ):done;
mov( '0', dl );
sub( subValue, eax );
jc done;
sub( subValue, eax );
jc done;
sub( subValue, eax );
jc done;
sub( subValue, eax );
jc done;
sub( subValue, eax );
jc done;
sub( subValue, eax );
jc done;
sub( subValue, eax );
jc done;
sub( subValue, eax );
jc done;
sub( subValue, eax );
jc done;
sub( subValue, eax );
done:
mov( dl, [edi-posn] );

#endmacro

begin _unsToBuf32;

// Drop down here if we're not outputting underscores inbetween
// groups of three digits in the number.

cmp( ecx, 10 );

jmp( noUSjt[ecx*4] );

noUS10:
mov( '1', dl );
sub( 1_000_000_000, eax );
sub( 1_000_000_000, eax );
jc done10;
sub( 1_000_000_000, eax );
jc done10;
sub( 1_000_000_000, eax );
jc done10;
sub( 1_000_000_000, eax );
done10:
mov( dl, [edi-9] );

noUS9:
subDigit( 100_000_000, 8 );
noUS8:
subDigit( 10_000_000, 7 );
noUS7:
subDigit( 1_000_000, 6 );
noUS6:
subDigit( 100_000, 5 );
noUS5:
subDigit( 10_000, 4 );
noUS4:
subDigit( 1_000, 3 );
noUS3:
subDigit( 100, 2 );
noUS2:
subDigit( 10, 1 );
noUS1:
or( '0', al );
mov( al, [edi] );
sub( ecx, edi );    // Point EDI at the first char in the
buffer
ret();

raise( ex.WidthTooBig );

end _unsToBuf32;

procedure __unsToBuf32
(
d       :dword;
width   :dword;
var buffer  :char in edi
);
@noframe;
@nodisplay;
@noalignstack;

var
asciiBuf    :char[12];

noUSjt  :dword[11] :=
[
&noUS1,
&noUS2,
&noUS3,
&noUS4,
&noUS5,
&noUS6,
&noUS7,
&noUS8,
&noUS9,
&noUS10
];

begin __unsToBuf32;

push( ebp );
mov( esp, ebp );
sub( _vars_, esp );
push( eax );
push( ebx );
push( ecx );
push( edx );
push( esi );

// Terje Mathisen

mov( d, ebx );
mov( 2814749767, eax );
mul( ebx );             // Reciprocal (2^n / 1e5) MUL
shr( 1, ebx );
xor( ecx,ecx );
mov( 100_000, eax );
shr( 16, ecx );         // ECX = high part
mov( d, ebx );
intmul( ecx, eax );     // High part * 100k
sub( eax, ebx );
mov( 429497, eax );
mul( ecx );
mov( eax, ecx );
mov( 429497, eax );
mov( dl, asciiBuf[0] );

mul( ebx );
mov( eax, ebx );
shr( 3, ecx );
mov( dl, asciiBuf[5] );

shr( 3, ebx );
lea( ecx, [ecx+ecx*4] );
mov( ecx, edx );
and( \$fff_ffff, ecx );
shr( 28, edx );
lea( ebx, [ebx+ebx*4] );
mov( ebx, eax );
shr( 28, eax );
mov( dl, asciiBuf[1] );

and( \$fffffff, ebx );
mov( al, asciiBuf[6] );

lea( ecx,[ecx+ecx*4] );
lea( ebx,[ebx+ebx*4] );
mov( ecx, edx );
mov( ebx, eax );
and( \$7ffffff, ecx );
shr( 27, edx );
and( \$7ffffff, ebx );
shr( 27, eax );
mov( dl, asciiBuf[2] );
mov( al, asciiBuf[7] );

lea( ecx,[ecx+ecx*4] );
lea( ebx,[ebx+ebx*4] );
mov( ecx, edx );
mov( ebx, eax );
and( \$3ffffff, ecx );
shr( 26, edx );
and( \$3ffffff, ebx );
shr( 26, eax );
mov( dl, asciiBuf[3] );
mov( al, asciiBuf[8] );
lea( ecx, [ecx+ecx*4] );
shr( 25, ecx );
lea( ebx, [ebx+ebx*4] );
shr( 25, ebx );
mov( cl, asciiBuf[4] );
mov( bl, asciiBuf[9] );

mov( width, ecx );
cmp( ecx, 10 );

jmp( noUSjt[ecx*4] );

noUS10:
mov( (type dword asciiBuf[0]), ebx );
mov( (type dword asciiBuf[4]), eax );
mov( (type word asciiBuf[8]), cx );
mov( ebx, [edi-9] );
mov( eax, [edi-5] );
mov( cx, [edi-1] );
sub( 9, edi );
jmp done;

noUS9:
mov( (type byte asciiBuf[1]), cl );
mov( (type word asciiBuf[2]), bx );
mov( (type dword asciiBuf[4]), eax );
mov( (type word asciiBuf[8]), dx );
mov( cl, [edi-8] );
mov( bx, [edi-7] );
mov( eax, [edi-5] );
mov( dx, [edi-1] );
sub( 8, edi );
jmp done;

noUS8:
mov( (type dword asciiBuf[2]), ebx );
mov( (type dword asciiBuf[6]), eax );
mov( ebx, [edi-7] );
mov( eax, [edi-3] );
sub( 7, edi );
jmp done;

noUS7:
mov( (type byte asciiBuf[3]), cl );
mov( (type word asciiBuf[4]), bx );
mov( (type dword asciiBuf[6]), eax );
mov( cl, [edi-6] );
mov( bx, [edi-5] );
mov( eax, [edi-3] );
sub( 6, edi );
jmp done;

noUS6:
mov( (type word asciiBuf[4]), bx );
mov( (type dword asciiBuf[6]), eax );
mov( bx, [edi-5] );
mov( eax, [edi-3] );
sub( 5, edi );
jmp done;

noUS5:
mov( (type byte asciiBuf[5]), bl );
mov( (type dword asciiBuf[6]), eax );
mov( bx, [edi-4] );
mov( eax, [edi-3] );
sub( 4, edi );
jmp done;

noUS4:
mov( (type dword asciiBuf[6]), eax );
mov( eax, [edi-3] );
sub( 3, edi );
jmp done;

noUS3:
mov( (type byte asciiBuf[7]), bl );
mov( (type word asciiBuf[8]), ax );
mov( bl, [edi-2] );
mov( ax, [edi-1] );
sub( 2, edi );
jmp done;

noUS2:
mov( (type word asciiBuf[8]), ax );
mov( ax, [edi-1] );
sub( 1, edi );
jmp done;

noUS1:
mov( (type byte asciiBuf[9]), al );
mov( al, [edi] );

done:
pop( esi );
pop( edx );
pop( ecx );
pop( ebx );
pop( eax );
leave();
ret( _parms_ );

raise( ex.WidthTooBig );

end __unsToBuf32;

var
start   :dword[2];
time1   :dword[2];
time2   :dword[2];
buffer  :byte[16];
buffer2 :byte[16];

begin t;

xor( ecx, ecx );
mov( 16, ecx );
lea( edi, buffer );
mov( ' ', al );
rep.stosb();
mov( 16, ecx );
lea( edi, buffer2 );
mov( ' ', al );
rep.stosb();
mov( 0, time1 );
mov( 0, time1[4] );
mov( 0, time2 );
mov( 0, time2[4] );

loopit0:

lea( edi, buffer[15] );
mov( ecx, eax );
conv.u32Size( ecx );
mov( eax, ebx );
rdtsc();
mov( eax, start );
mov( edx, start[4] );
__unsToBuf32( ecx, ebx, [edi] );
rdtsc();
sub( start, eax );
sbb( start[4], edx );

push( ecx );
lea( edi, buffer2[15] );
rdtsc();
mov( eax, start );
mov( edx, start[4] );
_unsToBuf32( ecx, ebx, [edi] );
rdtsc();
sub( start, eax );
sbb( start[4], edx );
pop( ecx );

mov( (type dword buffer), eax );
cmp( eax, (type dword buffer2) );
jne Failure;
mov( (type dword buffer[4]), eax );
cmp( eax, (type dword buffer2[4]) );
jne Failure;
mov( (type dword buffer[8]), eax );
cmp( eax, (type dword buffer2[8]) );
jne Failure;
mov( (type dword buffer[12]), eax );
cmp( eax, (type dword buffer2[12]) );
jne Failure;

jnz loopit0;

stdout.put( "Time1: " );
stdout.putd( time1[4] );
stdout.putd( time1[0] );
stdout.newln();
stdout.put( "Time2: " );
stdout.putd( time2[4] );
stdout.putd( time2[0] );
stdout.newln();
exit t;

Failure:
stdout.put( "Failed to compare at ", ecx, nl );

end t;

```
 0

```randyhyde@earthlink.net wrote:
> Terje Mathisen wrote:
>>> Terje Mathisen wrote:
>>>> Anyway, my algorithm is totally different: It splits the 32-bit binary
>>>> number into two 5-digit (decimal) halfs, using a reciprocal
>>>> multiplication by 1/100000.'
>>> Okay, I'll try that and see how it works.
>> I'm pretty confident it works, I did run an exhaustive test after all. :-)
>
> I've still got to refactor my version a bit, but the interesting thing
> is that your algorithm was barely edges out the "division by repeated
> subtraction" algorithm.
>
> Here are the results I got via RDTSC on my PIV:
>
> Terje: 000000CC04E1F460
> repeated subtraction: 000000B0960388E8

I.e. my code is slower?

> Now granted, I had to do a data copy with your algorithm so that it
> would leave the data in the format I needed it in (and the subtraction
> algorithm was able to do without the extra copy), and I suspect that I
> can eliminate this copy by refactoring the results, but I was expecting
> a *huge* difference in performance and was surprised by the fact that
> the two algorithms came out fairly close to one another in performance.
> I've posted my code, so see if you can spot any obvious mistakes that
> would skew the results.

First, what is the actual cycle count/iteration that you get on my code?

a) Are you doing this in a loop, with non-random inputs? Are smaller
numbers much more likely than those that have 9-10 significant digits?

This would _really_ speed up the subtraction-based algorithm, as it
would be perfectly branch-predicted on most (all in the case of
identical inputs) iterations.

b) You time my code while introducing a Partial Register Stall, costing
10-20 cycles directly after the end, by branching on ECX when my code
have just updated CL.

c) You run on a P4, which has by far the (relatively) slowest MUL speed
of any modern x86-compatible cpu, while running the inner logic core at
double rate.

It is still interesting that really naive code can outperform the
textbook "div/mod by 10" approach by such a great margin. :-)

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Terje Mathisen wrote:
> > Here are the results I got via RDTSC on my PIV:
> >
> > Terje: 000000CC04E1F460
> > repeated subtraction: 000000B0960388E8
>
> I.e. my code is slower?

Well, the function that incorporates your algorithm is slightly slower.
Your algorithm by itself would be slightly faster, but I needed to move
the data around (the same way as was needed by my library routines.
This is the part that can be refactored. Upon entry into the routine, I
*know* how many digits will be produced. I am not taking advantage of
that as well as I could with your implementation. I hadn't done this
work before because I blindly assumed that your algorithm would be so
many times faster it wouldn't be necessary for a first approximation.

>
> > Now granted, I had to do a data copy with your algorithm so that it
> > would leave the data in the format I needed it in (and the subtraction
> > algorithm was able to do without the extra copy), and I suspect that I
> > can eliminate this copy by refactoring the results, but I was expecting
> > a *huge* difference in performance and was surprised by the fact that
> > the two algorithms came out fairly close to one another in performance.
> > I've posted my code, so see if you can spot any obvious mistakes that
> > would skew the results.
>
> First, what is the actual cycle count/iteration that you get on my code?

Those numbers above are RDTSC values for the two different function
calls. In the case of your algorithm, there is the overhead of
preserving some registers (which my code doesn't do) and moving the
data from a temporary buffer  to its final resting place, which can be
eliminated by refactoring the code.
>
> a) Are you doing this in a loop, with non-random inputs? Are smaller
> numbers much more likely than those that have 9-10 significant digits?

I do an exhaustive test, with all possible values for 32 bits. In the
real world, of course, smaller numbers appear much more often than
large numbers, but I don't see that running all \$1_0000_0000 different
values is going to favor one algorithm over the other.

BTW, you will note in the code that I also compared all the strings
both algorithms produced to ensure they were identical.

>
> This would _really_ speed up the subtraction-based algorithm, as it
> would be perfectly branch-predicted on most (all in the case of
> identical inputs) iterations.

My subtraction based algorithm relies on the fact that the number of
output digit positions has been precomputed (your's does too -- it just
assumes that the value is 10). However, for the tests I ran, the
assumption was that it always output 10 digits (with leading zeros),
which is the slowest form (assuming less than 10 digits could be
output). For the exhaustive test I ran, the efficiencies gained by
processing fewer digits wouldn't matter much anyway, as only 20% of the
values in the exhaustive test would have nine or fewer digits anyway.
For real-world cases, where numbers don't often head into the 10-digit
zone, there might be something to be gained from this, but not in the
test I ran.

>
> b) You time my code while introducing a Partial Register Stall, costing
> 10-20 cycles directly after the end, by branching on ECX when my code
> have just updated CL.

I will switch to ESI and see if it makes a big difference. But quite
frankly I see a bigger gain to be made by refactoring all the
memory-copy code. I'm thinking of having 10 variants of your algorithm,
each optimized for one through ten digits of output.

>
> c) You run on a P4, which has by far the (relatively) slowest MUL speed
> of any modern x86-compatible cpu, while running the inner logic core at
> double rate.

Yep. But that's still something I have to take into account when
designing general-purpose library code.  I can't hand-tweak the
instruction scheduling for a particular CPU and call it the fastest
code around. As a general rule, I stay away from the cycle counting
stuff that varies from CPU to CPU and try to stick mainly with the
algorithmic choices.

>
> It is still interesting that really naive code can outperform the
> textbook "div/mod by 10" approach by such a great margin. :-)

It blew me away. I don't know what motivated me to try it in the first
place other than something along the lines of "I wonder how slow this
one would run."  As I said, my attempt at "divide by multiplication"
(which didn't work out well because I didn't think of first dividing by
100,000) turned out to be slower. That's my I started this line of
questioning in the first place.
Cheers,
Randy Hyde

```
 0

```randyhyde@earthlink.net wrote:
> Terje Mathisen wrote:
>> a) Are you doing this in a loop, with non-random inputs? Are smaller
>> numbers much more likely than those that have 9-10 significant digits?
>
> I do an exhaustive test, with all possible values for 32 bits. In the
> real world, of course, smaller numbers appear much more often than
> large numbers, but I don't see that running all \$1_0000_0000 different
> values is going to favor one algorithm over the other.

If you test the values in order, then it makes a _huge_ difference, in
favor of your version, due to minimizing the number of branch misses.

If you run with (say) 1000 random 32-bit value, then your version should
be significantly slower.

Anyway, how fast can your code be?

Subtracting to detect digits will average 5 iterations/digit, right?
(The alternative is a binary search which is even less predictable, and
which has about 3 X the number of branches in the code).

If all iterations except the final one is correctly predicted, then we
get an absolute minimum of 5 cycles + a branch miss costing anywhere
from 5 to 20 cycles (depending upon the current cpu architecture).

With 10 digits, this comes out to 50 + 10 * [5-20] = [100 - 250]

Since my code needs 4 MUL operations, each of which costs 10-11 cycles
on a P4, the entire code will take about 60-80 cycles on the same P4,
while on any AMD or P6 style cpu we're down in the 30-50 cycle range.

>> b) You time my code while introducing a Partial Register Stall, costing
>> 10-20 cycles directly after the end, by branching on ECX when my code
>> have just updated CL.
>
> I will switch to ESI and see if it makes a big difference. But quite
> frankly I see a bigger gain to be made by refactoring all the
> memory-copy code. I'm thinking of having 10 variants of your algorithm,
> each optimized for one through ten digits of output.

I really don't know/understand how you can know up front how many digits
a number will need? Are you only ever using this as part of a formatted,
fixed-field, output function?

>> c) You run on a P4, which has by far the (relatively) slowest MUL speed
>> of any modern x86-compatible cpu, while running the inner logic core at
>> double rate.
>
> Yep. But that's still something I have to take into account when
> designing general-purpose library code.  I can't hand-tweak the
> instruction scheduling for a particular CPU and call it the fastest
> code around. As a general rule, I stay away from the cycle counting
> stuff that varies from CPU to CPU and try to stick mainly with the
> algorithmic choices.

Since the P4 is the outlier, and all other x86 cpus run MUL quite fast,
it is unfortunate for my code that you happen to have a P4. :-(

>> It is still interesting that really naive code can outperform the
>> textbook "div/mod by 10" approach by such a great margin. :-)
>
> It blew me away. I don't know what motivated me to try it in the first
> place other than something along the lines of "I wonder how slow this
> one would run."  As I said, my attempt at "divide by multiplication"
> (which didn't work out well because I didn't think of first dividing by
> 100,000) turned out to be slower. That's my I started this line of
> questioning in the first place.

When working with 64-bit numbers, my assumption would be that really big
numbers are quite rare, and that an approach based on a higher power of
10, but maybe not 100000, could be useful:

I.e. a repeated reciprocal multiplication to generate 3
digits/iteration, as a fraction after division by 1000.
Back-multiplication by 1000 could be handled as (1024 - 8*3).

The modulo-1000 value can be converted to three digits using a single
MUL to convert it into a binary-scaled decimal fraction: 9.99 * 2^n.

This approach would require two MUL operations per three digits, and it
would stop as soon as the most significant triplet had been output.

Instead of doing back-multiplication and decimal fraction scaling, a
more precise reciprocal, of at least 76 bits or so, would allow us to
calculate both the results of the division by 10^n, and the decimal
fraction correctly rounded up so that it could be used to output the
digits directly. This would be easy to do by having two MULs, then
shifting the lower half down before adding to the top half.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

```
 0

```Terje Mathisen wrote:

>
> I really don't know/understand how you can know up front how many digits
> a number will need? Are you only ever using this as part of a formatted,
> fixed-field, output function?

That has to be computed in any case (even when doing a field-sized
operation). The reaon the calling code does this is because it needs to
verify that the output string will fit in the allocated space passed
into the routine. This is a computation that is paid for, whether or
not the conversion routine uses it, so I take advantage of that in the
routines. Currently, my implementation of your algorithm doesn't use
this information except for the memory move at the end. My plan is to
refactor the code and come up with ten variants of your algorithm that
compute *only* the number of digits needed, thus saving a bit of work
on numbers with less than 10 digits.

>
> Since the P4 is the outlier, and all other x86 cpus run MUL quite fast,
> it is unfortunate for my code that you happen to have a P4. :-(

Yep.
I believe, even on my P4, I can double the speed of your algorithm by
refactoring my code. Still, I was expecting an order of magnitude
difference, as the subtract by 10 algorithm seems so, well, brute
force. If, in the end, I can achieve a 5x speedup with your code I'll
be happier, but I'm still surprised that there isn't a huge difference
between the two.

>
> When working with 64-bit numbers, my assumption would be that really big
> numbers are quite rare, and that an approach based on a higher power of
> 10, but maybe not 100000, could be useful:

Definitely big numbers are rare. Even with 32-bit values. For example,
the 10 digit numbers consume 80% of my test cases, but I suspect in
real life they represent less than 20% of the numbers people actually
use (i.e., the Pareto Principle).  Probably the most common numbers
have one, two, three, or four digits. When I refactor my implementation
of your code, I'm going to take this into account and use specialized
algorithms to speed up 1-3 digits.

>
> I.e. a repeated reciprocal multiplication to generate 3
> digits/iteration, as a fraction after division by 1000.
> Back-multiplication by 1000 could be handled as (1024 - 8*3).
>
> The modulo-1000 value can be converted to three digits using a single
> MUL to convert it into a binary-scaled decimal fraction: 9.99 * 2^n.

Might even try creating a lookup table to convert 2-3 digits at one
shot. It would be interesting to see if the caching (or lack thereof)
kills such an approach).

>
> This approach would require two MUL operations per three digits, and it
> would stop as soon as the most significant triplet had been output.
>
> Instead of doing back-multiplication and decimal fraction scaling, a
> more precise reciprocal, of at least 76 bits or so, would allow us to
> calculate both the results of the division by 10^n, and the decimal
> fraction correctly rounded up so that it could be used to output the
> digits directly. This would be easy to do by having two MULs, then
> shifting the lower half down before adding to the top half.

Good idea. I've still got to do a 64-bit (and 128-bit) conversion
routine for the library, so I will keep this in mind.

Cheers,
Randy Hyde

```
 0