f



Whither now, Oh Scheme.

We are amidst a computer programming language renaissance.  New languages
and lots of new users to play with them. Daily it seems.

12-18 months the #scheme IRC channel on freenode.net was essentially
abandoned.  Often myself or another, on occasion a crowd of 3.  As I type,
on a Sunday evening, there are 40 users.

Discussions on #scheme are generally threefold in nature; SRFI's, as
several SRFI authors are frequently on, a bit of homework or Scheme newbie
assistance, and as one might imagine a good deal of comparative
implementation discussion.

There seems to be a consensus Scheme(s) for every situation except one,
serious application development.  And a number of Schemers are interested
in doing just that.  A case can be made that a SIFSAD (Scheme Intended For
Serious Application Development) does not exist today, what is worse, it
is doubtful one will exist tomorrow.

But suppose there was a plan for SIFSAD, a roadmap for a Scheme Intended
For Serious Application Development, what would it look like.  You would
have to start from somewhere, have a destination in mind,  the path
becomes just a bit of machete work.

Assuming the best opportunity for Sifsad is an evolutionary one from the
core of an existing Scheme implementation, here are some hypothetical
Sifsad bios.

Scheme 48 / PreScheme compiler - PreScheme compiler is resurrected,
initially emitting C code.  Later native emitters Itanium/AMD 64 bit
systems were added.

PLT/MzScheme - mzc compiler is enhanced with aggressive optimizations.
MzScheme becomes not only one of the functionally richest implementations
but the fastest as well.

Chicken/Bigloo/Gambit/Larceny/Scheme->C et al. Consensus is reached on one
code base, remaining authors, recognizing the will of the Scheme community
work to add the best features of each into the common code base. The
resulting Scheme->C compiler is widely regarded as the best HLL compiler
available.

Chez Scheme - Individual licenses are made available at reasonable cost.
Source is GPL'd for non-commercial use.

MITScheme - Port to new 64 bit systems is successfully achieved.  Module
system, syntax-case support is added.  With memory constraints lifted,
development of lightning fast, large memory footprint application are
possible in an incremental compilation environment.


Whither now, Scheme.





 

-1
ray7279 (14)
10/27/2003 2:45:33 AM
comp.lang.scheme 4781 articles. 0 followers. Post Follow

235 Replies
4458 Views

Similar Articles

[PageSpeed] 34

"R Racine" <ray@adelphia.net> wrote in message news:<pan.2003.10.27.03.45.42.943883@adelphia.net>...
> 
> There seems to be a consensus Scheme(s) for every situation except one,
> serious application development.  And a number of Schemers are interested
> in doing just that.  A case can be made that a SIFSAD (Scheme Intended For
> Serious Application Development) does not exist today, what is worse, it
> is doubtful one will exist tomorrow.

Could you elaborate on that? Why do you think (say) Bigloo or PLT
might not suitable for serious app development?

> 
> But suppose there was a plan for SIFSAD, a roadmap for a Scheme Intended
> For Serious Application Development, what would it look like.  You would
> have to start from somewhere, have a destination in mind,  the path
> becomes just a bit of machete work.
> 
> Assuming the best opportunity for Sifsad is an evolutionary one from the
> core of an existing Scheme implementation, here are some hypothetical
> Sifsad bios.
> 
> Scheme 48 / PreScheme compiler - PreScheme compiler is resurrected,
> initially emitting C code.  Later native emitters Itanium/AMD 64 bit
> systems were added.

PreSchene might not be anyones favorite Scheme dialect. 

> 
> PLT/MzScheme - mzc compiler is enhanced with aggressive optimizations.
> MzScheme becomes not only one of the functionally richest implementations
> but the fastest as well.

Interesting alternative. But Mzc still has to provide clean interfacing
to the MzScheme runtime system, which is not really tuned for
maximum performance, but for other things (debuggability, ease of use,
robustness, etc.)

> 
> Chicken/Bigloo/Gambit/Larceny/Scheme->C et al. Consensus is reached on one
> code base, remaining authors, recognizing the will of the Scheme community
> work to add the best features of each into the common code base. The
> resulting Scheme->C compiler is widely regarded as the best HLL compiler
> available.

(BTW, Larceny is not a Scheme->C compiler)

So that would mean we reduce all Scheme->C compilation strategies down
to the least common divisor: 

- drop Chicken's fast continuations
- drop Gambit's (forthcoming) very efficient multithreading system
- drop Bigloo's/Scheme->C's direct compilations style and make it a CPS compiler
  (you want 1st class continuations and TCO, right?)

What you will get is a Scheme implementation that is either unusable,
incomplete or inefficient.

> 
> Chez Scheme - Individual licenses are made available at reasonable cost.
> Source is GPL'd for non-commercial use.

Hm. Can't say much about that...

> 
> MITScheme - Port to new 64 bit systems is successfully achieved.  Module
> system, syntax-case support is added.  With memory constraints lifted,
> development of lightning fast, large memory footprint application are
> possible in an incremental compilation environment.

What many people don't realize is that there CAN'T BE NO SINGLE ALL-POWERFUL
SCHEME implementation. Tradeoffs have to be made, unless you want to
produce a mediocre one. Chicken (for example) will never beat Bigloo, in
terms of raw performance, yet Bigloo's (or PLT's) continuations are
awfully inefficient. Damn, it's even impossible to pin down a single
perfect implementation strategy (Cheney-on-the-MTA? Direct style?
Trampoline style? Bytecode VM? Threaded VM?). What GC? Conservative?
Ref. counting? Stop-and-copy? Mark-and-sweep? Which is best? Or,
more importantly, which is best for *all* applications? None, I'd say.

Several Scheme implementations are more than adequate for serious development
and people use it for that. In fact, Schemes generally provide better
performance and often have better foreign function interfaces than
languages like Python, Ruby or Perl, which seem to be well accepted for serious
stuff. Scheme is more rigorously defined, is better suited to
compilation and provides incredibly powerful syntactic abstractions.

It *is* easy to get lost in the number of implementations, and many
of those are somewhat half-finished, partly because it's so easy
to whip up a simple Scheme, yet this has absolutely nothing to do
with Scheme not being ready for development of real-world code.


cheers,
felix
-1
felix1812 (33)
10/27/2003 7:25:47 AM
felix@proxima-mt.de (felix) wrote in message news:<e36dad49.0310262325.40bcf6f5@posting.google.com>...
> 
> (BTW, Larceny is not a Scheme->C compiler)
> 

Or is petite larceny already available? 
It seems it isn't, but I may be wrong.


cheers,
felix
0
felix1812 (33)
10/27/2003 9:49:14 AM

On Sun, 26 Oct 2003 23:25:47 -0800, felix wrote:

> What many people don't realize is that there CAN'T BE NO SINGLE
> ALL-POWERFUL SCHEME implementation. Tradeoffs have to be made, unless
> you want to produce a mediocre one. Chicken (for example) will never
> beat Bigloo, in terms of raw performance, yet Bigloo's (or PLT's)
> continuations are awfully inefficient. Damn, it's even impossible to pin
> down a single perfect implementation strategy (Cheney-on-the-MTA? Direct
> style? Trampoline style? Bytecode VM? Threaded VM?). What GC?
> Conservative? Ref. counting? Stop-and-copy? Mark-and-sweep? Which is
> best? Or, more importantly, which is best for *all* applications? None,
> I'd say.
> 
> Several Scheme implementations are more than adequate for serious
> development and people use it for that. In fact, Schemes generally
> provide better performance and often have better foreign function
> interfaces than languages like Python, Ruby or Perl, which seem to be
> well accepted for serious stuff. Scheme is more rigorously defined, is
> better suited to compilation and provides incredibly powerful syntactic
> abstractions.
> 
> It *is* easy to get lost in the number of implementations, and many of
> those are somewhat half-finished, partly because it's so easy to whip up
> a simple Scheme, yet this has absolutely nothing to do with Scheme not
> being ready for development of real-world code.
> 
> 
> 
In my previous post I mentioned a threefold path to Nirvana.  Determine a
starting point, define an endpoint, get the mechete ready.  To properly
select an implementation to evolve into Sifsad, it only makes sense to
select an implementation that is best to build of off.  There is a
distinct chance that the "best" implementation to move forward with is not
even one of the top 2 or 3 implementations used today.

So a priori, agreed, no debate, compromises must and will occur.  However,
I will debate whether it is possible to a) effectively determine which
tradeoffs to select IF the end goal is adequately defined, b) compromises
can not be ameliorated by modular code design c) such tradeoffs inevitably
result in mediocrity.

For example,
 The end goal is defined.
	- Speed of application.  Very important. - Efficient use of large amounts
	of memory.  Very important.
 	- Full debugging.  Continuation restarts  ???
	- Core fullblown MOP.  Highly optimized dispatch. - Modules, standalone
	compilation, interfaces/signatures (also parametric
interfaces/signatures) and runtime determinable implementations.  [Imagine
the SRFI-44 debate on the definition of a collections library in the light
of a SIG/UNITs or Scheme48/Chez interfaces or SML sigs....
	- Standalone, static exe capability.
	- Real multithreading capable of utilizy multiple processors. - so on and
	so forth...

The point is, define the goal and tradeoffs become a debate in the context
of what is necessary to achieve the goal.

Another point, Larceny [as you correctly pointed out is not just a
Scheme->C system, later tonight I intend to post on why proposing Larceny
makes sense] has 5 - 6 different GC systems.  The Larceny core is very
well designed and supports plugable GC systems.  What is the penalty for
this flexibility?  I doubt the efficiency of the Twobit compiled code is
impacted.  PLT also has 2 GC/VM systems.  Such things can be abstracted in
the code base to support multiple solutions and pluggability with minimal
impact.

Bottom line, I believe it IS possible to allow for flexible pluggable
strategies to many of the issues you raised such as various VM strategies.

Couldn't you, being well versed on the Cheney-on-the-MTA approach either
show that this approach is decidedly superior then the MzScheme approach
or is a must have option in Sifsad and then assist in adding it to
MzScheme?  (Assuming MzScheme makes sense as the base system.)

Must two or more! Scheme distributions exist, complete with different
runtimes and libraries, predicated on the single point of bifurcation as
to how continuation capture is occuring??!!

In the context of doing comparative analysis via two small experimental
systems yes.  In the world of the application developer where the method
of continuation capture is invisible, it is decidedly not justification
for forking two blown Scheme systems.  Just capture the damn things, make
it stable, make it fast and MAKE IT ONE Scheme System.  Thank you very
much.


Regards,


Ray
0
ray7279 (14)
10/27/2003 1:28:37 PM
> In the context of doing comparative analysis via two small experimental
> systems yes.  In the world of the application developer where the method
> of continuation capture is invisible, it is decidedly not justification
> for forking two blown Scheme systems.  Just capture the damn things, make
> it stable, make it fast and MAKE IT ONE Scheme System.  Thank you very
> much.

I hope I'm wrong, but it seems you have a simplified view of Scheme 
architectures.  Continuation capture is probably *the* fundamental 
feature that drives selecting the implementation strategy.  One cannot 
have a modular continuation capture implementation.  Thats why systems 
with slow call/cc are unlikely to get much better without rearchitecting 
themselves at a low level.

If I'm using continuations heavily, I'm going to want to choose an 
implementation with that property.  If I'm not using them all, but I 
demand high performance otherwise, then I'm likely to make a completely 
different choice.  Its these sort of trade offs which make sifsad a bad 
idea.  You should ask yourself what the real problem is that prevents 
serious application development.  I would argue that its the lack of 
large, (standard?, maybe.) library.  This means covering things such as 
usable GUI toolkits, extensive database connectivity, mature threading, 
networking, datastructures... the sort of things career programmers take 
for granted from the platform libraries of C++ or Java.

The fallacy is believing this is only possible if we standardize on one 
Scheme.

	Scott

0
scgmille (240)
10/27/2003 2:21:52 PM
"Scott G. Miller" wrote:
> 
> > In the context of doing comparative analysis via two small experimental
> > systems yes.  In the world of the application developer where the method
> > of continuation capture is invisible, it is decidedly not justification
> > for forking two blown Scheme systems.  Just capture the damn things, make
> > it stable, make it fast and MAKE IT ONE Scheme System.  Thank you very
> > much.
> 
> I hope I'm wrong, but it seems you have a simplified view of Scheme
> architectures.  Continuation capture is probably *the* fundamental
> feature that drives selecting the implementation strategy.  One cannot
> have a modular continuation capture implementation.  Thats why systems
> with slow call/cc are unlikely to get much better without rearchitecting
> themselves at a low level.

Partly....  A scheme that compiles to a well-designed intermediate 
form could have two back-ends; one that heap-allocates and garbage 
collects call frames, and one that uses the hardware stack.  These 
back-ends would generate code that obeyed two different runtime 
models, but there's also a "tail" end -- keyhole optimization of 
machine code -- that could be shared between them.  The runtime 
symbol table and associated code could also be shared between the 
two models.

So you'd wind up duplicating maybe half of a simple compiler to 
accomodate the fundamentally different designs.  And effort spent 
on the crankiest and most bottomless, nonportable areas -- machine 
code and cache optimization -- would be sharable. By the time 
you'd done aggressive optimizations and ported to a half-dozen 
different hardware/OS combinations, the duplicated effort might 
be a tenth or less of the compiler. 

From a compilation point of view, it's easy to scan scheme code and 
see if you can find places where call/cc is ever used.  You could 
make a first-order choice of which backend to invoke just by 
checking for it.  But the right thing to do would be to profile 
it at the intermediate-code level and make a hard assessment of 
which model is a "win" for the given program.  

Much of what would need to be done can only be done in scheme as 
a result of whole-program optimization.  And that means getting 
program code away from the REPL, because as long as you have the 
REPL in the system, you absolutely cannot prove that something 
isn't going to be redefined or mutated.  It also means very 
serious support for optional declarations to eliminate unnecessary
typechecks and very serious support for memory and CPU profiling.

Finally, we really *really* need a linkable object file format 
that we don't have to go through an FFI for.  FFI's distort or 
contort the meaning of scheme code; they introduce special cases,
cause wraparound or length errors in integers, truncate complex 
numbers, create exceptions to garbage collection handling, and 
wreak all kinds of misfits with the runtime model. We paper over 
the problems reasonably well, but still they never quite work 
right. When scheme programs link to scheme libraries they shouldn't 
need to use braindead C calling conventions.


> The fallacy is believing this is only possible if we standardize on one
> Scheme.

I think maybe there needs to be a 'SISFAD' standard, above and 
beyond R5RS, that specifies a lot of things R5RS doesn't specify. 
I'd like to see a bunch of people implement it, much as a bunch
of people have implemented R5RS.  

A SISFAD standard would expressly forbid some of the things that 
make some schemes unusable for serious application development, 
like limits on the memory size (guile and MIT scheme have this 
problem particularly badly) and failure to support the full 
numeric tower.  It would specify a format for libraries portable
across all implementations of SISFAD, define which R5RS and other 
functions are found in what libraries, define a set of OS calls 
accessible through libraries, and straighten out a few things 
like binary I/O primitives for pipes, sockets and files.

It would specify the syntax of performance declarations, but the 
only requirement of implementations should be that they must not 
barf on the syntax -- actually using it for performance enhancement
is a plus, but not barfing on it is crucial.  

				Bear
0
bear (1219)
10/27/2003 5:30:59 PM
Ray Dillinger wrote:
> "Scott G. Miller" wrote:
> > The fallacy is believing this is only possible if we standardize
> > on one Scheme.
>
> I think maybe there needs to be a 'SISFAD' standard, above and
> beyond R5RS, that specifies a lot of things R5RS doesn't specify.
> I'd like to see a bunch of people implement it, much as a bunch
> of people have implemented R5RS.

This makes much more sense to me than "standardizing on one Scheme".

Of course, the first thing to be standardized has to be a better acronym
than SIFSAD!!

Anton



0
anton58 (1240)
10/27/2003 6:03:16 PM
Anton van Straaten wrote:
> Ray Dillinger wrote:
> 
>>"Scott G. Miller" wrote:
>>
>>>The fallacy is believing this is only possible if we standardize
>>>on one Scheme.
>>
>>I think maybe there needs to be a 'SISFAD' standard, above and
>>beyond R5RS, that specifies a lot of things R5RS doesn't specify.
>>I'd like to see a bunch of people implement it, much as a bunch
>>of people have implemented R5RS.
> 
> 
> This makes much more sense to me than "standardizing on one Scheme".
> 
> Of course, the first thing to be standardized has to be a better acronym
> than SIFSAD!!
> 

I've been speaking deliberately abstractly, but many of these topics 
were covered at Matthias Radestock's ILC presentation, and will likely 
come up again in some detail around the Scheme Workshop and LL3.  See 
you there!

	Scott

0
scgmille (240)
10/27/2003 7:19:04 PM
"Anton van Straaten" <anton@appsolutions.com> writes:

> Ray Dillinger wrote:
>> "Scott G. Miller" wrote:
>> > The fallacy is believing this is only possible if we standardize
>> > on one Scheme.
>>
>> I think maybe there needs to be a 'SISFAD' standard, above and
>> beyond R5RS, that specifies a lot of things R5RS doesn't specify.
>> I'd like to see a bunch of people implement it, much as a bunch
>> of people have implemented R5RS.
>
> This makes much more sense to me than "standardizing on one Scheme".

As far as I understand it, that's what SRFIs are about.  The existing
ones don't seem to me to go nearly far enough, though.  

Part of what makes (for example) Perl good is CPAN, and all the
conventions (and the resulting community) that make CPAN possible.  So
I can download a tarball, unpack it, run Makefile.PL using my chosen
Perl interpreter, and then "make; make test; make install" will work
(with high probability).

That's all made easier because Perl has a single implementation (give
or take), of course.  Even so, if there were a common FFI (even a
restricted one), and a few extra things (a common module and/or
package system, perhaps a common object system) something similar
could be built for Scheme.  

I'm guessing it won't happen, though.  I'm not sure quite what it is,
but something seems to prevent such cooperation.  

And that seems to mean that there isn't a scheme community in the same
way that there's a Perl community---so I can be confident of getting
Perl's LDAP package and being able to use it, but Bigloo's equivalent
<http://sourceforge.net/projects/bigloo-lib/> doesn't even build with
the current bigloo, presumably because bigloo's community is simply
too small.  (I found much the same with some RScheme libraries, and
doubtless the same is true of most scheme implementations.)
0
usenet44 (324)
10/27/2003 7:56:23 PM
Would you like a pony, too?
0
campbell1 (74)
10/27/2003 8:22:46 PM
On Mon, 27 Oct 2003 13:28:37 GMT, R Racine <ray@adelphia.net> wrote:

> In my previous post I mentioned a threefold path to Nirvana.  Determine a
> starting point, define an endpoint, get the mechete ready.  To properly
> select an implementation to evolve into Sifsad, it only makes sense to
> select an implementation that is best to build of off.  There is a
> distinct chance that the "best" implementation to move forward with is 
> not
> even one of the top 2 or 3 implementations used today.

Possible, *if* a Sifsad (geez, what an awful name! ;-) is possible
and practical, which I seriously doubt...

> For example,
> The end goal is defined.
> 	-Speed of application.  Very important. - Efficient use of large amounts
> 	of memory.  Very important.

No disagreement here.

> 	-Full debugging.  Continuation restarts  ???

But you want speed to, right? Ok, so have several optimization settings.

> 	-Core fullblown MOP.  Highly optimized dispatch.

Oh, how about speed? I assume a simple procedure call is more efficient
(whatever tricks your dynamic dispatch plays, it will not beat the
direct procedure call, naturally). Here you have your first tradeoff.
Why do you want OO baggage in the core, when you want speed at the same 
time?

> - Modules, standalone
> 	compilation, interfaces/signatures (also parametric
> interfaces/signatures) and runtime determinable implementations.  
> [Imagine
> the SRFI-44 debate on the definition of a collections library in the 
> light
> of a SIG/UNITs or Scheme48/Chez interfaces or SML sigs....

What kind of modules? How easy to use should they be? Should they
allow interactive use? Man, do you realize how much work has gone into
Scheme module systems, yet none really satisfies everybody!

> The point is, define the goal and tradeoffs become a debate in the 
> context
> of what is necessary to achieve the goal.

Yes, this is not new. People on c.l.s (and elsewhere) debate about these 
things
for decades, now. Have they reached only the slightest bit of consensus?
No, they haven't. Why, I ask you.

>
> Another point, Larceny [as you correctly pointed out is not just a
> Scheme->C system, later tonight I intend to post on why proposing Larceny
> makes sense] has 5 - 6 different GC systems.  The Larceny core is very
> well designed and supports plugable GC systems.  What is the penalty for
> this flexibility?  I doubt the efficiency of the Twobit compiled code is
> impacted.  PLT also has 2 GC/VM systems.  Such things can be abstracted 
> in
> the code base to support multiple solutions and pluggability with minimal
> impact.

Absolutely. Yet, there are implementation strategies that are very tightly
coupled with there collectors. One example is Cheney-on-the-MTA, another
is "traditional" direct style compilers that target C, which mostly use 
conservative
GC.

>
> Bottom line, I believe it IS possible to allow for flexible pluggable
> strategies to many of the issues you raised such as various VM 
> strategies.

Possible, yes. But not always adequate. I claim that the ideal Scheme
implementation you have in mind will be completely unusable for others.

>
> Couldn't you, being well versed on the Cheney-on-the-MTA approach either
> show that this approach is decidedly superior then the MzScheme approach
> or is a must have option in Sifsad and then assist in adding it to
> MzScheme?  (Assuming MzScheme makes sense as the base system.)

It doesn't (if I may say so). I wouldn't touch the MzScheme sources
unless physically forced to do so. That Cheney-on-the-MTA is superior
(to direct style, like Bigloo) is something that I'm firmly convinced
off. And? That doesn't matter to someone who isn't interested in anything
but raw speed of straight-line code. Tradeoffs, again.

>
> Must two or more! Scheme distributions exist, complete with different
> runtimes and libraries, predicated on the single point of bifurcation as
> to how continuation capture is occuring??!!

If you look carefully, you'll find many more differences than only
continuation capture. And capture is only *one* issue with continuations.
How about safe-for-space complexity? Reification? Storage consumption?

>
> In the context of doing comparative analysis via two small experimental
> systems yes.  In the world of the application developer where the method
> of continuation capture is invisible, it is decidedly not justification
> for forking two blown Scheme systems.  Just capture the damn things, make
> it stable, make it fast and MAKE IT ONE Scheme System.  Thank you very
> much.
>

Many people have tried to do so. Yet, the ideal Scheme system hasn't been
done yet.
If the unification of all Scheme implementation efforts is the really
important issue for you, then you effectively strive for mediocrity,
unless you happen to be a Scheme implementation wizard, vastly ahead
of all the others. Mind you, that would be nice!


cheers,
felix
0
felix4557 (46)
10/27/2003 10:42:24 PM
felix <felix@call-with-current-continuation.org> writes:

[...]

> Many people have tried to do so. Yet, the ideal Scheme system hasn't been
> done yet.
> If the unification of all Scheme implementation efforts is the really
> important issue for you, then you effectively strive for mediocrity,
> unless you happen to be a Scheme implementation wizard, vastly ahead
> of all the others. Mind you, that would be nice!

Probably true.  In that sense, Perl, Python, etc., are mediocre---some
reasonable uses of the languages are inefficient. 

On the other hand, if you've got a one-day sort of problem to solve
that requires access to LDAP, SSL, PostgreSQL, and gtk, then the
mediocre solutions win.

Heck, people have been writing reasonable size applications in Tcl for
years, largely because it had a very convenient binding to Tk.  tkman
(a *really* nice manpage reader) was first written (about 10 years
ago, apparently) when Tcl was a strongly string-based interpreter; the
author even wrote a paper about the various hackery he used to make it
fast enough (the files had non-essential spaces removed and ghastly
things like that).  

Even then, there were presumably choices that ought to have been
better (Tcl's far from a perfect language, and it was much worse in
1993); but Tcl had a convenient binding to Tk and an easy to use FFI,
and that was enough for it to be more usable for a large class of
applications.

For a big application, the work necessary to bind a few libraries is
dwarfed by the work necessary to attack the real problem.  However,
that leaves lots of little applications where you're naturally going
to choose a language which has lots of convenient packages.  Perhaps
more importantly, I suspect big applications often start off as small
ones---something like Perl makes it easier to start work on a problem.
0
usenet44 (324)
10/27/2003 11:23:55 PM
Bruce Stephens <bruce+usenet@cenderis.demon.co.uk> wrote:
> For a big application, the work necessary to bind a few libraries is
> dwarfed by the work necessary to attack the real problem.  However,
> that leaves lots of little applications where you're naturally going
> to choose a language which has lots of convenient packages.  Perhaps
> more importantly, I suspect big applications often start off as small
> ones---something like Perl makes it easier to start work on a problem.

Heck yeah. More than a few times, I've started a big project by writing
a prototype in Perl. More precisely, I try to hack it up in Perl, and if
that doesn't work, I do a better implementation in a more appropriate
language. As a bonus, the initial hack-job implementation gives me
enough experience with the problem domain that I can do a better design
for the "real" version.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/28/2003 12:06:08 AM
On Mon, 27 Oct 2003 08:21:52 -0600, Scott G. Miller wrote:

> I hope I'm wrong, but it seems you have a simplified view of Scheme
> architectures.

I do.  I represent the pitch fork wielding, torch waving, unwashed masses
of fustrated Scheme application developers.  And yes, maybe I am a mass of
one. (shades of a "silent" majority here)

I am not saying that Sifsad will have some trival property flag and will
then suddenly manifest 3 modes of continuation capture.

I'm just saying that after a decade or two, is it unreasonable to suggest
that there has been enough experimental versions, and multiple approaches
to the reach a "reasonable" conclusion (not a perfect conclusion) with
regard to implementing continuation capture if one were to design Sifsad.

As one of the unwashed, I don't care how its done, I am sure I wouldn't
understand the internals if I tried.  I can't slam dunk a basketball
either.  So be it.

SML/NJ is fast (not the fastest, but commercially fast) and supports
continuations.  And no, I am not saying, do it just like SML/NJ.

Ray
0
ray7279 (14)
10/28/2003 12:12:31 AM
R Racine wrote:
> On Mon, 27 Oct 2003 08:21:52 -0600, Scott G. Miller wrote:

>>I hope I'm wrong, but it seems you have a simplified view of Scheme
>>architectures.

> I do.  I represent the pitch fork wielding, torch waving, unwashed masses
> of fustrated Scheme application developers.  And yes, maybe I am a mass of
> one. (shades of a "silent" majority here)

What is missing in DrScheme?

> I am not saying that Sifsad will have some trival property flag and will
> then suddenly manifest 3 modes of continuation capture.

> I'm just saying that after a decade or two, is it unreasonable to suggest
> that there has been enough experimental versions, and multiple approaches
> to the reach a "reasonable" conclusion (not a perfect conclusion) with
> regard to implementing continuation capture if one were to design Sifsad.

The Grand Unified Scheme is nothing but a dream. You will always need
to make compromises in implementations. That's why you ought to be 
thrilled about the wide range of Scheme implementations in existence.
In other languages (e.g. Python/Perl) you are pretty much stuck with
one implementation.

> As one of the unwashed, I don't care how its done, ...

That's a bold statement in these parts of the wood.

See the last discussion on the Grand Unified Scheme:

<http://groups.google.com/groups?hl=da&lr=&ie=UTF-8&th=5f1ec978a3e333dc&rnum=2>


Perhaps a better idea was to begin making an FFI-SRFI?


-- 
Jens Axel S�gaard

0
usenet153 (246)
10/28/2003 12:27:42 AM
"R Racine" <ray@adelphia.net> writes:

> On Mon, 27 Oct 2003 08:21:52 -0600, Scott G. Miller wrote:
>
>> I hope I'm wrong, but it seems you have a simplified view of Scheme
>> architectures.
>
> I do.  I represent the pitch fork wielding, torch waving, unwashed
> masses of fustrated Scheme application developers.  And yes, maybe I
> am a mass of one. (shades of a "silent" majority here)

I'm sure you're not alone.

That's part of the problem: gathering a community of users seems much
easier when there's only one implementation.

But scheme (even if you add in slib and a selection of SFRIs) is small
enough that it's reasonably straightforward to produce an
implementation.  Certainly not *that* easy, but easy enough that there
seem to be about half a dozen implementations that aren't quite dead
yet.

[...]

> I'm just saying that after a decade or two, is it unreasonable to
> suggest that there has been enough experimental versions, and
> multiple approaches to the reach a "reasonable" conclusion (not a
> perfect conclusion) with regard to implementing continuation capture
> if one were to design Sifsad.

I'd say that STklos and guile are probably acceptable interpreters
(STklos is a byte-coding interpreter; I forget the details of guile),
and that bigloo and rscheme are probably pretty good compilers.  (I'm
judging implementations in terms of speed, popularity, whether I've
heard of them, etc.)

So it seems to me that not only do we have reasonble conclusions about
acceptable solutions, we have several.  Tom Lord's working on another,
and presumably there are other new ones being worked on, too.  (And
there are the other interpreters, native code/C compilers, and JVM and
..Net implementations, too.)

We don't lack choice.

> As one of the unwashed, I don't care how its done, I am sure I
> wouldn't understand the internals if I tried.  I can't slam dunk a
> basketball either.  So be it.
>
> SML/NJ is fast (not the fastest, but commercially fast) and supports
> continuations.  And no, I am not saying, do it just like SML/NJ.

Perhaps the best is to accept what's there, and to build prototypes
and so on with Perl (or Python, Ruby, etc.) and then (once you know
what you're trying to do) build it in your preferred scheme (or lisp).

That feels wrong, though.  I'd welcome a single (even if mediocre)
implementation of scheme that was generally regarded as the one to use
rather than Tcl, Perl, or Python.  (I guess guile is it, or perhaps
Scheme48, with the nice scsh, but I'd really like it to be a compiler;
I think the GNU project messed up there---I think they ought to have
chosen RScheme, or at least cooperated sufficiently that RScheme could
have been substituted later, but perhaps it wouldn't have made a
difference.)
0
usenet44 (324)
10/28/2003 12:42:23 AM
On Mon, 27 Oct 2003 23:42:24 +0100, felix wrote:
> Possible, *if* a Sifsad (geez, what an awful name! ;-) is possible and
> practical, which I seriously doubt...

The Sifsad name was chosen with the intent of it never seeing the light of
day in a real implementation.  But you have to admit googling Sifsad would
minimize the irrelevant.
 
> If the unification of all Scheme implementation efforts is the really
> important issue for you, then you effectively strive for mediocrity,
> unless you happen to be a Scheme implementation wizard, vastly ahead of
> all the others. Mind you, that would be nice!

The sad fact of Scheme life is that if I were a Scheme implementation
wizard, and we all know very well I am not, I would have already announced
Yet Another Scheme Implementation.  Math profs are annointed to generate new
Math profs.  Scheme implementation wizards seemed destined to create
endless streams of Scheme implementation.  They are the
Sysiphus' of language implementors.  Doomed by the gods to endlessly
create half finished implementations in isolation from one another.  

I am not proposing a GUS (Grand Unified Scheme).

Just a useful one.
0
ray7279 (14)
10/28/2003 12:50:57 AM
Jens Axel S�gaard <usenet@jasoegaard.dk> writes:

> R Racine wrote:
>> On Mon, 27 Oct 2003 08:21:52 -0600, Scott G. Miller wrote:
>
>>>I hope I'm wrong, but it seems you have a simplified view of Scheme
>>>architectures.
>
>> I do.  I represent the pitch fork wielding, torch waving, unwashed masses
>> of fustrated Scheme application developers.  And yes, maybe I am a mass of
>> one. (shades of a "silent" majority here)
>
> What is missing in DrScheme?

Bindings to Gtk/GNOME and other random useful libraries?  Speed?

Perhaps there are such bindings, and I just don't know where to look
for them.  It's true that speed isn't the main priority for the
DrScheme family, though, isn't it?

[...]

> The Grand Unified Scheme is nothing but a dream. You will always
> need to make compromises in implementations. That's why you ought to
> be thrilled about the wide range of Scheme implementations in
> existence.

Except that some implementations are virtually dead, and none have
quite the extensions that I want for this particular application...

> In other languages (e.g. Python/Perl) you are pretty much stuck with
> one implementation.

But that's OK, because although it is a compromise, it's a reasonable
one, and because there's only the one, there's an enormous library of
extensions and code that I can use.  There's lots of scheme code, too,
but each blob of code that I find will take a few hours of work to
massage to work with the implementation that I've chosen to use (with
its particular combination of module system and so on).

[...]

> Perhaps a better idea was to begin making an FFI-SRFI?

Probably.  On the other hand, if it were that easy, someone would
already have done it.
0
usenet44 (324)
10/28/2003 12:57:28 AM
Jens Axel S�gaard write:
> R Racine wrote:
> > I'm just saying that after a decade or two, is it unreasonable to
suggest
> > that there has been enough experimental versions, and multiple
approaches
> > to the reach a "reasonable" conclusion (not a perfect conclusion) with
> > regard to implementing continuation capture if one were to design
Sifsad.
>
> The Grand Unified Scheme is nothing but a dream. You will always need
> to make compromises in implementations. That's why you ought to be
> thrilled about the wide range of Scheme implementations in existence.
> In other languages (e.g. Python/Perl) you are pretty much stuck with
> one implementation.

I think it's interesting & relevant to look at the ways in which this is
*not* true.  First, there's Jython, which is a well-established
implementation of Python on the Java platform.  There's also the Psyco
compiler for Python, which is a kind of JIT compiler.  Then there are
implementations of both Python and Perl under way for .NET.

So I think it's possible that the much-vaunted single implementations of
some languages are merely an artifact of their youth.  Implementations will
multiply over time, because of the need to support significantly different
platforms, if nothing else.  The fact that Scheme has an amazing family of
implementations is an asset - but it also needs to do better at supporting
*reasonable* portability between at least some of those implementations.

Anton



0
anton58 (1240)
10/28/2003 1:02:18 AM
On Tue, 28 Oct 2003 00:50:57 GMT, R Racine <ray@adelphia.net> wrote:

> On Mon, 27 Oct 2003 23:42:24 +0100, felix wrote:
>> Possible, *if* a Sifsad (geez, what an awful name! ;-) is possible and
>> practical, which I seriously doubt...
>
> The Sifsad name was chosen with the intent of it never seeing the light 
> of
> day in a real implementation.  But you have to admit googling Sifsad 
> would
> minimize the irrelevant.

Absolutely.

>
>> If the unification of all Scheme implementation efforts is the really
>> important issue for you, then you effectively strive for mediocrity,
>> unless you happen to be a Scheme implementation wizard, vastly ahead of
>> all the others. Mind you, that would be nice!
>
> The sad fact of Scheme life is that if I were a Scheme implementation
> wizard, and we all know very well I am not, I would have already 
> announced
> Yet Another Scheme Implementation.  Math profs are annointed to generate 
> new
> Math profs.  Scheme implementation wizards seemed destined to create
> endless streams of Scheme implementation.  They are the
> Sysiphus' of language implementors.  Doomed by the gods to endlessly
> create half finished implementations in isolation from one another.

I wouldn't consider PLT (for example) half-finished.

>
> I am not proposing a GUS (Grand Unified Scheme).
>
> Just a useful one.
>

I can name several useful Scheme implementations. Just ask.
Many of those are used commercially and provide splendid FFIs and/or
extension libraries.
If Scheme implementations are insufficient for you, do it yourself.
But I don't think you will do any better than what is currently available,
since the major implementations take most known implementation strategies
pretty far.

Here's an idea: pick an implementation (unimportant which one), sit down 
and start
writing libraries for it (doesn't matter for what).
Then (and only then) you really will help making Scheme more usable for 
real-world development.



cheers,
felix
0
felix4557 (46)
10/28/2003 1:19:02 AM
On Tue, 28 Oct 2003 01:27:42 +0100, Jens Axel S�gaard wrote:


> What is missing in DrScheme?
> 
> 
Not too much AFAIAC.  On a personal level if I list the top 3 things that
have blown me away in the Scheme impl world:

MIT Scheme: The ground breaking work done here.  You see MITScheme code,
concepts and ideas in many of the current Scheme implementations.  It
is/was the fountainhead.

PLT Scheme: An almost endless stream of what Scheme is capable of.
Unit/Sigs, Languages , inheritable Structures, Contracts, the Syntax
concept, opaque types, module system ...  You can just randomly click
about the help system and almost stumble into whole new concepts.

Another example from MzScheme.  From Eli's Swindle.  I saw that Swindle
had somehow added support for self evaluating symbols which start with a
colon.  When I installed Swindle, I didn't recall any patching or
recompiling.  So hey, how'd he do that?  So I looked.

(module base mzscheme

(provide (all-from-except mzscheme
          #%module-begin #%top #%app define let let* letrec lambda))

..... stuff....

;;>> (#%top . id)
;;>This special syntax is redefined to make keywords (symbols whose names
;;>begin with a ":") evaluate to themselves.  Note that this does not
;;>interfere with using such symbols for local bindings. (provide (rename
top~ #%top))
(define-syntax (top~ stx)
  (syntax-case stx ()
    ((_ . x)
     (let ((x (syntax-object->datum #'x)))
       (and (symbol? x) (not (eq? x '||))
            (eq? #\: (string-ref (symbol->string x) 0))))
     (syntax/loc stx (#%datum . x)))
    ((_ . x) (syntax/loc stx (#%top . x)))))

.... stuff ...)

That was it! No special compiler hacking, reader hacking, any hacking at
all.  Just suck in the MzScheme language, extend what it means to be a
datum or a top level symbol with a 7 line macro and export a new
"extended" Scheme language with self-evaluating colon prefixed symbols.
Not only that I could use this extended Scheme, regular MzScheme or yet
another variant in a controlled module by module basis. WOW

The Larceny Twobit Compiler:  IMHO the finest bits of Scheme code I have
ever beheld.  I have seen Scheme code which tackles far less loftier
targets then a highly optimizing, pluggable emitter, native compiler that
is not a tenth as readable and elegant.  [BTW Sifsad should be based on
the Twobit compiler :)]

I digress.  What is missing in DrScheme? Overall I love it. Mainly a
Sifsad focus.  The system, DrScheme, has a intensional pedalogical focus.
My concerns, efficient memory usage, optimized VM, speed, debugging are
not their focus.  The mzc compiler is not on par with some of the other
Scheme->C systems out there.  Is there an inherant architectural tradeoff
which prevents mzc from approaching Chicken or Bigloo with speed.  Don't
know.  If two or three Scheme wizs announced this very night that they
were going to join the PLT team with a Sifsad prioritized feature list.  I
would do a hand spring and take up organized religion.

What I find more troubling is some of the other Scheme wiz's disdain for
MzScheme from the aspect of a production quality Scheme.  What is it that
THEY find missing in PLT? Do they know something that we simple Joes do
not regarding the inner workings of MzScheme?

What is that they see that prevents two major groups focusing on the PLT
code base and providing two releases/versions of PLT.  DrScheme and Sifsad.


Ray
0
ray7279 (14)
10/28/2003 2:05:01 AM
At Mon, 27 Oct 2003 23:42:24 +0100, felix wrote:
> 
> What kind of modules? How easy to use should they be? Should they
> allow interactive use? Man, do you realize how much work has gone into
> Scheme module systems, yet none really satisfies everybody!

Would it be too much to ask for a standard *syntax* to the module
system, without specifying the semantics?  No matter how many SRFIs or
libraries we write, if we can consistently load them into a program then
the same program can never run unmodified on two different Schemes.

Suppose we use a syntax encompassing all of the module-system concepts
in use now.  Something like

  (define-module <module-A>
    (use-module <module-B> [<procedure> ...])
    (use-syntax <module-C> [<syntax> ...])
    (autoload <module-D> [<procedure> ...])
    (export <procedure> ...)
    [(export-all)]
    )

  ... module code ...

as a preamble in a module file.  <procedure> may either be a symbol name
or a list of a symbol followed by optional type declarations, which a
Scheme that doesn't use type declarations can ignore.  If your Scheme
doesn't differentiate between importing syntax and importing procedures
then the use-module and use-syntax forms are the same.  Likewise if your
Scheme doesn't support autoloading then that too is equivalent to
use-module.  export-all means export all top-level definitions in the
module, and this could probably be optional (since it's handy for
prototyping but when your module is "finished" and ready for use it's
better style to explicitly declare your exports).

There are issues to be resolved but I don't believe it's impossible to
at least make the syntax work for all the major module systems out
there.  The question is, if a SRFI were to be created that specified a
syntax like the above, would Scheme implementations support it?

-- 
Alex

0
foof (110)
10/28/2003 2:29:28 AM
"Anton van Straaten" <anton@appsolutions.com> writes:

> I think it's interesting & relevant to look at the ways in which this is
> *not* true.  First, there's Jython, which is a well-established
> implementation of Python on the Java platform.  There's also the Psyco
> compiler for Python, which is a kind of JIT compiler.  Then there are
> implementations of both Python and Perl under way for .NET.
> 
> So I think it's possible that the much-vaunted single implementations of
> some languages are merely an artifact of their youth.  

Indeed, isn't that what happened with Stackless Python?  My
understanding is that for a while, Stackless created an Avignon vs
Rome situation in the Python community.  The noise over Stackless
seems to have subsided, but it seems likely Parrot will have
continuations, which means the debate will have to reopen.  And I
believe Tismer and others are now working on something called PyPy,
which means yet another implementation...

Shriram
0
sk1 (223)
10/28/2003 3:24:26 AM
Alex Shinn <foof@synthcode.com> writes:

> Would it be too much to ask for a standard *syntax* to the module
> system, without specifying the semantics?  

This is a troll, right?  I'd expect more from a regular like Alex...

>                                             No matter how many SRFIs or
> libraries we write, if we can consistently load them into a program then
> the same program can never run unmodified on two different Schemes.

I think you mean "...if we cannot consistently...".  What does it mean
to load consistently in the absence of a semantics?

> Suppose we use a syntax encompassing all of the module-system concepts
> in use now.  [...]

Doesn't encompass units.

Shriram
0
sk1 (223)
10/28/2003 3:31:08 AM
On Tue, 28 Oct 2003 02:19:02 +0100, felix wrote:


> I can name several useful Scheme implementations. Just ask.

Few.  Very few, have had success writing substantive applications in
Scheme.  Of those few, the majority, have or still are on some endless
merry-go-round of trying it on this impl and then that.  I expect most
give up and, use C#, Java, SML, CL or Haskell.

To not recognize that there is an implementation "issue" with Scheme that
is impacting its adoption in the realworld, retention of the few
application level coders it has and constraining a substantial and broad
library code base from forming is ... I don't know.  A shame.

> Many of those are used commercially and provide splendid FFIs and/or
> extension libraries.
> If Scheme implementations are insufficient for you, do it yourself. But
> I don't think you will do any better than what is currently available,
> since the major implementations take most known implementation
> strategies pretty far.
> 
> Here's an idea: pick an implementation (unimportant which one), sit down

Therein lies the crux.  I have been claiming it is.  Ever try using
Bigloo-libs GTK bindings in XYZ impl.  Or grabbing Schematics SchemeUnit
for use in ABC impl.  Non starters.  Sure you can spend a couple of days
porting it to whatever your current impl of choice.  Then you get to do it
again when the library code has a new version released.

> and start
> writing libraries for it (doesn't matter for what).

My efforts are diluted.  Because 49 other library writers are writing
libraries for some other impl.

> Then (and only then) you really will help making Scheme more usable for
> real-world development.

<sigh>Knew this one is coming eventually.  No comment.</sigh>


Ray
0
ray7279 (14)
10/28/2003 3:37:23 AM
Shriram Krishnamurthi wrote:
> Alex Shinn <foof@synthcode.com> writes:
>
> > Would it be too much to ask for a standard *syntax* to the module
> > system, without specifying the semantics?
>
> This is a troll, right?  I'd expect more from a regular like Alex...

Maybe Alex means something like a standard module declaration syntax which
maps to a minimal set of sufficiently similar semantics on different
Schemes.  Which seems like it could be a workable idea, to me.

> >                                             No matter how many SRFIs or
> > libraries we write, if we can consistently load them into a program then
> > the same program can never run unmodified on two different Schemes.
>
> I think you mean "...if we cannot consistently...".  What does it mean
> to load consistently in the absence of a semantics?

I dunno, Perl seems to manage!  ;)

> > Suppose we use a syntax encompassing all of the module-system concepts
> > in use now.  [...]
>
> Doesn't encompass units.

Standardizing something on the level of units isn't going to happen, I'm
sure.  But I think a lowest-common denominator module system, which would
support writing portable modular code and publishing portable libraries,
would be helpful.

Sure, that won't allow taking an arbitrary whiz-bang library from
implementation A and plugging it in to implementation B, but that's not the
point.  The point, I think, would be to build up the base a bit further, in
a direction that supports some of these pragmatic issues that we're all
aware of - so that there's a plausible portable base for application and
library developers to develop to, if they choose.

Anton



0
anton58 (1240)
10/28/2003 4:12:00 AM
> Jens Axel S�gaard <usenet@jasoegaard.dk> writes:
>> What is missing in DrScheme?

Bruce Stephens <bruce+usenet@cenderis.demon.co.uk> wrote:
> Bindings to Gtk/GNOME and other random useful libraries?  Speed?

It used to have a Gtk binding, and supposedly there's a new one in the
works. I'm not too worried about that, though; the wxWindows binding is
pretty good and probably more portable. A GNOME binding would be a dead
end, portability-wise. The ability to write GUI apps for Windows and X
(without paying a ton of money or relying on Cygnus) was actually *the*
major selling point for PLT, for me.

> Perhaps there are such bindings, and I just don't know where to look
> for them.  It's true that speed isn't the main priority for the
> DrScheme family, though, isn't it?

Apparently not, but that's not necessarily a bad thing. Portability,
robustness, ease of use, and a killer development environment seem to be
the main goals, and those things sell. And it's not like PLT is *slow*
-- it just isn't C, that's all. It compares favorably with other
interpreted languages.

BTW, the development environment was actually a drawback for me -- I'm a
hardcore vim & Makefiles kinda guy. (In fact, I wrote comprehensive vim
syntax-highlighting rules for PLT Scheme. I was originally supposed to
take over maintenance/development from the original author, but I never
got around to finishing and publishing my rules, because there were some
performance issues that I never quite worked out.)
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/28/2003 4:40:43 AM
Anton van Straaten <anton@appsolutions.com> wrote:
> Maybe Alex means something like a standard module declaration syntax
> which maps to a minimal set of sufficiently similar semantics on
> different Schemes.  Which seems like it could be a workable idea, to
> me.

Agreed. Some folks might rankle at some of the necessary restrictions,
though. For example, you couldn't count on shadowing/redefining imported
identifiers like you can at the top level; some Schemes (like Scheme-48)
support that, but others (like PLT) don't, and for good reasons.

I was actually toying with the idea of implementing modules as FEATURE,
based on the requirements syntax of SRFI-7. However, I decided that
wasn't quite the right way to do it. More on this later if I actually
find time to implement something useful.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/28/2003 5:43:26 AM
At 27 Oct 2003 22:31:08 -0500, Shriram Krishnamurthi wrote:
>=20
> Alex Shinn <foof@synthcode.com> writes:
>=20
> > Would it be too much to ask for a standard *syntax* to the module
> > system, without specifying the semantics? =20
>=20
> This is a troll, right?

It's not a troll, though perhaps it's not expressed clearly and
certainly isn't completely thought out.

> > No matter how many SRFIs or
> > libraries we write, if we can consistently load them into a program then
> > the same program can never run unmodified on two different Schemes.
>=20
> I think you mean "...if we cannot consistently...".

Yes, sorry.

> What does it mean to load consistently in the absence of a semantics?

Not complete absence but a sort of minimal assumption.  Consider every
SRFI that has a reference implementation, every module I see browsing
/usr/lib/plt/collects/mzlib/, the C-parser just posted to c.l.s., and
countless utility modules from all the Scheme implementations.  Many of
them are written in highly portable Scheme, which can be made more
portable with further SRFIs and standardization.  However, at the
beginning of every one is a little incantation that says "this is a
module" with some extra information about what modules it uses and what
procedures it provides.  If we just standardize on the syntax of that
incantation then there suddenly becomes the chance that a module written
in one Scheme would work out-of-the-box on another Scheme.  More
complicated semantics, module-introspection, etc. would still not be
portable.

> > Suppose we use a syntax encompassing all of the module-system concepts
> > in use now.  [...]
>=20
> Doesn't encompass units.

=46rom the MzScheme manual:

  In some ways, a unit resembles a module (see Chapter=A05 in PLT
  MzScheme: Language Manual), but units and modules serve different
  purposes overall.

I would only suggest this for modules, not units.

--=20
Alex

0
foof (110)
10/28/2003 6:48:19 AM
Bradd W. Szonye wrote:
> Anton van Straaten <anton@appsolutions.com> wrote:
> > Maybe Alex means something like a standard module declaration syntax
> > which maps to a minimal set of sufficiently similar semantics on
> > different Schemes.  Which seems like it could be a workable idea, to
> > me.
>
> Agreed. Some folks might rankle at some of the necessary restrictions,
> though. For example, you couldn't count on shadowing/redefining imported
> identifiers like you can at the top level; some Schemes (like Scheme-48)
> support that, but others (like PLT) don't, and for good reasons.

It would still be better than the restrictions imposed by coding to R5RS, or
some mixture of R5RS+SRFIs+SLIB.  Sure, you can use SLIB's modules, or
Taylor Campbell's lexmod, or roll your own modules, but all of these have
disadvantages which could (I believe) be addressed by some relatively
minimal implementation support for a standard "simple" module system.

Anton



0
anton58 (1240)
10/28/2003 6:48:52 AM
"Bradd W. Szonye" wrote:
> 
> Anton van Straaten <anton@appsolutions.com> wrote:
> > Maybe Alex means something like a standard module declaration syntax
> > which maps to a minimal set of sufficiently similar semantics on
> > different Schemes.  Which seems like it could be a workable idea, to
> > me.
> 
> Agreed. Some folks might rankle at some of the necessary restrictions,
> though. For example, you couldn't count on shadowing/redefining imported
> identifiers like you can at the top level; some Schemes (like Scheme-48)
> support that, but others (like PLT) don't, and for good reasons.
> 
> I was actually toying with the idea of implementing modules as FEATURE,
> based on the requirements syntax of SRFI-7. However, I decided that
> wasn't quite the right way to do it. More on this later if I actually
> find time to implement something useful.

I've been thinking about writing a portable "module mangler."

It would read from disk a bunch of scheme files with some kind 
of standard module syntax, and output a single honkin-large 
scheme file (maybe in a temporary directory) that puts them 
all together with separate namespaces kept separate, and 
strictly-controlled scope for macros, and so on.  

So you could do development in a bunch of different files and
be confident of putting them all together in one program with 
a well-defined semantics, regardless of implementation.  

It would answer namespace and macrology-scope issues, but it 
would never answer the separate-compilation issue.  Even so, 
it might attract enough of a following to standardize a 
module syntax, especially if distributed with a bunch of 
good libraries.

What do people think of the idea?

			Bear
0
bear (1219)
10/28/2003 6:49:52 AM
Alex Shinn <foof@synthcode.com> wrote in message news:<87vfqap01j.wl@strelka.synthcode.com>...
> At Mon, 27 Oct 2003 23:42:24 +0100, felix wrote:
> > 
> > What kind of modules? How easy to use should they be? Should they
> > allow interactive use? Man, do you realize how much work has gone into
> > Scheme module systems, yet none really satisfies everybody!
> 
> Would it be too much to ask for a standard *syntax* to the module
> system, without specifying the semantics?  No matter how many SRFIs or
> libraries we write, if we can consistently load them into a program then
> the same program can never run unmodified on two different Schemes.
>
>[...]
> 
> There are issues to be resolved but I don't believe it's impossible to
> at least make the syntax work for all the major module systems out
> there.  The question is, if a SRFI were to be created that specified a
> syntax like the above, would Scheme implementations support it?

It's easy: submit a SRFI, and you'll have a good chance of being
able to discusss the relevant questions with the relevant people
(or those which are interested in solving these issues).


cheers,
felix
0
felix1812 (33)
10/28/2003 7:38:58 AM
Shriram Krishnamurthi <sk@cs.brown.edu> wrote in message news:<w7dn0bmavth.fsf@cs.brown.edu>...
> "Anton van Straaten" <anton@appsolutions.com> writes:
> 
> > I think it's interesting & relevant to look at the ways in which this is
> > *not* true.  First, there's Jython, which is a well-established
> > implementation of Python on the Java platform.  There's also the Psyco
> > compiler for Python, which is a kind of JIT compiler.  Then there are
> > implementations of both Python and Perl under way for .NET.
> > 
> > So I think it's possible that the much-vaunted single implementations of
> > some languages are merely an artifact of their youth.  
> 
> Indeed, isn't that what happened with Stackless Python?  My
> understanding is that for a while, Stackless created an Avignon vs
> Rome situation in the Python community.  The noise over Stackless
> seems to have subsided, but it seems likely Parrot will have
> continuations, which means the debate will have to reopen.  And I
> believe Tismer and others are now working on something called PyPy,
> which means yet another implementation...
> 
> Shriram

I think you have got the wrong impression. The concept of "different
implementation" in the Python world is completely different from the
concept of "different implementation" in the Scheme world. 

Somebody saying "Python has only one implementation" wouldn't be far from 
the truth. There is only ONE implementation that matters, which is
CPython. All the others implementors strive to get as close as
possible to CPython. The minimal compatibility is 99%.
Different implementations provide something more and are intented to 
be used in specific situations (you want to script Java, use Jython, 
you want to skip the C-stack restriction, use stackless) but they are 
in no sense competitors of CPython. If the PyPy project succeeds
(and everybody hope so in the community, including Guido van Rossum) 
we will have a faster Python, but it will still be 99.99% compatible
with CPython. At least this is ideal goal of the developers, as I
understand their claims (and I think I do understand them).

I do think Perl/Python/Ruby succeds because they are basically one man
projects. Of course, there are hundreds of Python developers, but only one 
has the last word when essential decisions for the language have to be
taken: Guido van Rossum. 
It is also interesting to notice that Guido's ideas are *really* respected in
the community, more respected than you could imagine. Also, a lot of people in 
the Python community are practical programmers and not language designers
or academical people: this makes a big difference. Let me give a trivial
example: a large minority in the community regularly rants about the fact 
that the list .sort() method returns None and not the sorted list. Now, nobody 
will *ever* think about making a new implementation correcting this "wart"
(personally, I don't think it is a wart, by the way, at least in the context 
of Python). It would be considered foolish to make an implementation which
makes the same things Python already does in a different way. Implementations
are free to add, NOT to change. Essentially the ideas is "okay, this is a 
wart in my opinion, but I will live with it, because forking the community 
would be much worse than correcting the wart". 

My postings here made me realize that the Scheme community is very 
different from the Perl/Python/Ruby communities: a Pythonista has
no difficulties in accepting a BDFL (Benevolent Dictator For Life),
no difficulties in trading performances for easy of use, no difficulties
in accepting a bondage & discipline syntax (actually a rather large 
minority would appreciate even a stricter bondage & discipline syntax!). 
I could give other examples, but you get the idea. 

Notice that I am not saying that one approach is better than one other:
there are trade-offs. If you chose the one-implementation way you have
advantages (even big advantages), if you choose the way of freedom you
have other advantages (which may be considered even bigger by some).

I've got the impression that there is no way that the Perl/Python/Ruby 
model will ever work in the Scheme community, for historical and
socialogical reason. This can be considered good (for some reasons) or 
bad (for other reasons).

What I (as an outsider to the community) would appreciate is: 

1. make a stricter R5RS (not very strict, but stricter than now);

2. make more srfi (much more);

3. make them available on every implementation.

These points are (maybe/maybe not) in the range of realizable things; I don't
think I will ever see an unique (unique in Python sense) implementation of
Scheme; one could even argue that this a good thing, BTW. 

For the moment being, you schemers are stuck with Perl/Python/Ruby; if
this is of any consolation, think that it could have been worse (i.e.
Java/C++ ;)


          Michele Simionato
0
mis6 (224)
10/28/2003 8:43:02 AM
"R Racine" <ray@adelphia.net> wrote in message news:<pan.2003.10.28.03.36.37.871148@adelphia.net>...
> 
> Few.  Very few, have had success writing substantive applications in
> Scheme.  Of those few, the majority, have or still are on some endless
> merry-go-round of trying it on this impl and then that.  I expect most
> give up and, use C#, Java, SML, CL or Haskell.

Any numbers? You seem to be quite convinced of that. Is Haskell
really used more heavily for substantive applications than Scheme?
Or are you just guessing, since the respective communities appear
more unified?

If C#, Java or CL give you what you want, go ahead, use it.
Personally C#, Java, SML or Haskell don't give me the stuff I need. Neither
does CL, actually.

> 
> To not recognize that there is an implementation "issue" with Scheme that
> is impacting its adoption in the realworld, retention of the few
> application level coders it has and constraining a substantial and broad
> library code base from forming is ... I don't know.  A shame.

Stop whining. You are trying to blame the wrong people. It's a shame
that you think you're entitled to any demands. If Scheme (or better,
the available implementations) don't (doesn't) serve your needs, fine.
Fix it or try alternatives. Have you tried Common LISP? This might
be exactly what you need. I'm serious.

> 
> > Many of those are used commercially and provide splendid FFIs and/or
> > extension libraries.
> > If Scheme implementations are insufficient for you, do it yourself. But
> > I don't think you will do any better than what is currently available,
> > since the major implementations take most known implementation
> > strategies pretty far.
> > 
> > Here's an idea: pick an implementation (unimportant which one), sit down
> 
> Therein lies the crux.  I have been claiming it is.  Ever try using
> Bigloo-libs GTK bindings in XYZ impl.  Or grabbing Schematics SchemeUnit
> for use in ABC impl.  Non starters.  Sure you can spend a couple of days
> porting it to whatever your current impl of choice.  Then you get to do it
> again when the library code has a new version released.
> 
> > and start
> > writing libraries for it (doesn't matter for what).
> 
> My efforts are diluted.  Because 49 other library writers are writing
> libraries for some other impl.

There are not, you are wildly exaggerating. It *is* possible to write
cross-implementation libraries (see srfi.schemers.org for a couple
of examples), and it is even possible to write libraries for things
like GTK, with a little bit of pre-/post-processing, macros, careful use
of lexical scope and clean design.

(Now it's your turn to start whining why nobody did this for you already)

This discussion painfully reminds me on the ever-popular cl-is-great-but-
if-it-just-had-this-extension drivels that come up regularly on
comp.lang.lisp. Yet, it hasn't changed anything.

But we probably won't come to any useful conclusion here.

I will now go to comp.lang.python and complain about the fact
that there is no extension that provides macros, precise space-and-time
efficient GC and tail-call-optimization, all requirements that I find 
very important for serious application development. 
I wonder what they will tell me...?


cheers,
felix
0
felix1812 (33)
10/28/2003 9:01:13 AM
Jens Axel S�gaard <usenet@jasoegaard.dk> wrote in message news:<3f9db83e$0$70001$edfadb0f@dread12.news.tele.dk>...
> R Racine wrote:
> > On Mon, 27 Oct 2003 08:21:52 -0600, Scott G. Miller wrote:
>  
> >>I hope I'm wrong, but it seems you have a simplified view of Scheme
> >>architectures.
>  
> > I do.  I represent the pitch fork wielding, torch waving, unwashed masses
> > of fustrated Scheme application developers.  And yes, maybe I am a mass of
> > one. (shades of a "silent" majority here)
> 
> What is missing in DrScheme?

For me, a major gap is Unicode and multibyte character support. This
is by now standard in implementations of most other widely used
programming languages but surprisingly few Schemes have it.

--
Grzegorz
0
grzegorz1 (80)
10/28/2003 9:39:30 AM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

[...]

> Apparently not, but that's not necessarily a bad thing. Portability,
> robustness, ease of use, and a killer development environment seem
> to be the main goals, and those things sell. And it's not like PLT
> is *slow* -- it just isn't C, that's all. It compares favorably with
> other interpreted languages.

Yes, I agree with all that.  I'd just like some language which had
reasonable portability, ease of use, etc., and had the option of
blinding speed, at least on common platforms.  And that doesn't seem
to me to be impossible---there are various very fast scheme
implementations around.  It's just that the various scheme
implementations seem to stay just far enough apart in various respects
(FFI, mostly) that using more than one of them is inconvenient.

[...]

0
usenet44 (324)
10/28/2003 10:21:31 AM
Grzegorz Chrupala wrote:
> Jens Axel S�gaard <usenet@jasoegaard.dk> wrote in message news:<3f9db83e$0$70001$edfadb0f@dread12.news.tele.dk>...
> 
>>R Racine wrote:
>>
>>>On Mon, 27 Oct 2003 08:21:52 -0600, Scott G. Miller wrote:
>>
>> 
>>
>>>>I hope I'm wrong, but it seems you have a simplified view of Scheme
>>>>architectures.
>>
>> 
>>
>>>I do.  I represent the pitch fork wielding, torch waving, unwashed masses
>>>of fustrated Scheme application developers.  And yes, maybe I am a mass of
>>>one. (shades of a "silent" majority here)
>>
>>What is missing in DrScheme?
> 
> 
> For me, a major gap is Unicode and multibyte character support. This
> is by now standard in implementations of most other widely used
> programming languages but surprisingly few Schemes have it.

There is a reason for that.  The R5RS character operators cannot be made 
to work reliably with unicode characters.  SISC for example supports 
unicode characters and arbitrary character maps, but makes no effort to 
contort the standard operators to behave properly.  There was a usenet 
discussion about this in the past which you could probably find by googling.

	Scott

0
scgmille (240)
10/28/2003 4:27:25 PM
"Scott G. Miller" <scgmille@freenetproject.org> writes:

> Grzegorz Chrupala wrote:
>> Jens Axel S�gaard <usenet@jasoegaard.dk> wrote in message news:<3f9db83e$0$70001$edfadb0f@dread12.news.tele.dk>...

[...]

>>>What is missing in DrScheme?
>> For me, a major gap is Unicode and multibyte character support. This
>> is by now standard in implementations of most other widely used
>> programming languages but surprisingly few Schemes have it.
>
> There is a reason for that.  The R5RS character operators cannot be
> made to work reliably with unicode characters.  SISC for example
> supports unicode characters and arbitrary character maps, but makes no
> effort to contort the standard operators to behave properly.  There
> was a usenet discussion about this in the past which you could
> probably find by googling.

I couldn't find it.  I did searches under comp.lang.scheme for
unicode, utf8, utf-8, and most of the threads seemed positive (giving
implementations that support unicode in some form).  I didn't see any
threads showing fundamental problems.
0
usenet44 (324)
10/28/2003 5:26:24 PM
Bruce Stephens wrote:
> "Scott G. Miller" <scgmille@freenetproject.org> writes:
>
> > Grzegorz Chrupala wrote:
> >> Jens Axel S�gaard <usenet@jasoegaard.dk> wrote in message
news:<3f9db83e$0$70001$edfadb0f@dread12.news.tele.dk>...
>
> [...]
>
> >>>What is missing in DrScheme?
> >> For me, a major gap is Unicode and multibyte character support. This
> >> is by now standard in implementations of most other widely used
> >> programming languages but surprisingly few Schemes have it.
> >
> > There is a reason for that.  The R5RS character operators cannot be
> > made to work reliably with unicode characters.  SISC for example
> > supports unicode characters and arbitrary character maps, but makes no
> > effort to contort the standard operators to behave properly.  There
> > was a usenet discussion about this in the past which you could
> > probably find by googling.
>
> I couldn't find it.  I did searches under comp.lang.scheme for
> unicode, utf8, utf-8, and most of the threads seemed positive (giving
> implementations that support unicode in some form).  I didn't see any
> threads showing fundamental problems.

Perhaps you didn't make the proper offerings to the Great God Google...

Dunno if it's what Scott was thinking of, but here's a post in which Bear
describes some issues with Unicode & R5RS:
http://groups.google.com/groups?selm=3D753365.6BE29F0E%40sonic.net
Some of the earlier and later posts in that thread are also relevant.

Anton



0
anton58 (1240)
10/28/2003 5:55:05 PM
Anton van Straaten wrote:
> Bruce Stephens wrote:
> 
>>"Scott G. Miller" <scgmille@freenetproject.org> writes:
>>
>>
>>>Grzegorz Chrupala wrote:
>>>
>>>>Jens Axel S�gaard <usenet@jasoegaard.dk> wrote in message
> 
> news:<3f9db83e$0$70001$edfadb0f@dread12.news.tele.dk>...
> 
>>[...]
>>
>>
>>>>>What is missing in DrScheme?
>>>>
>>>>For me, a major gap is Unicode and multibyte character support. This
>>>>is by now standard in implementations of most other widely used
>>>>programming languages but surprisingly few Schemes have it.
>>>
>>>There is a reason for that.  The R5RS character operators cannot be
>>>made to work reliably with unicode characters.  SISC for example
>>>supports unicode characters and arbitrary character maps, but makes no
>>>effort to contort the standard operators to behave properly.  There
>>>was a usenet discussion about this in the past which you could
>>>probably find by googling.
>>
>>I couldn't find it.  I did searches under comp.lang.scheme for
>>unicode, utf8, utf-8, and most of the threads seemed positive (giving
>>implementations that support unicode in some form).  I didn't see any
>>threads showing fundamental problems.
> 
> 
> Perhaps you didn't make the proper offerings to the Great God Google...
> 
> Dunno if it's what Scott was thinking of, but here's a post in which Bear
> describes some issues with Unicode & R5RS:
> http://groups.google.com/groups?selm=3D753365.6BE29F0E%40sonic.net
> Some of the earlier and later posts in that thread are also relevant.
> 

Nah, its not his fault, I couldn't find it either (the above is not what 
I recall).  I'll try and dig up the reference, it may not have been on 
usenet.

	Scott

0
scgmille (240)
10/28/2003 6:11:11 PM
On Tue, 28 Oct 2003 02:05:01 GMT, R Racine <ray@adelphia.net> wrote:
> On Tue, 28 Oct 2003 01:27:42 +0100, Jens Axel S�gaard wrote:
>> What is missing in DrScheme?

> What I find more troubling is some of the other Scheme wiz's disdain for
> MzScheme from the aspect of a production quality Scheme.  What is it that
> THEY find missing in PLT? Do they know something that we simple Joes do
> not regarding the inner workings of MzScheme?

Well

1) I'm not a Scheme 'wiz' for any value of 'wiz'
2) I like PLT

but I don't use it. And haven't for quite a while (like since early v200).
There are a few reasons for this, some rational and some less so:

1) it's just not fast enough. I do Data Mining and IR applications in
    Scheme and I'm starving for CPU cycles, even on my 2Ghz+ machines

2) it was a pain to make fast. The notion of 'standalone executable', while
    ostensibly supported involved a complete rebuild of the PLT core

3) I write daemons and command-line programs and don't need GUI bells and
    whistles; if I did, PLT would be right up there. Although I'm pretty
    excited about SCX/Scsh, and I found programming raw XLIB under Stalin
    to have a perverse attraction as well...

4) The unit system was impressive ... and intimidating. And I hated all
    the extra punctuation I saw floating around inside of PLT's naming
    conventions

5) MrSpidey can't handle big enough programs - and I *really* wish it did.
    In fact, if MrSpidey could handle 15KLOC+ programs I would probably
    start to make the effort to move back to PLT for pre-production
    development. but did I mention that it's not fast enough for my crippled
    486/133 at home?

6) the v200 release b0rk3d my PLT code base and the performance wasn't good
    enough for me to abandon Gambit & Larceny (which my code also ran on
    since I have put a lot of effort into a portable Scheme programming
    infrastructure)

7) I'm really attached to Scsh's adaptation of Posix to Scheme. Where PLT
    has diverged, I haven't actually found it any better.

8) PLT's library is very big...and very inbred so I can't easily chop off
    parts of it to use under other, faster, Scheme implementations. So
    programming in PLT becomes a painful exercise in figuring out how to
    implement the PLT signatures for my production platforms.

9) PLT is a pain to install. I'm sure that the PLT folks don't think so, 
but
    but I haven't been able to get a fully-working install for quite a while
    now. It doesn't use configure/make to build and it is very finicky about
    file locations. Given that I *usually* need to have a multi-platform
    environment I find the lack of flexibility in PLT's installation very
    irritating.

10) Very good alternatives to PLT also exist...specifically Gambit (gets
     my vote for best all-round), Larceny (if only all the world was SPARC),
     Bigloo (great for speed assuming you can live with it's limits). And
     Stalin which is fast fast fast, but slow slow slow to compile.

Even though I am obsessed with performance, please understand that PLT is
I think the second-fastest interpreter out there (Petite Chez is #1). And
remember that I *do* like many things about PLT, even if it doesn't come
out when I'm whingeing. In fact, I am planning to use PLT to teach my kids
programming.

david rush
-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
0
kumo7543 (108)
10/28/2003 6:53:55 PM
On Tue, 28 Oct 2003 11:29:28 +0900, Alex Shinn <foof@synthcode.com> wrote:
> At Mon, 27 Oct 2003 23:42:24 +0100, felix wrote:
>>
>> What kind of modules? How easy to use should they be? Should they
>> allow interactive use? Man, do you realize how much work has gone into
>> Scheme module systems, yet none really satisfies everybody!
>
> Would it be too much to ask for a standard *syntax* to the module
> system, without specifying the semantics?

I don't think you realize just how outrageous this statement is.

Nevertheless I have been writing a GUMS (Grand Unified Module System) for
several *years* now, based on the theory that all module systems can be
modelled as source-to-source compilers which produce a single source 
module.
It works, for certain values of 'work', and if I was an academic I could
probably find the time to finish and polish and add the major missing
module languages to it. If you want to help, please contact me privately
(this is a serious offer). The project is on SourceForge at

	http://mangler.sourceforge.net

but be warned, building it is only straightforward for me (even with the
instructions page I imagine), and since I have no users, I tend to get
a bit sloppy about maintaining pieces of it. This has turned out to be a
rather larger project than I thought it would be when I started, if only
because maintaining the library is aa necessity I didn't foresee.

david rush
-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
0
kumo7543 (108)
10/28/2003 7:08:48 PM
On Tue, 28 Oct 2003 06:49:52 GMT, Ray Dillinger <bear@sonic.net> wrote:
> "Bradd W. Szonye" wrote:
> I've been thinking about writing a portable "module mangler."

Ray - I've been working on this for years. That's what S2 is all
about. It does work, but I just don't have the time to keep the
docs (and libs) up to date.

> It would read from disk a bunch of scheme files with some kind
> of standard module syntax, and output a single honkin-large
> scheme file (maybe in a temporary directory) that puts them
> all together with separate namespaces kept separate, and
> strictly-controlled scope for macros, and so on.

That's exactly what I do. I've got the hooks in for alpha-renaming
top-level symbols, but I've never had the need to fully productize
the code. You want to help? I'll happily help you get your first
builds going (bootstrapping the animal is a bit tricky).

> What do people think of the idea?

obviously I think it's brilliant. I just have a day job so my version
seems doomed to live in the twilight of my needs...

http://mangler.sourceforge.net

david rush
-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
0
kumo7543 (108)
10/28/2003 7:13:59 PM
On Tue, 28 Oct 2003 10:21:31 +0000, Bruce Stephens 
<bruce+usenet@cenderis.demon.co.uk> wrote:
> "Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:
> implementations around.  It's just that the various scheme
> implementations seem to stay just far enough apart in various respects
> (FFI, mostly) that using more than one of them is inconvenient.

What I do about that is I plonk the FFI-specific parts of the code
into cond-expand blocks. It seems to work pretty well for me anyway, but
then I'm generally not going much beyond POSIX.

david rush
-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
0
kumo7543 (108)
10/28/2003 7:26:11 PM
Anton van Straaten <anton@appsolutions.com> wrote:
> Dunno if it's what Scott was thinking of, but here's a post in which Bear
> describes some issues with Unicode & R5RS:
> http://groups.google.com/groups?selm=3D753365.6BE29F0E%40sonic.net
> Some of the earlier and later posts in that thread are also relevant.

That article deals with Unicode support in Scheme code. There's also the
issue of Unicode support for data. The former problem is thornier than
the latter, because supporting Unicode in Scheme code includes all the
problems of Unicode in data *plus* the special considerations necessary
for a case-insensitive programming language.

Bear's overview is good, but he missed an alternative:

Use the Unicode algorithms for case-folding equivalence. When the result
is ambiguous, signal an error. Give the programmer a way to resolve
ambiguities. Example:

    A program written in German contains the identifiers "masse" and
    "ma�e." If only one of the two identifiers is in scope, "MASSE"
    refers to the one that's in scope. If both are in scope, "MASSE" is
    ambiguous.

How does a programmer resolve the ambiguity? The simpler method is to
simply disallow ambiguous uses. The programmer must not use "MASSE" when
both "masse" and "ma�e" are in scope. A more sophisticated method could
allow a way to specify which identifier "MASSE" is supposed to be
equivalent to.

Unfortunately, this can violate the principle of least surprise. Suppose
that only "ma�e" is in scope. The programmer writes

    (lambda (MASSE) ... ma�e ...)

intending to bind MASSE but not ma�e. Unfortunately, this method shadows
the free variable "ma�e" because it's "unambiguous." I don't expect that
this would be a common problem, but it would be nasty when it did
happen. And case-folding isn't the only situation where that comes up.
For example, consider the words "resume" and "r�sum�" under English
collation rules. Depending on context, they may or may not be the same
word. There's a more general problem here: Identifiers that are
ambiguous even without case transformations.

Sometimes, identifiers are ambiguous even when they're spelled
identically. For example, try writing a "resume" (curriculum vitae)
class with "resume" (coroutine yielding) semantics. Oops, there's an
identifier collision! That's a thorny problem all on its own, and
locale-dependent identifiers just make it thornier.

Any identifier clash will tend to violate the principle of least
surprise. Case-folding and other locale-dependent forms of equivalence
just make it more surprising. To the human eye, the addition of accents
sufficiently disambiguate "resume" and "r�sum�," but to a compiler,
they're just as ambiguous as they are without the accents. That mismatch
between what the human sees and what the machine sees is what adds to
the surprise.

I understand why Bear chose the resolution he did -- simply don't permit
any ambiguous characters -- but unfortunately it doesn't address the
underlying problem.

Of course, even if you can deal with that problem, there's still the
problem of combining code from two different languages, with different
concepts of "equivalent symbols"! It's not too surprising that many
languages just punt on this issue and say, "Different spellings mean
different symbols."
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/28/2003 8:17:47 PM
Bradd W. Szonye wrote:
> Use the Unicode algorithms for case-folding equivalence. When the result
> is ambiguous, signal an error. Give the programmer a way to resolve
> ambiguities. Example:
> 
>     A program written in German contains the identifiers "masse" and
>     "maße." If only one of the two identifiers is in scope, "MASSE"
>     refers to the one that's in scope. If both are in scope, "MASSE" is
>     ambiguous.

Use of Unicode characters in indentifier names is largely irrelevant.
Unicode is essential in many applications such as NLP or XML processing,
but it is needed to deal with *data* mainly (characters, strings, symbols),
not identifier names. The potential ambiguity between maße and MASSE as a
variable name is a non-issue. For variable-name case folding, just use
standard Unicode case mapping, where (char-upcase #\ß) is just #\ß and be
done with it.

It is red herrings such as the above that mislead people into thinking that 
Unicode support on a basic level is more complicated than it really is. 
-- 
Grzegorz
http://pithekos.net
0
grzegorz1 (80)
10/28/2003 9:18:01 PM
Grzegorz Chrupa?a <grzegorz@pithekos.net> wrote:
> Use of Unicode characters in indentifier names is largely irrelevant.

That's why I initially mentioned the difference between Unicode support
for data and Unicode support for program code (e.g., identifiers). The
rest of my article was in response to Bear's earlier discussion of the
latter.

> For variable-name case folding, just use standard Unicode case
> mapping, where (char-upcase #\�) is just #\� and be done with it.

Is that actually true? If so, I'd consider that a defect in Unicode,
because the correct spelling of "capital esszed" is "SS." And besides,
case-folding is only part of the problem, because it's only one example
of different but equivalent spellings.

> It is red herrings such as the above that mislead people into thinking
> that Unicode support on a basic level is more complicated than it
> really is. 

Unicode support for data is fairly tricky on its own. Many languages
choose not to complicate things by applying the data rules to code. For
example, C++ permits a wide variety of Unicode characters in data and in
code, but it does not attempt locale-dependent equivalence for code --
every different spelling is a different identifier.

However, Schemers like it when the same rules apply to code and data
both. Also, programmers in any case-insensitive language like it when
identifiers "do the right thing" in non-English languages. That's why
any discussion of extended character sets is likely to stray into a
discussion of identifier equivalence.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/28/2003 9:52:29 PM
Bradd W. Szonye wrote:

> Grzegorz Chrupa?a <grzegorz@pithekos.net> wrote:
>> For variable-name case folding, just use standard Unicode case
>> mapping, where (char-upcase #\ß) is just #\ß and be done with it.
> 
> Is that actually true? If so, I'd consider that a defect in Unicode,
> because the correct spelling of "capital esszed" is "SS." And besides,
> case-folding is only part of the problem, because it's only one example
> of different but equivalent spellings.

The basic non-locale dependent case mapping of ß is ß. There is a
SpecialCasing table which deals with characters such as ß where case
mappings are not simple 1-1 character correspndences.
(http://www.unicode.org/Public/UNIDATA/)

> 
> Unicode support for data is fairly tricky on its own. Many languages
> choose not to complicate things by applying the data rules to code. For
> example, C++ permits a wide variety of Unicode characters in data and in
> code, but it does not attempt locale-dependent equivalence for code --
> every different spelling is a different identifier.
> 
> However, Schemers like it when the same rules apply to code and data
> both. Also, programmers in any case-insensitive language like it when
> identifiers "do the right thing" in non-English languages. That's why
> any discussion of extended character sets is likely to stray into a
> discussion of identifier equivalence.

"Doing the right thing" in the general case, in a fully locale sensitive way 
is indeed complicated, if at all possible. IMO the rules for identifiers as
should be well-defined and simple as well as consitent with treatment of
strings on the basic level, i.e. they should use the general, non-locale
dependent case-mappings. 
When dealing with data one could choose to use more refined,
locale-dependent mappings, algorithms etc as needed.

As I see it, it is enough if the the core language provides core
Unicode-compatible functionality including the way to read and write UTF-8
and UTF-16 encoded text, distinguish characters and bytes, get the length
of a string in characters and bytes, provide standard Unicode
case-mappings, sorting, and Unicode aware standard character predicates
such as char-whitespace? etc. Anything beyond that can be more or less
easily added in libraries or defined by the user as needed.

Cheers,
-- 
Grzegorz
http://pithekos.net
0
grzegorz1 (80)
10/28/2003 11:05:58 PM
> On Tue, 28 Oct 2003 02:05:01 GMT, R Racine <ray@adelphia.net> wrote:
>> What I find more troubling is some of the other Scheme wiz's disdain
>> for MzScheme from the aspect of a production quality Scheme.  What is
>> it that THEY find missing in PLT? Do they know something that we simple
>> Joes do not regarding the inner workings of MzScheme?

[2nd response attempt.  1st Vaporwared, I guess.]

I did not phrase that very well.  Sorry.

Here are couple of way to get to Sifsad.

Axiom
-----
There is a general consensus amoungst the Scheme User/Application
developer community with the direction the PLT group has expanded Scheme
beyond R5RS with regard to Units, modules, general libraries etc... Across
the board, taking everything into account we, most, not all, general
scheme users and app developers, are pretty happy with PLT philosophy of
Scheme.

However, there is also recognition that PLT fails to deliver the speed
necessary for it to become the mainstream implementation for general
application development.


Strawman Proposal #1
--------------------

The Scheme Community joins in to assist PLT with aggressively enhancing
mzc performance to be on par with the bulk with the other Scheme -> C
compilers.

	- This assumes that mzc can be significantly improved as the main
priorities of the PLT group are teaching and language research resulting
in the current overall efficiency of mzc not being what it could be.
	- This is what I was attempting to address above.  When I have
previously proposed this plan to "wizs", those with the technical chops to
work with compiler optimizations, the response can be best categorized as
"can't be done" without much follow on detail.  (beyond continuation
capture efficency, which is not a killer for something that is targeting
most general application developement needs)

Strawman Proposal #2
--------------------

Assuming that the PLT system cannot be improved to the performance levels
of other Scheme -> C systems because the basic architecture of the PLT
system was based on other priorities then speed, the Scheme community
adopts an existing, fundamentally strong, fast Scheme -> C implementation
with the goal of attaining 100% code compiliance with PLT (as is
reasonable).

The goal would be to have near identical collections shared between the
two implementations.  "Write once, run on both", as it were.


I claim either one of these solutions would be a bit of godsend to the
majority of bread and butter Scheme users who would like to use Scheme for
general application development.

Of course those with "specialized" applications will chose alternate
implementations that emphasize aspects vital to their application.


Ray
0
ray7279 (14)
10/28/2003 11:42:02 PM
Bruce Stephens wrote:
> Jens Axel S�gaard <usenet@jasoegaard.dk> writes:
>>What is missing in DrScheme?

> Bindings to Gtk/GNOME and other random useful libraries?  

What are the benefits of Gtk/GNOME over the current portable (Windows, 
Unix, Machintosh) GUI already in DrScheme?


Which other useful libraries are you thinking about?

 > Speed?

I wanted to hear what mzscheme misses compared to Perl/Python
and as far as I can tell, mzscheme has no problem in the speed
department.

[The existence of the *very* fast Scheme compilers does not imply
that mzscheme is slow]

> Perhaps there are such bindings, and I just don't know where to look
> for them.  It's true that speed isn't the main priority for the
> DrScheme family, though, isn't it?

Yes - but that doesn't mean it is slow.

>>In other languages (e.g. Python/Perl) you are pretty much stuck with
>>one implementation.

> But that's OK, because although it is a compromise, it's a reasonable
> one, and because there's only the one, there's an enormous library of
> extensions and code that I can use.  

Then find a Scheme that makes the same compromises as in Python/Perl
and use that. Ignore the rest.

> There's lots of scheme code, too,
> but each blob of code that I find will take a few hours of work to
> massage to work with the implementation that I've chosen to use (with
> its particular combination of module system and so on).

My experience is that the authors often are willing to do the porting,
if they are asked.

>>Perhaps a better idea was to begin making an FFI-SRFI?
> 
> 
> Probably.  On the other hand, if it were that easy, someone would
> already have done it.

I didn't say it was easy. Far from. Lars Hansen has made some leg work 
though.

-- 
Jens Axel S�gaard

0
usenet153 (246)
10/28/2003 11:44:49 PM
Bradd W. Szonye wrote:

> BTW, the development environment was actually a drawback for me -- I'm a
> hardcore vim & Makefiles kinda guy. (In fact, I wrote comprehensive vim
> syntax-highlighting rules for PLT Scheme. I was originally supposed to
> take over maintenance/development from the original author, but I never
> got around to finishing and publishing my rules, because there were some
> performance issues that I never quite worked out.)

?

Why didn't you just ignore DrScheme and used mzscheme?

-- 
Jens Axel S�gaard


0
usenet153 (246)
10/28/2003 11:46:11 PM
R Racine wrote:
> On Tue, 28 Oct 2003 01:27:42 +0100, Jens Axel S�gaard wrote:

>>What is missing in DrScheme?

> Not too much AFAIAC.  On a personal level if I list the top 3 things that
> have blown me away in the Scheme impl world:
> 
> MIT Scheme: The ground breaking work done here.  You see MITScheme code,
> concepts and ideas in many of the current Scheme implementations.  It
> is/was the fountainhead.
> 
> PLT Scheme: An almost endless stream of what Scheme is capable of.
> Unit/Sigs, Languages , inheritable Structures, Contracts, the Syntax
> concept, opaque types, module system ...  You can just randomly click
> about the help system and almost stumble into whole new concepts.
> 
> Another example from MzScheme.  From Eli's Swindle.  I saw that Swindle
> had somehow added support for self evaluating symbols which start with a
> colon.  When I installed Swindle, I didn't recall any patching or
> recompiling.  So hey, how'd he do that?  So I looked.

[Very clever example]

Yes also love the very high level of flexibility.
It is perfect for defining new languages without having
to write a compiler form scratch.

> I digress.  What is missing in DrScheme? Overall I love it. Mainly a
> Sifsad focus.  The system, DrScheme, has a intensional pedalogical focus.
> My concerns, efficient memory usage, optimized VM, speed, debugging are
> not their focus.  

I don't agree that debugging is not in focus. Part of an pedagogical
environment is to produce precise error messages to the user.

Specifically DrScheme has

   - stack traces
   - arrows on top of the source to show calling sequence
   - syntax coloring of live code
   - a tool for building test suites
   - an algebraic stepper (mostly for beginners though)

> The mzc compiler is not on par with some of the other
> Scheme->C systems out there.  Is there an inherant architectural tradeoff
> which prevents mzc from approaching Chicken or Bigloo with speed.  Don't
> know.  If two or three Scheme wizs announced this very night that they
> were going to join the PLT team with a Sifsad prioritized feature list.  I
> would do a hand spring and take up organized religion.

If you compare the speed of mzc executables to Perl and Python what are
your conclusions?

-- 
Jens Axel S�gaard

0
usenet153 (246)
10/28/2003 11:55:03 PM
David Rush wrote:
> On Tue, 28 Oct 2003 02:05:01 GMT, R Racine <ray@adelphia.net> wrote:
>> On Tue, 28 Oct 2003 01:27:42 +0100, Jens Axel S�gaard wrote:
>>> What is missing in DrScheme?

> but I don't use it. And haven't for quite a while (like since early v200).
> There are a few reasons for this, some rational and some less so:

[Relevant speed reasons snipped - I am interested in the other reasons]

> 5) MrSpidey can't handle big enough programs - and I *really* wish it did.
>    In fact, if MrSpidey could handle 15KLOC+ programs I would probably
>    start to make the effort to move back to PLT for pre-production
>    development. but did I mention that it's not fast enough for my crippled
>    486/133 at home?

I have actually never tried MrSpidey - but you can't seriously
list that as a reason, since the competing languages doesn't have
similar tools.

> 7) I'm really attached to Scsh's adaptation of Posix to Scheme. Where PLT
>    has diverged, I haven't actually found it any better.

POSIX. That would indeed be a good thing to have better support for.

> 8) PLT's library is very big...and very inbred so I can't easily chop off
>    parts of it to use under other, faster, Scheme implementations. So
>    programming in PLT becomes a painful exercise in figuring out how to
>    implement the PLT signatures for my production platforms.

Again. I am narrowmindedly comparing to Perl/Python today, so that
doesn't apply.

> 9) PLT is a pain to install. I'm sure that the PLT folks don't think so, 
> but
>    but I haven't been able to get a fully-working install for quite a while
>    now. It doesn't use configure/make to build and it is very finicky about
>    file locations. Given that I *usually* need to have a multi-platform
>    environment I find the lack of flexibility in PLT's installation very
>    irritating.

Hm. A valid concern.

> Even though I am obsessed with performance, please understand that PLT is
> I think the second-fastest interpreter out there (Petite Chez is #1). And
> remember that I *do* like many things about PLT, even if it doesn't come
> out when I'm whingeing. In fact, I am planning to use PLT to teach my kids
> programming.

How old are they? You could start my showing them the turtles in 
DrScheme. That's great fun.

-- 
Jens Axel S�gaard

0
usenet153 (246)
10/29/2003 12:00:15 AM
> Bradd W. Szonye wrote:
>> BTW, the development environment was actually a drawback for me --
>> I'm a hardcore vim & Makefiles kinda guy.

Jens Axel S�gaard <usenet@jasoegaard.dk> wrote:
> Why didn't you just ignore DrScheme and used mzscheme?

That's what I do.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/29/2003 12:09:31 AM
Bradd W. Szonye wrote:
>>Bradd W. Szonye wrote:

>>>BTW, the development environment was actually a drawback for me --
>>>I'm a hardcore vim & Makefiles kinda guy.

> Jens Axel S�gaard <usenet@jasoegaard.dk> wrote:
>>Why didn't you just ignore DrScheme and used mzscheme?

> That's what I do.

So what's the drawback?

-- 
Jens Axel S�gaard

0
usenet153 (246)
10/29/2003 12:15:56 AM
On Wed, 29 Oct 2003 00:55:03 +0100, Jens Axel S�gaard wrote:

> If you compare the speed of mzc executables to Perl and Python what are
> your conclusions?

Hands down mzc wins.  However, I do not consider Python and Perl serious
application development languages.  In arena of scripting, small or
one-off applications IMHO mzc/mzscheme is clearly superior.  No contest.

But I would like to see mzc/mzscheme move from the champ of the middle
weight division to heavy weight contender.  For me this means aggregate
benchmark suite performance on par (lets say less within a factor of 2x)
with SML/NJ, CMUCL, or C++ and most of the other Scheme -> C systems.

Anecdotal story.  Recently while the "Coins" discussion was taking place
on c.l.s the author of one of the major Scheme->C systems was on the
#Scheme IRC.  I believe both of us were surprised at how competitive mzc was
vs a well respected Scheme -> C system.  Mzc didn't win but did well.  (I
believe the GMP bingings for large exacts and how cleverly large exacts
are implemented in MzScheme account for its very respectable showing.)

I expect (guessing here) mzc would be less competitive on boyer.scm for
example.

Ray

0
ray7279 (14)
10/29/2003 12:56:59 AM
R Racine wrote:

Some large company located near the northwestern corner of the continental US
has sponsored Will Clinger (Larceny) and PLT to create a merger of the two 
Scheme systems not unlike a mix of the strawman proposals that you have put up
below.

The specific plan is as follows:
  - Will and some others are retargeting Larceny to the intermediate language of
    said companies virtual machine. Will has been calling this project Common
    Lareceny.
  - Joe Marshall and some others are porting MrEd to said company's toolbox
    The result could be a MrEd that's almost completely in Scheme.
    Will is certainly encouraging us to think of Scheme as a systems language.
    Eli's arrival has strengthened this goal even more.
  - Once we have a joint Scheme, we are hoping to retarget it to other platforms.

How realistic is the plan? Producing Larceny was a two-man effort. It's a fast,
reliable R5RS implementation with a few extra goodies. It is particularly 
well-suited for the research ideas that Will wishes to pursue.

PLT Scheme is a many people, many years effort. Matthew (mzscheme), Robby
(drscheme), Shriram (zodiac, server, libs), Cormac (mrspidey), Philippe (mrflow
  = mrspidey successor), Paul Steckler (myster, sister, mzcom), John (the foot,
and soon a debugger), Paul Graunke (the server, soon to be managed by Greg), 
Scott (parser tools), and countless others who are working and/or have worked on 
bits and pieces of the tool suite, not to mention their "day jobs". It is an 
expensive product.

Merging the two projects is not an easy task. It won't be done quickly. If
people really want a top-notch product, however, it may be the route to go.
If you have time to contribute or money or you want to volunteer friends, please 
do so. The goal is to produce a good platform for the first Schemers and the
rest of the world, too.

-- Matthias


>>On Tue, 28 Oct 2003 02:05:01 GMT, R Racine <ray@adelphia.net> wrote:
>>
>>>What I find more troubling is some of the other Scheme wiz's disdain
>>>for MzScheme from the aspect of a production quality Scheme.  What is
>>>it that THEY find missing in PLT? Do they know something that we simple
>>>Joes do not regarding the inner workings of MzScheme?
> 
> 
> [2nd response attempt.  1st Vaporwared, I guess.]
> 
> I did not phrase that very well.  Sorry.
> 
> Here are couple of way to get to Sifsad.
> 
> Axiom
> -----
> There is a general consensus amoungst the Scheme User/Application
> developer community with the direction the PLT group has expanded Scheme
> beyond R5RS with regard to Units, modules, general libraries etc... Across
> the board, taking everything into account we, most, not all, general
> scheme users and app developers, are pretty happy with PLT philosophy of
> Scheme.
> 
> However, there is also recognition that PLT fails to deliver the speed
> necessary for it to become the mainstream implementation for general
> application development.
> 
> 
> Strawman Proposal #1
> --------------------
> 
> The Scheme Community joins in to assist PLT with aggressively enhancing
> mzc performance to be on par with the bulk with the other Scheme -> C
> compilers.
> 
> 	- This assumes that mzc can be significantly improved as the main
> priorities of the PLT group are teaching and language research resulting
> in the current overall efficiency of mzc not being what it could be.
> 	- This is what I was attempting to address above.  When I have
> previously proposed this plan to "wizs", those with the technical chops to
> work with compiler optimizations, the response can be best categorized as
> "can't be done" without much follow on detail.  (beyond continuation
> capture efficency, which is not a killer for something that is targeting
> most general application developement needs)
> 
> Strawman Proposal #2
> --------------------
> 
> Assuming that the PLT system cannot be improved to the performance levels
> of other Scheme -> C systems because the basic architecture of the PLT
> system was based on other priorities then speed, the Scheme community
> adopts an existing, fundamentally strong, fast Scheme -> C implementation
> with the goal of attaining 100% code compiliance with PLT (as is
> reasonable).
> 
> The goal would be to have near identical collections shared between the
> two implementations.  "Write once, run on both", as it were.
> 
> 
> I claim either one of these solutions would be a bit of godsend to the
> majority of bread and butter Scheme users who would like to use Scheme for
> general application development.
> 
> Of course those with "specialized" applications will chose alternate
> implementations that emphasize aspects vital to their application.
> 
> 
> Ray

0
10/29/2003 1:04:48 AM
Bradd wrote:
>>>> BTW, the [PLT] development environment was actually a drawback for
>>>> me -- I'm a hardcore vim & Makefiles kinda guy.

Jens Axel S�gaard <usenet@jasoegaard.dk> wrote:
>>> Why didn't you just ignore DrScheme and used mzscheme?

>> That's what I do.

> So what's the drawback?

I've gotten the impression that some of the cool debugging and error
reporting features are only available in DrScheme. And in general, I've
gotten the impression that a lot of effort goes into developing the GUI
specifically rather than into improving the suite overall. That's a
bummer for me -- they're creating stuff that I can't make full use of,
because their "showcase" tool is incompatible with my work habits.

It's not a huge drawback, and it's obviously not stopping my from using
PLT, but I would like to see more "hooks" (or documentation on how to
use those tools outside of the GUI).
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/29/2003 1:08:04 AM
"Anton van Straaten" <anton@appsolutions.com> writes:

[...]

> Perhaps you didn't make the proper offerings to the Great God Google...

Quite possibly.

> Dunno if it's what Scott was thinking of, but here's a post in which Bear
> describes some issues with Unicode & R5RS:
> http://groups.google.com/groups?selm=3D753365.6BE29F0E%40sonic.net
> Some of the earlier and later posts in that thread are also relevant.

I'm not sure that's *so* important.  That seems to be specifically
about having unicode in identifiers (there are obvious issues about
matching (presumably you'd want to canonicalize), and the notion of
case insensitivity is more complex in the unicode world).  My guess is
that mostly people care about unicode in strings, and I/O with files
(or sockets or whatever) which are in particular encodings.  

Of course, there's a strong overlap, especially with a lispy
language---a natural way to process XML is presumably to transform to
and from s-expressions, and to manipulate the s-expressions.  Perhaps
the right things to worry about really are identifiers?
0
usenet44 (324)
10/29/2003 1:24:49 AM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

> I've gotten the impression that some of the cool debugging and error
> reporting features are only available in DrScheme. And in general,
> I've gotten the impression that a lot of effort goes into developing
> the GUI specifically rather than into improving the suite overall.

Investing lots of efforts on the GUI doesn't imply not improving the
"suite overall".


> That's a bummer for me -- they're creating stuff that I can't make
> full use of, because their "showcase" tool is incompatible with my
> work habits.

You know that you could use the GUI just to debug stuff, and when
you're not debugging just pretend it's not there.


> It's not a huge drawback, and it's obviously not stopping my from
> using PLT, but I would like to see more "hooks"

What hooks?


> (or documentation on how to use those tools outside of the GUI).

What tools exactly?  Take the arrows that you get in the GUI that show
you bindings or the arrows that show you how you arrived at this
point -- how would you do these things outside a GUI?  There's no
documentation on how to use this outside of a GUI simply because such
documentation requires an implementation, and the implementation of
these features without a GUI seems a bit like science fiction.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!
0
eli666 (555)
10/29/2003 1:33:53 AM
(very selective replying)

David Rush <kumo@gofree.indigo.ie> writes:

> 2) it was a pain to make fast. The notion of 'standalone
>    executable', while ostensibly supported involved a complete
>    rebuild of the PLT core

The standard meaning of a `standalone executable' never had anything
with a complete rebuild.


> 3) I write daemons and command-line programs and don't need GUI
>    bells and whistles; if I did, PLT would be right up
>    there. [...]

I've done this for years, and am still doing this as my heaviest
usage.  I fail to see how having bells an whistles stands in my way.


> 4) The unit system was impressive ... and intimidating. And I hated
>    all the extra punctuation I saw floating around inside of PLT's
>    naming conventions

I can definitely tell you, having gone through the nightmare of
porting tiny-clos to guile, then redoing it to mzshceme (v5x to v10x),
that the module system is one of the most amazing things I have ever
worked with.  Right now Swindle does all kinds of tricks you could not
even dream of of doing while keeping your insanity (and the keyword
stuff is far from the most complex thing, btw) -- yet, it works
perfectly on the command-line as well as in DrScheme.  Also, if though
Swindle is so drastically hacked, I can still use other Scheme
modules, and other Scheme modules can use Swindle modules -- and there
are no problems at all.

Units are a little harder to get a handle on, but they are not needed
for most straightforward usages.  (But given that you like functors,
you would probably want to use them, but you would probably not have
hard time learning how to use them.)


> 6) the v200 release b0rk3d my PLT code base [...]

When it did that for Swindle, I had a similar reaction.  Forcing me
into using modules and other stuff that was incompatible sounded
really bad.  I gave it a shot, and the result was so much cleaner to
write, and so much easier to maintain that I actually enjoyed it, and
as a result I could add more stuff which I couldn't before (since the
complexity was close to getting to critical mass).


> 9) PLT is a pain to install. I'm sure that the PLT folks don't think
>    so, but but I haven't been able to get a fully-working install
>    for quite a while now. It doesn't use configure/make to build

Either you're on a different planet than I am, or you're talking about
something else.  It definitely uses configure and make to build.

But even if you don't want to use that, and you're willing to use one
of a few popular platforms, then you can now work with the cvs by a
simple:

  curl http://download.plt-scheme.org/scheme/binaries/<some-path> \
  | tar xzf -
  cd plt
  ./install -u +z


>    and it is very finicky about file locations.

Huh?


>    Given that I *usually* need to have a multi-platform environment
>    I find the lack of flexibility in PLT's installation very
>    irritating.

What lack of flexibility?

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!
0
eli666 (555)
10/29/2003 1:59:42 AM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> wrote:
> Is that actually true? If so, I'd consider that a defect in Unicode,
> because the correct spelling of "capital esszed" is "SS." And besides,

That probably depends on your language, surely? The *character* upcased
is no change. "upcase" means look in the same position in the upper case,
after all.


0
mjr6079 (56)
10/29/2003 2:00:33 AM
On Tue, 28 Oct 2003 20:04:48 -0500, Matthias Felleisen wrote:

> Some large company located near the northwestern corner of the
> continental US has sponsored Will Clinger (Larceny) and PLT to create a
> merger of the two Scheme systems not unlike a mix of the strawman
> proposals that you have put up below.

Ohmy!

> The specific plan is as follows:
>   - Will and some others are retargeting Larceny to the intermediate
>   language of
>     said companies virtual machine. Will has been calling this project
>     Common Lareceny.
>   - Joe Marshall and some others are porting MrEd to said company's
>   toolbox
>     The result could be a MrEd that's almost completely in Scheme. Will
>     is certainly encouraging us to think of Scheme as a systems
>     language. Eli's arrival has strengthened this goal even more.
>   - Once we have a joint Scheme, we are hoping to retarget it to other
>   platforms.
> 
> How realistic is the plan? Producing Larceny was a two-man effort. It's
> a fast, reliable R5RS implementation with a few extra goodies. It is
> particularly well-suited for the research ideas that Will wishes to
> pursue.
> 
> 
Just this week I ran across Larceny (up till then I thought it yet another
fairly decent Scheme->C).  I was wrong.

Very impressive accomplishment. I believe in one of my earlier posts I
wrote a line where I was going to post a proposal for Larceny/Twobit as
the core of new Scheme system. Never got to it (lost my nerve with all the
flack), though I was still trying to "bend" conversation that way. Little
did I know I was WAY behind the curve.  Stole my thunder though :(

For any Schemers who have poked around a number Scheme's (and who hasn't)
and have not yet looked at Larceny / Twobit.  You should.

It took about a 15 or 20 simple compatability function definitions and I
was hosting the Twobit compiler emitting petit-larceny Scheme->C code in
MzScheme.  Its so pluggable, change an pass5p2 include and it spits Sparc
assembly.  The millicode looked straight forward enough to convice a
duffer like myself into self delusion.  "Fire up ol' nasm and whip out
millicode for i386.  No problemo.  Just follow the petit C millicode. Got
a Sparc sample to follow as well. Couple weekends..."  Like I said
delusional for a good 3 minutes there. :)

Targeting the compiler is of course the easy part.  The darn runtime is
the crux.

Just this very morning I reactivated an account I have on a Sun 15K for
the sole purpose of playing a bit with the Larceny runtime.  Small world.


 [Group of dedicated people, to whom we thank for their efforts was here.]
> Merging the two projects is not an easy task. It won't be done quickly.
> If people really want a top-notch product, however, it may be the route
> to go. If you have time to contribute or money or you want to volunteer
> friends, please do so. The goal is to produce a good platform for the
> first Schemers and the rest of the world, too.
> 
> 

The talent of the group is intimidating.  One trick is how the inner core
will "open" and run the project so secondary/tertiary players can
effectively contribute without getting in the way.  Watson for the Holmes,
Salieri for the Mozart, Barry Bonds batboys.

> -- Matthias


Ray

P.S.
- That sound you heard was me doing a couple of handsprings. 
- Was kidding about the organized religion thing.
0
ray7279 (14)
10/29/2003 2:41:13 AM
Matthias Felleisen wrote:
[...]

> Merging the two projects is not an easy task. It won't be done quickly. If
> people really want a top-notch product, however, it may be the route to go.
> If you have time to contribute or money or you want to volunteer 
> friends, please do so. The goal is to produce a good platform for the 
> first Schemers and the
> rest of the world, too.

These are very exciting news. Could you detail how one would go about 
contributing? Perhaps a project home page exists somewhere? If not maybe 
   one should be created (I'de volunteer but I have poor skills in that 
area). I have a feeling you could get a lot of help from people who are 
currently forced to target said intermediate language through more 
primitive means.

-pp

0
ppinto (14)
10/29/2003 2:41:18 AM
On Tue, 28 Oct 2003 20:04:48 -0500, Matthias Felleisen wrote:


> [stuff]
> Merging the two projects is not an easy task. 
> [stuff]
> 
> -- Matthias
>

BTW.  I don't suppose this new Scheme project has been named yet?
Because, ahem, I think have a real peachy suggestion or two.

Ray 
0
ray7279 (14)
10/29/2003 2:54:42 AM
At Wed, 29 Oct 2003 00:44:49 +0100, Jens Axel S=F8gaard wrote:
>=20
> Bruce Stephens wrote:
> > Jens Axel S=F8gaard <usenet@jasoegaard.dk> writes:
> >>What is missing in DrScheme?
>=20
> > Bindings to Gtk/GNOME and other random useful libraries? =20
>=20
> What are the benefits of Gtk/GNOME over the current portable (Windows,=20
> Unix, Machintosh) GUI already in DrScheme?

Please correct me if I'm wrong on any of these:

  1) UTF-8 and localization support
  2) OpenGL
  3) tables
  4) trees
  5) misc. compound widgets like dialogs and calendars
  6) efficiency (probably minor importance)
  7) native look&feel (important for newbies & PHB's)
  8) familiarity (many people already know the Gtk API)
  9) mindshare (lots of new work is done for Gtk)

Also, Gtk is fairly portable.  I have GUI Gauche-gtk apps that run
unmodified on both Linux and Mac.  Gtk apparently runs on Windows too
though I would never touch said OS.

--=20
Alex

0
foof (110)
10/29/2003 3:27:39 AM
"R Racine" <ray@adelphia.net> writes:

> > Merging the two projects is not an easy task. 
> 
> BTW.  I don't suppose this new Scheme project has been named yet?
> Because, ahem, I think have a real peachy suggestion or two.

Help implement part of it and we'll name it after your pet peach if
you want.

(Btw, there is a real peachy name that Matthias decided to keep under
wraps.  Fans of Law and Order can guess it pretty easily.)

Shriram
0
sk1 (223)
10/29/2003 4:43:17 AM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

> I've gotten the impression that some of the cool debugging and error
> reporting features are only available in DrScheme. 

That's partly true.  On the other hand, it's kinda' hard to draw
arrows on top of a textual interface.

Many of the language extensions, in particular, can be loaded into
MzScheme; those don't need DrScheme.  As for tools, some of the data
can be exposed as data structures with a little work.

After all, the tools always begin textually before becoming graphical.
The problem is that, if a tool is eventually going to have a graphical
interface, it's more work to provide both interfaces.  

A group of users who really cared could push for that second interface
to be documented -- especially if they did the first round of exposure
and documentation, it'd be a lot less work the developer to
maintain...

Shriram
0
sk1 (223)
10/29/2003 4:47:35 AM
Alex Shinn <foof@synthcode.com> writes:

> > What are the benefits of Gtk/GNOME over the current portable (Windows, 
> > Unix, Machintosh) GUI already in DrScheme?
> 
> Please correct me if I'm wrong on any of these:

Well, the question was which are benefits *over* wxWindows in PLT:

>   1) UTF-8 and localization support

True, but given that DrScheme doesn't quite have this yet...

>   2) OpenGL

Being done independently by Scott Owens (and possibly others).

>   3) tables
>   4) trees

Not sure what these are exactly.  Someone with better knowledge of
both toolkits will have to compare.

>   5) misc. compound widgets like dialogs and calendars

Fair enough, but how much use are these to the average developer?  And
do they actually work on all three platforms in a consistent way,
interfacing with the native tools (eg, on Windows, will it interface
with my Outlook calendar)?

>   6) efficiency (probably minor importance)

Sure.

>   7) native look&feel (important for newbies & PHB's)

Given that DrScheme looks and quacks like a Windows app on Windows, a
Mac app on the Mac...

>   8) familiarity (many people already know the Gtk API)
>   9) mindshare (lots of new work is done for Gtk)

No dispute there.

Shriram
0
sk1 (223)
10/29/2003 4:53:24 AM
At 28 Oct 2003 23:53:24 -0500, Shriram Krishnamurthi wrote:
> 
> Alex Shinn <foof@synthcode.com> writes:
> 
> > > What are the benefits of Gtk/GNOME over the current portable (Windows, 
> > > Unix, Machintosh) GUI already in DrScheme?
> > 
> > Please correct me if I'm wrong on any of these:
> 
> Well, the question was which are benefits *over* wxWindows in PLT:

Yes, the original question was about wxWindows but I was replying to the
quote above about advantages Gtk has over the DrScheme GUI.  Gtk has
existing bindings in at least Bigloo, Gauche and Guile, so is worth
comparing.  I'm not too familiar with the DrScheme widget set so I was
trying to get a feel for whether it could serve as a serious
alternative.

> >   1) UTF-8 and localization support
> 
> True, but given that DrScheme doesn't quite have this yet...

That's a showstopper for me.  I have (currently alpha) a Gtk mail client
and Gtk web browser written in Gauche that I use for English and
Japanese text among others.

> >   2) OpenGL
> 
> Being done independently by Scott Owens (and possibly others).

Cool!

> >   3) tables
> >   4) trees
> 
> Not sure what these are exactly.  Someone with better knowledge of
> both toolkits will have to compare.

Tables as in spreadsheet-like interfaces with editable cells.  Trees
like a file explorer with a collapsible hierarchy.

> >   7) native look&feel (important for newbies & PHB's)
> 
> Given that DrScheme looks and quacks like a Windows app on Windows, a
> Mac app on the Mac...

And, no offense, looks like a poor Tk substitute on Linux.  So DrScheme
is biased towards 2 platforms while Gtk is biased towards 1 (well, it
looks identical to the Linux Gtk on OS X, as opposed to the native
Aqua).

-- 
Alex

0
foof (110)
10/29/2003 5:29:17 AM
Grzegorz Chrupa?a <grzegorz@pithekos.net> wrote:
>>> For variable-name case folding, just use standard Unicode case
>>> mapping, where (char-upcase #\�) is just #\� and be done with it.

> Bradd W. Szonye wrote:
>> Is that actually true? If so, I'd consider that a defect in Unicode,
>> because the correct spelling of "capital esszed" is "SS." And
>> besides, case-folding is only part of the problem, because it's only
>> one example of different but equivalent spellings.

> The basic non-locale dependent case mapping of � is �. There is a
> SpecialCasing table which deals with characters such as � where case
> mappings are not simple 1-1 character correspndences.

Oh, I see what you're saying now. Ignore collation order in general, and
just use the "non-localized" version of case-insensitivity -- what Unix
geeks would call the "C" locale. That makes some sense. I don't know how
non-English programmers would feel about it (but then again, most of
them are accustomed to programming in ASCII, for better or worse).

> "Doing the right thing" in the general case, in a fully locale
> sensitive way is indeed complicated, if at all possible. IMO the rules
> for identifiers as should be well-defined and simple as well as
> consitent with treatment of strings on the basic level, i.e. they
> should use the general, non-locale dependent case-mappings.

Yeah, it's tough. On the one hand, it'd be nice to permit programming in
the local language. On the other hand, that's very hard to do, maybe
impossible, and it causes interoperability problems when you need to use
other people's code (in other languages).

> When dealing with data one could choose to use more refined,
> locale-dependent mappings, algorithms etc as needed.

Definitely.

By the way, I was experimenting with Unicode sources the other day. I
got to wondering how difficult it would be to use a lambda character
instead of the word lambda. There were a few surprises, some pleasant
and some unpleasant.

Bad: It took me a while to configure everything. Luckily, my favorite
monospaced font (Lucida Console) supports the Greek codepage -- it was
only one of three fonts on my system to do so. Unfortunately, it doesn't
include mathematical symbols like for-all and there-exists. I wish we
had better ISO 10646 fonts (and that font vendors did a better job of
advertising which fonts support which character sets). Currently, you
need to be an expert in the field to figure it out, and even then it's
difficult.

Good: MzScheme dealt with my lambda symbol with no tweaking whatsoever
(beyond defining it to mean the same thing as "lambda"). At first, I
thought I might need to use symbol quotes. Then, I realized that UTF-8
encoding make that unnecessary -- all non-ASCII glyphs have the high bit
set for all bytes, and MzScheme treats all such bytes as identifier
characters. The only drawback is that you don't get case-folding for
free.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/29/2003 6:32:51 AM
> David Rush <kumo@gofree.indigo.ie> writes:
>> 9) PLT is a pain to install. I'm sure that the PLT folks don't think
>>    so, but but I haven't been able to get a fully-working install
>>    for quite a while now. It doesn't use configure/make to build

Eli Barzilay <eli@barzilay.org> wrote:
> Either you're on a different planet than I am, or you're talking about
> something else.  It definitely uses configure and make to build ....
> What lack of flexibility?

He may have overlooked something, but I have a similar complaint in this
area. PLT Scheme ignores the local conventions for file locations:
programs in pfx/bin, docs in pfx/doc, libraries in pfx/lib, etc. Most
autoconf installers let you set the prefix and then distribute stuff in
the standard locations under it. (They'll even let you set the bin and
lib prefixes independently, for slightly non-standard installs.) With
PLT, I need to set up a bunch of symbolic links to support the standard
paths.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/29/2003 6:38:42 AM
Shriram Krishnamurthi <sk@cs.brown.edu> wrote:
> (Btw, there is a real peachy name that Matthias decided to keep under
> wraps.  Fans of Law and Order can guess it pretty easily.)

LtVanBuren? MrMcCoy? CptCragen?
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/29/2003 6:47:03 AM
> At 28 Oct 2003 23:53:24 -0500, Shriram Krishnamurthi wrote:
>> Given that DrScheme looks and quacks like a Windows app on Windows, a
>> Mac app on the Mac...

Alex Shinn <foof@synthcode.com> wrote:
> And, no offense, looks like a poor Tk substitute on Linux.

Heh, yeah. The differences from Gtk are subtle but noticeable. However,
I haven't considered it a deal-breaker, because I'm used to seeing a
combination of Gtk, Qt, Athena, etc. apps on Linux. Personally, I like
Qt's interface the best, but Gtk is close.

There's something subtle missing in the Windows interface too. Mainly
stuff like buttons not being quite where I'd expect them to be, and
"common dialogs" not quite matching the native Windows versions. It's
been a while since I've played with the GUI tools, though, so I may be
misremembering, and it may have improved in 205.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/29/2003 6:52:04 AM
Matthias Felleisen <matthias@ccs.neu.edu> wrote:
> Some large company located near the northwestern corner of the
> continental US has sponsored Will Clinger (Larceny) and PLT to create
> a merger of the two Scheme systems not unlike a mix of the strawman
> proposals that you have put up below ....
>   - Once we have a joint Scheme, we are hoping to retarget it to other
>     platforms.

Sounds interesting, although I'll be mighty bummed if Linux support is
late and MzScheme support suffers. I originally chose PLT so that I
could develop and plan on Linux, then deploy on Windows XP.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/29/2003 6:54:32 AM
> "Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:
>> I've gotten the impression that some of the cool debugging and error
>> reporting features are only available in DrScheme. And in general,
>> I've gotten the impression that a lot of effort goes into developing
>> the GUI specifically rather than into improving the suite overall.

Eli Barzilay <eli@barzilay.org> wrote:
> Investing lots of efforts on the GUI doesn't imply not improving the
> "suite overall".

Sorry, didn't mean to imply otherwise. Just noting that *some* resources
go to developing DrScheme (a tool I don't use) and not the tools that I
do use.

> You know that you could use the GUI just to debug stuff, and when
> you're not debugging just pretend it's not there.

I could, although I've had great difficulty doing so in practice. I can
never figure out how to load my code and get it running. I just press
buttons and nothing interesting happens. It's definitely not the most
intuitive debugger I've used. I should probably read the manual so that
I know what I'm doing and give it a fair trial.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/29/2003 6:59:26 AM
Bradd W. Szonye wrote:
> Shriram Krishnamurthi <sk@cs.brown.edu> wrote:
> > (Btw, there is a real peachy name that Matthias decided to keep under
> > wraps.  Fans of Law and Order can guess it pretty easily.)
>
> LtVanBuren? MrMcCoy? CptCragen?

I'm betting on spinoff show names: "Special Victims Unit" or perhaps
"Criminal Intent"...

Anton



0
anton58 (1240)
10/29/2003 7:07:39 AM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

> He may have overlooked something, but I have a similar complaint in
> this area. PLT Scheme ignores the local conventions for file
> locations: programs in pfx/bin, docs in pfx/doc, libraries in
> pfx/lib, etc.

But you have to realize that these conventions are local to just one
platform, which means that they're completely meaningless in the plt
tree context.  This means that putting files in the right places
should be the job of a platform-specific installer.  For the linux
case, there was an rpm for a while, and I hope to get that back in,
but I don't think that it's high priority (BTW, I always liked the
single tree, but I always used it from my home dir).


> Most autoconf installers let you set the prefix and then distribute
> stuff in the standard locations under it. (They'll even let you set
> the bin and lib prefixes independently, for slightly non-standard
> installs.)

I really don't see what's the point of doing this.  The file division
you have seems like on a standard linux distribution you'll be happy
if you get:

1. libraries go in /usr/lib/plt
2. documentation in /usr/share/doc/plt
3. binaries in /usr/bin
4. include files in /usr/include/plt
5. man files in /usr/man/man1

So:

1. There are very few libraries -- one for compiling extensions by
   mzc, so it should be in the plt tree.  Two others are for embedding
   it in a C application, but at that level I don't think having the
   libraries in a different place would matter much.

2. The plt documentation is really different than other packages --
   stuff that goes in /usr/share/doc is usually readmes etc, and not
   things that users should read.  So the most I'd put there is the
   readme and the notes directory.  Other documentations should stay
   in the plt tree, where they can be updated automatically, and used
   by the web server for queries etc.

3. most of the binaries are scripts -- and these set the default for
   PLTHOME variable to know where to find the collections and other
   stuff.  But is there anything bad with just using symbolic links in
   the bin directory?

4. there are a few include files (which mzc knows where to find) and a
   few man files (mostly the same as `mzscheme -h' etc).

So I don't see any reason at all to scatter files all over the place
to just make life harder afterwards when files that are required to
run stuff are not in a place you expect them to be.  So I think tha
the best approach would be a single plt directory, and putting a few
links to the above stuff, making easy maintenace of an RPM (I don't
even want to do an SRPM).  If you have any reasons for this to not
make sense, or if you have any additional information on politically
correct ways of creating RPMs, mail me directly.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!
0
eli666 (555)
10/29/2003 7:11:04 AM
Bradd W. Szonye wrote:
> Eli Barzilay <eli@barzilay.org> wrote:
> > You know that you could use the GUI just to debug stuff, and when
> > you're not debugging just pretend it's not there.
>
> I could, although I've had great difficulty doing so in practice. I can
> never figure out how to load my code and get it running. I just press
> buttons and nothing interesting happens. It's definitely not the most
> intuitive debugger I've used.

That might be because it isn't actually a debugger... :)   Although it does
give good navigable backtraces, and some other nice touches.  I use it
exactly as Eli suggests.

Anton



0
anton58 (1240)
10/29/2003 7:16:08 AM
mis6@pitt.edu (Michele Simionato) wrote in message 
> 
> What I (as an outsider to the community) would appreciate is: 
> 
> 1. make a stricter R5RS (not very strict, but stricter than now);
> 
> 2. make more srfi (much more);
> 
> 3. make them available on every implementation.

I agree with you on these points and you are not the first who 
proposed them. I red on the Bigloo mailinglist that R6RS or something
like this should be a big step towards "making Scheme more user
friendly within different Scheme implementations".

But at the same time I am also convinced that some people make a
mental mistake when using Scheme. May be I am heading in the wrong
direction, but isn't it true that a lot of Python,... guys think when
you have to use Scheme you also have to learn /all/ the different
implementations of Scheme?

I can only place the advice not think so. Most of the time I use
Bigloo and I feel like a Scheme programmer whether I know PLT or
Chicken or Gambit is not of relevance because I actually use Scheme.
Correspondingly, Chicken programmers feel the same,...

I have never heard of a C++ programmer not saying he programs in an
object orientied style even if he has never heard of Smalltalk for
example.

Apropos SRFI: What I saw: Chicken has all the SRFI's, I think Dr.
Scheme too. Bigloo has at least some of them natively integrated and
the newer Bigloos have the option to create  SRFI libraries (look into
the SRFI folder of the Bigloo distribution). I am not aware of all the
other Scheme implementations.

Nobody should hesitate to using Scheme for real (outside academic)
projects. I am reaching more and more  the conclusion that for example
the often stressed fact that people believe Common Lisp is the
industry standard and Scheme is the academic standard is nothing more
than a gag. Maybe my programming needs are different but I cannot see
this industry strength in Common Lisp.

Fensterbrett
0
chain_lube (440)
10/29/2003 7:28:31 AM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

> I could, although I've had great difficulty doing so in practice. I
> can never figure out how to load my code and get it running.

Well, I'm not a gui expert, but both of these seem obvious enough.


> I just press buttons and nothing interesting happens. It's
> definitely not the most intuitive debugger I've used. I should
> probably read the manual so that I know what I'm doing and give it a
> fair trial.

Yes.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!
0
eli666 (555)
10/29/2003 7:39:49 AM
> "Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:
>> He may have overlooked something, but I have a similar complaint in
>> this area. PLT Scheme ignores the local conventions for file
>> locations: programs in pfx/bin, docs in pfx/doc, libraries in
>> pfx/lib, etc.

Eli Barzilay <eli@barzilay.org> wrote:
> But you have to realize that these conventions are local to just one
> platform, which means that they're completely meaningless in the plt
> tree context.  This means that putting files in the right places
> should be the job of a platform-specific installer.

That's true for any software package. You don't organize the sources
that way! Some of those directories (like /bin) don't even exist in the
source tree. But the installer should organize the deliverables that
way, and autoconf (configure) makes it easy to get it right. This isn't
a huge problem -- just an annoyance and a minor bit of
"unprofessionalism."

> For the linux case, there was an rpm for a while, and I hope to get
> that back in, but I don't think that it's high priority ....

You don't need RPM to put deliverables in the standard directories. RPM
is just a script to run the actual installer and gather up what it
creates. RPM won't help much unless configure & make install do the
right thing in the first place.

>> Most autoconf installers let you set the prefix and then distribute
>> stuff in the standard locations under it. (They'll even let you set
>> the bin and lib prefixes independently, for slightly non-standard
>> installs.)

> I really don't see what's the point of doing this.  The file division
> you have seems like on a standard linux distribution you'll be happy
> if you get:
> 
> 1. libraries go in /usr/lib/plt
> 2. documentation in /usr/share/doc/plt
> 3. binaries in /usr/bin
> 4. include files in /usr/include/plt
> 5. man files in /usr/man/man1

Yes, that's the standard way to do it on a Red Hat Linux system, but not
all installations use the standard locations, so configure provides
hooks to rearrange the tree if you need to.

> So:
> 
> 1. There are very few libraries -- one for compiling extensions by
> mzc, so it should be in the plt tree.  Two others are for embedding it
> in a C application, but at that level I don't think having the
> libraries in a different place would matter much.

If you put those libraries in the standard locations, it eliminates one
step from the compile & link makefile.

> 2. The plt documentation is really different than other packages --
> stuff that goes in /usr/share/doc is usually readmes etc, and not
> things that users should read.

Not in my experience! Except for manpages, /usr/share/doc is exactly
where you put stuff like PLT's collects/doc directory.

> So the most I'd put there is the readme and the notes directory.
> Other documentations should stay in the plt tree, where they can be
> updated automatically, and used by the web server for queries etc.

Why can't PLT do that if they're in /usr/share/doc/plt? The simple
answer is that it's designed around a "one tree to rule it all"
approach, but that maps poorly to Unix systems.

> 3. most of the binaries are scripts -- and these set the default for
> PLTHOME variable to know where to find the collections and other
> stuff.  But is there anything bad with just using symbolic links in
> the bin directory?

It's one more thing I need to do manually when I install PLT. I can
manually add plt/bin to my path, but either way, it doesn't work out of
the box.

> 4. there are a few include files (which mzc knows where to find) and a
> few man files (mostly the same as `mzscheme -h' etc).

There are only a few include files for *most* libraries. That's not a
good reason to tuck them away in a non-standard directory.

> So I don't see any reason at all to scatter files all over the place
> to just make life harder afterwards when files that are required to
> run stuff are not in a place you expect them to be.

That's the problem: By putting them into a single tree rather than the
standard locations, they *aren't* where Unix developers expect them to
be. We need to manually adjust lots of paths (or create lots of
symlinks) to use this stuff, because it's not where other Unix apps
expect it to be.

> So I think tha the best approach would be a single plt directory, and
> putting a few links to the above stuff, making easy maintenace of an
> RPM (I don't even want to do an SRPM).  If you have any reasons for
> this to not make sense, or if you have any additional information on
> politically correct ways of creating RPMs, mail me directly.

RPM has nothing to do with it. RPM only does what make install tells it
to (plus some glue). RPM's scripting language could create those links,
but it's better to do it in make install. And really, it's even better
to put everything where it belongs rather than linking to it. Symlinks
are nice, but there are some gotchas, and in general they aren't as
convenient as installing in "the Unix way" to begin with.

PLT is not alone in this; for example, I think Perl installs to
/opt/perl on HP-UX systems, with the same annoyances. But on Linux
systems, the Perl installer puts everything where other tools expect to
find it, "scattered" across the directories.

It isn't really scattering, though. For most directories, the only
difference between the PLT way and the Linux standard way is that the
"type" name comes before the "package" way instead of the other way
around. For example:

    /usr/plt/include    /usr/include/plt
    /usr/plt/doc        /usr/doc/plt
    /usr/PKG/TYPE       /usr/TYPE/PKG

There are exceptions, but that's the general idea. While it may seem a
bit odd to put type before package, Unix systems do it that way because
it works better for search paths. If you use /usr/plt/include,
programmers need to explicitly put the path in their makefiles. If you
use /usr/include/plt, programmers can just write "#include <plt/foo.h>"
and go with it. Since each search path has its own "type," you can just
list all of the type directories and then use PKG/FILE to find what you
want.

Even PLT makes use of this concept, with plt/collects/PKG. Think of how
annoying it would be if a PLT collection installed itself into
plt/PKG/collects instead. You'd need to manually create links or update
the PLTCOLLECTS path, or it wouldn't work right. That's exactly the same
thing that the PLT installer does to other Unix tools.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/29/2003 7:57:53 AM
chain_lube@hotmail.com <chain_lube@hotmail.com> wrote:
> Apropos SRFI: What I saw: Chicken has all the SRFI's, I think Dr.
> Scheme too.

PLT has a lot of them, but not nearly all of them. Some SRFIs are very
difficult to implement in PLT. For example, you can *almost* implement
SRFI-34 (exceptions) with PLT, which provides a native RAISE function.
Unfortunately, PLT's RAISE doesn't permit rethrowing inside the handler.
That doesn't make rethrowing impossible, but it *does* make SRFI-34
semantics impossible, because SRFI-34 requires the handler to run in the
same context as RAISE.

You can work around that by hiding the native RAISE and using the
portable SRFI implementation instead, but hiding built-in functions is
tricky in PLT Scheme. You can do it at the top level, but it's usually
better to use modules instead of the top level in PLT, and you can't
shadow identifiers in modules. There's a way around that too, but it's
cumbersome.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/29/2003 8:04:16 AM
"Scott G. Miller" <scgmille@freenetproject.org> wrote in message news:<a8OdnQW-9YBwBQOiRVn-ug@giganews.com>...
> Grzegorz Chrupala wrote:
 
> > For me, a major gap is Unicode and multibyte character support. This
> > is by now standard in implementations of most other widely used
> > programming languages but surprisingly few Schemes have it.
> 
> There is a reason for that.  The R5RS character operators cannot be made 
> to work reliably with unicode characters.  SISC for example supports 
> unicode characters and arbitrary character maps, but makes no effort to 
> contort the standard operators to behave properly.  There was a usenet 
> discussion about this in the past which you could probably find by googling.
> 

Frankly, I don't see anything in R5RS that would prevent Unicode
support.
However if there is indeed some incompatibility, then probably the
following statement from the Scheme FAQs should be updated:

   Are there implementations that support unicode?

    There is nothing in the Scheme standard that conflicts 
    with supporting unicode, however such support is not 
    required. There are some Scheme implementations 
    that handle unicode characters, but most don't. Also, 
    SRFI-13 and SRFI-14 propose string and character 
    processing libraries that are unicode compliant.

--
Grzegorz
0
grzegorz1 (80)
10/29/2003 8:12:31 AM
Bradd W. Szonye wrote:

> There's something subtle missing in the Windows interface too. Mainly
> stuff like buttons not being quite where I'd expect them to be, and
> "common dialogs" not quite matching the native Windows versions. 

But the developer decides which buttons goes where and
which menu items goes where?

If you experience non standard placement in e.g. DrScheme
the file a bug report.

-- 
Jens Axel S�gaard


0
usenet153 (246)
10/29/2003 9:41:26 AM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

> Is that actually true? If so, I'd consider that a defect in Unicode,
> because the correct spelling of "capital esszed" is "SS." And besides,
> case-folding is only part of the problem, because it's only one
> example of different but equivalent spellings.

It is not a defect in Unicode. Just like ASCII, ISO-8859-* etc. Unicode
in itself does not contain any information about character case, but
considers 's', 'S', '�', "SS" etc. to be separate character. Case is up
to the software; the compiler in this case.

Me, I see this as an excellent argument for case sensitivity in
identifiers, but you can't have everything...

-- 
Bj�rn Lindstr�m <bkhl@elektrubadur.se>
http://bkhl.elektrubadur.se/
0
bkhl (10)
10/29/2003 11:34:24 AM
Matthias Felleisen <matthias@ccs.neu.edu> wrote in message news:<bnn3im$q7p$1@camelot.ccs.neu.edu>...

> Some large company located near the northwestern corner of the continental US
> has sponsored Will Clinger (Larceny) and PLT to create a merger of the two 
> Scheme systems ...

> If you have time to contribute or money or you want to volunteer friends, please 
> do so. The goal is to produce a good platform for the first Schemers and the
> rest of the world, too.

Will it be free once it is done?  If not, since this big company sounds 
like one of those that are richer than God, how about people forget about the 
volunteering and get paid to do it?
0
10/29/2003 11:48:37 AM
>
>    Are there implementations that support unicode?
> 

Chris Hanson's latest release of MIT Scheme supports UTF-16.  The extent
of integration "all the way down", I don't know.  But what he does, he
does well and throughly.  It is another point of reference for those with
an interest in providing Scheme support beyond basic UTF-8.

Ray
0
ray7279 (14)
10/29/2003 12:22:43 PM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> wrote in message news:<slrnbpunoj.kij.bradd+news@szonye.com>...
> By the way, I was experimenting with Unicode sources the other day. I
> got to wondering how difficult it would be to use a lambda character
> instead of the word lambda. There were a few surprises, some pleasant
> and some unpleasant.

Japanese character set has long included some mathematical
symbols and greek alphabet, so that's a kind of the thing
every Japanese Scheme programmer has tried at least once :-)
Gauche is shipped with a joke script that replaces some Scheme
syntax and procedures for in Japanese or mathematical symbols,
including greek lambda.

There are indeed some "Japanese" programming languages as well.
Besides the interoperability issue, the reason such languages
are inconvenient to use in production than traditional ones
is that the current keyboard UI is pretty much optimized for
ASCII or ISO8859 characters.  Typical Japanese input methods
are not very convenient for switching frequently between
Japanese text input and ASCII/symbol input back and forth.
For educational purpose (like to teach programming for kids),
I do see a benefit of programming in local languages, though.
0
shiro (31)
10/29/2003 12:50:46 PM
Alex Shinn <foof@synthcode.com> writes:

[...]

> Gtk has existing bindings in at least Bigloo, Gauche and Guile, so
> is worth comparing.

As far as I can tell, there's no working binding for the current
version of bigloo (the bigloo-lib project
<http://sourceforge.net/projects/bigloo-lib/> shows significant signs
of being dead (last release almost a year ago, 0% activity, old open
bug reporting it to be unbuildable).  Guile seems to have two, both
fairly undocumented; the preferred one is presumably the gobject one,
which looks technically a nice approach, but seems rather slow at
present.

[...]

0
usenet44 (324)
10/29/2003 1:36:49 PM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

> Sounds interesting, although I'll be mighty bummed if Linux support is
> late and MzScheme support suffers. 

MzScheme support won't suffer at all.  Preserving the current
cross-platform nature is a top priority.  All Matthias is saying that
*new goodies* may get unveiled platform-at-a-time.  This is no
different from our present situation, where, for instance, MysterX
(the ActiveX interface) works only on Windows, but its creation was
not to the detriment of the cross-platform effort.

>                                     I originally chose PLT so that I
> could develop and plan on Linux, then deploy on Windows XP.

You should be pleased, then -- you'll spend time making it look fast
on Linux, then it'll run like a bat-out-of-hell on XP! (-:  (Or, um,
it'll end up running at the same speed.)

Shriram
0
sk1 (223)
10/29/2003 1:41:07 PM
"Anton van Straaten" <anton@appsolutions.com> writes:

> I'm betting on spinoff show names: "Special Victims Unit" or perhaps

That would be MzScheme's will collector...

> "Criminal Intent"...

....and the work stealer.

LtShriram
0
sk1 (223)
10/29/2003 1:43:18 PM
=?ISO-8859-1?Q?Jens_Axel_S=F8gaard?= <usenet@jasoegaard.dk> wrote:

> > Speed?
>
>I wanted to hear what mzscheme misses compared to Perl/Python
>and as far as I can tell, mzscheme has no problem in the speed
>department.

IME, mzscheme does just fine against Perl/Python.  The problem is that Perl
and Python are slow.  Sure, they tend to be fast enough for what they're used
for, but their performance limits are still much smaller than those of a
compiled language.

>[The existence of the *very* fast Scheme compilers does not imply
>that mzscheme is slow]

It does when you're looking for maximal performance out of given hardware.
Anything interpreted is slow relative to most compiled things.

Actually, looking at my inefficient Fibonacci benchmark, mzscheme does *quite*
well against Perl.  10x faster.  mzc module --prim does even better, getting
to within 4x of C.  (Perl is 100x slower than C.)  I haven't gotten Chez
Scheme to go that fast.  (It's a lot faster by default though.)  CMUCL can get
within 2x of C, and ocaml matches C.

How much Fibonacci generalizes to anything, I don't know; it's mostly testing
recursive calls and numerics.

-xx- Damien X-) 
0
dasulliv (21)
10/29/2003 2:45:34 PM
Eli Barzilay <eli@barzilay.org> writes:

> should be the job of a platform-specific installer.  For the linux
> case, there was an rpm for a while, and I hope to get that back in,
> but I don't think that it's high priority (BTW, I always liked the
> single tree, but I always used it from my home dir).

I always like to install under a single tree: then it's particularly
easy to get rid of it later without having to hunt for bits
and pieces.
By the way, I was surprised at how the whole PLT bunch built and
installed and worked properly out of the box on a development
version of OpenBSD 3.4 in /usr/local/plt. I didn't see any
advertising promising OpenBSD compatibility anywhere, but it
worked just like that.


0
toni1 (16)
10/29/2003 3:02:56 PM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

> [...]

[I'll take this off to email.]

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!
0
eli666 (555)
10/29/2003 6:16:43 PM
MJ Ray <mjr@dsl.pipex.com> writes:

> "Bradd W. Szonye" <bradd+news@szonye.com.invalid> wrote:
>> Is that actually true? If so, I'd consider that a defect in Unicode,
>> because the correct spelling of "capital esszed" is "SS." And besides,
>
> That probably depends on your language, surely? The *character* upcased
> is no change. "upcase" means look in the same position in the upper case,
> after all.

Not in Unicode.  Upcase a `LATIN SMALL LETTER SHARP S' and you get
*two* `LATIN CAPTIAL LETTER S'.

0
jrm (1310)
10/29/2003 7:23:46 PM
Joe Marshall <jrm@ccs.neu.edu> wrote:
> MJ Ray <mjr@dsl.pipex.com> writes:
> 
>> "Bradd W. Szonye" <bradd+news@szonye.com.invalid> wrote:
>>> Is that actually true? If so, I'd consider that a defect in Unicode,
>>> because the correct spelling of "capital esszed" is "SS." And besides,
>>
>> That probably depends on your language, surely? The *character* upcased
>> is no change. "upcase" means look in the same position in the upper case,
>> after all.
> 
> Not in Unicode.  Upcase a `LATIN SMALL LETTER SHARP S' and you get
> *two* `LATIN CAPTIAL LETTER S'.

That depends on the locale. There's a "locale-neutral" implementation
where upcasing esszed doesn't actually change anything.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/29/2003 8:10:07 PM
Bradd W. Szonye wrote:
> PLT has a lot of them, but not nearly all of them. Some SRFIs are very
> difficult to implement in PLT. For example, you can *almost* implement
> SRFI-34 (exceptions) with PLT, which provides a native RAISE function.
> Unfortunately, PLT's RAISE doesn't permit rethrowing inside the handler.
> That doesn't make rethrowing impossible, but it *does* make SRFI-34
> semantics impossible, because SRFI-34 requires the handler to run in the
> same context as RAISE.
> 
> You can work around that by hiding the native RAISE and using the
> portable SRFI implementation instead, but hiding built-in functions is
> tricky in PLT Scheme.

If you can work around the issue, how is it impossible to to implement SRFI 34 
semantics?

> You can do it at the top level, but it's usually
> better to use modules instead of the top level in PLT, and you can't
> shadow identifiers in modules. There's a way around that too, but it's
> cumbersome.

I don't get this.  The PLT raise is different from SRFI-34 raise.  The module 
system just allows you say what you mean and mean what you say, ie you don't 
confuse the two raises.  I don't think it's cumbersome at all.

-d

0
dvanhorn1 (74)
10/29/2003 9:01:11 PM
Matthias Felleisen wrote:

> Some large company located near the northwestern corner of the continental
> US 

I was just wondering, is there some unwritten rule that forbids mentioning
the word "Microsoft" in usenet posts? I was unable to find anything in the
FAQs. Care to enlighten a usenet newbie?
-- 
Grzegorz
http://pithekos.net
0
grzegorz1 (80)
10/29/2003 9:30:19 PM
Pedro Pinto wrote:


> These are very exciting news. Could you detail how one would go about 
> contributing? Perhaps a project home page exists somewhere? If not maybe 
>   one should be created (I'de volunteer but I have poor skills in that 
> area). I have a feeling you could get a lot of help from people who are 
> currently forced to target said intermediate language through more 
> primitive means.

I will bring up the idea of outside contributors over the next week
with Will and Joe and the rest of the re-targeting team.

To all others: Microsoft will not retain any rights and the results, if
any, will be distrubted just like all other PLT software. Note the caveat.

As Shriram has already pointed out, we see .Net for now as just one more
platform though as a platform on which we're trying to merge. If someone had 
raised the funds to merge Larceny and PLT Scheme on some other platform we
would have targeted that instead, too.

Keep in mind that this is a large project with a high potential for failure.

-- Matthias

0
10/29/2003 10:29:19 PM
Grzegorz =?UTF-8?B?Q2hydXBhxYJh?= <grzegorz@pithekos.net> writes:

> I was just wondering, is there some unwritten rule that forbids mentioning
> the word "Microsoft" in usenet posts? I was unable to find anything in the
> FAQs. Care to enlighten a usenet newbie?

Given the flamewars that it tends to engender (if you haven't already
figured this part out, you will soon), one tends to step softly around
the mention of The Corporation.

Besides, given that the N*S*A monitors Usenet to look out for
subversive activity, and nothing could be more subversive to the US
than staunching the Freedom to Innovate...

Shriram
0
sk1 (223)
10/29/2003 10:48:58 PM
Matthias Felleisen <matthias@ccs.neu.edu> writes:

> As Shriram has already pointed out, we see .Net for now as just one more
> platform though as a platform on which we're trying to merge. If
> someone had raised the funds to merge Larceny and PLT Scheme on some
> other platform we
> would have targeted that instead, too.

Yeah -- any Linux Innovation Grants out there that will pay for a
programmer or two?

Shriram
0
sk1 (223)
10/29/2003 10:49:57 PM
Grzegorz Chrupa?a wrote:
> =

> Bradd W. Szonye wrote:
> =

> > Grzegorz Chrupa?a <grzegorz@pithekos.net> wrote:
> >> For variable-name case folding, just use standard Unicode case
> >> mapping, where (char-upcase #\=C3Y) is just #\=C3Y and be done with =
it.
> >
> > Is that actually true? If so, I'd consider that a defect in Unicode,
> > because the correct spelling of "capital esszed" is "SS." And besides=
,
> > case-folding is only part of the problem, because it's only one examp=
le
> > of different but equivalent spellings.
> =

> The basic non-locale dependent case mapping of =C3Y is =C3Y. There is a=

> SpecialCasing table which deals with characters such as =C3Y where case=

> mappings are not simple 1-1 character correspndences.
> (http://www.unicode.org/Public/UNIDATA/)

The Unicode consortium has provided ways to handle the issue of =

case-insensitive identifiers in code since I wrote that article.  =

The 1-to-1 mappings for specialcased characters you're referring
to are a part of that.  It provides *a* way to provide something =

you can call "case insensitivity" within the boundaries of the =

Unicode standard, but it will still annoy those who speak and =

write languages whose rules it violates. =


> > Unicode support for data is fairly tricky on its own. Many languages
> > choose not to complicate things by applying the data rules to code. F=
or
> > example, C++ permits a wide variety of Unicode characters in data and=
 in
> > code, but it does not attempt locale-dependent equivalence for code -=
-
> > every different spelling is a different identifier.
> >
> > However, Schemers like it when the same rules apply to code and data
> > both. Also, programmers in any case-insensitive language like it when=

> > identifiers "do the right thing" in non-English languages. That's why=

> > any discussion of extended character sets is likely to stray into a
> > discussion of identifier equivalence.

Scheme is a LISP.  Our code *is* data.  Whatever rules we have for =

handling data must also, sooner or later, apply to code, and vice =

versa.  Any imbalances mean mismatches and bugs that we have to =

code around until they are fixed.

Anyway, the scheme system I've got (which is now up to most of R4RS,
but still no macros and a lot of missing simple functions) still =

disallows the direct inclusion of characters in symbols and =

variablenames that cause ambiguity when upcased or downcased in =

strings. If you want them, you have to escape them in, and the =

escape codes are not case-insensitive.  =


However, I've overcome several problems in strings.  My strings are =

represented as trees that have primitive-strings (limited to ~240 =

bytes plus overhead) at the leafnodes.  The primitive-strings in =

turn can be any of several different formats for handling characters =

of different widths.  I was initially mapping everything into UTF-32
codepoints but having arbitrary-width characters turns out to save =

space usually and it means I can solve other problems.  In particular, =

I can now handle accented and combined characters much better.  I'm =

using an open-ended character-set that allows any unicode base =

character plus any number of combining characters to map onto a =

unique integer.  The integer you get back from calling char->integer =

on it may be, um, kinda large, since it's essentially a base-(2^32) =

number as many "digits" long as the number of code points, but I =

had bignum libs lying around anyway and in practice such heroic =

characters are quite rare.

What this means is that my string-indexes actually count characters, =

(accents and all) not codepoints.  And this drastically simplifies =

the implementation and semantics of most string primitives and =

drastically reduces the likelihood of a string growing longer or =

shorter (in characters) as a result of being upcased or downcased.

> As I see it, it is enough if the the core language provides core
> Unicode-compatible functionality including the way to read and write UT=
F-8
> and UTF-16 encoded text, distinguish characters and bytes, get the leng=
th
> of a string in characters and bytes, provide standard Unicode
> case-mappings, sorting, and Unicode aware standard character predicates=

> such as char-whitespace? etc. Anything beyond that can be more or less
> easily added in libraries or defined by the user as needed.

As I'm treating it the encoding (so far ascii, UTF-8, UTF-16, UTF-32, =

(binary 1), (binary 8), (binary 16), or (binary 32) - I will probably =

add more character encodings and a fast path for (binary 64) at some =

point but haven't yet... ) is a property of the input port or output =

port that you specify when you open the port. From some types of ports =

you read/write characters or <datum>s; from other types of ports you =

read/write fixed-width units of binary data, which you get as exact =

integers in the range 0-1, 0-255, 0-65535, etc. =


Sigh... but my close-port functions aren't working and I can't figure
out why, and I still need to get some kind of =


 (Decode-IEEE-Float32 int)=3D> <inexact real>  =


etc, conversions working to make more sense of binary formats read from =

ports. =



				Bear

	(the scheme compiler may or may not ever be a released
         piece of software.  Mostly it's me doing katas for my
         code-fu.)
0
bear (1219)
10/30/2003 1:25:02 AM
Joe Marshall <jrm@ccs.neu.edu> wrote:
> Not in Unicode.  Upcase a `LATIN SMALL LETTER SHARP S' and you get
> *two* `LATIN CAPTIAL LETTER S'.

So the PP was wrong about behaviour of char-upcase or just confusing?

Regardless, I still find it hard to believe that anyone has ever had
a real physical typecase containing all unicode symbols...


0
mjr6079 (56)
10/30/2003 2:28:43 AM
> Bradd W. Szonye wrote:
>> PLT has a lot of them, but not nearly all of them. Some SRFIs are
>> very difficult to implement in PLT. For example, you can *almost*
>> implement SRFI-34 (exceptions) with PLT, which provides a native
>> RAISE function. Unfortunately, PLT's RAISE doesn't permit rethrowing
>> inside the handler. That doesn't make rethrowing impossible, but it
>> *does* make SRFI-34 semantics impossible, because SRFI-34 requires
>> the handler to run in the same context as RAISE.
>> 
>> You can work around that by hiding the native RAISE and using the
>> portable SRFI implementation instead, but hiding built-in functions
>> is tricky in PLT Scheme.

David Van Horn <dvanhorn@cs.uvm.edu> wrote:
> If you can work around the issue, how is it impossible to to implement
> SRFI 34 semantics?

It isn't impossible to implement them in PLT, but it is impossible to
implement them using the native RAISE. It's also difficult (but not
impossible) to override built-in functions in a user-friendly way.

Since the native RAISE is not suitable for SRFI-34 semantics, you'll end
up with two parallel exception-handling systems. The bigger problem is
that the native I/O functions use native RAISE to throw exceptions, not
SRFI-34 RAISE. That means that you cannot fully implement both SRFI-34
and SRFI-36, unless you also create replacements for all of those I/O
functions. By the time you're done, you'll end up duplicating a large
chunk of the library.

>> You can do it at the top level, but it's usually better to use
>> modules instead of the top level in PLT, and you can't shadow
>> identifiers in modules. There's a way around that too, but it's
>> cumbersome.

> I don't get this.  The PLT raise is different from SRFI-34 raise.  The
> module system just allows you say what you mean and mean what you say,
> ie you don't confuse the two raises.  I don't think it's cumbersome at
> all.

It requires a significant amount of extra scaffolding, plus there are
the problems with SRFI-36 I/O conditions (since they use PLT RAISE, not
SRFI-34 raise). Surely you'll agree that that is cumbersome?

By the way, what's up with everybody getting defensive when I write that
something is difficult to do with PLT? I like PLT, and I'm even thinking
about ways to make this stuff easier. But it's silly to get defensive
and insist that these difficulties don't exist.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/30/2003 6:13:28 AM
Ray Dillinger <bear@sonic.net> wrote:
> 	(the scheme compiler may or may not ever be a released
>          piece of software.  Mostly it's me doing katas for my
>          code-fu.)

Heh, I know exactly what you mean!
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/30/2003 6:17:23 AM
MJ Ray <mjr@dsl.pipex.com> wrote:
> Regardless, I still find it hard to believe that anyone has ever had a
> real physical typecase containing all unicode symbols...

I haven't seen one with the whole set of codes, and you probably won't
see one in the future, because (for example) the typographical
conventions for Greek and Japanese characters are very different, so it
doesn't make much sense to put them both in the same typeface.

However, some typefaces do contain a *lot* of symbols. For example,
Lucida Console provides code points for just about all of the Western
languages, IIRC, and a few miscellaneous symbols. Unfortunately, it
doesn't have a full set of mathematical operators, so you can write a
lambda character but not a there-exists character.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/30/2003 6:30:29 AM
Ray Dillinger <bear@sonic.net> wrote in message news:<3FA0694B.3C4EF1F5@sonic.net>...
> 
> The 1-to-1 mappings for specialcased characters you're referring
> to are a part of that.  It provides *a* way to provide something 
> 
> you can call "case insensitivity" within the boundaries of the 
> 
> Unicode standard, but it will still annoy those who speak and 
> 
> write languages whose rules it violates. 

Scheme users as well as users of most other PLs can never hope to
avoid the annoyance associated with the English bias in computing.
Even if you provide a perfect means to deal with case, sorting, etc in
a locale sensitive way, there still remains the fact that you have to
use English words for most identifiers in your program, since that is
what the r5rs and every implementation I know of define. Programming
in one's native language is a luxury only English speakers and maybe
children are afforded (see Shiro Kawai's post).

> Scheme is a LISP.  Our code *is* data.  Whatever rules we have for 
> 
> handling data must also, sooner or later, apply to code, and vice 
> 
> versa.  Any imbalances mean mismatches and bugs that we have to 
> 
> code around until they are fixed.


In this context it may be interesting to have a look at CLisp which:
- Is an implemntation of a case-insensitive language
- Is an implementation of Common Lisp (re code is data)
- Has had native Unicode and localization support for a long time
- Is designed and implemented by non-English speakers 

> 
> Anyway, the scheme system I've got ...[snip description of Unicode support]

It may be that people tend to worry too much in advance and in
abstract about the pitfalls of multilingual support instead of giving
it a go and seeing what the actual issues are.
Its good to hear that some just go ahead with it.

--
Grzegorz
0
grzegorz1 (80)
10/30/2003 9:52:14 AM
On 28 Oct 2003 20:59:42 -0500, Eli Barzilay <eli@barzilay.org> wrote:
> David Rush <kumo@gofree.indigo.ie> writes:
>
>> 2) it was a pain to make fast. The notion of 'standalone
>>    executable', while ostensibly supported involved a complete
>>    rebuild of the PLT core
>
> The standard meaning of a `standalone executable' never had anything
> with a complete rebuild.

ISTR, that if you didn't want to just glue your bytecode on to the
pre-existing PLT binary, but you wanted an honest-to-god native binary
that went all the way down, you did yes. I've never really considered
the glued-bytecode version to be a standalone executable - just a
cleverly packaged VM+bytecodes. I suppose that reflects my own
terminological difficulties.

>> 3) I write daemons and command-line programs and don't need GUI
>>    bells and whistles; if I did, PLT would be right up
>>    there. [...]
>
> I've done this for years, and am still doing this as my heaviest
> usage.  I fail to see how having bells an whistles stands in my way.

It doesn't. However, having them when they're not needed does not
motivate me to use the system.

>> 6) the v200 release b0rk3d my PLT code base [...]
>
> When it did that for Swindle, I had a similar reaction.  Forcing me
> into using modules and other stuff that was incompatible sounded
> really bad.  I gave it a shot, and the result was so much cleaner to
> write,

At the time it would have meant retargeting my code-generator and
libraries to v200 so I had another meta-layer to work through. That
said targetting v200 *is* on my to-do list - just not at a very high
priority (especially since I have Gambit working on all platforms)

david rush
-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
0
kumo7543 (108)
10/30/2003 12:41:52 PM
> "Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

in another thread an excellent exposition of my annoyances w/the PLT
installation scheme. But I have one or two more things to add here.

On 29 Oct 2003 02:11:04 -0500, Eli Barzilay <eli@barzilay.org> wrote:
> But you have to realize that these conventions are local to just one
> platform, which means that they're completely meaningless in the plt
> tree context.  This means that putting files in the right places
> should be the job of a platform-specific installer.

Well, I regularly target mutually incopmpatible versions of Linux as
well as a couple of versions of Solaris, and HP/UX. For various reasons
I end up installing a lot of stuff in either /usr/local/$OSPATH or
$HOME/opt/$OSPATH where $OSPATH is set via values returned from uname.
Making PLT fit into this is tortuous, to say the least. I would like it
very much if

	./configure --prefix=$HOME/opt/$OSPATH
	make
	make install

just did the right thing, even at the cost of disk wastage. It doesn't,
or at least it didn't last time I tried (v203 or so). Most everything,
else (even Bigloo which uses a non-standard configure ... grr) just does
the Right Thing.

david rush
-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
0
kumo7543 (108)
10/30/2003 12:53:19 PM
On Wed, 29 Oct 2003 01:00:15 +0100, Jens Axel S�gaard 
<usenet@jasoegaard.dk> wrote:
> David Rush wrote:
>> On Tue, 28 Oct 2003 02:05:01 GMT, R Racine <ray@adelphia.net> wrote:
>>> On Tue, 28 Oct 2003 01:27:42 +0100, Jens Axel S�gaard wrote:
>>>> What is missing in DrScheme?
>
>> 5) MrSpidey can't handle big enough programs - and I *really* wish it 
>> did. In fact, if MrSpidey could handle 15KLOC+ programs I would probably
>> start to make the effort to move back to PLT for pre-production
>> development.
>
> I have actually never tried MrSpidey - but you can't seriously
> list that as a reason, since the competing languages doesn't have
> similar tools.

But PLT has *NO OTHER DEBUGGING TOOLS*. Or at least it didn't last time
I used it. I will grant you that it has some of the best error messages
on the planet, but sometimes you just need to be able to inspect a
stack trace...

david rush
-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
0
kumo7543 (108)
10/30/2003 12:56:25 PM
Björn Lindström wrote:

> "Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:
> 
>> Is that actually true? If so, I'd consider that a defect in Unicode,
>> because the correct spelling of "capital esszed" is "SS." And besides,
>> case-folding is only part of the problem, because it's only one
>> example of different but equivalent spellings.
> 
> It is not a defect in Unicode. Just like ASCII, ISO-8859-* etc. Unicode
> in itself does not contain any information about character case, but
> considers 's', 'S', 'ß', "SS" etc. to be separate character. Case is up
> to the software; the compiler in this case.
> 
> Me, I see this as an excellent argument for case sensitivity in
> identifiers, but you can't have everything...
> 

Please don't spread misinformation. Unicode *DOES* define case mappings.
Quoting from <http://www.unicode.org/Public/UNIDATA/UCD.html#Case_Mappings>

> Case Mappings
> 
> 
> There are a number of complications to case mappings that occur once the
> repertoire of characters is expanded beyond ASCII. For more information,
> see Chapter 3 in Unicode 4.0.

> 
> 
> For compatibility with existing parsers, UnicodeData.txt only contains
> case mappings for characters where they are one-to-one mappings; it also 
> omits information about context-sensitive case mappings. Information about
> these special cases can be found in a separate data
> file, SpecialCasing.txt.

It really exasperating to see people argue without first checking the basic
facts.

-- 
Grzegorz
http://pithekos.net
0
grzegorz1 (80)
10/30/2003 1:11:15 PM
On 28 Oct 2003 23:43:17 -0500, Shriram Krishnamurthi <sk@cs.brown.edu> 
wrote:
> (Btw, there is a real peachy name that Matthias decided to keep under
> wraps.  Fans of Law and Order can guess it pretty easily.)

Can it be anything other than 'Grand Larceny'? I mean really...

david rush
-- 
BTW, the only way this could be better news was if Will was finally
targetting x86 *native* instead of via C
0
kumo7543 (108)
10/30/2003 2:01:03 PM
Bradd W. Szonye wrote:
> It isn't impossible to implement them in PLT, but it is impossible to
> implement them using the native RAISE. It's also difficult (but not
> impossible) to override built-in functions in a user-friendly way.

Well SRFI-34 specifies RAISE.  It's not suprising that this is different than 
what PLT refers to as RAISE.

SRFI-34 isn't the first SRFI that has specified a value that is named the same 
as a different value in PLT (cf. SRFI 1).  But that's one of the reasons for 
having a module system in the first place.  These name clashes are really 
trivial issues.  If SRFI-34 is implemented as a module then the user of the 
module can either rename the provided raise to something else, say 
srfi34:raise, or a new language can be constructed that uses (SRFI-34) raise 
as primitive.

> Since the native RAISE is not suitable for SRFI-34 semantics, you'll end
> up with two parallel exception-handling systems.  The bigger problem is...

This is no problem, thanks to the module system.

> that the native I/O functions use native RAISE to throw exceptions, not
> SRFI-34 RAISE. That means that you cannot fully implement both SRFI-34
> and SRFI-36, unless you also create replacements for all of those I/O
> functions. By the time you're done, you'll end up duplicating a large
> chunk of the library.

Nothing has to be duplicated.  If you want IO procedures to (SRFI-34) raise 
(SRFI-36) conditions you can just wrap the PLT primitives to catch PLT 
exceptions and (SRFI-34) raise (SRFI-36) conditions.

I wrote:
>>I don't get this.  The PLT raise is different from SRFI-34 raise.  The
>>module system just allows you say what you mean and mean what you say,
>>ie you don't confuse the two raises.  I don't think it's cumbersome at
>>all.
> 
> It requires a significant amount of extra scaffolding, plus there are
> the problems with SRFI-36 I/O conditions (since they use PLT RAISE, not
> SRFI-34 raise). Surely you'll agree that that is cumbersome?

I don't agree that this is cumbersome.  And I don't see this "significant 
amount of extra scaffolding".  In either of the cases I mentioned above you're 
talking about, what, 1 line to specify what you mean when you say "raise"?

Maybe your issue is that defining a new language is not as convenient as you'd 
like in PLT.  There had been some discussion about allowing a module's 
language to be specified by an arbitrary require spec.  This would allow, for 
example:

(module x (all-from-except mzscheme raise)
   (require (lib "34.ss" "srfi"))
   ...)

But I don't recall what the conclusion on this was.  Regardless, it's really 
sugar.  The module system provides all the machinery needed to have SRFI-34 
raise coexist peacefully with PLT raise.

> By the way, what's up with everybody getting defensive when I write that
> something is difficult to do with PLT? I like PLT, and I'm even thinking
> about ways to make this stuff easier. But it's silly to get defensive
> and insist that these difficulties don't exist.

I can't speak for anyone else.  Also, I'm not being defensive; I don't 
understand your point of view here.  You've now made this same point on the 
SRFI-44 mailing list, so I'd like to understand what is at the root of this 
issue you're taking; I don't see it.

-d

0
10/30/2003 4:36:08 PM
David Rush wrote:
> On Wed, 29 Oct 2003 01:00:15 +0100, Jens Axel S�gaard
> <usenet@jasoegaard.dk> wrote:
> > David Rush wrote:
> >> 5) MrSpidey can't handle big enough programs - and I *really* wish it
> >> did. In fact, if MrSpidey could handle 15KLOC+ programs I would
probably
> >> start to make the effort to move back to PLT for pre-production
> >> development.
> >
> > I have actually never tried MrSpidey - but you can't seriously
> > list that as a reason, since the competing languages doesn't have
> > similar tools.
>
> But PLT has *NO OTHER DEBUGGING TOOLS*. Or at least it didn't last time
> I used it. I will grant you that it has some of the best error messages
> on the planet, but sometimes you just need to be able to inspect a
> stack trace...

Both DrScheme and MzScheme give good stack traces.  The GUI ones in DrScheme
are particularly good, giving highlighted snippets of the source at each
location, so you can often see exactly what's going on without having to
consult the original source.  But if you want to be able to inspect the
values of variables etc., you'll have to wait for the real PLT debugger.  I
just use my ability to mentally simulate MrSpidey to figure out what a
particular value must have been to cause a particular error...  :)

Anton



0
anton58 (1240)
10/30/2003 4:53:56 PM
> >> 6) the v200 release b0rk3d my PLT code base [...]
> >
> > When it did that for Swindle, I had a similar reaction.  Forcing me
> > into using modules and other stuff that was incompatible sounded
> > really bad.  I gave it a shot, and the result was so much cleaner to
> > write,
>
> At the time it would have meant retargeting my code-generator and
> libraries to v200 so I had another meta-layer to work through. That
> said targetting v200 *is* on my to-do list - just not at a very high
> priority (especially since I have Gambit working on all platforms)

Switching to v200 (now v205) is worth the effort.  It really justified the
+97 increment in the version number.



0
anton58 (1240)
10/30/2003 4:57:45 PM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

> MJ Ray <mjr@dsl.pipex.com> wrote:
>> Regardless, I still find it hard to believe that anyone has ever had a
>> real physical typecase containing all unicode symbols...
>
> I haven't seen one with the whole set of codes, and you probably won't
> see one in the future, because (for example) the typographical
> conventions for Greek and Japanese characters are very different, so it
> doesn't make much sense to put them both in the same typeface.

`Code 2000' has made the attempt to put a glyph in every place a glyph
should be.  (Obviously the whitespace is blank, but you get the idea.)

Microsoft Arial Unicode is also extremely complete.

0
jrm (1310)
10/30/2003 6:24:33 PM
In article <bnojqe$lgs$1@hood.uits.indiana.edu>,
Damien R. Sullivan <dasulliv@cs.indiana.edu> wrote:
>Actually, looking at my inefficient Fibonacci benchmark, mzscheme does *quite*
>well against Perl.  10x faster.  mzc module --prim does even better, getting
>to within 4x of C.  (Perl is 100x slower than C.)  I haven't gotten Chez
>Scheme to go that fast.  (It's a lot faster by default though.)  CMUCL can get
>within 2x of C, and ocaml matches C.
>
>How much Fibonacci generalizes to anything, I don't know; it's mostly testing
>recursive calls and numerics.

Speaking of fibonacci, mzscheme's bignum library is just not very fast,
even though it uses gmp.  Here are some timings with the beta Gambit-C
(which uses a bignum library written in Scheme) and current mzscheme on my
Mac Cube (500MHz PowerPC 7410):

[zakon:~/programs/gambc40b4/lib] lucier% gsi
Gambit Version 4.0 beta 4

> (define a (expt 3 1000000))       
> (define b (expt 3 1000001))
> (define c (time (* a a)))         
(time (* a a))
    1465 ms real time
    1090 ms cpu time (1090 user, 0 system)
    3 collections accounting for 35 ms real time (10 user, 0 system)
    17871352 bytes allocated
    no minor faults
    no major faults
> (define d (time (* a b)))    
(time (* a b))
    2077 ms real time
    1660 ms cpu time (1660 user, 0 system)
    1 collection accounting for 16 ms real time (0 user, 0 system)
    21364376 bytes allocated
    no minor faults
    no major faults
> (define e (time (quotient c a)))   
(time (quotient c a))
    8334 ms real time
    6940 ms cpu time (6940 user, 0 system)
    17 collections accounting for 155 ms real time (140 user, 0 system)
    107020216 bytes allocated
    no minor faults
    no major faults
> (define f (time (sqrt a)))
(time (sqrt a))
    7841 ms real time
    6420 ms cpu time (6420 user, 0 system)
    30 collections accounting for 214 ms real time (180 user, 0 system)
    122090160 bytes allocated
    no minor faults
    no major faults
> (define (fib-ratio n)
  (if (= n 1)
      1
      (+ 1 (/ (fib-ratio (- n 1))))))
> (define (fib n)
  (numerator (fib-ratio n)))
> (time (fib 1000))
(time (fib 1000))
    8 ms real time
    10 ms cpu time (10 user, 0 system)
    no collections
    363072 bytes allocated
    no minor faults
    no major faults
70330367711422815821835254877183549770181269836358732742604905087154537118196933579742249494562611733487750449241765991088186363265450223647106012053374121273867339111198139373125598767690091902245245323403501
> (define a (time (expt 3 1000000)))
(time (expt 3 1000000))
    1228 ms real time
    970 ms cpu time (970 user, 0 system)
    5 collections accounting for 37 ms real time (40 user, 0 system)
    17993856 bytes allocated
    no minor faults
    no major faults
> (define b (time (expt 3 1000001)))
(time (expt 3 1000001))
    1856 ms real time
    1630 ms cpu time (1630 user, 0 system)
    6 collections accounting for 39 ms real time (20 user, 0 system)
    28921592 bytes allocated
    no minor faults
    no major faults

and

[zakon:~/Desktop/PLT MzScheme v205/bin] lucier% ./mzscheme 
Welcome to MzScheme version 205, Copyright (c) 1995-2003 PLT
> (define a (expt 3 1000000))
> (define b (expt 3 1000001))
> (define c (time (* a a)))
cpu time: 3060 real time: 3570 gc time: 0
> (define d (time (* a b)))        
cpu time: 4590 real time: 5439 gc time: 0
> (define e (time (quotient c a)))  
cpu time: 10800 real time: 12258 gc time: 0
>  (define f (time (sqrt a)))
cpu time: 7820 real time: 8756 gc time: 0
> (define (fib-ratio n)
  (if (= n 1)
      1
      (+ 1 (/ (fib-ratio (- n 1))))))
> (define (fib n)
  (numerator (fib-ratio n)))
> (time (fib 1000))
cpu time: 379490 real time: 454623 gc time: 450
70330367711422815821835254877183549770181269836358732742604905087154537118196933579742249494562611733487750449241765991088186363265450223647106012053374121273867339111198139373125598767690091902245245323403501
> (define a (time (expt 3 1000000)))
cpu time: 5350 real time: 5983 gc time: 10
> (define b (time (expt 3 1000001)))
cpu time: 5110 real time: 6154 gc time: 0


I find stuff like this (and worse) every time I download PLT scheme.  Including
gmp seems to indicate that bignum and rational arithmetic is part of PLT's
target audience, but they just don't get it right.

Brad
0
bjl (76)
10/30/2003 8:17:52 PM
> Bradd W. Szonye wrote:
>> The bigger problem is that the native I/O functions use native RAISE
>> to throw exceptions, not SRFI-34 RAISE. That means that you cannot
>> fully implement both SRFI-34 and SRFI-36, unless you also create
>> replacements for all of those I/O functions. By the time you're done,
>> you'll end up duplicating a large chunk of the library.

David Van Horn <dvanhorn@emba.uvm.edu> wrote:
> Nothing has to be duplicated.  If you want IO procedures to (SRFI-34)
> raise (SRFI-36) conditions you can just wrap the PLT primitives to
> catch PLT exceptions and (SRFI-34) raise (SRFI-36) conditions.

No, that doesn't work, not for all cases. Here's why:

SRFI-34 requires handlers to execute in the dynamic context of (SRFI-34)
RAISE. A guard can pass on the exception, and a handler can throw a new
exception. In either case, next handler in the chain also executes in
the dynamic context of the original RAISE. (This is explicit for
uncaught exceptions and implicit for new exceptions.)

It's possible to implement SRFI-34 so that it intercepts PLT RAISE. Just
install a new PLT current-exception-handler that translates PLT
conditions to SRFI-35 conditions (if necessary) and kicks off the
SRFI-34 mechanism.

The SRFI-34 handler chain executes entirely in the dynamic context of
the PLT RAISE (which has been translated to a SRFI-34 RAISE). So far, so
good. Now, it's an error to re-invoke PLT RAISE in the dynamic context
of PLT RAISE, but that shouldn't be a problem, right? Re-raises and new
exceptions use the SRFI-34 mechanism, so there's no double-raise from
PLT's point of view.

Not quite. Consider this:

(guard (x2 (else (display "caught again!") (newline)))
  (guard (x1 (else (display "caught!") (newline)))
    (display "this I/O fails") (newline)))

The innermost display fails and raises an I/O condition. A PLT exception
handler translates the condition to SRFI-35 and throws the exception as
a SRFI-34 exception. The inner guard catches the exception and tries to
print an error message, but that fails too.

This second exception *should* be translated into a SRFI-35 condition
and passed on to the outer guard, if I'm reading SRFI-34 correctly.
However, that's not what actually happens. This second PLT RAISE occurs
in the dynamic context of the original PLT RAISE, which is an error.

As written, SRFI-34 appears to support double-exceptions like this.
However, both exceptions use the same dynamic context, and that's not
permitted by PLT RAISE. You're OK so long as two PLT RAISEs don't
collide, but it cannot obey SRFI-34 semantics in all cases.

Regarding extensions to existing languages:

> Maybe your issue is that defining a new language is not as convenient
> as you'd like in PLT.

Yes, that's most of it.

> There had been some discussion about allowing a module's language to
> be specified by an arbitrary require spec.

That would probably help.

> But I don't recall what the conclusion on this was.  Regardless, it's
> really sugar.  The module system provides all the machinery needed to
> have SRFI-34 raise coexist peacefully with PLT raise.

Yes, that's true, but IMO it's sugar for some very bitter coffee.
Without the sugar, that coffee does it's job, quite powerfully, but it
tastes yucky.

> ... I'm not being defensive; I don't understand your point of view
> here.  You've now made this same point on the SRFI-44 mailing list, so
> I'd like to understand what is at the root of this issue you're
> taking; I don't see it.

I hope I've explained it sufficiently now.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/30/2003 10:17:36 PM
Bradd W. Szonye wrote:
> As written, SRFI-34 appears to support double-exceptions like this.
> However, both exceptions use the same dynamic context, and that's not
> permitted by PLT RAISE. You're OK so long as two PLT RAISEs don't
> collide, but it cannot obey SRFI-34 semantics in all cases.

Search Help desk for `current-exception-handler'.

-d

0
10/30/2003 11:49:50 PM
David Rush <kumo@gofree.indigo.ie> writes:

> Well, I regularly target mutually incopmpatible versions of Linux as
> well as a couple of versions of Solaris, and HP/UX. For various
> reasons I end up installing a lot of stuff in either
> /usr/local/$OSPATH or $HOME/opt/$OSPATH where $OSPATH is set via
> values returned from uname.  Making PLT fit into this is tortuous,
> to say the least.

*Please* define "tortuous" in terms of things you need to do.  I've
exchanged emails about this with Bradd, and it looks like the best
solution would be to have another make target that will spread a few
links in standard places.  In other words, I don't see anything that
you'd do that could be described as tortuous -- you might have other
stuff that should be done.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!
0
eli666 (555)
10/31/2003 1:46:33 AM
David Rush <kumo@gofree.indigo.ie> writes:

> On 28 Oct 2003 20:59:42 -0500, Eli Barzilay <eli@barzilay.org> wrote:
> > David Rush <kumo@gofree.indigo.ie> writes:
> >
> >> 2) it was a pain to make fast. The notion of 'standalone
> >>    executable', while ostensibly supported involved a complete
> >>    rebuild of the PLT core
> >
> > The standard meaning of a `standalone executable' never had anything
> > with a complete rebuild.
> 
> ISTR, that if you didn't want to just glue your bytecode on to the
> pre-existing PLT binary, but you wanted an honest-to-god native
> binary that went all the way down, you did yes.

First of all, most references of "standalone" are just that, which is
why I wrote that the *standard* meaning was never related to a
rebuild.

Second, what's unbinary in slapping the bytecode on the executable in
this way?  It's not like it costs anything to load it from itself.  So
it is standalone, and it is executable.  Even with the actual source
attached in this way, the price you pay is the same as loading the
sources which shouldn't be a problem, and shouldn't make it
not-executable or not-standalone.

Third, if you do want such a "real honest-to-god native binary" with
any system, then how can you do it without a "complete rebuild"?


> I've never really considered the glued-bytecode version to be a
> standalone executable - just a cleverly packaged VM+bytecodes. I
> suppose that reflects my own terminological difficulties.

(Would it be better to xor the byte code with some pattern?)

But seriously, I don't see any problem with that.  Especially when mzc
is declared to not be an optimizing compiler.


> >> 3) I write daemons and command-line programs and don't need GUI
> >>    bells and whistles; if I did, PLT would be right up
> >>    there. [...]
> >
> > I've done this for years, and am still doing this as my heaviest
> > usage.  I fail to see how having bells an whistles stands in my
> > way.
> 
> It doesn't. However, having them when they're not needed does not
> motivate me to use the system.

The question is -- does it *bother* you to have stuff you don't need?
And if it does, then what's wrong with downloading just mzscheme?  (It
is available as a seperate package.)


> >> 6) the v200 release b0rk3d my PLT code base [...]
> >
> > When it did that for Swindle, I had a similar reaction.  Forcing
> > me into using modules and other stuff that was incompatible
> > sounded really bad.  I gave it a shot, and the result was so much
> > cleaner to write,
> 
> At the time it would have meant retargeting my code-generator and
> libraries to v200 so I had another meta-layer to work through. That
> said targetting v200 *is* on my to-do list - just not at a very high
> priority (especially since I have Gambit working on all platforms)

I can just join Anton and tell you that after a short period of being
frustrated, I found the module system and the way that it handles
syntax to be one of the best things I've ever seen.  (And I don't
spend "best"s without intending to.)  I don't think that there is any
way I could get Swindle as it is done now, where it just works, and
fully cooperate with other modules -- while being able to redefine
stuff as it does.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!
0
eli666 (555)
10/31/2003 2:03:39 AM
David Rush <kumo@gofree.indigo.ie> writes:

> [...] but sometimes you just need to be able to inspect a stack
> trace...

I'm sorry to sound like a fanatic, but it really looks it has been a
while since you tried it:

| Welcome to MzScheme version 205, Copyright (c) 1995-2003 PLT
| > (require (lib "errortrace.ss" "errortrace"))
| > (define (a) (+ (b) (b)))
| > (define (b) (c))
| > (a)
| reference to undefined identifier: c
| STDIN::84: c
| STDIN::83: (c)
| STDIN::58: (+ (b) (b))

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!
0
eli666 (555)
10/31/2003 2:06:59 AM
> Bradd W. Szonye wrote:
>> As written, SRFI-34 appears to support double-exceptions like this.
>> However, both exceptions use the same dynamic context, and that's not
>> permitted by PLT RAISE. You're OK so long as two PLT RAISEs don't
>> collide, but it cannot obey SRFI-34 semantics in all cases.

David Van Horn <dvanhorn@emba.uvm.edu> wrote:
> Search Help desk for `current-exception-handler'.

I know how current-exception-handler works. I even mentioned how you
would use it to implement the translation from PLT exceptions to SRFI-34
exceptions. It does not change the facts that:

1. PLT RAISE does not permit nested or overlapping exceptions in the
   same dynamic context.
2. SRFI-34 RAISE and GUARD require that nested and overlapping
   exceptions execute in the dynamic context of the original RAISE.

It's possible to handle *most* cases with clever use of current-
exception-handler, but it will not obey SRFI-34 semantics in *all*
cases. If the original RAISE originates from the PLT primitive, a guard
expression *cannot* properly handle a nested or overlapping PLT RAISE,
because it requires behavior that PLT RAISE does not permit.

Specifically, an I/O function can raise an exception using the PLT
primitive. While handling that in SRFI-34 style, you *cannot* handle a
second I/O exception. The only way to avoid that is to rewrite all
library code to use SRFI-34 exceptions instead of PLT exceptions.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/31/2003 3:22:32 AM
bjl@cs.purdue.edu (Bradley J Lucier) writes:

> I find stuff like this (and worse) every time I download PLT scheme.

You know, Brad, a lot of people respect your contributions, especially
to getting numerics right.  But remarks like this are basically not
much more than flames.  I'd hope for better.

Could you be more specific about

  stuff like this

and

  (and worse)

and, for that matter,

  every time

(Like, you download it every week, and each time you find more stuff
broken?  Or you download it once every three years, try some toy
benchmark, and conclude that things have gotten worse?)

> Including gmp seems to indicate that bignum and rational arithmetic
> is part of PLT's target audience, but they just don't get it right.

Could you possibly be more helpful?  Unless I missed something, your
message didn't report any errors.  So PLT Scheme doesn't seem to get
it *wrong*, either.  There does seem to be a difference in performance
between MzScheme and Gambit.  Is this what you mean by "just don't get
it right"?  (Surely you recognize that including the word "just" is
provocative, like you're referring to the bumbling earnest striver in
your class who can't seem to make better than a C.  Or perhaps that's
really how you view PLT Scheme, in which case I'd definitely love to
hear more.)

Shriram
0
sk1 (223)
10/31/2003 4:21:24 AM
Eli Barzilay <eli@barzilay.org> wrote:
> I can just join Anton and tell you that after a short period of being
> frustrated, I found the module system and the way that it handles
> syntax to be one of the best things I've ever seen.  (And I don't
> spend "best"s without intending to.)

I agree; PLT's module design is well done. However, for some tasks, it
could use a bit of syntactic sugar, or an interface layer, or a
recommended design pattern, or something.

For example, SRFI-1 extends a couple of R5RS functions. It isn't
difficult to use them in a module, but it takes more than just a simple
REQUIRE statement. You need to do something like this instead:

    ;; define a module language to shadow R5RS functions
    (module srfi-1-scheme mzscheme
      (require (lib "1.ss" "srfi"))
      (provide (all-except mzscheme ...) ; remove the standard versions
               (all-except |1| ...)      ; provide most of SRFI-1
               (rename |1| ...)))        ; provide the extended functions

    ;; use the modified scheme for your actual code
    (module my-program srfi-1-scheme
      ;; actual module code goes here
      ...)

That's not too difficult, but it may not be obvious to a new PLT
Schemer. (It didn't take me long to figure it out, but it wasn't
immediately obvious to me.)

It gets a bit more complicated if you want to use several such
R5RS-extending modules. Suppose that you write three modules. One needs
SRFI-1 extensions, one needs SRFI-13 extensions, and one needs a bit of
both. You could write a single "extended language" module that follows
the above pattern, and use it as the base language for all three program
modules. That's good enough for small programs, but it's not a very good
idea for large systems. (John Lakos explains why in the excellent
/Large-Scale Programming in C++./ While that book is aimed squarely at
C++ programmers, many of its lessons are applicable to all large
projects, just as HTDP and SICP are useful to more than just Schemers.)

So my experience with large systems tells me that it's better to create
a separate language module for each combination of language extensions
I'll be using. Hrm, yucky. Time to look for a better pattern. This looks
like a good candidate:

    ;; Define a minimal "Scheme" that provides only a module interface
    (module min-require-environment mzscheme
      (provide provide require))

    ;; Build up Scheme from scratch for the code modules
    (module my-program min-require-environment
      (require (all-except mzscheme ...)
               (lib "1.ss" "srfi")
               (lib "13.ss" "srfi"))
      ;; actual module code goes here
      ...)

The idea here is that you create a minimal environment for new modules,
and import exactly what you need. This is effectively the same thing as
permitting an arbitrary require-spec as the initial module language.

It's still a bit of a pain to explicitly state what you want from
mzscheme, SRFI-1, SRFI-13, etc., in every module that you use them. That
kind of tedious and error-prone task seems like it's better suited for
an automated process. That's what automated processes do best: Handle
the tedium and the error-prone aspects.

So I'm currently thinking that "extension modules" should have some sort
of associated data that declares what the conflicts are. I think PLT
Scheme does the right thing by disallowing shadowing by default, but I
also think there should be a way to handle the common case where you
really do want to shadow core language features with extensions. The
above methods do work, but they're cumbersome; it'd be better to use
tools -- macros, or maybe a preprocessor -- to handle the tedious
details when you really do want shadowing.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/31/2003 4:25:28 AM
On Thu, 30 Oct 2003 12:53:19 +0000, David Rush wrote:
> Well, I regularly target mutually incopmpatible versions of Linux as
> well as a couple of versions of Solaris, and HP/UX. For various reasons
> I end up installing a lot of stuff in either /usr/local/$OSPATH or
> $HOME/opt/$OSPATH where $OSPATH is set via values returned from uname.
> Making PLT fit into this is tortuous, to say the least. I would like it
> very much if
> 
> 	./configure --prefix=$HOME/opt/$OSPATH
> 	make
> 	make install
> 
> just did the right thing, even at the cost of disk wastage. It doesn't,
> or at least it didn't last time I tried (v203 or so). Most everything,
> else (even Bigloo which uses a non-standard configure ... grr) just does
> the Right Thing.
> 

This is what I do:

../configure --prefix=/usr/local/stow/plt-20x
make
make install
cd /usr/local/stow/plt-20x && ./install
cd /usr/local/stow && stow plt-20x


It does the Right Thing(tm) on Debian.  I'm sure you can substitute
plt-20x-$OSPATH for plt-20x.

- Daniel

0
dsilva (35)
10/31/2003 4:56:28 AM
Daniel P. M. Silva <dsilva@ccs.neu.edu> wrote:
> This is what I do:
> 
> ./configure --prefix=/usr/local/stow/plt-20x
> make
> make install
> cd /usr/local/stow/plt-20x && ./install
> cd /usr/local/stow && stow plt-20x
> 
> It does the Right Thing(tm) on Debian.  I'm sure you can substitute
> plt-20x-$OSPATH for plt-20x.

Hey, that's handy.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
10/31/2003 5:08:41 AM
On Thu, 30 Oct 2003 12:56:25 +0000, David Rush wrote:
> But PLT has *NO OTHER DEBUGGING TOOLS*. Or at least it didn't last time
> I used it.

There is a "Debug" button in the DrScheme window now.  It stands next to
the "Stepper" button.

And I usually load the following to trace a namespace's functions:

(require (lib "trace.ss"))

;; trace all functions in the current namespace
(define (trace-all)
  (eval `(begin (require (prefix mz: (lib "trace.ss")))
                (mz:trace ,@(all-function-symbols)))))

;; untrace all functions in the current namespace
(define (untrace-all)
  (eval `(begin (require (prefix mz: (lib "trace.ss")))
                (mz:untrace ,@(all-function-symbols)))))

;; all-function-symbols: -> (listof symbol)
;; get a list of symbols representing all global functions
(define (all-function-symbols)
  (let ([mz:filter (dynamic-require '(lib "list.ss") 'filter)])
    (mz:filter (lambda (s)
                 (and (procedure? (namespace-variable-value s #f
                                                            (lambda () #f)))
                      (not (regexp-match "mz:"
                                         (symbol->string s)))
                      (not (regexp-match "trace-table"
                                         (symbol->string s)))))
               (namespace-mapped-symbols))))


(trace-all)


0
dsilva (35)
10/31/2003 5:27:10 AM
David Rush wrote:

> ...deleted
> else (even Bigloo which uses a non-standard configure ... grr)

i have added things to gnu configure and i have added things to bigloo 
configure. it was much easier for bigloo.
perhaps somebody that does a lot of gnu configure thinks otherwise? 
(like the ''fact'' that windows is more intuitive than mac-os. ie, 
windows users find mac-os harder to use than windows :-)


bengt

0
10/31/2003 6:38:18 AM
On 30 Oct 2003 21:03:39 -0500, Eli Barzilay <eli@barzilay.org> wrote:
> David Rush <kumo@gofree.indigo.ie> writes:
>> On 28 Oct 2003 20:59:42 -0500, Eli Barzilay <eli@barzilay.org> wrote:
>> > David Rush <kumo@gofree.indigo.ie> writes:
>> >
>> >> 2) it was a pain to make fast. The notion of 'standalone
>> >>    executable', while ostensibly supported involved a complete
>> >>    rebuild of the PLT core
>> >
>> > The standard meaning of a `standalone executable' never had anything
>> > with a complete rebuild.
>>
>> ISTR, that if you didn't want to just glue your bytecode on to the
>> pre-existing PLT binary, but you wanted an honest-to-god native
>> binary that went all the way down, you did yes.
>
> First of all, most references of "standalone" are just that, which is
> why I wrote that the *standard* meaning was never related to a
> rebuild.

*I* said that this may just be a terminological problem. Chill.

> Second, what's unbinary in slapping the bytecode on the executable in
> this way?  It's not like it costs anything to load it from itself.

Well, binary is frequently used as a ahorthand for 'native code', sorry
if my usage was confusing. That version of app delivery with PLT is not
native binary.

> Third, if you do want such a "real honest-to-god native binary" with
> any system, then how can you do it without a "complete rebuild"?

*most* Scheme implementations don't require you to rebuild the run-time
(or even re-link it) in order to ship a native binary. Generally, the
*most* you have to do is link with their standard run-time - some of
them even hide that step from you.

>> I've never really considered the glued-bytecode version to be a
>> standalone executable - just a cleverly packaged VM+bytecodes. I
>> suppose that reflects my own terminological difficulties.
>
> But seriously, I don't see any problem with that.  Especially when mzc
> is declared to not be an optimizing compiler.

Gee, it has optimization flags. It claims to deliver native code via
compilation through C. I just can't use the native code unless I jump
through hoops that cause me (IIRC) to have to make a clone of the PLT
source code tree and go through some convoluted extra link process. And
no I don't remember the details because when I read the instructions I
was so confused by them that I decided that it just wasn't going to be
worth the effort.

There's a bunch of stuff I *do* like about PLT. But you have to admit
that it is *heavily* geared towards interactive development and the use
of the MrEd framework. I mean loading a .so into a running mzscheme or
DrEd is (or was) way simpler than producing a native-code, standalone
executable. If this is just a documentation problem then you have my
apologies, but as it stands PLT gives me more pain than pleasure.

david rush
-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
0
kumo7543 (108)
10/31/2003 12:01:06 PM
Eli Barzilay <eli@barzilay.org> wrote in message news:<skekwumadg.fsf@mojave.cs.cornell.edu>...
> 
> Third, if you do want such a "real honest-to-god native binary" with
> any system, then how can you do it without a "complete rebuild"?
> 

c:\home>csc tak.scm -static -v
csc tak.scm -static -v
tak.c
c:\chicken\chicken tak.scm -output-file tak.c -quiet
cl tak.c /Fotak.obj /nologo /c "/I%CHICKEN_HOME%" /DC_NO_PIC_NO_DLL
del tak.c
link /out:tak.exe /nologo tak.obj
c:\chicken\libstuffed-chicken-static.lib
c:\chicken\libsrfi-chicken-static.lib c:\chicken\libchicken-static.lib
del tak.obj

c:\home>dir tak.exe
dir tak.exe
 Volume in Laufwerk C: hat keine Bezeichnung.
 Volumeseriennummer: C4CC-5729

 Verzeichnis von c:\home

31.10.2003  13:04           548.864 tak.exe
               1 Datei(en)        548.864 Bytes
               0 Verzeichnis(se),  1.832.878.080 Bytes frei

c:\home>dumpbin /dependents tak.exe
dumpbin /dependents tak.exe
Microsoft (R) COFF Binary File Dumper Version 6.00.8168
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.


Dump of file tak.exe

File Type: EXECUTABLE IMAGE

  Image has the following dependencies:

    KERNEL32.dll

  Summary

        B000 .data
        1000 .rdata
       77000 .text

c:\home>


cheers,
felix
0
felix1812 (33)
10/31/2003 12:10:18 PM
In article <w7d7k2mm3zv.fsf@cs.brown.edu>,
Shriram Krishnamurthi  <sk@cs.brown.edu> wrote:
>bjl@cs.purdue.edu (Bradley J Lucier) writes:
>Could you be more specific about
>
>  stuff like this

"Stuff like this" was inefficiencies.

>and
>
>  (and worse)

Bugs.

>and, for that matter,
>
>  every time
>
>(Like, you download it every week, and each time you find more stuff
>broken?  Or you download it once every three years, try some toy
>benchmark, and conclude that things have gotten worse?)

About once a year after I read another PLT love-fest thread
in c.l.s.  And I find less stuff broken.  And that wasn't a "toy
benchmark", it was some code I wrote in the last two weeks in 
notes I am typing about using Scheme for mathematical computations for
beginning math graduate students.  The emphasis was on knowing the intrinsic
algorithmic costs of various operations.  I was trying to show the
students the difference between things that are realistically possible and
the things that might be impossible to compute in a reasonable time.

>> Including gmp seems to indicate that bignum and rational arithmetic
>> is part of PLT's target audience, but they just don't get it right.
>
>Could you possibly be more helpful?  Unless I missed something, your
>message didn't report any errors.  So PLT Scheme doesn't seem to get
>it *wrong*, either.  There does seem to be a difference in performance
>between MzScheme and Gambit.

"difference in performance": On the fibonacci code mzscheme was
38,000 times slower than the beta of Gambit-C.  In my opinion
that's a large enough quantitative difference to make a qualitative
statement about the implementation.  But it's only one small snippet of
code; how many examples like this do you want me to look for?

I was specifically replying to the meta-argument made in this thread that
"mzscheme includes gmp so bignum/rational arithmetic is fast".  This
meta-argument is false (it would be false even if the bignum/rational
arithmetic in mzscheme were fast).

Brad
0
bjl (76)
10/31/2003 1:57:53 PM
Grzegorz Chrupa�a <grzegorz@pithekos.net> writes:

> Please don't spread misinformation. Unicode *DOES* define case
> mappings.  Quoting from
> <http://www.unicode.org/Public/UNIDATA/UCD.html#Case_Mappings>

Cool, I wasn't aware of that. Still, I don't think this kind of
information is something a compiler should have to bother with. It would
probably be better implemented as some sort of macros.

-- 
Bj�rn Lindstr�m <bkhl@elektrubadur.se>
http://bkhl.elektrubadur.se/
0
bkhl (10)
10/31/2003 2:08:02 PM
David Rush <kumo@gofree.indigo.ie> writes:

> >> > David Rush <kumo@gofree.indigo.ie> writes:
> >> >
> >> >> 2) it was a pain to make fast. The notion of 'standalone executable',
                                                      ^^^^^^^^^^^^^^^^^^^^^
> >> >>    while ostensibly supported involved a complete rebuild of
> >> >>    the PLT core
> 
> Well, binary is frequently used as a ahorthand for 'native code',
        ^^^^^^
> sorry if my usage was confusing. That version of app delivery with
> PLT is not native binary.


> *most* Scheme implementations don't require you to rebuild the
> run-time (or even re-link it) in order to ship a native
> binary. Generally, the *most* you have to do is link with their
> standard run-time - some of them even hide that step from you.

and felix@proxima-mt.de (felix) writes:

> c:\home>csc tak.scm -static -v
> [...]

It was supposed to be a stupid joke.  You're both talking about using
some program (ld) that takes some binary stuff from diffrent files and
slaps it together, possibly changing a few pointers.  Under a
definition of "binary", this (or anything else that wasn't directly
compiled) doesn't count as "real honest-to-god native binary".  Such a
definition would be similar to David's definition of a "standalone
executable" that exclude some standalone files that are executable...

Apologies for my idea of humor when it's too late...

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!
0
eli666 (555)
10/31/2003 6:23:35 PM
OK, I'll try to crank down my rhetoric.  Here are some more examples from my
notes.  Perhaps mzscheme could benefit from a later version of gmp.  Perhaps gmp
isn't well enough integrated into mzscheme's bignum/rational arithmetic.
Perhaps it shows nothing at all---after all, mzscheme does beat Gambit-C in some
benchmarks (especially the latter ones that heavily involve the interpreter).

If someone can tell me how to compute an integer sqrt in mzscheme I'll run
the Brent-Salamin code for pi.

Like I said, this is code from some notes to tell math students how much
different operations cost (among other things).  Notes are at the bottom.

Formula                                               CPU times (ms)
                                            beta Gambit-C      mzscheme 205

(expt 3 1000000)  ; a                             930              4700
(expt 3 1000001)  ; b                            1610              4970
(* a a)           ; c                            1170              3150
(* a b)           ;                              1640              4660
(quotient c a)    ;                              6960             10780
(sqrt c)          ;                             14080             57240
(fib 100000)      ; a, note 1                    4280              7560
(fib 100001)      ; b                            4450              7790
(gcd a b)         ;                             27580             88780
(gcd a b)         ; a=3^100000, b=2^100000      23380             56230
(expt1 3 1000000) ; note 2                        920              1770
(expt2 3 1000000) ; note 3                       6510              6290
(* a a)           ; a=3^1000000                  1180              2950
(expt 10 10000000); a                           42770            486200
(expt 2 10000000) ; b                             990             70300
(quotient a b)    ;                               140           3169230
(expt 2/3 10000)  ; a                               0               130
(expt 3/5 10000)  ; b                              10               370
(* a b)           ;                               280               390
(fib 1000)        ; note 4                         10              2790
(factorial 10000) ; note 5                       1630              1210
(partial-factorial 0 10000) ; note 6              170               160
(binary-splitting-compute-e 1000) ; note 7        980               400
(naive-compute-e 1000) ; note 8                142970             51520
(binary-splitting-compute-pi 1000) ; note 9      2070              1130
(pi-brent-salamin) ; n. 10, beta^k=10^1000000 1345230
(pi-brent-salamin) ; beta^k=2^33219            902040

note 1:

(define (fib n)   ; works for $n\geq 2$
  (let loop ((i       2)
             (fib_i-1 1)
             (fib_i   1))
    (if (= i n)
        fib_i
        (loop (+ i 1) fib_i (+ fib_i-1 fib_i)))))

note 2:

(define (expt1 a b)
  (define (square x) (* x x))
  (cond ((= b 0) 1)
        ((even? b)
         (square (expt1 a (quotient b 2))))
        (else
         (* a (square (expt1 a (quotient b 2)))))))

note 3:

(define (expt2 a b)
  (define (square x) (* x x))
  (cond ((= b 0) 1)
        ((even? b)
         (expt2 (square a) (quotient b 2)))
        (else
         (* a (expt2 (square a) (quotient b 2))))))

note 4:

(define (fib-ratio n)
  (if (= n 1)
      1
      (+ 1 (/ (fib-ratio (- n 1))))))
(define (fib n)
  (numerator (fib-ratio n)))

note 4:

(define (factorial n)
  (let loop ((i 1)
             (result 1))
    (if (> i n)
        result
        (loop (+ i 1)
              (* i result)))))
note 5:

(define (partial-factorial m n)
  ;; computes the product (m+1) * ... * (n-1) * n
  (if (< (- n m) 10)
      (do ((i (+ m 1) (+ i 1))
           (result 1 (* result i)))
          ((> i n) result))
      (* (partial-factorial m (quotient (+ m n) 2))
         (partial-factorial (quotient (+ m n) 2) n))))

note 6:

(define (partial-factorial m n)
  ;; computes the product (m+1) * ... * (n-1) * n
  (if (< (- n m) 10)
      (do ((i (+ m 1) (+ i 1))
           (result 1 (* result i)))
          ((> i n) result))
      (* (partial-factorial m (quotient (+ m n) 2))
         (partial-factorial (quotient (+ m n) 2) n))))

note 7:

(define (binary-splitting-partial-sum m n
                                      partial-term
                                      common-factor-ratio)
  ;; sums (partial) terms from m to n-1
  ;; (partial-term n m) is the term at n with the common factors of terms >= m removed
  ;; (common-factor-ratio m n) is the ratio of the common factor of terms >= n divided by
  ;; the common factors of terms >= m
  (if (< (- n m) 10)
      (do ((i m (+ i 1))
           (result 0 (+ result (partial-term m i))))
          ((= i n) result))
      (+ (binary-splitting-partial-sum m (quotient (+ m n) 2) partial-term common-factor-ratio)
         (* (common-factor-ratio m (quotient (+ m n) 2))
            (binary-splitting-partial-sum (quotient (+ m n) 2) n partial-term common-factor-ratio)))))

(define (binary-splitting-sum n partial-term common-factor)
  (binary-splitting-partial-sum 0 n partial-term common-factor))

(define (binary-splitting-compute-e n)
  (binary-splitting-sum n
                       (lambda (m n) (/ (partial-factorial m n)))
                       (lambda (m n) (/ (partial-factorial m n)))))

note 8:

(define (naive-compute-e n)
  (do ((k 0 (+ k 1))
       (sum 0 (+ sum (/ (partial-factorial 0 k)))))
      ((= k n) sum)))

note 9:

(define (binary-splitting-compute-atan n x)
  ;; here we just consider the common factor to be x^(2n+1)
  (* x    ;; common factor for all terms
     (binary-splitting-sum n
                           (lambda (m n) (/ (expt x (* 2 (- n m)))
                                         (* (if (odd? n) -1 1) ( + (* 2 n) 1)))) 
                           (lambda (m n) (expt x (* 2 (- n m)))))))

(define (binary-splitting-compute-pi n)
  (* 4 (- (* 4 (binary-splitting-compute-atan n 1/5))
          (binary-splitting-compute-atan (quotient (* n 10) 34) 1/239))))


note 10:

(define (fixed.+ x y)
  (+ x y))
(define (fixed.- x y)
  (- x y))
(define (fixed.* x y)
  (quotient (* x y) beta^k))
(define (fixed.square x)
  (fixed.* x x))
(define (fixed./ x y)
  (quotient (* x beta^k) y))
(define (fixed.sqrt x)
  (##exact-int.sqrt (* x beta^k)))
(define (number->fixed x)
  (round (* x beta^k)))
(define (fixed->number x)
  (/ x beta^k))

(define (pi-brent-salamin)
  (let ((one (number->fixed 1)))
    (let loop ((a one)
               (b (fixed.sqrt (quotient one 2)))
               (t (quotient one 4))
               (x 1))
      (if (= a b)
          (fixed./ (fixed.square a) t)
          (let ((new-a (quotient (fixed.+ a b) 2)))
            (loop new-a
                  (fixed.sqrt (fixed.* a b))
                  (fixed.- t (* x (fixed.square (fixed.- new-a a))))
                  (* 2 x)))))))

0
bjl (76)
10/31/2003 8:49:56 PM
Bradley J Lucier wrote:
> OK, I'll try to crank down my rhetoric.  Here are some more examples from my
> notes.  Perhaps mzscheme could benefit from a later version of gmp.  Perhaps gmp
> isn't well enough integrated into mzscheme's bignum/rational arithmetic.
> Perhaps it shows nothing at all---after all, mzscheme does beat Gambit-C in some
> benchmarks (especially the latter ones that heavily involve the interpreter).

Which options did you use when you invoked mzscheme?

Or did you compile it with mzc?


-- 
Jens Axel S�gaard

0
usenet153 (246)
10/31/2003 10:04:43 PM
Bradley J Lucier wrote:
> If someone can tell me how to compute an integer sqrt in mzscheme I'll run
> the Brent-Salamin code for pi.

This brings back memories:

<http://groups.google.com/groups?q=jens+axel+dk.videnskab+pi+scheme&hl=en&lr=&ie=UTF-8&selm=snhrn35u.fsf%40soegaard.net&rnum=3>

This uses the Bailey-Borwein-Plouffe algorithm and thus need nothing 
else than floating point!

This program was one of my first in Scheme, so don't pay too
much attention on the style (I suddenly get an urge to rewrite it).

-- 
Jens Axel S�gaard

0
usenet153 (246)
10/31/2003 10:46:36 PM
Bradley J Lucier wrote:
> OK, I'll try to crank down my rhetoric.  Here are some more examples from my
> notes.  Perhaps mzscheme could benefit from a later version of gmp.  Perhaps gmp
> isn't well enough integrated into mzscheme's bignum/rational arithmetic.
> Perhaps it shows nothing at all---after all, mzscheme does beat Gambit-C in some
> benchmarks (especially the latter ones that heavily involve the interpreter).
> 
> If someone can tell me how to compute an integer sqrt in mzscheme I'll run
> the Brent-Salamin code for pi.

I haven't the patience to try the following for n=10^1000000, but here
is one way to get to run a little faster:

pc125218: /tmp/ $ mzc --prim --unsafe-disable-interrupts 
--unsafe-skip-tests --exe pi-brent-salamin pi-brent-salamin.scm
MzScheme compiler (mzc) version 204.6, Copyright (c) 1996-2003 PLT
  [output to "pi-brent-salamin"]
pc125218: /tmp/ $ time ./pi-brent-salamin
314159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593344612847564823378678316527120190914564856692346034861045432664821339360726024914127372458700660631558817488152092096282925409171536436789259036001133053054882046652138414695194151160943305727036575959195309218611738193261179310511854807446237996274956735188575272489122793818301194912983367336244065664308602139494639522473719070217986094370277053921717629317675238467481846766940513200056812714526356082778577134275778960917363717872146844090122495343014654958537105079227968925892354201995611212902196086403441815981362977477130996051870721134999999837297804995105973173281609631859502445945534690830264252230825334468503526193118817101000313783875288658753320838142061717766914730359825349042875546873115956286388235378759375195778185778053217122680661300192787661119590
921641993800719631802083755674748798620626506933520789559185079203995178534792226473561091405298384024833370412965703904918870009582616872275744356706019138079962770994950847663509750718103178666530734300821650697230830340458296207271747308127917112194660244155162929113671781294159435408464209151891033141032707556427322835309417328739764088836203571856405984643825727608761112705757335215634221637924150693460974641535658320165467058920953415954930968225922133531681276198299951046833474286163539890641494872091074603800501082198251968692425049374776349166101705705865931533123677119336944550975960150969084497916613594788518802614543799025610882966728025353970340335761441578026378922707798090178813575081885149514575723923791179181891368104305536908019441691565890585844592600899660682331700639242783217204264592582867486290394067491761868224171043251157788359158270497234453794881068201132169079776351337105447705267995758434525504213230442292305457902929912885803419707152512332566019
562147888136084617153

real    0m36.565s
user    0m19.610s
sys     0m0.330s



(module pi-brent-salamin mzscheme
   (provide pi-brent-salamin)

   (define n      10)
   (define beta^k (expt 10 1000))

   (define (fixed.+ x y)
     (+ x y))

   (define (fixed.- x y)
     (- x y))

   (define (fixed.* x y)
     (quotient (* x y) beta^k))

   (define (fixed.square x)
     (fixed.* x x))

   (define (fixed./ x y)
     (quotient (* x beta^k) y))

   (define (fixed.sqrt x)
     (integer-square-root (* x beta^k)))

   (define (adjust-fences n lower-bound upper-bound)
     (if (= (add1 lower-bound) upper-bound)
         lower-bound
         (let ((average (quotient (+ lower-bound upper-bound) 2)))
           (if (< n (* average average))
               (adjust-fences n lower-bound average)
               (adjust-fences n average upper-bound)))))

   (define (integer-square-root n)
     (adjust-fences n 0 (add1 n)))

   (define (number->fixed x)
     (round (* x beta^k)))
   (define (fixed->number x)
     (/ x beta^k))

   (define (pi-brent-salamin)
     (let ((one (number->fixed 1)))
       (let loop ((a one)
                  (b (fixed.sqrt (quotient one 2)))
                  (t (quotient one 4))
                  (x 1))
         (if (= a b)
             (fixed./ (* a a) t)
             (let ((new-a (quotient (fixed.+ a b) 2)))
               (loop new-a
                     (fixed.sqrt (fixed.* a b))
                     (fixed.- t (* x (fixed.square (fixed.- new-a a))))
                     (* 2 x)))))))

   (display (pi-brent-salamin))
   (newline)
   )

0
usenet153 (246)
10/31/2003 11:06:29 PM
Eli Barzilay wrote:
> 
> David Rush <kumo@gofree.indigo.ie> writes:
> 
> > On 28 Oct 2003 20:59:42 -0500, Eli Barzilay <eli@barzilay.org> wrote:
> > > David Rush <kumo@gofree.indigo.ie> writes:
> > >
> > >> 2) it was a pain to make fast. The notion of 'standalone
> > >>    executable', while ostensibly supported involved a complete
> > >>    rebuild of the PLT core
> > >
> > > The standard meaning of a `standalone executable' never had anything
> > > with a complete rebuild.
> >
> > ISTR, that if you didn't want to just glue your bytecode on to the
> > pre-existing PLT binary, but you wanted an honest-to-god native
> > binary that went all the way down, you did yes.
> 
> First of all, most references of "standalone" are just that, which is
> why I wrote that the *standard* meaning was never related to a
> rebuild.
> 
> Second, what's unbinary in slapping the bytecode on the executable in
> this way?  It's not like it costs anything to load it from itself.  So
> it is standalone, and it is executable.  Even with the actual source
> attached in this way, the price you pay is the same as loading the
> sources which shouldn't be a problem, and shouldn't make it
> not-executable or not-standalone.
> 
> Third, if you do want such a "real honest-to-god native binary" with
> any system, then how can you do it without a "complete rebuild"?
> 
> > I've never really considered the glued-bytecode version to be a
> > standalone executable - just a cleverly packaged VM+bytecodes. I
> > suppose that reflects my own terminological difficulties.
> 
> (Would it be better to xor the byte code with some pattern?)
> 
> But seriously, I don't see any problem with that.  Especially when mzc
> is declared to not be an optimizing compiler.
> 
> > >> 3) I write daemons and command-line programs and don't need GUI
> > >>    bells and whistles; if I did, PLT would be right up
> > >>    there. [...]
> > >
> > > I've done this for years, and am still doing this as my heaviest
> > > usage.  I fail to see how having bells an whistles stands in my
> > > way.
> >
> > It doesn't. However, having them when they're not needed does not
> > motivate me to use the system.
> 
> The question is -- does it *bother* you to have stuff you don't need?
> And if it does, then what's wrong with downloading just mzscheme?  (It
> is available as a seperate package.)

What bothers me (I won't speak to particular scheme implementations 
here) is when someone assumes that, because some heavyweight feature
I don't need or want is there, some other system, which I was 
relying on in an environment where the heavy stuff isn't supported, 
is no longer needed. 

The most egregious example I've dealt with lately involves a nice
database product that created a new help and diagnostic system 
using X windows -- and quit shipping the only help system I can 
use on database server boxes (which are mostly headless boxes
accessed via SSH, and which for performance and security reasons 
do not get an X install).

I sent them a nice note explaining why we no longer need their 
product. 

In scheme systems it's more subtle.  Somebody builds a module 
system, gets all excited about it, and suddenly you can't use 
their scheme at all without putting nonportable module declarations
into your source files.  I don't need that.  I want to use the 
same source files without modification on a dozen different 
systems. Somebody else gets all excited about generic programming 
and dispatch, and then multiple function definitions and other 
errors you were relying on having the system tell you about 
slip past you and you get unexpected behavior and/or nonportable 
code. Yet a third implementor gets all excited about exceptions 
and raising and throwing and suddenly you can't write portable 
code there anymore without specializing code for the slightly-broken
versions of primitives that are now niftier because they "throw 
exceptions" instead of doing what R5RS says.  Somebody else gets 
all excited about OO, and then even the tiniest program that does 
any I/O suddenly has to drag the elephantine bulk of their object 
system along with it "because ports are now objects that inherit 
from ...."  

Sometimes I feel like a Luddite, but what I really want is a 
straightforward environment with no exotic bumps or surprises, 
which supports a thoroughly portable dialect, and simple 
facilities that avoid any unnecessary overhead or complications.  

For documentation, I prefer man pages.  Ideally, one man page
for every library function and system call, the same way C's 
man pages work.  I don't like info files because they don't 
have the standard format and organization of man pages and 
they don't respond properly to 'apropos.' That standard format 
helps me find things instantly in man pages, and apropos is 
exactly the thing that helps me figure out which page I need 
to look at.  I don't like hyperhelp because their indexes and 
search engines are never as good as apropos and their pages 
aren't in any standard format.  I especially don't like PDF 
or Postscript books -- you have to figure out the individual 
organization of a new book for every product just to know 
where to look in them.  

And even though I *usually* use Xwindows, the minute your 
system *forces* me to use Xwindows I'm done with it unless 
there's some critical reason why I can't use anything else for 
a particular job. 

I can ignore bells and whistles, but their presence usually 
signals the imminent or recent demise of some system I actually 
use.

			Bear
0
bear (1219)
10/31/2003 11:07:32 PM
Bradley J Lucier wrote:


>>(Like, you download it every week, and each time you find more stuff
>>broken?  Or you download it once every three years, try some toy
>>benchmark, and conclude that things have gotten worse?)
> 
> About once a year after I read another PLT love-fest thread
> in c.l.s.  And I find less stuff broken.  And that wasn't a "toy
> benchmark", it was some code I wrote in the last two weeks in 
> notes I am typing about using Scheme for mathematical computations for
> beginning math graduate students.  The emphasis was on knowing the intrinsic
> algorithmic costs of various operations.  I was trying to show the
> students the difference between things that are realistically possible and
> the things that might be impossible to compute in a reasonable time.
....
> On the fibonacci code mzscheme was
> 38,000 times slower than the beta of Gambit-C.  In my opinion
> that's a large enough quantitative difference to make a qualitative
> statement about the implementation. 
.... and later on ...
> OK, I'll try to crank down my rhetoric.  Here are some more examples from my
> notes.  Perhaps mzscheme could benefit from a later version of gmp.  Perhaps gmp
> isn't well enough integrated into mzscheme's bignum/rational arithmetic.
> Perhaps it shows nothing at all---after all, mzscheme does beat Gambit-C in some
> benchmarks (especially the latter ones that heavily involve the interpreter).

This "love-fest" part sounds a bit like you're jealous. The rest sounds like 
you're looking for a high-performance Scheme. You didn't support your claims 
about bugs with any examples, but I grant you that we have some. Please read the 
very last paragraph first.

Performance:
We have never claimed that we produce a high-performance compiler. We freely 
admit that we have focused our energy on other aspects of Scheme so that we 
could move Scheme forward on other fronts. I think we succeeded. I don't see 
anything wrong with it.

Competition with other Schemes:
Back in 96 (the Philly meeting), we offered to develop our world with enough 
hooks so that others could use what we deliver. Our hope was that others would 
produce a great compiler and we could produce the environment, the libraries, 
and the curriculum to go with it. Nobody took us up on that offer. We're still 
open to work with everyone and we try when we see an opportunity.

Bugs:
I will freely admit that we have errors in our code. We keep track of all of 
them in a database. Show me a Scheme implementation that doesn't and I show you 
one that doesn't has much of an impact on the world. Overall, however, the 
errors hardly ever show up in pieces that should affect the kinds of programs in 
your message. If you think you find them, report them. You probably know 
perfectly well that the rest of the people around here wouldn't use PLT Scheme 
as much as they do, if our system were as bug-ridden as your logically 
unjustified claim about quantity and quality implies.

Thanks for toning down the rhetoric. It hurts the Scheme community tremendously 
that the major implementors don't work together more than they do. It would hurt 
even more if sophisticated users like you were to create schisms with rhetoric 
were there don't need to be any.

-- Matthias

0
11/1/2003 12:32:21 AM
Ray Dillinger <bear@sonic.net> wrote in message news:<3FA2EC26.2A669674@sonic.net>...
> Somebody builds a module 
> system, gets all excited about it, and suddenly you can't use 
> their scheme at all without putting nonportable module declarations
> into your source files.  I don't need that.  I want to use the 
> same source files without modification on a dozen different 
> systems.

This is why Scheme48's module system and SRFI 7 have FILES clauses
rather than forcing you to put everything in BEGIN/CODE clauses.  This
is also why INCLUDE macros exist.  Nothing stops you from writing the
source code files and module definition files (for any number of module
systems) separately.

> Somebody else gets all excited about generic programming 
> and dispatch, and then multiple function definitions and other 
> errors you were relying on having the system tell you about 
> slip past you and you get unexpected behavior and/or nonportable 
> code. Yet a third implementor gets all excited about exceptions 
> and raising and throwing and suddenly you can't write portable 
> code there anymore without specializing code for the slightly-broken
> versions of primitives that are now niftier because they "throw 
> exceptions" instead of doing what R5RS says.  Somebody else gets 
> all excited about OO, and then even the tiniest program that does 
> any I/O suddenly has to drag the elephantine bulk of their object 
> system along with it "because ports are now objects that inherit 
> from ...."  
> 
> Sometimes I feel like a Luddite, but what I really want is a 
> straightforward environment with no exotic bumps or surprises, 
> which supports a thoroughly portable dialect, and simple 
> facilities that avoid any unnecessary overhead or complications.  
> 
> [snip]
> 
> I can ignore bells and whistles, but their presence usually 
> signals the imminent or recent demise of some system I actually 
> use.

OK.  So stick to R5RS and SRFIs and completely ignore the bells and
whistles.  What stops you from doing that?  Lack of SRFIs?  Write some!

> 			Bear
0
campbell1 (74)
11/1/2003 2:18:40 AM
In article <bnuupo$3v7$1@camelot.ccs.neu.edu>,
Matthias Felleisen  <matthias@ccs.neu.edu> wrote:
>Bradley J Lucier wrote:
>> 
>> About once a year after I read another PLT love-fest thread
>> in c.l.s.
>
>This "love-fest" part sounds a bit like you're jealous.

I find myself irritated when people speculate about my motives.

> You probably know 
>perfectly well that the rest of the people around here wouldn't use PLT Scheme 
>as much as they do, if our system were as bug-ridden as your logically 
>unjustified claim about quantity and quality implies.

Are you saying that a large enough quantitative difference *can not* make a
qualitative difference? If so I question your design judgement.

I don't try to speculate why so many people use PLT Scheme.  What I'm looking
for is a solid runtime library, a good compiler, and few bugs.  I haven't found
these properties in the past in PLT Scheme.  If people want curriculum, GUIs,
scripts, etc., fine with me.  That's not the main properties I'm looking for
in a Scheme implementation, and perhaps the emphasis of the PLT Scheme team
on these aspects are the reason I have not found PLT Scheme suitable for my
purposes.

I wanted to try PLT Scheme again because several posters here commented that
"bignum arithmetic in mzscheme is fast because it includes gmp".  Nobody
from the PLT Scheme team wrote "Sorry guys, you got it wrong, we never claimed
PLT Scheme is a high-performance implementation, it includes gmp for <whatever
reason I can't imagine at the moment, ease of implementation?>," so I thought
I would try it for myself.  (BTW, high performance is relative---I run my
homework-on-the-web system interpreted for access to environment variables
in the exception handlers.)

>Thanks for toning down the rhetoric. It hurts the Scheme community tremendously 
>that the major implementors don't work together more than they do. It
>would hurt 
>even more if sophisticated users like you were to create schisms with rhetoric 
>were there don't need to be any.

Gee, I was called lazy two days ago in a meeting and now I don't play well
with others.

I don't think that anything I say in this forum could possibly hurt "more"
than what the Scheme implementors are doing.  And perhaps we wouldn't agree
on what should change to improve things.

Brad
0
bjl (76)
11/1/2003 5:05:40 AM
In article <3fa2eb3f$0$69944$edfadb0f@dread12.news.tele.dk>,
Jens Axel S�gaard  <usenet@jasoegaard.dk> wrote:
>Bradley J Lucier wrote:
>I haven't the patience to try the following for n=10^1000000, but here
>is one way to get to run a little faster:
>
>pc125218: /tmp/ $ mzc --prim --unsafe-disable-interrupts 
>--unsafe-skip-tests --exe pi-brent-salamin pi-brent-salamin.scm
>MzScheme compiler (mzc) version 204.6, Copyright (c) 1996-2003 PLT
>  [output to "pi-brent-salamin"]
>pc125218: /tmp/ $ time ./pi-brent-salamin
<output omitted>
>
>real    0m36.565s
>user    0m19.610s
>sys     0m0.330s

>   (define beta^k (expt 10 1000))

I'm sorry, I see that you're compiling it rather than running it in the
interpreter, but as for faster...the Gambit code for this value of beta^k runs
in 120ms on my 500MHz PowerPC.  So this is not competitive (unless this is
on a *very* old machine).

As for your previous message, I didn't compile code either for Gambit
or for mzscheme because I wanted to test their libraries, without getting
into the fact that the Gambit compiler has a better reputation for speed and
the mzscheme interpreter is reputed to be faster than the Gambit interpreter.

Brad
0
bjl (76)
11/1/2003 5:45:20 AM
Bradley J Lucier wrote:
> Jens Axel S�gaard  <usenet@jasoegaard.dk> wrote:
> 
>>Bradley J Lucier wrote:
>>I haven't the patience to try the following for n=10^1000000, but here
>>is one way to get to run a little faster:

> I'm sorry, I see that you're compiling it rather than running it in the
> interpreter, but as for faster...

I ment to say: Faster than running it without a module.
When you enter your code at the top level some optimizations
are not possible, such as inlining functions top level functions
(and primitive fucntions).

> the Gambit code for this value of beta^k runs
> in 120ms on my 500MHz PowerPC.  So this is not competitive (unless this is
> on a *very* old machine).

Have you tried the bencmark in Perl or Python?

-- 
Jens Axel S�gaard

0
usenet153 (246)
11/1/2003 9:35:44 AM
Bradley J Lucier wrote:

> I wanted to try PLT Scheme again because several posters here commented that
> "bignum arithmetic in mzscheme is fast because it includes gmp".  Nobody
> from the PLT Scheme team wrote "Sorry guys, you got it wrong, we never claimed
> PLT Scheme is a high-performance implementation, it includes gmp for <whatever
> reason I can't imagine at the moment, ease of implementation?>," so I thought
> I would try it for myself.  (BTW, high performance is relative---I run my
> homework-on-the-web system interpreted for access to environment variables
> in the exception handlers.)

As you say speed is relative.


Damien wrote:
 >>Jens_Axel_S=F8gaard wrote:

 >>> Speed?

 >>I wanted to hear what mzscheme misses compared to Perl/Python
 >>and as far as I can tell, mzscheme has no problem in the speed
 >>department.


 >IME, mzscheme does just fine against Perl/Python.  The problem is that 
 >Perl and Python are slow.  Sure, they tend to be fast enough for what
 > they're used for, but their performance limits are still much smaller
 > than those of a compiled language.

 >>[The existence of the *very* fast Scheme compilers does not imply
 >>that mzscheme is slow]


The interesting part [from my point of view] is to compare with
Python/Perl which is the camp I think have the most potential
new users of Scheme. Thus I like to know the speed compared
to Perl and Python. In that context mzscheme aren't slow.

If you have special needs (e.g. wants to run large numerical
calculations) then you use one of the very fast compilers.

You loose flexibility but gain raw speed.


This reminds me of this insightful comment from Jaffer:

     Lack of experience using bignums apparently motivated a
     developer of   Guile, which is descended from SCM, to replace SCM's
     integrated arithmetics with an external package. Guile uses GNU MP:

     I deleted all code belonging to the old bignum implementation and
     replaced it with calls to some older GMP...

     On my PII 266 with 64MB it now computes (factorial 10000) in about
     two seconds, if no garbage collection occurs. If gc is activated
     however, guile performs the calculation within 200KB and calls gc
     several times. The whole thing takes about 25 seconds (!), but this
     is still twice as fast as the old implementetation. (I do not
     garantee for these figures).

     (factorial 10000) is an absurdly huge (118459.bit) number with a
     length over 35000 decimal digits! Numbers that large occur only in
     number theory. Currently, secure encryption keys are 1024 bits.
     Public keys 100 times wider would take 10000 times longer to
     compute; not practical for Internet shopping.

See the rest at

     <http://www.swiss.ai.mit.edu/~jaffer/CNS/DIMPA>

[But you have probably seen it before]

-- 
Jens Axel S�gaard

0
usenet153 (246)
11/1/2003 9:52:07 AM
On Sat, 01 Nov 2003 00:05:40 -0500, Bradley J Lucier wrote:
> 
> I find myself irritated when people speculate about my motives.
>
>
Now imagine others irritation over wild and crazy speculations on what
they actually wrote.
 
 
> I wanted to try PLT Scheme again because several posters here commented
> that "bignum arithmetic in mzscheme is fast because it includes gmp".
> 
This is the second time I have seen the use what is appears to be
fictitous quote as some kind of suitable justification for rather boorish
behavoir.  Is this a direct quote being used without attribution?  Or a
quoted paraphrase of your speculation as to the real intent on what
someone wrote.

As someone (referring to myself), who did use the two words GMP and
PLT/MzScheme in an earlier post where I was relating an "Anectodal Story",
about MzScheme with regard to another version of Scheme, with regard to a
simple program, not claimed to be a arithmetic benchmark or being offered
as one.   Well I hope it was not my post which provoked your first post. 
The apparent speculative leap from anectodal to "quod erat demonstrandum"
is irritating yes.  Especially when being offered as the basis to sneak in
a couple of low blows..

> I don't think that anything I say in this forum could possibly hurt
> "more" than what the Scheme implementors are doing.

No, but a simple, "Hey guys, was a bit out of line there in that one post
for which I would like to apologize.", wouldn't hurt either.

> And perhaps we wouldn't agree
> on what should change to improve things.

As someone who is at the mercy of the upper echelon of the Scheme
community (those blessed with sufficent chops and talent) to improve
things, my response is "well how would you know unless you tried."

Is there really that much fracture that a priori it is  assumed that
agreement is doubtful and therefore the discussion is not worth the
attempt?

Ray
0
ray7279 (14)
11/1/2003 4:09:49 PM
In article <3fa3828d$0$69913$edfadb0f@dread12.news.tele.dk>,
Jens Axel S�gaard  <usenet@jasoegaard.dk> wrote:
>This reminds me of this insightful comment from Jaffer:

I would interpret what Jaffer's saying as "Bolting gmp onto the side of a
Scheme implementation can be a bad design".  I think this is what was done
with mzscheme, and I think it is a bad design.  Contrast that with OpenMCL,
which has well-integrated naive algorithms for bignum arithmetic.  Always
slow, but a good design.

And if value judgements are forever harried from this group, we're in real
trouble as a community.

Brad
0
bjl (76)
11/1/2003 5:55:08 PM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

> LtVanBuren? MrMcCoy? CptCragen?

Shouldn't that be ADAMcCoy?  Sounds like a good name for a rap group,
actually.

Shriram
0
sk1 (223)
11/1/2003 8:09:39 PM
Ray Dillinger <bear@sonic.net> writes:

> Eli Barzilay wrote:
> > 
> > The question is -- does it *bother* you to have stuff you don't
> > need?  And if it does, then what's wrong with downloading just
> > mzscheme?  (It is available as a seperate package.)
> 
> What bothers me (I won't speak to particular scheme implementations 
> here)
> [...a lot of stuff that bothers you...]
> 
> I can ignore bells and whistles, but their presence usually signals
> the imminent or recent demise of some system I actually use.

As I said above, mzscheme is available as a standalone download.  No
X, no GUI.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!
0
eli666 (555)
11/1/2003 9:25:46 PM
joanestes2000@yahoo.com (Joan Estes) writes:

> Will it be free once it is done?  

Yes, of course.  As Matthias has already clarified, it'll be free just
like all other PLT software.  The big company "richer than God" is not
keeping any rights to the product.

Shriram
0
sk1 (223)
11/1/2003 10:39:05 PM
> "Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:
>> LtVanBuren? MrMcCoy? CptCragen?

Shriram Krishnamurthi <sk@cs.brown.edu> wrote:
> Shouldn't that be ADAMcCoy?  Sounds like a good name for a rap group,
> actually.

Actually EADA or XADA -- he's an executive assistant district attorney.
But the judges and defense attorneys all call him "Counselor" or "Mr
McCoy."
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
11/2/2003 7:38:28 AM
"Bradd W. Szonye" <bradd+news@szonye.com.invalid> writes:

> MJ Ray <mjr@dsl.pipex.com> wrote:
> > Regardless, I still find it hard to believe that anyone has ever had a
> > real physical typecase containing all unicode symbols...
> 
> I haven't seen one with the whole set of codes, and you probably won't
> see one in the future, because (for example) the typographical
> conventions for Greek and Japanese characters are very different, so it
> doesn't make much sense to put them both in the same typeface.
> 
> However, some typefaces do contain a *lot* of symbols. For example,
> Lucida Console provides code points for just about all of the Western

MJ Ray is making a somewhat clever joke on the somewhat obscure fact
that the words 'uppercase' and 'lowercase' originally came from the
fact that early printers kept the big letters in one box  and the
little letters in another box, and the former was always positioned
above the latter, hence 'upper case'. 

A 'real physical typecase' containing all the unicode symbols would
have to be a very big box... 

-- 
--
|Andrew Tarr |  http://arc.stuff.gen.nz
|GPG Public Key:- http://arc.stuff.gen.nz/andrew.gpg
|_____
"There is no excellent beauty that hath not 
some strangeness in the proportions" 
--Francis Bacon
|~~~~~
|
0
arc9024 (8)
11/2/2003 1:06:12 PM
>> MJ Ray <mjr@dsl.pipex.com> wrote:
>>> Regardless, I still find it hard to believe that anyone has ever had a
>>> real physical typecase containing all unicode symbols...

> "Bradd W. Szonye" writes:
>> I haven't seen one with the whole set of codes .... However, some
>> typefaces do contain a *lot* of symbols. For example, Lucida Console
>> provides code points for just about all of the Western

Andrew Tarr <arc@stuff.gen.nz> wrote:
> MJ Ray is making a somewhat clever joke on the somewhat obscure fact
> that the words 'uppercase' and 'lowercase' originally came from the
> fact that early printers kept the big letters in one box  and the
> little letters in another box, and the former was always positioned
> above the latter, hence 'upper case'. 
> 
> A 'real physical typecase' containing all the unicode symbols would
> have to be a very big box... 

Oops, duh! I missed the "real, physical" part. I know all about the
origins of type "cases" -- I just didn't get the joke.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
My Usenet e-mail address is temporarily disabled.
Please visit my website to obtain an alternate address.
0
news6468 (51)
11/2/2003 5:53:11 PM
Scott G. Miller <scgmille@freenetproject.org> wrote:
> If I'm using continuations heavily, I'm going to want to choose an
> implementation with that property.  If I'm not using them all, but I
> demand high performance otherwise, then I'm likely to make a
> completely different choice.

If I understand correctly, you're referring to the heap-based vs
stack-based strategies, yes? Explicit continuation calls are somewhat
expensive in both, because of dynamic context, but each method has
unique advantages and disadvantages beyond that.

Heap-based call frames allocate a *lot* of heap objects, so you need a
very, very good garbage collector. It only gets worse if you want true
concurrency (rather than simulated concurrency), since it's difficult to
create an efficient and correct concurrent garbage collector.

Stack-based call frames tend to be more efficient for most cases,
because allocating hardware stack frames is very cheap. However,
creating continuations is much more expensive, because you need to
transfer the stack to the heap -- effectively, you need to simulate a
heap-based implementation. It also complicates the garbage collector,
which must be able to scan the stack for roots (which is non-trivial for
some hardware).

It seems to me that the heap-based approach is superior for general
cases, but the stack-based approach is superior for specific
applications. You could handle this in a "pluggable" way as a later
article suggests, although it would require special support from the
garbage collector which might degrade performance in both cases.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/2/2003 8:25:17 PM
Regarding the joint Scheme implementation:

> "Bradd W. Szonye" writes:
>> Sounds interesting, although I'll be mighty bummed if Linux support is
>> late and MzScheme support suffers. 

Shriram Krishnamurthi <sk@cs.brown.edu> wrote:
> MzScheme support won't suffer at all.  Preserving the current
> cross-platform nature is a top priority.  All Matthias is saying that
> *new goodies* may get unveiled platform-at-a-time.

OK, that's good to hear!
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/2/2003 8:26:34 PM
> Ray Dillinger <bear@sonic.net> writes:
>> I can ignore bells and whistles, but their presence usually signals
>> the imminent or recent demise of some system I actually use.

Eli Barzilay <eli@barzilay.org> wrote:
> As I said above, mzscheme is available as a standalone download.  No
> X, no GUI.

That doesn't really address Ray's concern, though. At the very least,
development of the bells & whistles implies lack of support for the
parts he actually uses, given the same resources. And often, it leads to
something much worse: the conclusion that the bells & whistles are
sufficient to *replace* the features that he actually wants, which
leaves him out in the cold.

This happens partly because with limited resources, you can't do
everything. But often it's also psychological: Implementors get
*excited* about the bells & whistles and start thinking of the basics as
old cruft that's a chore to maintain.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/2/2003 8:30:37 PM
"Bradd W. Szonye" <bradd+news@szonye.com> writes:

> [...]
> This happens partly because with limited resources, you can't do
> everything. But often it's also psychological: Implementors get
> *excited* about the bells & whistles and start thinking of the
> basics as old cruft that's a chore to maintain.

I can assure you that the huge effort on the bells does not make
anybody think that core bell-less functionality is any less
important.  But this is a good point to let this thread die.

-- 
          ((lambda (x) (x x)) (lambda (x) (x x)))          Eli Barzilay:
                  http://www.barzilay.org/                 Maze is Life!
0
eli666 (555)
11/2/2003 11:53:48 PM
"Bradd W. Szonye" <bradd+news@szonye.com> writes:

> This happens partly because with limited resources, you can't do
> everything. But often it's also psychological: Implementors get
> *excited* about the bells & whistles and start thinking of the basics as
> old cruft that's a chore to maintain.

Given that most of us use MzScheme for applications such as scripting,
this is unlikely to happen.  In addition, some key applications such
as the PLT Web server depend crucially on having a light, fast,
textual MzScheme.

Shriram
0
sk1 (223)
11/3/2003 1:39:57 AM
> "Bradd W. Szonye" <bradd+news@szonye.com> writes:
>> This happens partly because with limited resources, you can't do
>> everything. But often it's also psychological: Implementors get
>> *excited* about the bells & whistles and start thinking of the basics
>> as old cruft that's a chore to maintain.

Shriram Krishnamurthi <sk@cs.brown.edu> wrote:
> Given that most of us use MzScheme for applications such as scripting,
> this is unlikely to happen.  In addition, some key applications such
> as the PLT Web server depend crucially on having a light, fast,
> textual MzScheme.

Believe me, assurances like this are helpful. Ray has a good point: When
vendors get excited about some doodad, they tend to neglect other parts
of the system. So it's good to know that PLT considers the "console"
elements important. (That said, it's probably best to let the thread
die, as Eli suggested.)
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/3/2003 6:25:35 AM
Bradd W. Szonye wrote:
> > "Bradd W. Szonye" <bradd+news@szonye.com> writes:
> >> This happens partly because with limited resources, you can't do
> >> everything. But often it's also psychological: Implementors get
> >> *excited* about the bells & whistles and start thinking of the basics
> >> as old cruft that's a chore to maintain.
>
> Shriram Krishnamurthi <sk@cs.brown.edu> wrote:
> > Given that most of us use MzScheme for applications such as scripting,
> > this is unlikely to happen.  In addition, some key applications such
> > as the PLT Web server depend crucially on having a light, fast,
> > textual MzScheme.
>
> Believe me, assurances like this are helpful. Ray has a good point: When
> vendors get excited about some doodad, they tend to neglect other parts
> of the system. So it's good to know that PLT considers the "console"
> elements important. (That said, it's probably best to let the thread
> die, as Eli suggested.)

<rant>
I'd love to know who these evil vendors are: they're apparently blatantly
ignoring R5RS-compatibility and creating mutant systems which only look
superficially like Scheme.  This thread should die because it's totally
pointless to talk about invented bogeymen, unless you're a four-year old.
If some Scheme implementor is doing something that doesn't follow the letter
or spirit of R5RS, let's talk about the specifics, otherwise let's talk
about something more interesting.
</rant>

OK, *now* the thread can die.  :)

Anton



0
anton58 (1240)
11/3/2003 6:43:39 AM
Ray Dillinger wrote:

> However, I've overcome several problems in strings.  My strings are
> represented as trees that have primitive-strings (limited to ~240
> bytes plus overhead) at the leafnodes.  The primitive-strings in
> turn can be any of several different formats for handling characters
> of different widths.  I was initially mapping everything into UTF-32
> codepoints but having arbitrary-width characters turns out to save
> space usually and it means I can solve other problems.  In particular,
> I can now handle accented and combined characters much better.  I'm
> using an open-ended character-set that allows any unicode base
> character plus any number of combining characters to map onto a
> unique integer.  The integer you get back from calling char->integer
> on it may be, um, kinda large, since it's essentially a base-(2^32)
> number as many "digits" long as the number of code points, but I
> had bignum libs lying around anyway and in practice such heroic
> characters are quite rare.
> 
> What this means is that my string-indexes actually count characters,
> (accents and all) not codepoints.  And this drastically simplifies
> the implementation and semantics of most string primitives and
> drastically reduces the likelihood of a string growing longer or
> shorter (in characters) as a result of being upcased or downcased.

It keeps winning.  The operation of reading a "character" from a 
port, since it keeps going as long as  it finds combiner codes, 
can never leave the port in a weird state with just part of a 
character read.  

But What this also means, I've just realized, is that I can't find 
any reasonable way to implement SRFI-14's character-set operations. 

It's clear that SRFI-14 did not anticipate that the number of characters
representable might be unbounded. I can create "cursors", which are 
essentially lists of UCS-4 codes with base codepoint as car and then 
the infinite variety of combining codes in some canonical ordering 
following it -- but all the possible combinations is an unbounded number 
of characters.  I can return a #,(NaN) or #,(Inf) I suppose, when someone 
asks for an integer that says how many characters are in the character 
set.  But extending the numerics won't rescue me from code that tries 
iterating or folding across the whole set of characters.  

Can I call it SRFI-14 compliant if it starts a nonterminating operation
or an operation that is guaranteed to run out of memory before it halts?  
That seems rude, even if the code is "correct".  

What if I just skip the operation and call 

(error: "can't finish enumerating this char-set before the sun 
explodes.") or (error: "all earthly computers combined don't have 
enough memory to hold the result of this operation.")

instead? 

Character-set algebra can work, mainly in terms of filters and closures, 
but the idea of doing a charset->list or charset->string really, really
won't.

				Bear
0
bear (1219)
11/3/2003 9:11:25 AM
> Bradd W. Szonye wrote:
>> Believe me, assurances like this are helpful. Ray has a good point: When
>> vendors get excited about some doodad, they tend to neglect other parts
>> of the system. So it's good to know that PLT considers the "console"
>> elements important. (That said, it's probably best to let the thread
>> die, as Eli suggested.)

Anton van Straaten <anton@appsolutions.com> wrote:
> <rant>
> I'd love to know who these evil vendors are: they're apparently blatantly
> ignoring R5RS-compatibility and creating mutant systems which only look
> superficially like Scheme.

Who said that this was about R5RS compatibility? Heck, who even said
that it's solely about Scheme implementors?

> This thread should die because it's totally pointless to talk about
> invented bogeymen, unless you're a four-year old.

Note that Ray gave examples of vendors who *have* dropped useful
features in favor of some new doodad. It wasn't R5RS conformance being
dropped -- I don't think it was a Scheme vendor at all -- but this sort
of thing *does* happen in the software world.

> If some Scheme implementor is doing something that doesn't follow the
> letter or spirit of R5RS, let's talk about the specifics, otherwise
> let's talk about something more interesting.
> </rant>

There's more to it than just basic standards conformance. (Although some
vendors don't even bother with that -- how many Scheme implementations
lack define-syntax, null-is-true, and similar features? Schemers are
lucky; some other languages have it much worse.) This is more about
stuff like, "Crap! The vendor dropped support for the batch debugger
when they released the GUI debugger!"

Whenever a vendor starts crowing about some feature that you never use,
it's appropriate to worry, because often they *do* drop support for the
features you do use.

> OK, *now* the thread can die.  :)

Alas, you resurrected it with your off-the-mark rant.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/3/2003 9:15:40 AM
Bradd W. Szonye wrote:
> > Bradd W. Szonye wrote:
> >> Believe me, assurances like this are helpful. Ray has a good point:
When
> >> vendors get excited about some doodad, they tend to neglect other parts
> >> of the system. So it's good to know that PLT considers the "console"
> >> elements important. (That said, it's probably best to let the thread
> >> die, as Eli suggested.)
>
> Anton van Straaten <anton@appsolutions.com> wrote:
> > <rant>
> > I'd love to know who these evil vendors are: they're apparently
blatantly
> > ignoring R5RS-compatibility and creating mutant systems which only look
> > superficially like Scheme.
>
> Who said that this was about R5RS compatibility? Heck, who even said
> that it's solely about Scheme implementors?

Reread the subthread starting with Bear's post.  Afaict, it's all about
Scheme - the database vendor example was simply an external example used to
segue into the issues with Scheme.  And that's the point: no specific
examples have been raised relative to Scheme.  Bear talked about wanting a
"thoroughly portable dialect" and mentioned things like exception handling
that isn't "doing what R5RS says".  In Scheme, the only thoroughly portable
dialect is R5RS plus some subset of SRFIs (which I meant to include in my
statement).

If the concern is that this is not a sufficient standard for some purposes,
that's a real concern which everyone's aware of, but that's not what was
being discussed.  If the concern is that vendors are doing subtle things to
undermine this standard, then that should be discussed with at least *some*
specifics to ground the discussion.  If you're moaning about software in
general, fine, but don't confuse the issue by worrying vaguely about
mismapped concerns from other types of software onto Scheme.

> > This thread should die because it's totally pointless to talk about
> > invented bogeymen, unless you're a four-year old.
>
> Note that Ray gave examples of vendors who *have* dropped useful
> features in favor of some new doodad. It wasn't R5RS conformance being
> dropped -- I don't think it was a Scheme vendor at all -- but this sort
> of thing *does* happen in the software world.

Yes, I know.  I've experienced it.  But the example Ray gave was not a
language vendor, and did not have any standards or SRFIs to follow in the
areas in question.  I haven't experienced this in Scheme.  I'm not saying
that means it doesn't exist, but this thread is half a dozen levels deep
since the initial essentially unspecified concerns were raised, and has
gotten even less specific.  I'm saying put up or shut up.

> > If some Scheme implementor is doing something that doesn't follow the
> > letter or spirit of R5RS, let's talk about the specifics, otherwise
> > let's talk about something more interesting.
> > </rant>
>
> There's more to it than just basic standards conformance. (Although some
> vendors don't even bother with that -- how many Scheme implementations
> lack define-syntax, null-is-true, and similar features?

Why don't you tell me.  My answer is I don't count those as real Scheme
implementations - they're either toys or obsolete (or at least outdated).
There are plenty which have substantial compliance.

> Schemers are lucky; some other languages have it much worse.)

Yes, but Scheme was being discussed.

> This is more about stuff like, "Crap! The vendor dropped support for
> the batch debugger when they released the GUI debugger!"

Specific examples in Scheme, please.

> Whenever a vendor starts crowing about some feature that you never use,
> it's appropriate to worry, because often they *do* drop support for the
> features you do use.
>
> > OK, *now* the thread can die.  :)
>
> Alas, you resurrected it with your off-the-mark rant.

The smiley embodies full awareness of my actions.  I needed to call BS.

Anton


0
anton58 (1240)
11/3/2003 3:09:20 PM
[Regarding a Unicode string system that represents combining characters
as a single character:]

Ray Dillinger <bear@sonic.net> wrote:
> ... I've just realized ... that I can't find any reasonable way to
> implement SRFI-14's character-set operations. 
> 
> It's clear that SRFI-14 did not anticipate that the number of
> characters representable might be unbounded. I can create "cursors",
> which are essentially lists of UCS-4 codes with base codepoint as car
> and then the infinite variety of combining codes in some canonical
> ordering following it -- but all the possible combinations is an
> unbounded number of characters.

Possible solutions:

1. Implement it so that the character set enumerators raise an
   exception, simulating the fact that they don't halt. Will probably
   break some systems.

2. Only enumerate the code points, not all possible combinations. Even
   this is likely to break some systems -- not because of your code, but
   because they aren't prepared to deal with thousands of characters.

3. Change the implementation so that there's a limit to the number of
   characters you can combine. (I thought that was true for Unicode
   anyway, but I haven't finished reading the standard yet.) Only
   enumerate the resulting "reasonable" set of characters. Still likely
   to break some systems, just from the large character set.

4. Don't implement all of SRFI-14.

5. Find a more sane way to deal with enumeration in Unicode, get some
   experience with it, and submit a new SRFI.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/3/2003 4:52:28 PM
"Bradd W. Szonye" wrote:
> 
> [Regarding a Unicode string system that represents combining characters
> as a single character:]
> 
> Ray Dillinger <bear@sonic.net> wrote:
> > ... I've just realized ... that I can't find any reasonable way to
> > implement SRFI-14's character-set operations.
> >
> > It's clear that SRFI-14 did not anticipate that the number of
> > characters representable might be unbounded. I can create "cursors",
> > which are essentially lists of UCS-4 codes with base codepoint as car
> > and then the infinite variety of combining codes in some canonical
> > ordering following it -- but all the possible combinations is an
> > unbounded number of characters.
> 
> Possible solutions:
> 
> 1. Implement it so that the character set enumerators raise an
>    exception, simulating the fact that they don't halt. Will probably
>    break some systems.

> 2. Only enumerate the code points, not all possible combinations. Even
>    this is likely to break some systems -- not because of your code, but
>    because they aren't prepared to deal with thousands of characters.

Riffing on the second idea, I think I have a plan.  I can define the 
character repertoire as being comprised of an (infinite?) number 
of char-sets that are each of limited size.  If the infinite character
repertoire is not itself a "char-set" as such, the worst form of this 
doesn't happen.  So...  I could have charset:1code for all characters 
representable in a single unicode codepoint, charset:2code for all 
characters representable in 1 or 2 unicode codepoints, and so on.  
Presumably nobody would be calling out those names without realizing 
how large the set can be.  Each charset in this scheme is finite, 
although some may be ridiculously large.  

And of course, when someone calls for charset:64code, what happens is 
his own damn fault. In practice I don't think we'll be likely seeing 
any legitimate applications for characters that aren't members of 
charset:24code.
 
> 3. Change the implementation so that there's a limit to the number of
>    characters you can combine. (I thought that was true for Unicode
>    anyway, but I haven't finished reading the standard yet.) Only
>    enumerate the resulting "reasonable" set of characters. Still likely
>    to break some systems, just from the large character set.

It's not.  Unicode defines codepoints to be of different "combining 
classes" and then has a canonical ordering for the combining classes, 
but you can have an unbounded number of combining codepoints within 
each combining class.  There's no canonicalization of the ordering 
within a combining class; within each combining class, the order 
is read as providing information about which marks go closest to the 
base character.  


> 4. Don't implement all of SRFI-14.

It would be ....  _offensive_   .... to me to build a system from 
the ground up for unicode support, and then not be able to implement
the "standard" for unicode support.  
 
> 5. Find a more sane way to deal with enumeration in Unicode, get some
>    experience with it, and submit a new SRFI.

Seems likely. I'm treating "characters" as gestalt, monolithic 
entities, considered together with all possible modifiers, etc, 
as a single unit.  And generally I'm finding that is a win, 
because it allows me to say exactly what I mean in manipulating 
strings, ports, etc, one "character" at a time.  Char->integer
and Integer->char are still defined, but rather inappropriate
for some kinds of usage; it seems like Char->Unicodevector and 
Unicodevector->char would be more appropriate accessors.  

			Bear
0
bear (1219)
11/3/2003 5:34:58 PM
David Rush wrote:
> 
> On Tue, 28 Oct 2003 06:49:52 GMT, Ray Dillinger <bear@sonic.net> wrote:
> > "Bradd W. Szonye" wrote:
> > I've been thinking about writing a portable "module mangler."
> 
> Ray - I've been working on this for years. That's what S2 is all
> about. It does work, but I just don't have the time to keep the
> docs (and libs) up to date.
> 
> > It would read from disk a bunch of scheme files with some kind
> > of standard module syntax, and output a single honkin-large
> > scheme file (maybe in a temporary directory) that puts them
> > all together with separate namespaces kept separate, and
> > strictly-controlled scope for macros, and so on.
> 
> That's exactly what I do. I've got the hooks in for alpha-renaming
> top-level symbols, but I've never had the need to fully productize
> the code. You want to help? I'll happily help you get your first
> builds going (bootstrapping the animal is a bit tricky).

I admire the heck out of your project, but I think you're 
aiming a lot higher than I am.  The module mangler would 
define a module language -- one, single, module language
and no others.  As I understand S2, it tries to read and 
cross-compile the module languages of a dozen different 
implementations.  Is that correct?

			Bear
0
bear (1219)
11/3/2003 5:45:41 PM
Bradd writes: 


> 3. Change the implementation so that there's a limit to the number
>    of characters you can combine. (I thought that was true for
>    Unicode anyway, but I haven't finished reading the standard yet.)
>    Only enumerate the resulting "reasonable" set of
>    characters. Still likely to break some systems, just from the
>    large character set.


The unicode standard is explicit that it does not limit the number of
combining characters that might be added to a base character to form
what the user might reasonably think of as a "grapheme".

In other words, if you define Scheme's CHAR? type as an isolated base
char or as a base-char-plus-combining-chars (officially an "isolated
base character or non-defective combining character sequence") then,
indeed, there are an infinite set of non-EQV? CHAR? values.

Ray's idea of defining CHAR? that way is really damn interesting.  For
exmaple: I like the idea of making a globally useful extensible text
editor with an easy-to-learn extension language, the extension
language being applicable to at least an interesting subset of
possible extensions.  Part of what I consider to be the meaning of
"easy-to-learn" is that the conceptual model used by extension writers
and the conceptual model used by users of the text editor should be
highly congruent.  Unicode offers a definition of "grapheme" as "what
the user thinks of as a character".  In Unicode, a grapheme is
_almost_ always consistent with Ray's proposed definition of CHAR?.
In other words, an interactive user and extension writer would both
agree -- a "character"/CHAR? is a grapheme, graphemes being very well
approximated by Ray's definition.

In most natural languages, indeed, the set of actual graphemes is
probably finite.  But what about math?  How many "'" ("prime")
qualifications do you want to limit a variable to, for example?

Part of the "problem", re-srfi-14, I think is this:

Scheme is deliberately elegant by omission and ambiguity.   You can
certainly have a perfectly conformant, quite useful Scheme in which
there are an infinite supply of non-EQV? CHAR? values.  

14 is therefore not usefully portable across all schemes.  Period.
It addresses a very large subset -- those with a finite CHAR? type --
but not all possible Scheme's.

I think that such problems are inevitable.  It's _good_ to have SRFIs
that address sane subsets of the space of possible Scheme
implementations.  It's also good to be able to float a standard Scheme
in many situations where the scheme you'e surfaced couldn't possibly
be a "general purpose programming language" or even "supports all
srfis".

I'd suggest to Ray that 14-support is a so-so issue.  It seems pretty
hard to me (you'd really want a theorem prover for maximal usefulness
and it never quite will evade the termination issue).   If he _really_
wants a character set standard that applies to his dialect, perhaps
the thing to do is to make a srfi that does nothing more than identify
a subset of 14 that is unproblematic for his dialect.


-t
0
lord1 (42)
11/3/2003 10:10:31 PM
On Mon, 03 Nov 2003 17:45:41 GMT, Ray Dillinger <bear@sonic.net> wrote:
> David Rush wrote:
>>
>> On Tue, 28 Oct 2003 06:49:52 GMT, Ray Dillinger <bear@sonic.net> wrote:
>> > "Bradd W. Szonye" wrote:
>> > I've been thinking about writing a portable "module mangler."
>>
>> Ray - I've been working on this for years. That's what S2 is all
>> about. It does work, but I just don't have the time to keep the
>> docs (and libs) up to date.
>>
>> > It would read from disk a bunch of scheme files with some kind
>> > of standard module syntax, and output a single honkin-large
>> > scheme file (maybe in a temporary directory) that puts them
>> > all together with separate namespaces kept separate, and
>> > strictly-controlled scope for macros, and so on.
>>
>> That's exactly what I do. I've got the hooks in for alpha-renaming
>> top-level symbols, but I've never had the need to fully productize
>> the code. You want to help? I'll happily help you get your first
>> builds going (bootstrapping the animal is a bit tricky).
>
> I admire the heck out of your project, but I think you're
> aiming a lot higher than I am.  The module mangler would
> define a module language -- one, single, module language
> and no others.  As I understand S2, it tries to read and
> cross-compile the module languages of a dozen different
> implementations.  Is that correct?

Well, yeah. But practically my user base is such that there are two
module languages in common use: One which is a lightly modified Bigloo
clone (for historical reasons primarily), and one which is used for
prepackaged, implementation-specific, code within the library. FWIW,
the modifications to the Bigloo module system were inspired by Scheme48.
I used to support Guile/SLIB, but I've had absolutely no need to
continue to use that support because of 3 main (mostly social) phenomena:

	1) S2 got it's own library system so it could support the
          'portable' SRFIs
	2) Al* wrote an eval-free syntax-rules expander
       3) The system acquired enough mass that I never actually feel
          the need to use any of the native module systems - it's
	   generally easier to subsume target-specific changes into
          the library and then my code works everywhere (well, the
          mostly-compiled impls, Bigloo, Gambit, Larceny, Chicken)
       4) most of the best libraries for what I do are written very
          portably (All praise to *L*, the truest enemies of
          Boskonian programming)
       5) The best implementations for what I do are very compatible
          in terms of their library extensions - some of this is due
          the spread of SRFI support, some is natural...

So, somewhat to my chagrin, while the architecture (and a fair amount
of the code-base) does support my higher goals, the system *in use*
primarily implements what you're describing. In fact, I've started
working on a new version which dumps the multi-module system support
(which is ... inefficient to say the least) and concentrates on doing
whole-program mangling much more quickly (and pirates your Ocelot
lexer BTW).

Work on the new version proceeds very slowly though. I'm too busy using
the old one for text-routing and data-mining applications...and this is
all on top of my day job.

david rush
-- 
Nobody expects the coders with opinions! Our three main design
principles are...
0
kumo7543 (108)
11/6/2003 11:11:13 AM
David Rush wrote:
> 
> So, somewhat to my chagrin, while the architecture (and a fair amount
> of the code-base) does support my higher goals, the system *in use*
> primarily implements what you're describing. In fact, I've started
> working on a new version which dumps the multi-module system support
> (which is ... inefficient to say the least) and concentrates on doing
> whole-program mangling much more quickly (and pirates your Ocelot
> lexer BTW).

Excellent!  I'm pleased that you're getting some use out of that 
code.  It was interesting and satisfying to write because the 
design goal was complete separation from the hosting 
implementation - ie, taking complete control at the syntax level 
and not relying on anybody else's idea of what numbers and symbols 
can and can't look like or what int<->char mapping the host 
implementation uses for its characters.  I wanted to do unicode 
support and that meant starting with 100% coverage at the syntax 
level.  Which I did and I'm pleased with it.  It does assume 
large integers are available, so it can't be hosted on just anything.
But it can be hosted on almost any real implementation of scheme, 
and quite a few "fake" ones -- it doesn't even use call/cc. 

But adding to the satisfaction was that I also got to drill 
straight to the core of one of my pet peeves.  I've seen so many 
schemes barf on simple input like #i237/520 or #e22.5 or #e22@1.7 
that it was just -- intensely satisfying -- to write a lexer that 
got _every_damn_bit_ of the numeric syntax the scheme standard 
specifies  -- and the common lisp extension for user-specified 
radixes too -- *RIGHT*.

I've even considered writing a SRFI that standardizes a format 
for unicode in scheme ascii source and the ability to read *ALL* 
kinds of scheme numbers, and providing that lexer as a reference 
implementation.  And I may do it yet. 

But, having done that, I'm not going to be using that particular
lexer implementation much longer. I'll still use the syntax it 
defined, and the escape sequences for putting unicode into 
characters, symbols and strings will stay the same.  I will 
be very happy if it spreads and becomes a "normal" way of doing 
things and I can quit worrying about crap like schemes barfing 
on legal number syntax.

But when I was implementing the compiler it was necessary to 
create a chomsky-style grammar-driven parser.  I made it very 
general (able to handle type 1 and some type 0 languages) because 
I wanted to use it in a natural-language project too.  So it takes 
an arbitrary list of scheme data and a grammar, and produces as 
output another arbitrary list of scheme data.  And since I now 
have one of those, replacing the original lexer is now just a 
matter of writing another grammar.  (admittedly, a hairy and 
complicated one, but just another grammar). 

Right now, I'm using the ocelot lexer to lex the input.  But 
the job it's doing can (and soon will) be done by the same code 
as the parser, just with a character->token grammar to convert 
a stream of input characters into a stream of structured lexemes.  
Anyway, there's a lexeme->codeword grammar that runs on the 
lexeme stream and produces a stream of codewords that are 
basically semantic notation about what the code is supposed to 
do, where scopes begin and end, etc.  Then I run the parser 
with a different grammar on the stream of codewords to produce 
my compile-time symbol table.  Then I run the parser with a 
third grammar on the stream of codewords to produce "instructions" 
-- and then process the instructions again with a fourth grammar 
to do some kinds of optimization.  And finally run the same 
parser with a fifth grammar that eats the "instructions" and 
produces code (at this moment, C code - I haven't a machine-code 
back-end working yet, and C source is way portable so it'll do 
for now).  

MitScheme and Gambit are completely happy with it.  Guile dies 
because it uses too much memory.  Bigloo, Rscheme, and Chicken 
die (on some inputs) because they don't have big integers.  
But I think that's not my problem.

As compilers go it's way slow.  A big project (on the order of 5K 
lines of code) can take 20-30 minutes to recompile, and that doesn't
count the time the makefile spends in gcc!  But I can make the 
parser work faster yet without breaking the way it works, and 
it's way elegant.  Basically there's the parser which takes an 
input stream (list of arbitrary data) and produces an output 
stream (list of arbitrary data), and everything else is grammars 
for it. 

				Bear
0
bear (1219)
11/6/2003 11:10:59 PM
I'll be more explicit here about what these computations tell me about
mzscheme's implementation of bignums and rationals and mzscheme's
integration of gmp into its runtime library.  I haven't looked much at
mzscheme's code; perhaps my comments will help mzscheme and other
implementors with their runtime implementations.

bjl@cs.purdue.edu (Bradley J Lucier) wrote in message news:<bnuhtk$9er@arthur.cs.purdue.edu>...
> Like I said, this is code from some notes to tell math students how much
> different operations cost (among other things).  Notes are at the bottom.
> 
> Formula                                               CPU times (ms)
>                                             beta Gambit-C      mzscheme 205
> 
> (expt 3 1000000)  ; a                             930              4700
> (expt 3 1000001)  ; b                            1610              4970

So far, so good.  Gambit is faster, but not by much.

> (* a a)           ; c                            1170              3150

Here things start to look strange.  mzscheme's multiplication is
faster than its exponential, but in Gambit-C it's slower.  We'll get
back to this later.

> (* a b)           ;                              1640              4660

Similar effect, but smaller.

> (quotient c a)    ;                              6960             10780
> (sqrt c)          ;                             14080             57240

Nothing too strange here, but there is a factor of < 3 in Gambit and >
5 in mzscheme comparing (sqrt c) to (quotient c a), so it seems that
mzscheme's sqrt implementation could be improved.

> (fib 100000)      ; a, note 1                    4280              7560
> (fib 100001)      ; b                            4450              7790

No surprises.

> (gcd a b)         ;                             27580             88780
> (gcd a b)         ; a=3^100000, b=2^100000      23380             56230

gcd is based on a lot of quotients, but here the gcd times for
Gambit-C are < 4 times (quotient c a), and in mzscheme they can be > 8
times (quotient c a), so it seems again that mzscheme's gcd code is
not tremendously good.

> (expt1 3 1000000) ; note 2                        920              1770
> (expt2 3 1000000) ; note 3                       6510              6290

These two are interesting.  Here we have an exponential running in the
interpreter (expt1) that's almost three times as fast as mzscheme's
compiled runtime version of expt.  This shouldn't happen; mzscheme's
integer exponential should at the least be replaced by expt1.

> (* a a)           ; a=3^1000000                  1180              2950
> (expt 10 10000000); a                           42770            486200

A factor of 10 between a Scheme bignum implementation (Gambit) and a
compiled C bignum implementation (mzscheme) should not be acceptable. 
I can't tell why this arises, except perhaps that mzscheme's basic
expt routine is not based on expt1.

> (expt 2 10000000) ; b                             990             70300

Here we have a factor of 70 difference in speed.  This tells me that
mzscheme does not special case either multiplication by a number with
many low-order zero bits or exponential of such a number.  This is not
really acceptable in a runtime that goes to great lengths to use fast
algorithms for other operations.

I was surprised by this a few months ago with the Gambit-C bignum
implementation and talked in this post about how a naive
multiplication algorithm can compute this in O(N) time while a "fast"
FFT-based algorithm takes O(N log N) time for the same thing:

http://groups.google.com/groups?q=expt+group:comp.lang.scheme+author:bjl%40cs.purdue.edu&hl=en&lr=&ie=UTF-8&scoring=d&selm=b40hah%24a48%40arthur.cs.purdue.edu&rnum=6&filter=0

I took the opportunity at that time to add this to Gambit.  This
message ended the thread; perhaps nobody read it, but the mzscheme
implementors certainly didn't follow up on it.

> (quotient a b)    ;                               140           3169230

Here there is a time difference of over 2000; dividing by a number
with many low-order bits zero should also be fast using a grade school
algorithm.  And this is useful for fixed-point arithmetic computations
to high precision.

> (expt 2/3 10000)  ; a                               0               130
> (expt 3/5 10000)  ; b                              10               370

These two, where mzscheme is about 40 times slower than Gambit, tell
me that the mzscheme runtime probably doesn't use the fact that

(expt p/q n) = (make-rational (expt p n) (expt q n))

where you don't have to do a gcd to get things in lowest form.  My
notes talk about computing with power series, where this is important.

> (* a b)           ;                               280               390

Doesn't tell you much, but coupled with the previous two means that
mzscheme's runtime probably does not use the formula

(* p/q r/s) => (make-rational (* (quotient p (gcd p s))
                                 (quotient r (gcd r q)))
                              (* (quotient q (gcd r q))
                                 (quotient s (gcd p s))))

and instead uses the naive

(* p/q r/s) => (make-rational (quotient (* p r) (gcd (* p r) (* q s)))
                              (quotient (* q s) (gcd (* p r) (* q
s))))

> (fib 1000)        ; note 4                         10              2790

This time difference of about 300 tells me that mzscheme does not use
that

(+ n p/q) = (make-rational (+ p (* n q))
                           q)

and no gcd is necessary.  Again, this is very useful in computing
power series using binary splitting.

> (factorial 10000) ; note 5                       1630              1210
> (partial-factorial 0 10000) ; note 6              170               160
> (binary-splitting-compute-e 1000) ; note 7        980               400
> (naive-compute-e 1000) ; note 8                142970             51520
> (binary-splitting-compute-pi 1000) ; note 9      2070              1130

I'm not tremendously happy with Gambit's performance here; before
looking at the runtime, however, I'd like to compile both to see if
the interpreter speed is masking the time of the runtime.  But I don't
really want to bother figuring out how to compile code in mzscheme.

> (pi-brent-salamin) ; n. 10, beta^k=10^1000000 1345230
> (pi-brent-salamin) ; beta^k=2^33219            902040

Here's another small point related to this computation---I looked at
the mzscheme runtime to see if exact integer square root was available
as a separate routine.  It is, as a C routine, but there doesn't seem
to be an associated scheme routine.  This is an important operation,
is invoked in (sqrt n) if n is a perfect square, and if you're going
to have a fast implementation of it, which mzscheme has, it should be
available separately.  (Perhaps it is, nobody answered when I asked
how to do exact integer square root in mzscheme.)


So I would characterize mzscheme's bignum and ratnum implementation as
one where they go to great lengths to do some things quickly, probably
within a constant of best possible, by using gmp, and at the same time
do not implement simple algorithms that in certain special cases that
come up quite often in practice can speed performance by a factor of
100s to 1000s.  If I found one such problem it would be one thing,
something that was overlooked; instead, with so many such problems, it
seems that little was done beyond "bolting gmp onto this side of"
mzscheme.

This, to me, indicates a design problem that cannot be fixed just by
going back and adding the few hundred lines of code to implement the
changes I suggested above.  Design is not about fixing bugs and
inefficiencies; it's about organizing and thinking about the code in a
way that lessons the likelihood of such things coming up in the first
place.

Brad
0
lucier (68)
11/10/2003 6:33:12 PM
lucier@math.purdue.edu (Brad Lucier) writes:
>
>> (gcd a b)         ;                             27580             88780
>> (gcd a b)         ; a=3^100000, b=2^100000      23380             56230
>
> gcd is based on a lot of quotients,

That's not usually the best approach.

> but here the gcd times for
> Gambit-C are < 4 times (quotient c a), and in mzscheme they can be > 8
> times (quotient c a), so it seems again that mzscheme's gcd code is
> not tremendously good.

If that's 88 seconds for a gcd of fib(100000) then the gmp gcd code
can probably help quite a bit.  And if the second one is with b equal
to a power of 2 then clearly it can run much faster.
0
user42 (6)
11/11/2003 1:03:00 AM
lucier@math.purdue.edu (Brad Lucier) writes:

> This, to me, indicates a design problem that cannot be fixed just by
> going back and adding the few hundred lines of code to implement the
> changes I suggested above.  

Thanks for your detailed comments.  I am surprised by your conclusion,
though.  Assuming your analysis is fairly thorough, why wouldn't
"going back and adding a few hundred lines of code" not do the trick?
Or do you want something other than numeric performance, that you
haven't mentioned?

Shriram
0
sk1 (223)
11/11/2003 2:25:42 AM
               Towards Standard Scheme Unicode Support


* The Problems

  There are two major obstacles to providing nice,
  non-culturally-biased Unicode support in standard Scheme.  First,
  the required standard character and string procedures are
  fundamentally inconsistent with the structure of unicode.  Second,
  attempts to ignore that fact and "force fit" unicode into them
  anyway inevitably result in a set of text-manipulation primitives
  that are too low level -- that require even very simple text
  manipulation programs to be far more "aware" of the details of
  unicode encodings and structure than they ought to be.

** CHAR? Makes No Sense In Unicode

  Consider the unicode character U+00DF "LATIN SMALL LETTER SHARP S"
  (aka Eszett).

  Clearly it should behave this way:

	(char-alphabetic? eszett) => #t
	(char-lower-case? eszett) => #t

  and it is required that:

	(char-ci=? eszett (char-upcase eszett)) => #t
	(char-upper-case? (char-upcase eszett)) => #t

  but now what exactly does:

	(char-upcase eszett)

  return?  The upper case mapping of eszett is a two character
  sequence, "SS".  It's not even a Unicode base character plus
  combining characters -- it's two base characters, a string.

  Eszett is not an isolated anomaly (though, admittedly, is not the
  common case).  Here is a pointer to the data file of similarly
  problematic case mappings:

	http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

  So, something has to give, somewhere :-)  

  [Case mappings are a particularly clear example but I suspect
  that there are other "character manipulation" operators that
  make sense in Unicode but, similarly, don't map onto a 
  standard CHAR? type.]



** Other Approaches are Too Low Level

  Consider the example of attempting to write a procedure,
  in portable scheme, which performs "studly capitalization".
  It should accept a string like:

	a studly capitalizer

  and return a string like:
  
	a StUDly CaPItalIZer

  In the simple world of the scheme CHAR and STRING types, such a
  procedure is quite simple to write _and_get_completely_correct_.
  It would make good exercises for a new programming student.

  Let's assume that the student solves the problem in a reasonable
  way:  by iterating over the string and, at random positions, 
  replacing a character with its upper case equivalent.  Simple
  enough.

  Unfortunately, there does not (can not) exist a mapping of
  Unicode onto the standard character and string types that would not
  break our student's program.  His program can still _often_ give
  a correct result, but to produce a completely correct program, 
  he will have to take a far different and, as things stand, more
  complicated approach.


** One Approach Comes Close

  Ray Dillenger has recently proposed on comp.lang.scheme a 
  treatement of Unicode in which a CHAR? value may be:

	~ a unicode base character
        ~ a unicode base character plus a sequence of 1 or 
          more unicode combining characters

  That goes a very long way towards solving the problem.  For example, 
  if I had asked our student to write an anagram generator instead of
  a studly capitalizer, Ray's solution would perserve the correctness
  of the student's program.

  Unfortunately, Ray's approach still has problems.  It can not handle
  case mappings correctly, as noted above.  In Ray's system, there are
  an infinite number of non-EQV? CHAR? values and therefore
  CHAR->INTEGER may return a bignum (in Indic, Tibetan, and the Hangul
  Jamo alphabets, it would apparently return a bignum frequently).
  With an infinite set of characters, libraries (such as SRFI-7
  "Character Sets"), which are designed with a finite character set in
  mind, can not be ported.  The issue of multi-character case mappings
  aside, It is difficult to see how to preserve the required ordering
  isomophism between characters and their integer representations.

  Nevertheless, Ray's idea that a "conceptual character" is part of 
  an infinite set of values and a "conceptual string" a sequence of
  those is the basis of this proposal.


* The Proposal

  The proposal has two parts.   Part 1 introduces a new type, TEXT?, 
  which is a string-like type that is compatible with Unicode, and
  a subtype of TEXT?, GRAPHEME?, to represent "conceptual
  characters". 

  Part 2 discusses what can become of the STRING? and CHAR? types in
  this context.


** The TEXT? and GRAPHEME? Types

  [This is a sketch of a specification -- not yet even a first
   draft of a specification.]

    ~ (text? obj) => <boolean>

	True if OBJ is a text object, false otherwise.

	A text object represents a string of printed graphemes.

    ~ (utf8->text string) => <text>
    ~ (utf16->text string) => <text>
    ~ (utf16be->text string) => <text>
    ~ (utf16le->text string) => <text>
    [...]
    ~ (text->utf8 text) => <string> 
    [...]
	The usual conversions from strings (presumed to be
        sequences of octets) to text.

  A subset of text objects are distinguished as graphemes:

    ~ (grapheme? obj) => <boolean>

      True if OBJ is a text object which is a grapheme,
      false otherwise.

      The set of graphemes is defined to be isomorphic to the set of
      all unicode base characters and well formed unicode combinding
      character sequences (and is thus an infinite set).

    ~ (grapheme=? g1 g2 [locale]) => <boolean>
    ~ (grapheme<? g1 g2 [locale])
    ~ (grapheme>? g1 g2 [locale])
    [...]
    ~ (grapheme-ci=? g1 g2 [locale])
    ~ (grapheme-ci<? g1 g2 [locale])
    ~ (grapheme-ci>? g1 g2 [locale])

      The usual orderings.

      Here and elsewhere I've left the optional parameter LOCALE there
      as a kind of place-holder.  There are many possible collation
      orders for text and programs need a way to distinguish which
      they mean (as well as have a reasonable default).


  It is important to note that, in general, EQV? and EQUAL?  do _not_
  test for grapheme equality.  GRAPHEME=? must be used instead.
             
  Also note that this proposal does not include GRAPHEME->INTEGER or
  INTEGER->GRAPHEME.   I have not included, but probably should
  include, a hash value procedure which hashes GRAPHEME=? values 
  equally.

    ~ (grapheme-upcase g) => <text>
    ~ (grapheme-downcase g) => <text>
    ~ (grapheme-titlecase g) => <text>

       Note that these return texts, not necessarilly graphemes.
       For example, GRAPHEME-UPCASE of eszett would return a 
       text representation of "SS".

  All texts, including graphemes, behave like (conceptual) strings:

    ~ (text-length text) => <integer>

      Return the number of graphemes in TEXT.

    ~ (subtext text start end) => <text>

      Return a subtext of TEXT containing the graphemes beginning at
      index START (inclusive) and ending at END (exclusive).

    ~ (text=? t1 t2 [locale]) => <boolean>
    ~ (text<? t1 t2 [locale]) => <boolean>
    [...]
        The usual ordering predicates.

    ~ (text-append text ...) => <text>
    ~ (list->text list-of-graphemes) => <text>
      
         Various constructors for text ....


    However, instead of TEXT-SET!, we have:

  
    ~ (text-replace! text start end replacement)

      Replace the graphemes at [START, END) in TEXT with 
      the graphemes in text object REPLACEMENT.  Passing
      #t for END is equivalent to passing an index 1
      position beyond START.

      TEXT must be a mutable text object (see below).


  Implementations are permitted to make _some_ graphemes immutable.
  In particular:

    ~ (text-ref text index) => <grapheme>

      Return  the grapheme at position INDEX in TEXT.
      The grapheme returned may be immutable.


    ~ (text->list text) => <list of graphemes>

      The graphemes returned may be immutable.

    ~ (char->grapheme char) => <grapheme>
    ~ (utf8->grapheme string) => <grapheme>
    [....]

       Conversions to possibly immutable graphemes.


  And some simple I/O extensions:

    ~ (read-grapheme [port]) => <grapheme>
    ~ (peek-grapheme [port]) => <grapheme>
    [etc.]


  There is still an awkwardness, however.  Consider witing the "StUDly
  CaPItalIZer" procedure.  It's tempting to write it as a loop that
  uses an integer grapheme index to iterate over the text, randomly
  picking graphemes to change the case of.  That wouldn't work though:
  changing the case of one character can change the length of text,
  right at the point being indexed, and invalidate the indexes.  So,
  texts really need markers that work like those in Emacs:

    ~ (make-text-marker text index) => <marker>
    ~ (text-marker? obj) => <boolean>
    ~ (marker-text marker) => <index>
    ~ (marker-index marker) => <index>
    ~ (set-marker-index! marker index)
    ~ (set-marker! marker text index)
    etc.

	Changes (by TEXT-REPLACE!) to the region of a text object to
        the left of a marker leave the marker in the same position
        relative to the right end of the text, and vice versa.

        Changes to a region which _includes_ a marker leave the
        marker at last grapheme index of the replacement
        text that was inserted, or, if the replacement was empty, 
        at its old index position minus the number of graphemes
        deleted to the marker's left.

        The procedures SUBTEXT, TEXT-REPLACE!, and TEXT-REF 
        and others that except indexes can accept markers as those
        indexes.

  Unlike markers, text properties and overlays aren't strictly needed to
  make TEXT? useful -- but they would make a good addition.   The issue
  is that mutating procedures (like TEXT-REPLACE!) should be aware of
  properties in order to update them properly.    If properties and
  overlays are left out, and people have to implement them in a higher
  layer, then their "attributed text" data type can't be passed to a
  procedure that just expects a text object.



* Optional Changes to CHAR? and STRING?

  The above sepcification of the TEXT? and GRAPHEME? is useful on its
  own, but it might be considerably more convenient in implementations
  which also adopt the following ideas:

    ~ CHAR? is an octet, STRING? a sequence of octets

    ~ STRING? valuess are resizable

    ~ STRING? values contain an "encoding" attribute which may be
      any of 
		utf8
                utf16be
                utf16le
                utf32

      or an impelementation defined value.   Note however that
      procedures such as STRING-REF ignore this attribute and 
      view strings as sequences of octets.

      STRING-APPEND implicitly converts its second and subsequent
      arguments to the same encoding as its first.


    ~ (text? "a string") => #t

    ~ (grapheme? #\a) => #t

  In other words, all character values are graphemes, and all strings
  are text values.

  These ideas _could_ be taken even a step further with the addition
  of:

    ~ TEXT? values contain an "encoding" attribute, just as strings
      do (utf-8, etc.)

    ~ (string? a-text-value) => #t

    ~ (char? a-grapheme) => <boolean>

  All text values can be strings;  some graphemes can be characters.


* Summary

  The new TEXT? and GRAPHEME? types present a simple and traditional
  interface to "conceptual strings" and "conceptual characters".  
  They make it easy to express simple algorithms simply and without
  reference to the internal structure of Unicode.

  Reflecting the realities of global text processing, there is
  no bias in the interfaces suggesting that the set of graphemes
  is finite.

  Also reflecting the realities of global text processing: the length
  of a text object may change over time; a sequence replacement
  operator is supplied instead of an element replacement operator; 
  and markers (similar to those in text editors) are provided for 
  iteration and other examples of keeping track of "a position within
  a text vaue".

  There is no essential difference between a grapheme and a text
  object of length 1, and thus the proposal makes GRAPHEME? a 
  subtype of TYPE.

  If STRING? is suitably extended, then it may be equal to or a subset
  of TEXT?.  Conversely, if TEXT? is suitably extended, it may be
  equal to or a subset of STRING?.  It may be sensible to unify the
  two types (although even analogous string procedures and text
  procedures will still behave differently from one another).

  CHAR? may be safely viewed as a subtype of GRAPHEME?, but the 
  converse is not, and can not, be true.




0
lord1 (42)
11/11/2003 6:53:14 PM
Some good stuff in here -- not bad for a "0th draft." However, it
glosses over a few things that may make it difficult to use in practice.

My thoughts about the concepts that a good i18n system needs:

Encoding: defines operations for manipulating text-as-data
Language: defines operations for manipulating text-as-text

D-Char: an encoding unit
G-Char: a graphical/typesetting unit
L-Char: a language unit

Encoding and language are largely orthogonal. In general, you can write
a language in more than one encoding (for example, en_US.ISO8859-1 and
en_US.UTF-8), and you can write several languages with a single encoding
(for example, en_US.UTF-8 and fr_CA.UTF-8). The encoding defines how you
store the text in memory, how you find individual characters, etc. The
language determines what the text means, how you collate strings, how
you upcase, etc.

The different kinds of characters express the different ways of viewing
a string: as data, as language, and as graphics. For example, consider
the "ffl" ligature in the UTF-8 encoding. From a human reader's point of
view, it's three letters: eff eff ell. To a typesetter, it's a single
graphic: the ffl ligature. To the encoder, it's a multibyte UTF-8 code.

All three ways of viewing the data are important: D-Chars are what you
write to files and store in memory. G-Chars are what the renderer uses
to display the text. L-Chars are what humans see (and generally what
collation algorithms work on).

I'm not sure how to organize these so that they're intuitive and easy to
use, but I'm pretty sure that they're all important elements of a
text-processing system.

Tom Lord <lord@emf.emf.net> wrote:
>   [This is a sketch of a specification -- not yet even a first
>    draft of a specification.]
> 
>     ~ (text? obj) => <boolean>
> 
> 	True if OBJ is a text object, false otherwise.
> 
> 	A text object represents a string of printed graphemes.
> 
>     ~ (utf8->text string) => <text>
>     ~ (utf16->text string) => <text>
>     ~ (utf16be->text string) => <text>
>     ~ (utf16le->text string) => <text>

This suggests that the TEXT type hides the details of encoding from the
user -- maybe it uses UCS-4 internally, maybe it uses an adaptive
algorithm to determine the most efficient type. Users must convert all
external encodings to the internal one.

Another possibility: TEXT is a paramaterized type that carries the
encoding information along with each text. When working with multiple
texts in different encodings, the system implicitly converts them to a
mutually-compatible type.

The main difference between the two is that the second approach has no
"priveleged" text encoding. Users must still specify the encoding for
external input streams (so that the system knows how to interpret them),
but the system doesn't need to convert them to anything else unless the
user explicitly requests it or uses multiple encodings. For example, an
all-ASCII system need not convert anything at all, and it can store data
internally in the terse single-byte encoding.

Recommendation: Don't use explicit encoding->encoding procedures.
Instead, create some kind of "encoding" or "locale" object and use it as
a parameter for TEXT types and conversion functions. For example:

    (define hello-utf8 (text utf-8-encoding "Hello, world"))
    (define hello-ucs4 (text ucs-4-encoding hello-utf8))

The first DEFINE creates a TEXT object in the UTF-8 encoding from a
literal string. The second one creates a TEXT object in the UCS-4
encoding by converting the first string. (This is similar to the way
that C++ locale support works, by the way.)

>   A subset of text objects are distinguished as graphemes:
> 
>     ~ (grapheme? obj) => <boolean>
> 
>       True if OBJ is a text object which is a grapheme,
>       false otherwise.
> 
>       The set of graphemes is defined to be isomorphic to the set of
>       all unicode base characters and well formed unicode combinding
>       character sequences (and is thus an infinite set).

It's obvious how this works for combining characters, but how does it
work for ligatures? For example, how does it represent the "ffl"
typesetter's ligature? There's a non-trivial relationship between
dchars, gchars, and lchars.

In other words, I'm not sure whether graphemes are supposed to be gchars
or lchars here. (Note that they might be both: gchars might just be a
way of representing a text in "typesetter's language," and lchars might
just be a way of representing them in "human language." Although that
breaks encoding/language orthogonality a bit, because you'll generally
want to encode those two languages differently.)

> * Summary
> 
>   The new TEXT? and GRAPHEME? types present a simple and traditional
>   interface to "conceptual strings" and "conceptual characters".  
>   They make it easy to express simple algorithms simply and without
>   reference to the internal structure of Unicode.

This is good, but beware of the subtle differences between "conceptual"
from the human's point of view, the typesetter's point of view, and the
disk/memory point of view. It's tempting to gloss over the typesetting
elements, until you realize that that's how you actually display text.

>   Reflecting the realities of global text processing, there is no bias
>   in the interfaces suggesting that the set of graphemes is finite.

Note that the (unspecified) internal incoding does introduce some bias,
which is why I recommend the paramaterized approach rather than the
"convert all text to internal representation" approach.

>   Also reflecting the realities of global text processing: the length
>   of a text object may change over time; a sequence replacement
>   operator is supplied instead of an element replacement operator; and
>   markers (similar to those in text editors) are provided for
>   iteration and other examples of keeping track of "a position within
>   a text vaue".

Good idea, although I don't grok the marker concept. I feel that there
must be a better way to do that -- perhaps rely on FOLD rather than
cursors?

>   There is no essential difference between a grapheme and a text
>   object of length 1, and thus the proposal makes GRAPHEME? a subtype
>   of TYPE.

Good idea, so long as you can deal with the data/graphical/conceptual
issues.

>   If STRING? is suitably extended, then it may be equal to or a subset
>   of TEXT?.  Conversely, if TEXT? is suitably extended, it may be
>   equal to or a subset of STRING?.  It may be sensible to unify the
>   two types (although even analogous string procedures and text
>   procedures will still behave differently from one another).
> 
>   CHAR? may be safely viewed as a subtype of GRAPHEME?, but the 
>   converse is not, and can not, be true.

This is a bit fuzzy, probably because you haven't defined "subtype."
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/11/2003 8:29:58 PM
Bradd W. Szonye wrote:


> 
> The different kinds of characters express the different ways of viewing
> a string: as data, as language, and as graphics. For example, consider
> the "ffl" ligature in the UTF-8 encoding. From a human reader's point of
> view, it's three letters: eff eff ell. To a typesetter, it's a single
> graphic: the ffl ligature. To the encoder, it's a multibyte UTF-8 code.

It would probably a good idea not to clutter character representation with
ligatures, which are basically a glyph-level idea. Although Unicode does
encode some ligatures for compatibility reasons, according to
<http://www.unicode.org/faq/ligature_digraph.html#6>:

---
A: The existing ligatures exist basically for compatibility and
round-tripping with non-Unicode character sets. Their use is discouraged.
No more will be encoded in any circumstances.

Ligaturing is a behavior encoded in fonts: if a modern font is asked to
display "h" followed by "r", and the font has an "hr" ligature in it, it
can display the ligature. Some fonts have no ligatures, some (especially
for non-Latin scripts) have hundreds. It does not make sense to assign
Unicode code points to all these font-specific possibilities.
---

As I see it, typesetting support shouldn't really be dealt with at the same
level as collation and case mappings. It is a matter of presentation rather
than semantics.

-- 
Grzegorz
http://pithekos.net
0
grzegorz1 (80)
11/11/2003 10:51:31 PM
lucier@math.purdue.edu (Brad Lucier) writes:
>
>> (quotient a b)    ;                               140           3169230
>
> Here there is a time difference of over 2000; dividing by a number
> with many low-order bits zero should also be fast using a grade school
> algorithm.  And this is useful for fixed-point arithmetic computations
> to high precision.

Incidentally, if this goes straight to the gmp division then it's
worth noting that gmp deliberately doesn't look for low trailing
zeros.

The theory is that there will only rarely be many zeros, so it's not
worth the time or code to look.  It's left up to callers to check and
make their arrangements if they might be doing something special.

Of course a language interpreter is free to pass that policy up to
application code, or not.  If bit shift functions are available then
it probably makes sense to expect people to be using them, not calling
quotient with a power-of-2 divisor.
0
user42 (6)
11/11/2003 11:11:44 PM
Tom Lord wrote:
 
>   There are two major obstacles to providing nice,
>   non-culturally-biased Unicode support in standard Scheme.  First,
>   the required standard character and string procedures are
>   fundamentally inconsistent with the structure of unicode.  Second,
>   attempts to ignore that fact and "force fit" unicode into them
>   anyway inevitably result in a set of text-manipulation primitives
>   that are too low level -- that require even very simple text
>   manipulation programs to be far more "aware" of the details of
>   unicode encodings and structure than they ought to be.

Agreed.  And this is exactly why I defined "characters" as I 
did.  The programs ought to be manipulating characters (the 
entities that people are aware of), not bothering people with
details of encoding units unless they go down to that level. 

 
>   and it is required that:
> 
>         (char-ci=? eszett (char-upcase eszett)) => #t
>         (char-upper-case? (char-upcase eszett)) => #t

It is required by the R5RS standard, and my system happens to 
achieve it - but R5RS is, in some cases, just plain wrong.  In 
particular there are many lowercase unicode characters which 
are not the preferred lowercasing of their own preferred 
uppercase forms, and vice versa.  Thus, even when the length in 
unicode codepoints is constant, often required behavior such as 

(char-ci=? foo (char-downcase (char-upcase foo))) => #t

cannot be preserved.


>   but now what exactly does:
> 
>         (char-upcase eszett)
>
>   return?

I don't know, really.  I got rid of most of the things 
that change length when capitalizing/lowercasing by 
adopting the idea of the transfinite character set which
folds combining characters and accents into the character 
itself. That handles things whose length change because 
precomposed characters of opposite case don't exist. 

But there are fourteen annoying exceptions, of which 
eszett, as you correctly point out, is one.  I handle 
these in a "hacky" way; I have allocated forty-eight 
codes in the private-use area to hold the (conceptual)
other two cases of these fourteen characters, and 
where necessary, other-case mappings of those quasi-
characters.  

Thus, as I handle it, 

    (char-upcase eszett)

returns a quasicharacter 'SS'  - which lowercases into
another quasicharacter 'ss' and titlecases into yet a 
third, 'Ss'. 

But these quasicharacters are non-canonical representations 
of characters, and when you canonicalize a string containing
one of them, it goes away, like any other non-preferred 
encoding for a character in unicode, and decomposes into
two characters. 

This seems to work.  It's admittedly hacky, because 
it can introduce mismatches between the number of characters
seen by eye and the number of 'locations' in the string.
But it gives our student what he needs to write his 
StUdlYcaPPeR naively and have it work correctly, and 
it gives (char-upcase eszett) something besides a string
to return, and provides a semantics for the R5RS cases 
you mentioned above (but alas, not the one I brought up) 
to be true and meaningful. 

>   [Case mappings are a particularly clear example but I suspect
>   that there are other "character manipulation" operators that
>   make sense in Unicode but, similarly, don't map onto a
>   standard CHAR? type.]

Canonicalization and composition create other, different 
problems. 
 
> * The Proposal
> 
>   The proposal has two parts.   Part 1 introduces a new type, TEXT?,
>   which is a string-like type that is compatible with Unicode, and
>   a subtype of TEXT?, GRAPHEME?, to represent "conceptual
>   characters".

Interesting.... I had not thought of making these separate from 
characters and strings.  When you are manipulating unicode 
codepoints individually, that seems to me below the level of 
characters and strings entirely; you should be working with 
vectors of fixnums at that point, not thinking of them as 
"characters" in any meaningful sense.  So I'd been folding 
all the character->fixnum and fixnum->character machinery into 
the input and output ports, and providing string->UTF*vector 
and UTF*vector->string operations to allow getting at the raw 
encodings where necessary from inside programs. 

>       Here and elsewhere I've left the optional parameter LOCALE there
>       as a kind of place-holder.  There are many possible collation
>       orders for text and programs need a way to distinguish which
>       they mean (as well as have a reasonable default).

Right. I see that that will be necessary.  

>   It is important to note that, in general, EQV? and EQUAL?  do _not_
>   test for grapheme equality.  GRAPHEME=? must be used instead.

Um.... why not?  If your graphemes are conceptual characters, 
you don't have to worry about them being represented in different
ways; you should have some kind of internal canonical 
ordering of the combiners, and eqv? and equal? should find the 
same grapheme represented the same way, everywhere. The only way
to screw this up is to let people get inside the representation 
of characters/graphemes directly with byte operations - and at 
that point you're not treating them as characters at all, you're
treating them as bytes.

>        Note that these return texts, not necessarilly graphemes.
>        For example, GRAPHEME-UPCASE of eszett would return a
>        text representation of "SS".

Yes.  That is another good approach, and doesn't introduce 
the hackiness of my special-use quasicharacters.  But does 
it provide what the student needs for his naive implementation 
of the stUdLycaPpeR to be correct?
 
>     ~ (text=? t1 t2 [locale]) => <boolean>
>     ~ (text<? t1 t2 [locale]) => <boolean>
>     [...]
>         The usual ordering predicates.

Ordering predicates are semantically interesting if you're not 
providing a grapheme->integer and integer->grapheme mapping.  
Is the semantic requirement to be simply that the ordering 
represented by these primitives is a total ordering which 
does not change during the program run?  Would it be "correct" 
for a system to simply assign each text a sequential number 
in the order in which it sees them, and then have the total 
ordering (lasting for the run of the program) correspond 
with ordering of the sequence numbers?
 

>     However, instead of TEXT-SET!, we have:
> 
>     ~ (text-replace! text start end replacement)
> 
>       Replace the graphemes at [START, END) in TEXT with
>       the graphemes in text object REPLACEMENT.  Passing
>       #t for END is equivalent to passing an index 1
>       position beyond START.
> 
>       TEXT must be a mutable text object (see below).

Okay... I can see that this is a bit more flexible than 
string-set! because it doesn't require the replacement to 
be the same length as what it replaces.  But I think the 
quasicharacters + text-set! combination is still easier 
to use, because people attempting to use text-replace! 
will still be trying to keep track of indices into the 
string and they're going to have to write code that keeps
track of those indices. 
 

>   That wouldn't work though:
>   changing the case of one character can change the length of text,
>   right at the point being indexed, and invalidate the indexes.  So,
>   texts really need markers that work like those in Emacs:
> 
>     ~ (make-text-marker text index) => <marker>
>     ~ (text-marker? obj) => <boolean>
>     ~ (marker-text marker) => <index>
>     ~ (marker-index marker) => <index>
>     ~ (set-marker-index! marker index)
>     ~ (set-marker! marker text index)
>     etc.

Ah.  And here are the facilitators for writing code that 
keeps track of indexes.  I don't think these are really 
an easier way of dealing with the casing->size-changing 
warts of unicode than introducing forty-eight non-canonical 
decomposing characters.  The crucial difference, to me, is
that these are things you'll have to use and be aware of 
but the decomposing characters will be invisible most of 
the time beyond simply knowing that when you canonicalize
a string (err, text) its length may change. 

>   Unlike markers, text properties and overlays aren't strictly needed to
>   make TEXT? useful -- but they would make a good addition.   The issue
>   is that mutating procedures (like TEXT-REPLACE!) should be aware of
>   properties in order to update them properly.    If properties and
>   overlays are left out, and people have to implement them in a higher
>   layer, then their "attributed text" data type can't be passed to a
>   procedure that just expects a text object.

I recall that commonlisp strings used to have similar properties 
and overlays.  They got rid of them in CLTL2.  There are probably 
interesting and useful debates about their merits archived somewhere. 

> * Optional Changes to CHAR? and STRING?
> 
>   The above sepcification of the TEXT? and GRAPHEME? is useful on its
>   own, but it might be considerably more convenient in implementations
>   which also adopt the following ideas:
> 
>     ~ CHAR? is an octet, STRING? a sequence of octets

Way too low-level, IMO.  For operations on characters, you need 
characters.  If you have to specify numbers of bits, you're talking
about numbers instead.

>     ~ STRING? values contain an "encoding" attribute which may be
>       any of
>                 utf8
>                 utf16be
>                 utf16le
>                 utf32
> 
>       or an impelementation defined value.   Note however that
>       procedures such as STRING-REF ignore this attribute and
>       view strings as sequences of octets.

So you want to use string-ref to get (say) the third byte of 
the fourth UCS32 value in a string?  Is that any kind of meaningful 
character?  These are operations on bytevectors, not operations on 
strings. 

>   These ideas _could_ be taken even a step further with the addition
>   of:
> 
>     ~ TEXT? values contain an "encoding" attribute, just as strings
>       do (utf-8, etc.)

No...  I like the idea of the string (or text) simply as a sequence 
of characters/graphemes.  There are no circumstances, at all, under 
which you ought to be able to tell, or should need to care, about 
the encoding of a text when you're considering it as a sequence of 
_characters_.  It is read from a particular encoding when it is read 
from something that has a particular encoding, at an input port or 
by conversion from a bytevector.  It gets rendered in a particular 
encoding when it is output to something whose encoding is relevant, 
like a bytevector or a port.  Those formattings matter and the 
programmer needs to be able to specify those conversions; but we 
need be no more fixated on the internal representation of our strings/
texts than we are on the internal representation of our bignums.  That's 
an implementation detail that only the system author should care about. 
 

> * Summary
> 
>   The new TEXT? and GRAPHEME? types present a simple and traditional
>   interface to "conceptual strings" and "conceptual characters".
>   They make it easy to express simple algorithms simply and without
>   reference to the internal structure of Unicode.
> 
>   Reflecting the realities of global text processing, there is
>   no bias in the interfaces suggesting that the set of graphemes
>   is finite.
> 
>   Also reflecting the realities of global text processing: the length
>   of a text object may change over time; a sequence replacement
>   operator is supplied instead of an element replacement operator;
>   and markers (similar to those in text editors) are provided for
>   iteration and other examples of keeping track of "a position within
>   a text vaue".

I will have to think more about the markers.  While I think that
introducing a very limited set of non-canonical private use characters
is easier conceptually and from the implementor's and programmer's 
viewpoint, Markers may also be useful for some tasks -- such as 
inserting into a string (or text) at a known location before 
canonicalizing that string (or text) so you can then find the 
mark again even though the length has changed. 
 
>   There is no essential difference between a grapheme and a text
>   object of length 1, and thus the proposal makes GRAPHEME? a
>   subtype of TYPE.

You mean, a subtype of TEXT, in the sense that all grapheme values 
are also text values.  I thought some similar thoughts about characters
as a subtype of strings, but eventually decided against it. 

>   If STRING? is suitably extended, then it may be equal to or a subset
>   of TEXT?.  Conversely, if TEXT? is suitably extended, it may be
>   equal to or a subset of STRING?.  It may be sensible to unify the
>   two types (although even analogous string procedures and text
>   procedures will still behave differently from one another).

Well... you know my opinions on this matter. 

				Bear
0
bear (1219)
11/11/2003 11:25:34 PM
> Bradd W. Szonye wrote:
>> The different kinds of characters express the different ways of
>> viewing a string: as data, as language, and as graphics. For example,
>> consider the "ffl" ligature in the UTF-8 encoding. From a human
>> reader's point of view, it's three letters: eff eff ell. To a
>> typesetter, it's a single graphic: the ffl ligature. To the encoder,
>> it's a multibyte UTF-8 code.

Grzegorz Chrupa?a <grzegorz@pithekos.net> wrote:
> It would probably a good idea not to clutter character representation
> with ligatures, which are basically a glyph-level idea. Although
> Unicode does encode some ligatures for compatibility reasons ...
> [their] use is discouraged.

OK, that makes sense. Given the "graphemes" (i.e., what a human thinks
of as a "character"), a typesetter can generally construct any necessary
ligatures or other graphics-oriented characters.

It's still something that a text system needs to deal with, but it can
pretend that ligatures are just a detail of the encoding. Thanks for the
clarification; I've done a lot of work in this area, but it's a *very*
big area, so I'm still a novice in some ways.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/11/2003 11:44:48 PM
>>>> Bradd == Bradd W. Szonye
>>>> Tom == Tom Lord

You raised two primary issues:

 1) The D-char, G-char, L-char distinction
 2) A purported "encoding bias" in the proposal

I'll reply to those and to your reactions to the summary.

(Thank you for your uptake on this issue, by the way.)


* D-char/G-char/L-char

  Bradd> D-Char: an encoding unit
  Bradd> G-Char: a graphical/typesetting unit
  Bradd> L-Char: a language unit
  [....]

  D-char: The (full) proposal gives you CHAR? as a direct
  representation of octets, and a reasonable basis on which to build
  libraries for differently sized encoding units.  In other words, in
  this proposal, CHAR? and STRING? give you everything you need for
  D-Char.  Note that the full proposal includes encoding tags for both
  STRING? and TEXT? values which tell you how they are seen when
  viewed as sequences of octets.

  G-Char: The intension of the proposal is to give you GRAPHEME? for
  G-Char, very directly.  Thus, a TEXT? value representing:

        U+FB04  (the "ffl" ligature)

  would contain a single grapheme.

  L-Char: L-char is context/text-process specific.  There isn't a
  unilateral definition of L-char -- there's a parameterized
  definition.  I believe that any L-char interface you might want can
  be specified on top of the GRAPHEME? and TEXT? types (admittedly it
  will have to introduce additional structure to the GRAPHEME? type).

  The G-char/L-char disinction does point out some subtleties that
  I didn't explicitly call out in the proposal.  For example:

        (text=? some-locale t1 t2)

  does not imply:

        (= (text-length t1) (text-length t2))

  and: In general, TEXT<? is not a lexical ordering based on
  GRAPHEME<?.  

  The proposal could/should say that TEXT<? for the default locale is
  in fact a lexical ordering based on GRAPHEME<? for the default
  locale.



* Encoding Bias

  Tom>     ~ (utf8->text string) => <text>
  Tom>     ~ (utf16->text string) => <text>
  Tom>     ~ (utf16be->text string) => <text>
  Tom>     ~ (utf16le->text string) => <text>

  Bradd> This suggests that the TEXT type hides the details of encoding
  Bradd> from the user -- maybe it uses UCS-4 internally, maybe it uses
  Bradd> an adaptive algorithm to determine the most efficient
  Bradd> type. Users must convert all external encodings to the internal
  Bradd> one.


  The internal representation is not really a portable application's
  business.  Particular implementations might provide extensions that
  allow them to give "hints" about internal representations.

  The proposal _does_ suggest, if TEXT? and STRING? are to be made 
  identical, that text/string objects have an encoding label.  That
  label determines how the octet-oriented string procedures (e.g.,
  STRING-REF) view a text.

  There is no encoding bias in the proposal: there is simply no
  requirement the implementations internally provide multiple
  encodings other than in conversion procedures between texts and
  strings.

  The presumed encoding of a port is an interesting question and the
  proposal doesn't, I admit, speak to that.  It does imply some need
  to be able to label a port with an encoding in order to support
  procedures like READ-GRAPHEME.  An alternative would be to declare
  that ports are always streams of octets and add additional
  parameters to procedures like READ-GRAPHEME.  I could go either way,
  at this stage, with a slight personal preference for port labels.

  Bradd> Recommendation: Don't use explicit encoding->encoding
  Bradd> procedures.  Instead, create some kind of "encoding" or
  Bradd> "locale" object and use it as a parameter for TEXT types and
  Bradd> conversion functions. For example:

  Bradd> (define hello-utf8 (text utf-8-encoding "Hello, world"))
  Bradd> (define hello-ucs4 (text ucs-4-encoding hello-utf8))

  The full proposal suggests exactly that: that text objects include a 
  tag that determines their encoding when viewed as strings.

  The proposal does _not_ force an implementation to represent, for
  example, a UTF-8 TEXT? value internally as UTF-8 -- it only says that
  if you view a UTF-8 TEXT? value as a string (of octets) that the 
  sequence of octets should be a UTF-8 encoding of the text.

  Tom> A subset of text objects are distinguished as graphemes:

  Tom>     ~ (grapheme? obj) => <boolean>

  Tom>       True if OBJ is a text object which is a grapheme,
  Tom>       false otherwise.

  Tom>       The set of graphemes is defined to be isomorphic to the set of
  Tom>       all unicode base characters and well formed unicode combinding
  Tom>       character sequences (and is thus an infinite set).

  Bradd> It's obvious how this works for combining characters, but how
  Bradd> does it work for ligatures? For example, how does it
  Bradd> represent the "ffl" typesetter's ligature? There's a
  Bradd> non-trivial relationship between dchars, gchars, and lchars.

  U+FB04 ("ffl" ligature) is a grapheme.

  The sequence U+0066 U+0066 U+006C ("f" "f" "l") is a text consisting
  of three graphemes.

  It would be a fine thing to add additional procedures that 
  expose the relationshiop between those two -- but in the 
  proposal, a grapheme, as the name suggests, is a G-char.

  This does have some consequences.  For example, it would _not_ be a
  good first-year-student exercise to "write a culturally
  neutral/parameterized TEXT<? implementation" -- which is part of the
  reason that the TEXT<? in the proposal takes a "locale" argument.


* Reactions to Summary

  Tom> * Summary

  Tom>   The new TEXT? and GRAPHEME? types present a simple and
  Tom>   traditional interface to "conceptual strings" and "conceptual
  Tom>   characters".  They make it easy to express simple algorithms
  Tom>   simply and without reference to the internal structure of
  Tom>   Unicode.

  Bradd> This is good, but beware of the subtle differences between
  Bradd> "conceptual" from the human's point of view, the typesetter's
  Bradd> point of view, and the disk/memory point of view. It's
  Bradd> tempting to gloss over the typesetting elements, until you
  Bradd> realize that that's how you actually display text.

  The "display" issue, in its full splendor, is far larger in scope
  than just the G-char/L-char distinction.  A good goal for the
  proposal is to lay a foundation on which display algorithms can 
  be implemented.

  Graphemes, as I've defined them, are such a foundation as far as I
  know, and are also a reasonable foundation (especially if given
  additional structure by extensions to the proposal) for all
  conceivable L-char-oriented text processes.

  A display engine could be written, for example, to translate the
  text ["f" "f" "l"] into the ligature ["ffl"] before display.

  Tom> Reflecting the realities of global text processing, there is
  Tom> no bias in the interfaces suggesting that the set of
  Tom> graphemes is finite.

  Bradd> Note that the (unspecified) internal [encoding] does
  Bradd> introduce some bias, which is why I recommend the
  Bradd> paramaterized approach rather than the "convert all text to
  Bradd> internal representation" approach.

  You are mistaken: that the internal encoding is unspecified does not
  change the fact that the interface is _not_ biased towards the
  presumption that the set of graphemes is finite.

  There are, in Unicode, an infinite set of graphemes so the interface
  has to reflect that.  (There is also an infinite set of L-chars, for 
  some definitions of L-char.)

  Tom> Also reflecting the realities of global text processing: the
  Tom> length of a text object may change over time; a sequence
  Tom> replacement operator is supplied instead of an element
  Tom> replacement operator; and markers (similar to those in text
  Tom> editors) are provided for iteration and other examples of
  Tom> keeping track of "a position within a text vaue".

  Bradd> Good idea, although I don't grok the marker concept. I feel
  Bradd> that there must be a better way to do that -- perhaps rely on
  Bradd> FOLD rather than cursors?

  A cursor, unlike the invocation of a procedure by a FOLD-family
  procedure, is transportable across dynamic contexts.  It is possible
  to reasonably impement FOLD-family procedures on top of cursors, but
  it is not possible to reasonably implement cursors on top of
  FOLD-family procedures.  


  Tom> There is no essential difference between a grapheme and a text
  Tom> object of length 1, and thus the proposal makes GRAPHEME? a
  Tom> subtype of TYPE.

  Bradd> Good idea, so long as you can deal with the
  Bradd> data/graphical/conceptual issues.

  It is straightforward and even clean (not necessarily _easy_) to add
  context/process-specific "L-char" procedures as extensions to the
  proposal.

  Tom> CHAR? may be safely viewed as a subtype of GRAPHEME?, but the
  Tom> converse is not, and can not, be true.

  Bradd> This is a bit fuzzy, probably because you haven't defined
  Bradd> "subtype."

  I simply mean that is safe to require that:

        (if (char? obj)
          (grapheme? obj)
          #t)

        => #t

  for all possible values of OBJ but _not_ that:

        (if (grapheme? obj)
          (char? obj)
          #t)

        => #t

        

0
lord1 (42)
11/11/2003 11:49:22 PM
> Tom Lord wrote:
>> These ideas _could_ be taken even a step further with the addition
>> of:
>>     ~ TEXT? values contain an "encoding" attribute, just as strings
>>       do (utf-8, etc.)

Ray Dillinger <bear@sonic.net> wrote:
> No...  I like the idea of the string (or text) simply as a sequence of
> characters/graphemes.  There are no circumstances, at all, under which
> you ought to be able to tell, or should need to care, about the
> encoding of a text when you're considering it as a sequence of
> _characters_.

I disagree. In programs that do significant amounts of text processing,
it's often important to know or specify the encoding used. For one
thing, it can make a large difference in space requirements. For
another, some algorithms care about the representation. Common example:
Regular expression processors can use more efficient data structures and
algorithms when the encoding is known to use single-byte characters
only.

So it depends on what your goals are. An abstract/opaque internal
representation is OK if your text-processing needs are incidental
(internationalizing output, e.g.). If you hope to compete with optimized
text-processing languages, however, it's better to use text types that
are explicitly paramaterized over encodings.

The analogy with bignums is apt: If you just want it to work properly
for casual use, an opaque implementation is good enough. If you want it
to compete with Fortran, though, you need to give programmers more
control over the encoding, especially for large-scale processing.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/12/2003 12:34:54 AM
Tom Lord <lord@emf.emf.net> wrote:
> (Thank you for your uptake on this issue, by the way.)

Sure, happy to help.

> * D-char/G-char/L-char
>   Bradd> D-Char: an encoding unit
>   Bradd> G-Char: a graphical/typesetting unit
>   Bradd> L-Char: a language unit

> G-Char: The intension of the proposal is to give you GRAPHEME? for
> G-Char, very directly.  Thus, a TEXT? value representing:
> 
>         U+FB04  (the "ffl" ligature)
> 
> would contain a single grapheme.

Sounds like the Unicode standard actually recommends going in the other
direction: storing strings as L-chars and constructing the G-char
representation for output only. IIRC, that's also the way that actually
typesetting software works; for example, TeX doesn't turn ffl into a
ligature until very late in the translation process.

Bradd wrote:
>> This suggests that the TEXT type hides the details of encoding from
>> the user -- maybe it uses UCS-4 internally, maybe it uses an adaptive
>> algorithm to determine the most efficient type. Users must convert
>> all external encodings to the internal one.

> The internal representation is not really a portable application's
> business.

I disagree. The internal representation is a very big deal for some
text-processing algorithms. For example, many regular expression parsers
for the English language make simplifying assumptions based on the small
size of the character set. Because of that, it makes good sense to
permit direct operation on a string with a "small" character set
encoding.

True, those algorithms don't work on characters in general, but they're
very useful when the character set is known. And if the software is
careful about its inputs, it can make sure that the character set is
known (and small) at all times.

This permits portable algorithms with fallback algorithms: Use the
efficient procedures when possible and the general procedures otherwise.

> The presumed encoding of a port is an interesting question and the
> proposal doesn't, I admit, speak to that.  It does imply some need to
> be able to label a port with an encoding in order to support
> procedures like READ-GRAPHEME.  An alternative would be to declare
> that ports are always streams of octets and add additional parameters
> to procedures like READ-GRAPHEME.  I could go either way, at this
> stage, with a slight personal preference for port labels.

For what it's worth, C++ makes encoding a trait of the strings and the
ports themselves. (It also provides adapters and converters so that you
can convert portions of an input source into a different encoding or
language.) I like that toolkit approach.

>> Note that the (unspecified) internal [encoding] does introduce some
>> bias, which is why I recommend the paramaterized approach rather than
>> the "convert all text to internal representation" approach.

>   You are mistaken: that the internal encoding is unspecified does not
>   change the fact that the interface is _not_ biased towards the
>   presumption that the set of graphemes is finite.

Instead, it's biased toward the presumption that the set of graphemes is
infinite. While that's true for the general case (Unicode input), it's
not true for significant classes of inputs -- and a program can easily
ensure that it uses only certain inputs. Since some significant
algorithms rely on those restricted case, this approach sacrifices
sophistication for generality.

>> Good idea, although I don't grok the marker concept. I feel that
>> there must be a better way to do that -- perhaps rely on FOLD rather
>> than cursors?

> A cursor, unlike the invocation of a procedure by a FOLD-family
> procedure, is transportable across dynamic contexts.

So are folds; it just takes a bit of continuation trickery to get it.
Which is better? Depends on what you're trying to do.

> It is possible to reasonably impement FOLD-family procedures on top of
> cursors, but it is not possible to reasonably implement cursors on top
> of FOLD-family procedures.  

You can actually make the round trip -- just isn't very efficient.

>>> CHAR? may be safely viewed as a subtype of GRAPHEME?, but the
>>> converse is not, and can not, be true.

>> This is a bit fuzzy, probably because you haven't defined
>> "subtype."

>   I simply mean that is safe to require that:
>   (if (char? obj) (grapheme? obj) #t) => #t
>   for all possible values of OBJ but _not_ that:
>   (if (grapheme? obj) (char? obj) #t) => #t

OK, thanks for the clarification.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/12/2003 12:53:10 AM
"Bradd W. Szonye" wrote:
> 
> > Tom Lord wrote:
> >> These ideas _could_ be taken even a step further with the addition
> >> of:
> >>     ~ TEXT? values contain an "encoding" attribute, just as strings
> >>       do (utf-8, etc.)
> 
> Ray Dillinger <bear@sonic.net> wrote:
> > No...  I like the idea of the string (or text) simply as a sequence of
> > characters/graphemes.  There are no circumstances, at all, under which
> > you ought to be able to tell, or should need to care, about the
> > encoding of a text when you're considering it as a sequence of
> > _characters_.
> 
> I disagree. In programs that do significant amounts of text processing,
> it's often important to know or specify the encoding used. For one
> thing, it can make a large difference in space requirements. For
> another, some algorithms care about the representation. Common example:
> Regular expression processors can use more efficient data structures and
> algorithms when the encoding is known to use single-byte characters
> only.
> 
> The analogy with bignums is apt: If you just want it to work properly
> for casual use, an opaque implementation is good enough. If you want it
> to compete with Fortran, though, you need to give programmers more
> control over the encoding, especially for large-scale processing.

You know how declarations work in Common Lisp?  You write your code 
using general values or objects, you get it working.  The compiler 
is generating all the code it needs to do type dispatch and handle 
type exceptions as necessary, and maybe it's slower than it has 
to be but you never *ever* had to worry about internal formats. 

Then you add declarations, essentially making promises to the compiler; 
"you don't have to check for this case, I promise it won't happen."  
And the compiler then finds that, if it doesn't have to check, it 
can optimize your operations by using hardware types, fixed-width 
numbers, inlining, etc. CMUCL gets numeric performance that outruns 
C and competes directly with FORTRAN when appropriate declarations 
exist in the code. 

(Side note: We should do that.  Optional declarations ought to take 
the place of both the forms in SRFI-4 homogenous numeric vectors 
and the forms in the proposed SRFI-47 Arrays.)  

The benefit is that this means correct code need not be refactored 
if an implementation doesn't support something; the 

(declare ((var declaration)(var declaration)...)
   body)

in that case can just macroexpand to body and the general, opaque, 
representation can still handle it and get correct behavior.

Anyway, when you start wanting at the bits-n-bytes for performance
reasons, I think what we have a good case for is optional declarations;
completely not in the user's face unless he wants them, completely 
transparent if the compiler happens not to get benefit from them, 
capable of being inserted into working code after it gets working
without changing any semantics, and providing enough information 
that the compiler can provide and support fast-path operations 
where they are possible.

So I think people ought to be able to write 

(define regmatch 
   (lambda (regexp str)
  ...))

, get it working, and then if and when performance becomes a concern, 
come back later and add a declaration like this: 

(declare ((str (string-of charset:ascii)))
   (define regmatch 
      (lambda (regexp str)
       ...))
)

and the compiler would then know that, inside the function regmatch,
str should always be a string, and that the programmer promises 
the string will never contain any non-ascii characters.  It becomes
"an error" to call regmatch on a string that contains a non-ascii
character and the compiler is allowed to ignore the possibility. 
Type inference can even drive the effect of the declaration back 
into the scope of the calling sites and forward into the routines
that regmatch calls if the compiler is aggressive. 

The difference is that the implementor need do nothing special 
(aside from providing a syntax-rules one-liner to allow declarations
to be ignored) to provide for this unless s/he is providing the 
optimization, and optimized programs are then portable without 
damage to their semantics (though their performance may vary) 
across all implementations whether the optimization itself is 
portable or not. 

				Bear
0
bear (1219)
11/12/2003 1:19:14 AM
Embrace Optional Declarations.  You can separate 
representation from semantics while still providing
all the information an optimizing compiler needs to 
optimize where possible. 

The benefits are staying out of the coders' faces 
until they have a performance problem they need a 
way to solve and providing a way for even heavily-
optimized code to be portable with semantics intact 
to systems that don't support the optimizations. 

			Bear
0
bear (1219)
11/12/2003 1:25:32 AM
Tom> There are two major obstacles to providing nice,
Tom> non-culturally-biased Unicode support in standard Scheme.  First,
Tom> the required standard character and string procedures are
Tom> fundamentally inconsistent with the structure of unicode.  [....]

Ray> Agreed.  And this is exactly why I defined "characters" as I did.
Ray> The programs ought to be manipulating characters (the entities
Ray> that people are aware of), not bothering people with details of
Ray> encoding units unless they go down to that level.

Agreed?!?!?  Dude?!??!??  "I do not think that word means what you
think it means."  Hey, no, you can't do that.

You _can_not_ define the CHAR? type in a unicode-sane way.  I recall
that one of your previous posts mentioned "reducing the _chances_ that
case conversion will change a string's length" -- well, yeah -- that's
a symptom of the problem.  You _can_not_ reconcile Scheme's
requirements for CHAR? with Unicode's requirements concerning case.
It's impossible.  You go on to talk about limiting the problem to 48
"quasi-characters" but you're quite bogus in doing so.  Give up.

I think your _implementation_work_ (as I've heard about it on c.l.s.)
is very promising -- excellent, superb, inspiring, right on the mark,
very valuable.... (hence my proposal).  But I think you are
implementing approximations of TEXT?  and GRAPHEME? not STRING? and
CHAR?.  Repurpose your work!  Take the "Character Sets" srfi is a hint
that repurposing your work is The Right Thing.

Given your proposed CHAR? type, CHAR-UPCASE and CHAR-DOWNCASE can not
behave in a sane way.  Shall we talk of CHAR-TITLECASE, too?  Eszett
is the canonical example, but there are many, others.  I again refer
you to:

      http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

And look, "case" is the issue that R5RS happened to pay attention to
and so can be called out on, but it's not the only issue that has such
problems.  So, admittedly more vaguely, this whole idea of a
"character type" is far more limited than R5RS suggests.  The
existence of disjoint CHAR? and the procedures relating to it are
highly problematic.

As came up on the Guile developers' list recently: I'm not so sure
that CHAR? shouldn't go away entirely from its existence as a disjoint
type.  But that'd be a brutal change whereas my TEXT?/GRAPHEME?
addition is delicate and upwards compatible.

Tom> (char-ci=? eszett (char-upcase eszett)) => #t>
Tom> (char-upper-case? (char-upcase eszett)) => #t

Ray> It is required by the R5RS standard, and my system happens to
Ray> achieve it - but R5RS is, in some cases, just plain wrong.

Dear god, how does your system "happen to achieve it"?  The UPPER-CASE
of eszett is a sequence of two characters!

Have you extended CHAR-UPPER-CASE? to accept an argument which is a
string?  and CHAR-UPCASE to sometimes return a string? To do so
violates R5RS in a deep way: it overrides the disjointness of the 
CHAR? and STRING? types.

No, taking your reply out of order, I see this:

Tom> Note that these return texts, not necessarilly graphemes.  For
Tom> example, GRAPHEME-UPCASE of eszett would return a text
Tom> representation of "SS".

Ray> Yes.  That is another good approach, and doesn't introduce the
Ray> hackiness of my special-use quasicharacters.  But does it provide
Ray> what the student needs for his naive implementation of the
Ray> stUdLycaPpeR to be correct?

Yikes.  "special-use quasicharacters"?  Your description sounds
appropriately embarassed :-)

Yes, the student writes his stUdLycaPpeR for TEXT? values using
markers and TEXT-REPLACE!.  Piece of cake.  And as a bonus of his
first-year class, he gets some exposure to the semantics of "global
text processing".


      > often required behavior such as 

      > (char-ci=? foo (char-downcase (char-upcase foo))) => #t


Ray> But there are fourteen annoying exceptions, of which eszett, as
Ray> you correctly point out, is one.

I think it is, in fact, infinitely more than 14.  There is no promise
that the Unicode consortium won't introduce more.   Fundamentally, the
upcase mapping of a character is a string.

Tom> * The Proposal

Tom> The proposal has two parts.  Part 1 introduces a new type, TEXT?,
Tom> which is a string-like type that is compatible with Unicode, and
Tom> a subtype of TEXT?, GRAPHEME?, to represent "conceptual
Tom> characters".

Ray> Interesting.... I had not thought of making these separate from
Ray> characters and strings.  

Yup.  My proposal is just a tweak on your work (as far as I can tell
from your c.l.s. posts).  IMHO, your work is mostly all good -- just
don't call it CHAR? and TEXT? and, by the way, feel free to lose the
integer mapping and replace it with a hash function.

Ray> When you are manipulating unicode codepoints individually, that
Ray> seems to me below the level of characters and strings entirely;

Hence GRAPHEME? which is (deep down) isomorphic to your proposal for
CHAR?.  Where the isomorphism breaks down is where it should break
down.

Ray> you should be working with vectors of fixnums at that point, not
Ray> thinking of them as "characters" in any meaningful sense.  

CHAR? only makes good sense, at best, as a type isomorphic (but for
historical reasons disjoint) from integer octets.


Ray> So I'd been folding all the character->fixnum and
Ray> fixnum->character machinery into the input and output ports, 

No you haven't.   Reading "characters" on your ports yields something
isomorphic to an integer, not a fixnum.   


Tom> It is important to note that, in general, EQV? and EQUAL?  do
Tom> _not_ test for grapheme equality.  GRAPHEME=? must be used
Tom> instead.

Ray> Um.... why not? 

Informally, I expect EQV? to mean "equal even under mutation"
and EQUAL? to mean "equal when passed to functional procedures".

Neither of those is true of graphemes.

Ray> If your graphemes are conceptual characters, you don't have to
Ray> worry about them being represented in different ways; you should
Ray> have some kind of internal canonical ordering of the combiners,
Ray> and eqv? and equal? should find the same grapheme represented the
Ray> same way, everywhere.  The only way to screw this up is to let
Ray> people get inside the representation of characters/graphemes
Ray> directly with byte operations - and at that point you're not
Ray> treating them as characters at all, you're treating them as
Ray> bytes.

Ok, sort of.   GRAPHEME=? takes a LOCALE parameter.  I would not 
object to define (extending the proposal):

	EQV? of two graphemes being the same as EQ?

	EQUAL? of two graphemes being the same as GRAPHEME=?
          with the default locale.



Tom>     ~ (text=? t1 t2 [locale]) => <boolean>
Tom>     ~ (text<? t1 t2 [locale]) => <boolean>
Tom>     [...]
Tom>         The usual ordering predicates.

Ray> Ordering predicates are semantically interesting if you're not
Ray> providing a grapheme->integer and integer->grapheme mapping.  Is
Ray> the semantic requirement to be simply that the ordering
Ray> represented by these primitives is a total ordering which does
Ray> not change during the program run?  Would it be "correct" for a
Ray> system to simply assign each text a sequential number in the
Ray> order in which it sees them, and then have the total ordering
Ray> (lasting for the run of the program) correspond with ordering of
Ray> the sequence numbers?

The LOCALE parameter is key (and, really, should be the first, not
last parameter).  In general, collation of texts is _not_ a lexical
ordering based on grapheme order.  I think it might be good to make
the "default locale" subjet to the requirement that text order _is_ a
lexical ordering defined from the (default locale) ordering of
graphemes.




Tom>     ~ (text-replace! text start end replacement)

Tom>       Replace the graphemes at [START, END) in TEXT with
Tom>       the graphemes in text object REPLACEMENT.  Passing
Tom>       #t for END is equivalent to passing an index 1
Tom>       position beyond START.
Tom> 
Tom>       TEXT must be a mutable text object (see below).

Ray> Okay... I can see that this is a bit more flexible than
Ray> string-set! because it doesn't require the replacement to be the
Ray> same length as what it replaces.  But I think the quasicharacters
Ray> + text-set! combination is still easier to use, because people
Ray> attempting to use text-replace!  will still be trying to keep
Ray> track of indices into the string and they're going to have to
Ray> write code that keeps track of those indices.

Your quasicharacters, from what I can tell from your posts, are 
a sure and steady path to hell.   A quasicharacter like "SS" returned
from upcasing eszett -- well, isn't that actually a string (except 
that procedures don't treat it as such)?

Besides: "case" is the least of your problems.  Unicode implies other
mappings (an open set of mappings) from char->string:  are you going
to make quasicharacters include the union of all possible ranges of
those mappings?  By the time you do, you're right beack to giving up
on ->integer mappings for characters and nestled right smack dab into 
the middle of the TEXT?/GRAPHEME? types.


Tom>   That wouldn't work though:
Tom>   changing the case of one character can change the length of text,
Tom>   right at the point being indexed, and invalidate the indexes.  So,
Tom>   texts really need markers that work like those in Emacs:

Tom>     ~ (make-text-marker text index) => <marker>
Tom>     ~ (text-marker? obj) => <boolean>
Tom>     ~ (marker-text marker) => <index>
Tom>     ~ (marker-index marker) => <index>
Tom>     ~ (set-marker-index! marker index)
Tom>     ~ (set-marker! marker text index)
Tom>     etc.

Ray> Ah.  And here are the facilitators for writing code that keeps
Ray> track of indexes.  I don't think these are really an easier way
Ray> of dealing with the casing->size-changing warts of unicode than i
Ray> introducing forty-eight non-canonical decomposing characters.
Ray> The crucial difference, to me, is that these are things you'll
Ray> have to use and be aware of but the decomposing characters will
Ray> be invisible most of the time beyond simply knowing that when you
Ray> canonicalize ,a string (err, text) its length may change.

Insertion or replacement by one of your quasicharacters, "SS" still
being a good example, should change the length of a string and
invalidate integer indexes -- hence markers.

I don't understand where you get this "48" foo.  I count more than
that in SpecialCasing.txt and see no promise from the Unicode
consortium not to return more.   Fundamentally, case mappings accept a
grapheme-like character and return a string.  No need for special
cases like quasicharacters if you accept that reality.



Tom> ~ CHAR? is an octet, STRING? a sequence of octets

Ray> Way too low-level, IMO.  For operations on characters, you need
Ray> characters.  If you have to specify numbers of bits, you're
Ray> talking about numbers instead.

Your opinion is not insane.  Just call the things you're thinking of
GRAPHEME? and TEXT? and leave poor, overspecified CHAR? and STRING?
alone.  (As a prize for making that change in your thinking, you can
drop the ->integer and integer-> mappings.)



Ray> So you want to use string-ref to get (say) the third byte of the
Ray> fourth UCS32 value in a string?  Is that any kind of meaningful
Ray> character?  These are operations on bytevectors, not operations
Ray> on strings.

Yes, I do want to use string-ref that way.  It is the only coherent,
useful, unicode-friendly, and upwards compatible interpretation of
CHAR? and STRING? available to treat them as octet-based procedures
and, yes, that does undermine the entire reason to have a disjoint
CHAR? type.  The proposed GRAPHEME? type is to be a a superior and
graceful replacement for CHAR?.


Tom> ~ TEXT? values contain an "encoding" attribute, just as strings
Tom>   do (utf-8, etc.)

Ray> No...  I like the idea of the string (or text) simply as a
Ray> sequence of characters/graphemes.  There are no circumstances, at
Ray> all, under which you ought to be able to tell, or should need to
Ray> care, about the encoding of a text when you're considering it as
Ray> a sequence of _characters_.

You quoted me from a context in which I was describing what happens 
if values of tyep TEXT? are viewed as type STRING?.   In that context,
a TEXT? value is, very much, a string of octets described by an
encoding label.

-t

0
lord1 (42)
11/12/2003 3:21:48 AM
A nice proposal.

I'm for separating the text/grapheme layer and string/char
layer.  These are different abstractions and can't be
merged.

lord@emf.emf.net (Tom Lord) wrote in message news:<vr2c0q94bmjtc3@corp.supernews.com>...
>     ~ (text->utf8 text) => <string> 
>     [...]
> 	The usual conversions from strings (presumed to be
>         sequences of octets) to text.

I'll comment on this later.

>   It is important to note that, in general, EQV? and EQUAL?  do _not_
>   test for grapheme equality.  GRAPHEME=? must be used instead.

Can't we define equal? to work as grapheme=? on grapheme
arguments?  It seems inconvenient if I have a nested lists
and vectors that may contain grapheme object in it and
I can't compare two of such structures with equal?...

>     ~ (read-grapheme [port]) => <grapheme>
>     ~ (peek-grapheme [port]) => <grapheme>

It may be worth noted that, to implement this, a port
must know what character encodings it reads in.

>     ~ (make-text-marker text index) => <marker>
>     ~ (text-marker? obj) => <boolean>
>     ~ (marker-text marker) => <index>
>     ~ (marker-index marker) => <index>
>     ~ (set-marker-index! marker index)
>     ~ (set-marker! marker text index)

An alternative idea is to have something similar to srfi-6
string port.

>     ~ CHAR? is an octet, STRING? a sequence of octets

I'm against this idea.   I believe there should be three
layers:

  grapheme (conceptural character)
  character (represents a codepoint in system-dependent
             character encoding)
  octet (raw bits)

The codepoint character is useful in the following situations.

 - to write the text/grapheme layer.
 - to write an encoding conversion routine, using codepoint
   object as a "pivot".
 - when the native encoding of Scheme implementation is not
   Unicode:  Since grapheme is defined in terms of Unicode,
   we need some other way to deal with character codepoints
   that are not Unicode.
 - to write basic string operations, such as scanning a
   codepoint sequence within a string.  Note that in some
   encodings, scanning octet sequence is not a right way
   to scanning codepoint sequence.
 - to write basic character (codepoint) operations: for example,
   when I read character entity reference "&#309a;" in XML
   text, I want to convert it to a codepoint object for later
   processing (it can't directly be a grapheme, since it is
   just a combining character).
 - if string? is an octet sequence and we use string-set!,
   then we may get a string object that is illegal for any
   character encoding.  The same difficulty arises when 
   substring splits an octet sequence in the way that 
   the original well-formed utf-8 string produces two 
   ill-formed utf-8 strings.  So do we need check encoding
   consistency every time we perform these string operations?

We do need an API to touch the underlying octet sequence
of a string.  For that purpose, something like octet?
and octet-string? will serve better.

>       STRING-APPEND implicitly converts its second and subsequent
>       arguments to the same encoding as its first.

It seems difficult to make it work consistently.
What if two encodings of strings are incompatible
(e.g. string1 is in utf-8 and string2 is just an octet
stream that contains illegal utf-8 sequence, or
string1 is EUC-JP and string2 is utf-8 which may contain
a character that doesn't have mapping to EUC-JP?).

Character encoding conversion is pretty nasty business.
Sometimes you want to get an error if there's no mapping;
sometimes you want to replace bad octet sequence with
some alternative character silently; sometimes you want
just to ignore bad sequence; sometimes you want to control
mappings (there are more than one mappings between some
character encodings).   So, I'd like it to keep conversions
happen within well-separated operations, rather than to let
it happen implicitly.

>     ~ (text? "a string") => #t
> 
>     ~ (grapheme? #\a) => #t
> 
>   In other words, all character values are graphemes, and all strings
>   are text values.

I think this doesn't work if a character is an octet.
A character which is an octed value #x80 can't map to a 
grapheme, for example.

>   All text values can be strings;  some graphemes can be characters.

This could work.

--shiro
0
shiro (31)
11/12/2003 3:58:13 AM
Ray Dillinger <bear@sonic.net> wrote:
> Embrace Optional Declarations.  You can separate representation from
> semantics while still providing all the information an optimizing
> compiler needs to optimize where possible. 

This is totally at odds with good software engineering practices! At
some levels, sure, you don't want to worry about implementation details.
That's abstraction. But would you say the same thing about the choice to
represent data with a splay tree vs a hash table, depending on your
algorithmic needs? See /How to Design Programs,/ which is all about
teaching programmers how to find appropriate representations and then
letting them drive the design.

Sure, you can let the translator worry about details like whether to use
a halfword or fullword integer. But general-purpose, mechanical
translators don't have the domain knowledge or analytical sophistication
to make gross algorithmic choices. Example:

When writing a regex state machine, there are a few different ways to
represent character classes (e.g., [0-9A-Za-z]). You can always fall
back on the basic regex operations: catenation, alternation, and
repetition. You can represent the example character class as
0|1|2|3|4|5|6|7|8|9|A|...|z. But if you know that you're working with a
single-byte character set, you can represent it much more efficiently as
a 256-element bitmap. It's faster *and* it requires less space that way.

If you know of a compiler that can figure that out automatically from an
optional type declaration, please let me know.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/12/2003 5:33:40 AM
Tom Lord wrote:
> 
> 
> Given your proposed CHAR? type, CHAR-UPCASE and CHAR-DOWNCASE can not
> behave in a sane way.  Shall we talk of CHAR-TITLECASE, too?  Eszett
> is the canonical example, but there are many, others.  I again refer
> you to:
> 
>       http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

With exactly fourteen exceptions - the ligatures lacking altercase 
mappings and eszett itself - this entire file maps single graphemes
to single graphemes.  

With twenty-eight special characters, I can provide the "missing" 
cases of the exceptions.  With twenty more, I can achieve transitive
closure within case-change operations.
 
> As came up on the Guile developers' list recently: I'm not so sure
> that CHAR? shouldn't go away entirely from its existence as a disjoint
> type.  But that'd be a brutal change whereas my TEXT?/GRAPHEME?
> addition is delicate and upwards compatible.

Right.  Upwards compatibility will allow this to exist within 
existing implementations alongside the finite-charset "char" and 
"string" types, whereas my way of handling it is a wholesale 
replacement. 
 
> Ray> But there are fourteen annoying exceptions, of which eszett, as
> Ray> you correctly point out, is one.
> 
> I think it is, in fact, infinitely more than 14.  There is no promise
> that the Unicode consortium won't introduce more.   Fundamentally, the
> upcase mapping of a character is a string.

The unicode consortium does in fact promise to introduce no more.

"... the ligatures are deprecated; they appear in unicode only for 
 purposes of round-tripping with existing character sets.  When
 canonicalizing strings, they are reduced to other mappings, and 
 no more ligatures will be given unicode codepoints.... "

> Ray> you should be working with vectors of fixnums at that point, not
> Ray> thinking of them as "characters" in any meaningful sense.
> 
> CHAR? only makes good sense, at best, as a type isomorphic (but for
> historical reasons disjoint) from integer octets.

I've used a lot of languages that have worse historical warts. 
C, in fact, is a fine example, where "signed characters" are 
actually an integer type and the actual things that people use 
as characters are now wchar or similar. 
 
But you realize you're proposing the death of "char" as a type 
meaningful as characters at all. 

> Ray> So I'd been folding all the character->fixnum and
> Ray> fixnum->character machinery into the input and output ports,
> 
> No you haven't.   Reading "characters" on your ports yields something
> isomorphic to an integer, not a fixnum.

Ports have an encoding.  If the encoding is UTF32, what you 
read (or write) is a "character" (grapheme, in your parlance) - 
opaque, internal, representation, which may be several unicode 
codepoints.  If the encoding is binary32, what you read (or write) 
is an integer in the range [0..2^32), which you can map to or 
from a codepoint explicitly if need be.  

Codepoints for combining characters and codepoints not mapped 
to anything meaningful by Unicode are also examples of 
quasicharacters; what you get back from (UTF32->char N) when 
N contains such an integer may not in fact be a valid character 
by itself.  Combining codepoints can be added to a valid character
but by themselves they're only quasicharacters.  And codepoints
that don't map to anything at all give quasicharacters that 
aren't very useful.  
 
> Ray> Ordering predicates are semantically interesting if you're not
> Ray> providing a grapheme->integer and integer->grapheme mapping.  

> The LOCALE parameter is key (and, really, should be the first, not
> last parameter).  In general, collation of texts is _not_ a lexical
> ordering based on grapheme order.  

Hmmm. Okay, I'll buy that.  Right now I'm using grapheme order 
because it's simple. But you're right, a charset or locale will
probably provide its own collation order.

> Your quasicharacters, from what I can tell from your posts, are
> a sure and steady path to hell.   A quasicharacter like "SS" returned
> from upcasing eszett -- well, isn't that actually a string (except
> that procedures don't treat it as such)?

Like I said, it's hacky.  But just like the ligatures that Unicode
defines, it's a single character until it gets put into a string and
the string is canonicalized.  The quasicharacters can defer all 
length changes to string-canonicalization operations. 
 
> Besides: "case" is the least of your problems.  Unicode implies other
> mappings (an open set of mappings) from char->string:  are you going
> to make quasicharacters include the union of all possible ranges of
> those mappings?  By the time you do, you're right beack to giving up
> on ->integer mappings for characters and nestled right smack dab into
> the middle of the TEXT?/GRAPHEME? types.

I haven't yet seen this: Aside from the thirteen ligatures lacking 
opposite-case mappings, and eszett itself, all the mappings defined 
are of a single grapheme to a single grapheme; they only appear 
"special" because the number of unicode codepoints to represent 
them changes.
 
> I don't understand where you get this "48" foo.  I count more than
> that in SpecialCasing.txt and see no promise from the Unicode
> consortium not to return more.   Fundamentally, case mappings accept a
> grapheme-like character and return a string.  No need for special
> cases like quasicharacters if you accept that reality.

There are fourteen problematic graphemes -- eszett, which uppercases 
into two characters, and the thirteen ligatures with no corresponding 
case-mappings, which uppercase or lowercase or titlecase into different 
numbers of characters.  Everything else in SpecialCasing.txt is mappings
of single graphemes to altercase single graphemes that just happen to 
require different numbers of unicode codepoints to represent.  Size 
mismatches on them go away when you are looking at grapheme lengths 
rather than unicode codepoint lengths. 

Furthermore, if you canonicalize a string, all of these problematic 
graphemes except eszett itself go away, replaced by canonical unicode 
mappings that do not cause such problems. 

Each ligature is uppercase, lowercase, or titlecase.  That means just 
twenty-eight special-use characters are required to provide one-to-one
mappings for them into the missing two cases.  Those twenty-eight 
special-use characters represent two (or three) graphemes each, and 
many need further special cases to achieve closure on case 
transformations from them because the ligature that altercases into 
them isn't, in turn, their own preferred altercase image.  I wound 
up with forty-eight characters.

Finally, if you read the unicode consortium's documentation and policy,
they have promised, in fact, that they will not introduce any more 
code points for ligatures that altercase into different numbers of 
characters -- so the thirteen, plus eszett, really and truly are all
we have to worry about. 

				Bear
0
bear (1219)
11/12/2003 5:54:32 AM
Ray Dillinger <bear@sonic.net> wrote:
> Anyway, when you start wanting at the bits-n-bytes for performance
> reasons, I think what we have a good case for is optional
> declarations; completely not in the user's face unless he wants them,
> completely transparent if the compiler happens not to get benefit from
> them, capable of being inserted into working code after it gets
> working without changing any semantics, and providing enough
> information that the compiler can provide and support fast-path
> operations where they are possible.

Not all "bits-n-bytes" optimizations are amenable to that kind of
mechanical analysis. For example, see the literature on garbage
collection. It's often critical to know "implementation details" like
the size of the virtual address space, because they affect
representation at a level that compilers can't (yet) understand.

For example, a subspace map or card-marking scheme that works well for a
32-bit address space might fail miserably for a 64-bit space. That's
because the smaller address space permits some mathematical
transformations that are too inefficient for the larger space. Reverse
pointer lookups are easy in a 32-bit space; just split it into 64KB
pages, mask off the low bits, and look it up in a 64K-word table. Expand
that to a 64-bit address space, and you need a different data structure
and a different algorithm. Compilers can't make that decision.

Exactly the same thing comes up in regex machines. Small character sets
(especially single-byte sets) permit some optimizations that simply
aren't possible on a large or infinite character set. While Unicode
specifies an infinite character set, most applications only need to deal
with the small character sets in practice. And again, this isn't
something that a compiler can figure out for you.

You could code this application-specific knowledge into the compiler,
but then what happens when a slightly different application comes along?

And it's not like the "string type parameterized by encoding" is
especially burdensome. Programmers can always convert text internally to
the infinite character set and then forget about it (except for I/O) if
they don't care about these things.

Let me put this another way: How would you feel about programming in a
language that optimizes all ADTs this way? No trees, lists, vectors,
etc., just a generic ADT and a set of optional data declarations. You
can tell the compiler, "A tree would be best for this data," but it
wouldn't be required to honor the request. I think that would be
extremely frustrated.

I feel the same way about a text-processing facility that tells me to
keep my hands off the internals. How am I supposed to implement an
efficient regex machine if I don't have sufficient information to
simplify the general-case math? Sure, you can't always simplify, but
what's good about a system that prevents you from doing it even when it
would otherwise be possible? It's not like modern compilers can figure
out how to do it for me, just from a data declaration. Compilers don't
know that [abc] can be implemented much more efficiently than the
general transformation to a|b|c (for some character sets).
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/12/2003 5:58:28 AM
Regarding Tom Lord's Unicode proto-proposal:

Shiro Kawai <shiro@acm.org> wrote:
> A nice proposal.

Yes, very nice, but of course the devil is in the details.

> I'm for separating the text/grapheme layer and string/char layer.
> These are different abstractions and can't be merged.

As you explain later and I mentioned earlier, there are a few different
abstractions when dealing with text, plus a few intrusive implementation
details.

> I believe there should be three layers:
> 
>   grapheme (conceptural character)
>   character (represents a codepoint in system-dependent character encoding)
>   octet (raw bits)

While I generally agree with your assessment here, there are some
complications at the "character" level. Often, the meaning of a code
point depends on context. For example, you can't locate a code point in
a "shift" encoding without knowing the shift state in that part of the
string. Unicode tries to minimize this kind of state-based encoding,
although even in Unicode there are some issues with mixed-language texts
(because languages tend to have special conventions regarding the
inclusion of extra-lingual text).
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/12/2003 6:20:16 AM
Ray Dillinger <bear@sonic.net> wrote:
> If the encoding is UTF32 ....

Nitpick: There is no "UTF32" encoding. It's called "UCS-4" (or at least,
that's what it was called when I worked in this area a couple of years
ago). "UTF-8" and "UTF-16" are ways to write UCS-4 in octets and
double-octets. There also used to be "UCS-2," before Unicode outgrew the
16-bit range.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/12/2003 6:28:22 AM
At Wed, 12 Nov 2003 05:58:28 GMT, Bradd W. Szonye <bradd+news@szonye.com> wrote:
> 
> Exactly the same thing comes up in regex machines. Small character sets
> (especially single-byte sets) permit some optimizations that simply
> aren't possible on a large or infinite character set. While Unicode
> specifies an infinite character set, most applications only need to deal
> with the small character sets in practice. And again, this isn't
> something that a compiler can figure out for you.

Why not?  If you use utf-8 or any other character set backwards
compatible with ASCII, then when you compile a regexp with only ASCII
character sets it can safely read a single byte at a time.  So in [abc]
the optimizer could read and compare a single byte.  If it matches then
you had a single-byte char to begin with, if it fails then you need to
backtrack and don't care whether you had read a full character or not.
Likewise in a.*b, the individual .'s have to match valid wide
characters, but if they are followed by a b that's a given (assuming the
string was valid to begin with) and again you can read a byte at a time.

-- 
Alex

0
foof (110)
11/12/2003 6:43:00 AM
"Bradd W. Szonye" wrote:

> Nitpick: There is no "UTF32" encoding.

There most certainly is, it's detailed in the Unicode Standard 4.0,
section 2.5, subsection "UTF-32."

-- 
   Erik Max Francis && max@alcyone.com && http://www.alcyone.com/max/
 __ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/  \ The purpose of man's life is not happiness but worthiness.
\__/  Felix Adler
0
max78 (1220)
11/12/2003 8:03:55 AM
"Bradd W. Szonye" <bradd+news@szonye.com> wrote in message news:<slrnbr3k8v.s6e.bradd+news@szonye.com>...
> While I generally agree with your assessment here, there are some
> complications at the "character" level. Often, the meaning of a code
> point depends on context. For example, you can't locate a code point in
> a "shift" encoding without knowing the shift state in that part of the
> string. Unicode tries to minimize this kind of state-based encoding,
> although even in Unicode there are some issues with mixed-language texts
> (because languages tend to have special conventions regarding the
> inclusion of extra-lingual text).

Yeah, that's true.  And how do you write a program that
deals with such stuff?  Surely we can do it at the bottom
level, octet stream, but it seems to me that there should be
one level up----if it's unicode, touching unicode codepoints
is easier than writing routines that deals with utf-8, utf-16,
and ucs-32 individually.  If its euc-jp, which has packed form
(octet stream with single-character shift) and unpacked form 
(2-bytes canonical representation), we can still see a 
canonical codepoint stream.  If we have full iso2022---then 
we'll end up to scan at octet stream level, but to make
anything useful I think we should convert it to some internal 
codepoint stream in some chosen encoding anyway.  And we can 
build a text processing layer on top of it.

(Even in unicode-only world, you don't deny that it is 
convenient to have middle layer that is a sequence of
unicode codepoints, do you?  If you don't take that approach,
then you'll write your text processing library directly on
top of utf-8, AND utf-16, AND ucs-32... If you first
canonicalize those various encoding to, say, utf-8, then
effectively you have your intermediate layer.)
0
shiro (31)
11/12/2003 12:47:55 PM
> "Bradd W. Szonye" wrote:
>> Nitpick: There is no "UTF32" encoding.

Erik Max Francis <max@alcyone.com> wrote:
> There most certainly is, it's detailed in the Unicode Standard 4.0,
> section 2.5, subsection "UTF-32."

Mea culpa! I didn't spot it in my quick flip through the book. There has
long been confusion between the "UCS" and "UTF" encoding families, and I
thought this was just another example.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/12/2003 4:55:33 PM
> Bradd W. Szonye <bradd+news@szonye.com> wrote:
>> Exactly the same thing comes up in regex machines. Small character
>> sets (especially single-byte sets) permit some optimizations that
>> simply aren't possible on a large or infinite character set. While
>> Unicode specifies an infinite character set, most applications only
>> need to deal with the small character sets in practice. And again,
>> this isn't something that a compiler can figure out for you.

Alex Shinn wrote:
> Why not?

Because general-purpose compilers don't have the domain-specific
knowledge necessary to make the appropriate data and algorithm choices.
And it's not like you can program that knowledge into the compiler --
you'd effectively need AI.

> If you use utf-8 or any other character set backwards compatible with
> ASCII, then when you compile a regexp with only ASCII character sets
> it can safely read a single byte at a time.  So in [abc] the optimizer
> could read and compare a single byte.

It's not just reading and comparing a byte at a time. It's also
designing data structures to take advantage of the limited character
set. Compilers can't make that choice for you. (Maybe someday they will,
but that day is not in the near future.)

> If it matches then you had a single-byte char to begin with, if it
> fails then you need to backtrack and don't care whether you had read a
> full character or not.

If you need to test every character/byte like that, you lose much of the
benefit of the domain-specific optimization. Part of the optimization is
guaranteeing in advance that some large set of data will all fall within
a certain range, rather than verifying it one bit at a time.

And again, it's not like it's an extra burden to carry the encoding
information with each string; you effectively need to do that anyway.

> Likewise in a.*b, the individual .'s have to match valid wide
> characters, but if they are followed by a b that's a given (assuming
> the string was valid to begin with) and again you can read a byte at a
> time.

How does a *compiler* know that? Why are you assuming that a compiler
has this kind of domain-specific knowledge? And even assuming that it
does, how do you deal with future extensions that weren't anticipated
when the compiler was written?

Related to this: This is also why it's important to provide some kind of
character->integer conversion. Character sets have many interesting
mathematical properties. That's no accident. ASCII upcasing is a simple
OR operation, for example, and UTF-8 was carefully designed so that
ASCII and Unicode could live together in numerical harmony. I thought I
heard a suggestion to eliminate the char->int conversion, and I think
it's a bad idea, because it strips away information that was carefully
designed into the character sets specifically to make numerical analysis
easier.

While it's true that Unicode-in-general doesn't permit that kind of
analysis, most smaller encodings do. Applications targeted for a
specific encoding shouldn't need to re-implement the string/text library
just to recover useful information that the general system has thrown
away.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/12/2003 5:16:25 PM
Okay, I volunteer be the ugly American here.

lord@emf.emf.net (Tom Lord) writes:

>   There are two major obstacles to providing nice,
>   non-culturally-biased Unicode support in standard Scheme.

> ** CHAR? Makes No Sense In Unicode

If we look at it in an unbiased manner, a fair question to ask is "who
is not making sense: CHAR? or Unicode?".

Do I understand correctly that upcasing a single "latin small letter
sharp s" gives you two "latin capital letter s"s, but when you
downcase those, you get two "latin small letter s"s with no sharpness?
What kind of nonsense is that?

What really puzzles me here is that I thought Germany just did a
spelling reform recently.  Asking all users of a language to change
their spelling is a huge and painful undertaking.  Why on earth would
you do that and not take the opportunity to fix total breakage like
the lack of a one-to-one case mapping?  Such obtuseness should be
punished, not accommodated.

Why not just add a sentence to the char-upcase spec that says "It is
an error if your language does not have a char-to-char upcase
function"?

Of course there will always be a need for applications that deal with
obsolete languages and typography, but should we really try to throw
all that cruft into a simple general-purpose (including pedagogical)
language like scheme?  You might be studly enough to handle that
morass, but, please, think of the children.

-al
Americans for ASCII Association -- Keep Unicode out of our schools
"If ASCII's good enough for Jesus, then it's good enough for me."
(What's that?  He spoke HEBREW?)
"ASCIIFY HIM!  ASCIIFY HIM!"
0
11/12/2003 6:02:42 PM
Alphanumeric Petrofsky <alphanumeric@petrofsky.org> writes:

> What really puzzles me here is that I thought Germany just did a
> spelling reform recently.  Asking all users of a language to change
> their spelling is a huge and painful undertaking.  Why on earth would
> you do that and not take the opportunity to fix total breakage like
> the lack of a one-to-one case mapping?  Such obtuseness should be
> punished, not accommodated.

The problem is that the last time people tried to punish Germany, the
Volk regrouped with massive armament.  It seems clear that the same
peril exists here.  The Germans, being master philologists, might just
come up with a standard so magnificently convoluted that we'll for the
simplicity of Unicode yearn.

> Americans for ASCII Association -- Keep Unicode out of our schools
> "If ASCII's good enough for Jesus, then it's good enough for me."

I'm quite sure you mean the prophets.

Shriram, ducking out before someone brings up Chamberlain
0
sk1 (223)
11/12/2003 7:09:02 PM
Alphanumeric Petrofsky wrote:
> 
> Okay, I volunteer be the ugly American here.
> 
> lord@emf.emf.net (Tom Lord) writes:
> 
> >   There are two major obstacles to providing nice,
> >   non-culturally-biased Unicode support in standard Scheme.
> 
> > ** CHAR? Makes No Sense In Unicode
> 
> If we look at it in an unbiased manner, a fair question to ask is "who
> is not making sense: CHAR? or Unicode?".
> 
> Do I understand correctly that upcasing a single "latin small letter
> sharp s" gives you two "latin capital letter s"s, but when you
> downcase those, you get two "latin small letter s"s with no sharpness?
> What kind of nonsense is that?

Err, yeah.  That's exactly the correct nonsense. 

Unicode has a bunch of mappings where the preferred lowercase form 
of something has an uppercase form that's not itself, too; like 
the dotless i that uppercases into latin capital I, but latin 
capital I downcases into latin small i. That's a more subtle kind
of nonsense. 

There are a bunch of ligatures -- like latin lowercase ligature ffl -- 
that uppercase into latin capital F, latin capital F, latin capital L -- 
and that sequence in turn downcases into latin lowercase f, latin 
lowercase f, latin lowercase l. But, in fairness, ligature nonsense 
is limited because the ligatures are deprecated in Unicode, there 
only for roundtripping with existing character sets, and if you 
canonicalized the string the ligature would have been transformed 
into lowercase f, lowercase f, lowercase l anyway.  

latin small letter sharp s is the only example of a grapheme that 
does NOT go away in string canonicalization but which DOES map to 
multiple graphemes when made uppercase or titlecase.  IOW, if you're
counting graphemes rather than codepoints, and you start with a 
canonicalized string, latin small letter sharp s is the only 
character that creates a situation where changing the case of a 
string can change its length.  It _is_ tempting to just treat it 
as an error. 

Even so, you still have hundreds of characters like that dotless i, 
that have uppercases they're not the preferred lowercase of, or 
lowercases they're not the preferred uppercase of.  Sometimes 
if your locale setting is right, Latin capital I _will_ lowercase 
into lowercase dotless i instead of latin lowercase i, which adds 
to the confusion. 

> What really puzzles me here is that I thought Germany just did a
> spelling reform recently.  Asking all users of a language to change
> their spelling is a huge and painful undertaking.  Why on earth would
> you do that and not take the opportunity to fix total breakage like
> the lack of a one-to-one case mapping?  Such obtuseness should be
> punished, not accommodated.

Well, Germany's like that, I guess. Germans have always had their 
own special ways of doing things. 
 
> Why not just add a sentence to the char-upcase spec that says "It is
> an error if your language does not have a char-to-char upcase
> function"?

:-)


> Americans for ASCII Association -- Keep Unicode out of our schools
> "If ASCII's good enough for Jesus, then it's good enough for me."
> (What's that?  He spoke HEBREW?)
> "ASCIIFY HIM!  ASCIIFY HIM!"

Well.... Aramaic actually.  He probably only spoke Hebrew in synagogue 
on Fridays.

				Bear
0
bear (1219)
11/12/2003 7:14:52 PM
Shriram Krishnamurthi wrote:
> 
> Alphanumeric Petrofsky <alphanumeric@petrofsky.org> writes:
> 
> > What really puzzles me here is that I thought Germany just did a
> > spelling reform recently.  Asking all users of a language to change
> > their spelling is a huge and painful undertaking.  Why on earth would
> > you do that and not take the opportunity to fix total breakage like
> > the lack of a one-to-one case mapping?  Such obtuseness should be
> > punished, not accommodated.
> 
> The problem is that the last time people tried to punish Germany, the
> Volk regrouped with massive armament.  It seems clear that the same
> peril exists here.  

If you assign the capital form of sharp-s the most obvious glyph, it 
might look uncomfortably familiar to a lot of Germans....  A thought
that occurred to me when I was inventing the quasicharacters I've 
mentioned and finally realized that sharp-s is the ONLY canonical 
cased glyph *without* a single-glyph uppercase. 
      ______
     /  /  /
    /  /  /
   /  /_  -
   -   /  /
   /  /  /
  /__/__/


			Bear
0
bear (1219)
11/12/2003 7:50:19 PM
On 12 Nov 2003, Ray Dillinger <- bear@sonic.net wrote:

> Shriram Krishnamurthi wrote:

>> The problem is that the last time people tried to punish Germany, the
>> Volk regrouped with massive armament.  It seems clear that the same
>> peril exists here.  

I hope that's your way of humour.  If so I can laugh.

> If you assign the capital form of sharp-s the most obvious glyph, it 
> might look uncomfortably familiar to a lot of Germans....  A thought

A capital sharp s does not exist.  The sharp s (�) is a ligature from
the small caps letters s and z.


   KP

-- 
            One, two!  One, two!  And through and through
                 The vorpal blade went snicker-snack!
          He left it dead, and with its head
                He went galumphing back.   "Lewis Carroll" "Jabberwocky"
0
sigurd
11/12/2003 9:41:29 PM
Karl Pfl=E4sterer wrote:
> =

> On 12 Nov 2003, Ray Dillinger <- bear@sonic.net wrote:
> =

> > Shriram Krishnamurthi wrote:
> =

> >> The problem is that the last time people tried to punish Germany, th=
e
> >> Volk regrouped with massive armament.  It seems clear that the same
> >> peril exists here.
> =

> I hope that's your way of humour.  If so I can laugh.

Yes, he was being sarcastic. Apologies, I know sarcasm is very
strange and sometimes prickly if to people whose tradition it's =

not. =


> > If you assign the capital form of sharp-s the most obvious glyph, it
> > might look uncomfortably familiar to a lot of Germans....  A thought
> =

> A capital sharp s does not exist.  The sharp s (=DF) is a ligature from=

> the small caps letters s and z.

Sorry if I gave offense. Although I'm aware that sharp-s originated =

as an SZ ligature, the fact that it's no longer interchangeable in =

spelling with the ligatured letters means that now it's no longer a =

"ligature" in technical parlance; now it's a "letter" in its own =

right.  But as a letter with no independent capital form in an =

otherwise cased alphabet, it's unique and exceptional.  =


Right now it's the only character in all of unicode that frustrates =

my efforts to create a clean interface to strings where you can raise =

or lower case on strings without changing their length.  All the other
"specials" that change case into more or fewer glyphs are non-canonical =

characters - actual ligatures, which are a problem that can be "solved" =

simply by replacing them with the letters they're ligatures of.  =


But sharp-s isn't a ligature in that sense.  If I replaced it with
the letters 'SZ', or 'ss', then it would result in words being =

misspelled. 'Ma=DFe' is not the same word as 'Masse', and if =DF =

were actually a ligature, it would be, because a ligature is just
a typesetting convention for the letters it joins.  If it were =

a ligature, I could treat it like all the other ligatures and it =

wouldn't cause this problem.

I know this isn't really something Germany did just to mess the =

rest of us up in our efforts to create some kind of sane API for =

character-based programs.  I know this wasn't an act of malice. =

It's just...  frustrating, and we were making little sarcastic =

jokes about our frustration.  Apologies if we gave offense. =


				Bear
0
bear (1219)
11/12/2003 11:00:51 PM
On 13 Nov 2003, Ray Dillinger <- bear@sonic.net wrote:

> Karl Pfl�sterer wrote:

>> On 12 Nov 2003, Ray Dillinger <- bear@sonic.net wrote:

>> > Shriram Krishnamurthi wrote:

>> >> The problem is that the last time people tried to punish Germany, the
>> >> Volk regrouped with massive armament.  It seems clear that the same
>> >> peril exists here.

>> I hope that's your way of humour.  If so I can laugh.

> Yes, he was being sarcastic. Apologies, I know sarcasm is very
> strange and sometimes prickly if to people whose tradition it's 
> not. 

Well a lot of people here (in Germany) are ironical sarcastic or
cynical.  It's only sometimes hard in Usenet to tell if people only
tried to make a joke (then everyone can laugh or smile if it was agood
one) or if they meant it for real.  A lot of prejudices exist and you
never know if the other took them for real.

If I read some of the rants againsz Unicode I can't help but think of
some prejudices about US americans.

>> A capital sharp s does not exist.  The sharp s (�) is a ligature from
>> the small caps letters s and z.

> Sorry if I gave offense. Although I'm aware that sharp-s originated 
> as an SZ ligature, the fact that it's no longer interchangeable in 
> spelling with the ligatured letters means that now it's no longer a 
> "ligature" in technical parlance; now it's a "letter" in its own 
> right.  But as a letter with no independent capital form in an 
> otherwise cased alphabet, it's unique and exceptional.  

That's right.  But I'm sure other languages have other special caes;
what about eg. french?  You write Caf� but CAFE (you could write CAF�
but that would be unusual).  Or other languages with diacritical
letters.

> Right now it's the only character in all of unicode that frustrates 

Really?  But I muist admit I#M no unicode specialist.


[...]
> But sharp-s isn't a ligature in that sense.  If I replaced it with
> the letters 'SZ', or 'ss', then it would result in words being 
> misspelled. 'Ma�e' is not the same word as 'Masse', and if � 

Well that's right and wrong :-) In eg. Switzerland you could write
�Masze� for �Ma�e� IIRC.  But most of the time it would be misspelled
that's right.

[...]
> I know this isn't really something Germany did just to mess the 

Who knows?  Maybe Gutenberg and his friends said: hey let's make it not
to easy if the guys decide one day not longer to use letters from lead.

> rest of us up in our efforts to create some kind of sane API for 
> character-based programs.  I know this wasn't an act of malice. 
> It's just...  frustrating, and we were making little sarcastic 
> jokes about our frustration.  Apologies if we gave offense. 

No you didn't.  But sometimes I read similar postings where people
really mean the things they write (I don't answer those postings because
there's absolutely no sense in it; you only get flamewars).


   KP

-- 
    'Twas brillig, and the slithy toves
        Did gyre and gimble in the wabe;
    All mimsy were the borogoves,
         And the mome raths outgrabe.   "Lewis Carroll" "Jabberwocky"
0
sigurd
11/13/2003 12:06:46 AM
At Wed, 12 Nov 2003 17:16:25 GMT, Bradd W. Szonye <bradd+news@szonye.com> wrote:
> 
> Alex Shinn wrote:
> > Why not?
> 
> Because general-purpose compilers don't have the domain-specific
> knowledge necessary to make the appropriate data and algorithm choices.
> And it's not like you can program that knowledge into the compiler --
> you'd effectively need AI.

You're saying compilers aren't perfect.  Granted.  However as they get
smarter they tend to perform better than humans, even when the humans
take the time to try to optimize.  Just to be sure I'm not
misunderstanding, in the specific example of regexps, are you really
suggesting the user manually test encodings of strings and then dispatch
accordingly to different regexps?

> > If it matches then you had a single-byte char to begin with, if it
> > fails then you need to backtrack and don't care whether you had read a
> > full character or not.
> 
> If you need to test every character/byte like that, you lose much of the
> benefit of the domain-specific optimization. Part of the optimization is
> guaranteeing in advance that some large set of data will all fall within
> a certain range, rather than verifying it one bit at a time.

I'm talking about optimizing the regexp itself so that it can match
against *any* string as quickly as a pure ASCII regexp engine,
regardless of whether or not the string in question has multi-byte
characters.  That's a much more powerful optimization than you're
proposing.

What I think your suggesting is optimization in one or both of the
following two cases:

  1) If the string contains chars not allowed in the regexp (e.g. the
     regexp only has single byte chars or charsets and no instances of
     either . or ?, and the string is known to have multi-byte
     characters) then it can guarantee failure.  But that only works if
     you know for *certain* that the multi-byte string could not in fact
     be represented as a single byte string, which would require
     guaranteed normalization of multi-byte strings into single byte
     whenever possible, which is more likely to hurt performance than to
     help it.

  2) Alternately if the regexp *requires* characters (not in an optional
     | group) that don't fit into the strings' encoding, then you can
     guarantee failure.  This occurs when you're matching against
     specific multi-byte characters and the string is single byte.  This
     is fairly rare - the Gauche regexp engine handles native multi-byte
     character and character set literals, but I almost never take
     advantage of this and have seen no other code that does.

So both are uncommon, require extra overhead, and 1 puts harsh
restrictions on what multi-byte encoded strings may hold.  However, if
you did add an encoding tag to strings and did want these optimizations,
then the regexp compiler would be the best place to put these
optimizations.

Note a more general approach to optimizing arbitrary groups of
characters is to do the equivalent of the Perl "study" function and make
a quick once-over of the string to find out what characters it holds,
which allows quick failure when there are mismatches with the regexp
characters.

> > Likewise in a.*b, the individual .'s have to match valid wide
> > characters, but if they are followed by a b that's a given (assuming
> > the string was valid to begin with) and again you can read a byte at a
> > time.
> 
> How does a *compiler* know that? Why are you assuming that a compiler
> has this kind of domain-specific knowledge? And even assuming that it
> does, how do you deal with future extensions that weren't anticipated
> when the compiler was written?

How is any of this domain specific?!  Character sets are a fundamental
part of the programming language, so you're going to optimize them
regardless of regexps.  Gauche *does* only treat strings as multi-byte
in regexps when it has to.  It could be a little smarter in this
respect, but the optimizations aren't hard to add, and certainly better
added to the regexp code than done by hand.  Likewise integer set
implementations can take advantage of compact groups of integers with
bitmaps.  These are general optimizations that can and should be taken
out of the hands of the developers.

What are the future extensions to charsets (once you agree on what a
"char" is)?  If you change the definition of char you'll have a lot more
code to rewrite than just charsets, and in that case I certainly hope
you left all the optimizations up to the compiler so that you can fix it
in one place, rather than have to rewrite all your manual optimizations
*shudder*.

Note I'm not necessarily arguing against multiple string encodings, but
not for speed, and I certainly don't think the user should be checking
and manually optimizing based on that encoding.  The primary argument is
simply one of size.  In UTF-8 many scripts such as Cyrillic double in
size, whereas most Asian scripts grow to 3/2 the original size.  So you
often want to store large texts in the most appropriate format and
convert as you load it into Scheme (which is why ports need encoding
tags).  However, if you do a lot of converting back and forth or if you
store a lot of text in memory it can be a expensive - it's like working
with a compressed filesystem.  When 99% of your processing is handled in
a certain charset, it's nice to be able to use that charset internally.
So when I tell Japanese companies about our application server that's
native UTF-8 "... so you can handle Korean and Thai!," they say "Really?
That's nice... how much work would it be to convert it to EUC-JP?"

So string tags are a debatable point, but the programmer shouldn't
access them directly.

-- 
Alex

0
foof (110)
11/13/2003 2:21:00 AM
On Wed, 12 Nov 2003 23:00:51 GMT, Ray Dillinger <bear@sonic.net> wrote:

>All the other "specials" that change case into more or fewer glyphs are
>non-canonical characters - actual ligatures, which are a problem that
>can be "solved" simply by replacing them with the letters they're
>ligatures of.

Note that using "ligature" in the sense of "letter + diacritical
mark(s)" is a bit non-standard, and potentially confusing. In Unicode,
the use of a ligature is presentational (that is, it affects the
appearance, but not the meaning, of the text), whereas the use of either
the precomposed or separate form of a letter + diacritical mark has no
effect on either the meaning or the appearance of the text. 

[It could be argued that ligatures can, in some cases, affect the
meaning of the text, in the sense that the absence of a ligature where
one is expected (in Arabic, for example), might render the text somewhat
unintelligible. In a computer program, it is, of course, the job of the
presentation layer to handle any obligatory ligature conversions; the
back-end code shouldn't have to worry about it.]

In any case, you're right that a canonical decomposition takes care of
the one-to-one mapping problem for all but U+00DF, but there are still a
couple of anomalous cases I can think of (I'm sure there are more):

1) GREEK SMALL LETTER FINAL SIGMA (U+03C2) vs. GREEK SMALL LETTER SIGMA
(U+03C3) - The end of a text string doesn't necessarily represent the
end of a word, so an upper->lower conversion from GREEK CAPITAL LETTER
SIGMA (U+03A3) may be ambiguous.

2) In Dutch, the two-letter combination IJ is treated as a single letter
(or ligature, depending on your point of view) as far as capitalization
is concerned. Thus, we have:
 
 (title-case "IJMUIDEN") -> "IJmuiden"
 (title-case "ijmuiden") -> "IJmuiden"

In this case, using the ligated forms (U+0132 and U+0133) actually makes
case conversion simpler.

-Steve

0
see94 (37)
11/13/2003 4:08:47 AM
Steve Schafer wrote:
> 
> On Wed, 12 Nov 2003 23:00:51 GMT, Ray Dillinger <bear@sonic.net> wrote:
> 
> >All the other "specials" that change case into more or fewer glyphs are
> >non-canonical characters - actual ligatures, which are a problem that
> >can be "solved" simply by replacing them with the letters they're
> >ligatures of.
> 
> Note that using "ligature" in the sense of "letter + diacritical
> mark(s)" is a bit non-standard, and potentially confusing. In Unicode,
> the use of a ligature is presentational (that is, it affects the
> appearance, but not the meaning, of the text), whereas the use of either
> the precomposed or separate form of a letter + diacritical mark has no
> effect on either the meaning or the appearance of the text.

I'm not using "ligature" in the sense of letter + diacritical mark.
A letter plus diacritical marks is a single glyph, and its uppercase
or lowercase is also a single glyph.  They simply appear in the 
specialcasing.txt file because the uppercase and lowercase glyphs 
happen to take different numbers of unicode codepoints to represent. 

				Bear
0
bear (1219)
11/13/2003 4:47:32 AM
> Bradd W. Szonye <bradd+news@szonye.com> wrote:
>> [General]-purpose compilers don't have the domain-specific knowledge
>> necessary to make the appropriate data and algorithm choices [for
>> different character sets]. And it's not like you can program that
>> knowledge into the compiler -- you'd effectively need AI.

Alex Shinn <foof@synthcode.com> wrote:
> You're saying compilers aren't perfect.  Granted.  However as they get
> smarter they tend to perform better than humans, even when the humans
> take the time to try to optimize.

I don't anticipate compilers smart enough to make high-level algorithmic
choices any time soon.

> Just to be sure I'm not misunderstanding, in the specific example of
> regexps, are you really suggesting the user manually test encodings of
> strings and then dispatch accordingly to different regexps?

No. I'm talking about the implementation of the regex parser and state
machine, which is normally library code, not user code. But users
occasionally write these things themselves, so sometimes user code needs
to deal with it too. I'm curious; have you ever implemented a regex,
string-searching, or character classification library? In case you
haven't, I'll explain it in more detail.

If you know that the input strings only contain single-byte characters,
you can make certain optimizations. For example, you can often record
state and character-class information with a 256-element bitmap. That's
much more space- and time-efficient than the general case, which must
store state/class information in a more complicated set ADT.

Here's a toy example that might make it more obvious. When a regex
parser translates [abc] into a state machine, it needs to represent it
somehow in an internal parse tree. If the machine will be used to scan
Unicode strings, it needs a pretty heavy parse tree. But what if the
character set only has eight characters, a-h? Then you can represent
that character class with a small bitmap: 00000111. That requires much
less storage, and it's much faster to interpret at scan time.

The same thing is true for character classification predicates like
CHAR-ALPHABETIC? and CHAR-WHITESPACE?. Those predicates are very quick
and simple when the character set has a small number of characters; just
store the results in a bitmap. In practice, "small" character type
systems usually store the type information in a 256-word array, with one
bit from each word representing a different type. For example:

    (define alphabetic-bitmask 1)
    (define numeric-bitmask    2)
    (define whitespace-bitmask 4) ...

    (define (char-type? mask c)
      (let* ((i         (char->integer c))
             (type-info (vector-ref type-map i))
             (type-bits (bitwise-and type-info mask)))
        (not (= 0 type-bits))))

    (define (char-alphabetic? c) (char-type? alphabetic-bitmask c))
    (define (char-numeric?    c) (char-type? numeric-bitmask    c))
    (define (char-whitespace? c) (char-type? whitespace-bitmask c)) ...

In principle, you can do this for any character set. In practice, the
storage requirements are unreasonable for anything much bigger than a
single-byte character set. The definition of CHAR-ALPHABETIC? for UCS-4
is much, much more complicated and difficult to store.

Regex machines need something similar to deal with character classes
like [abc]. If you know in advance that the strings are always ASCII or
Latin-1, you can store the character classes in bitmaps, as above. If
you need to deal with Unicode in general, you need a more general (and
less efficient) way to store the data. Mechanical compilers aren't
nearly intelligent enough to make that kind of decision, and I don't
expect them to get that sophisticated any time soon.

>> If you need to test every character/byte like that, you lose much of
>> the benefit of the domain-specific optimization. Part of the
>> optimization is guaranteeing in advance that some large set of data
>> will all fall within a certain range, rather than verifying it one
>> bit at a time.

> I'm talking about optimizing the regexp itself so that it can match
> against *any* string as quickly as a pure ASCII regexp engine,
> regardless of whether or not the string in question has multi-byte
> characters.  That's a much more powerful optimization than you're
> proposing.

How do you propose to beat a well-tuned, pure ASCII regex machine? You
can't beat it for speed or space efficiency, because the small character
sets permit optimizations that make a big difference at the Big-Oh
level. With a small character set, a character class only requires O(1)
space (typically 4 machine words) and O(1) scanning time. For the
general case (i.e., Unicode), character classes require O(K) space, and
the scanning time is either O(1) with a perfect hash function, O(lg N)
with a tree, or O(N) with a poor hash function. No amount of mechanical
optimization is going to make up for those order-of-magnitude
differences.

> What I think your suggesting is optimization in one or both of the
> following two cases:
> 
>   1) If the string contains chars not allowed in the regexp ....
>   2) Alternately if the regexp *requires* characters ....

I don't think you're understanding me correctly. There are no "chars not
allowed in the regexp." The regex machine is built to handle strings of
a certain encoding, and it will only ever see characters in that
encoding. That's what makes the data optimizations possible. That's why
it's important for text-processing procedures to be aware of the
encoding (and any optimizations it makes possible).

Now, if you need to deal with the general case anyway, you have no
choice but to use the more general algorithms. (Although even then, you
might be able to save some space and time, if all regex characters fall
into a small band.) But if your application *isn't* general -- if you
know what encodings you'll find in your inputs -- then you can do much
better.

As I said, you might be able to come close to the single-byte
optimizations in some cases. Even if the scanned strings aren't limited
to a small character set, the regex might only recognize a small band of
characters. In that case, it can use a bit of extra analysis, some range
guards, and some math to permit the data optimizations. Dunno whether it
would actually be worth it, though. It certainly wouldn't be as quick as
what you can do when working with a small character set.

That's why I object to using the general-case, infinite character set
for the "internal" encoding. It puts harsher requirements on
applications that really only need a small character set.

> Note a more general approach to optimizing arbitrary groups of
> characters is to do the equivalent of the Perl "study" function and
> make a quick once-over of the string to find out what characters it
> holds, which allows quick failure when there are mismatches with the
> regexp characters.

That doesn't speed up scanning. It only speeds up parsing of the regex
itself. Any decent regex machine does that "quick failure" thing --
that's just how regex machines work. Studying makes no difference. I get
the impression that you haven't actually implemented a regex parser or
scanner.

>> How does a *compiler* know that? Why are you assuming that a compiler
>> has this kind of domain-specific knowledge? And even assuming that it
>> does, how do you deal with future extensions that weren't anticipated
>> when the compiler was written?

> How is any of this domain specific?!

The domain is "regex parsing and scanning," and the details of character
set encoding are important to that domain.

> Character sets are a fundamental part of the programming language, so
> you're going to optimize them regardless of regexps.

Regex parsing, character classification, and similar text-processing
facilities require specific knowledge of the character sets used, if you
want a space- and time-efficient implementation. You can put together
general-purpose versions without that information, but it's much less
efficient than a clued-in version. And it's silly to impose the extra
requirements of the general case on applications that only care about
ASCII or Latin-1. The input data permits significant optimizations, but
by converting it to a more general internal encoding, you throw away
that information and require a less efficient implementation.

> What are the future extensions to charsets (once you agree on what a
> "char" is)?

Not future extensions to charsets, future extensions to the
text-processing facilities. Sure, you can program the details of
character classes into the compiler, but that won't help new text
processing applications.

> Note I'm not necessarily arguing against multiple string encodings,
> but not for speed, and I certainly don't think the user should be
> checking and manually optimizing based on that encoding.  The primary
> argument is simply one of size.

You don't think a speed improvement from O(lg K) to O(1) is significant?
The space improvement from O(K) to O(1) is a big deal too.

> In UTF-8 many scripts such as Cyrillic double in size, whereas most
> Asian scripts grow to 3/2 the original size.  So you often want to
> store large texts in the most appropriate format and convert as you
> load it into Scheme (which is why ports need encoding tags).  However,
> if you do a lot of converting back and forth or if you store a lot of
> text in memory it can be a expensive ....

In the applications I'm talking about there is no conversion! You read
the input from an ASCII file, you store it as ASCII in memory, and you
use ASCII-optimized regex machines to scan it. Why convert at all when
the native format is the most efficient one by far, both speed-wise and
space-wise?

Now compare that to an implementation that always stores text internally
in Unicode. You read from an ASCII file, convert it to Unicode, store
each character as a word instead of a byte, and use a slow,
general-purpose regex machine to scan it. Why?! The data was *already*
in the best format for the application, but the system insists on
converting it to a much less appropriate format.

None of this is necessary if you use polymorphic text functions
paramaterized on the encoding. In that kind of system, an application
doesn't need to mess with conversions and inefficient, general-purpose
text processing if it doesn't need it. But it can easily convert
everything to Unicode if that's what it needs.

> So string tags are a debatable point, but the programmer shouldn't
> access them directly.

Why not? That stuff is *important* for efficient and sane text
processing. Often, you need to know about it to determine whether a
conversion is even possible. Furthermore, it's information that the
system must already have, in order to work properly, so why hide it
behind unnecessary and intrusive abstraction?
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/13/2003 5:30:05 AM
At Thu, 13 Nov 2003 05:30:05 GMT, Bradd W. Szonye <bradd+news@szonye.com> wrote:
> 
> I'm curious; have you ever implemented a regex, string-searching, or
> character classification library?

Yes.

> But what if the character set only has eight characters, a-h? Then you
> can represent that character class with a small bitmap: 00000111.

You can perform the same optimization in any charset.  An optimized
char-set class should handle this for you automatically.  If you create
the charset a-h in Gauche it will in fact create a range for those
codepoints, resulting in

   (char-set-contains? #[a-h] my-char)

to perform the equivalent of

   (char<? #\a my-char #\h)

which is as fast as you can hope for in this case even though my-char
could be a wide character.  It doesn't convert to a bitmap but could
just as well do that.

The difference between corresponding sets of wide characters such as
char-set:hiragana and char-set:ascii is no different from the difference
between char-set:ascii and char-set:a-h.  In a-h you can use the bitmap
00000111 with an offset of 97, and in hiragana you use a 128-element
bitmap with a higher offset number, but the performance and space
efficiency is the same.

> > I'm talking about optimizing the regexp itself so that it can match
> > against *any* string as quickly as a pure ASCII regexp engine,
> > regardless of whether or not the string in question has multi-byte
> > characters.  That's a much more powerful optimization than you're
> > proposing.
> 
> How do you propose to beat a well-tuned, pure ASCII regex machine?

Did you not read or not understand what I wrote?  In UTF-8, the regexp
/[abc]/ can be compiled in *exactly* the same way as in an ASCII
compiler and still work on multi-byte strings.  Was the problem that you
don't understand why this is so?

> > What I think your suggesting is optimization in one or both of the
> > following two cases:
> > 
> >   1) If the string contains chars not allowed in the regexp ....
> >   2) Alternately if the regexp *requires* characters ....
> 
> I don't think you're understanding me correctly. There are no "chars not
> allowed in the regexp."

"Chars not allowed in the regexp" was immediately followed by

    (e.g. the regexp only has single byte chars or charsets and no
    instances of either . or ?, and the string is known to have
    multi-byte characters)

which you snipped.  s/allow/match-anywhere if you prefer.  It refers to
/^[a-z]*$/ not being able to match against a string known to have wide
characters in it.

> The regex machine is built to handle strings of a certain encoding,
> and it will only ever see characters in that encoding. That's what
> makes the data optimizations possible. That's why it's important for
> text-processing procedures to be aware of the encoding (and any
> optimizations it makes possible).

Exactly.  It's very important for the procedures to be aware of the
encoding when optimizing.  Which makes it all the more powerful if you
guarantee a single consistent internal encoding.  Then not only does the
programmer not need to be aware of the encoding in everyday use, he only
needs to worry about one encoding when he optimizes.

> > Note a more general approach to optimizing arbitrary groups of
> > characters is to do the equivalent of the Perl "study" function and
> > make a quick once-over of the string to find out what characters it
> > holds, which allows quick failure when there are mismatches with the
> > regexp characters.
> 
> That doesn't speed up scanning. It only speeds up parsing of the regex
> itself. Any decent regex machine does that "quick failure" thing --
> that's just how regex machines work. Studying makes no difference. I get
> the impression that you haven't actually implemented a regex parser or
> scanner.

I get the impression you don't understand what the Perl study function
does.  It doesn't act on the regexp at all, it acts on the string you're
about to match.

-- 
Alex

0
foof (110)
11/13/2003 8:10:12 AM
Ray Dillinger wrote:

> lowercase f, latin lowercase l. But, in fairness, ligature nonsense
> is limited because the ligatures are deprecated in Unicode, there

Why are ligatures deprecated?  Did the Unicode People decide that they
just didn't exist anymore?

> Well.... Aramaic actually.  He probably only spoke Hebrew in synagogue
> on Fridays.

On Fridays?  Best you read up on Judaism a bit...

David
0
feuer (188)
11/13/2003 4:23:30 PM
Feuer wrote:
> 
> Ray Dillinger wrote:
> 
> > lowercase f, latin lowercase l. But, in fairness, ligature nonsense
> > is limited because the ligatures are deprecated in Unicode, there
> 
> Why are ligatures deprecated?  Did the Unicode People decide that they
> just didn't exist anymore?
> 

They exist, but they're just printing conventions rather than actual 
characters. The reason any of them are in Unicode in the first place 
is not for actual use as characters, but just for interoperation with
existing character sets that (mistakenly in the opinion of the unicode
consortium) have them as characters. 

				Bear
0
bear (1219)
11/13/2003 5:22:51 PM
On Thu, 13 Nov 2003 11:23:30 -0500, Feuer wrote:

>> lowercase f, latin lowercase l. But, in fairness, ligature nonsense
>> is limited because the ligatures are deprecated in Unicode, there
> 
> Why are ligatures deprecated?  Did the Unicode People decide that they
> just didn't exist anymore?

Some ligatures have separate code points in Unicode but they shouldn't be
encoded as such in text. Ligatures like "fi" are a matter of presentation,
not contents. The set of ligatures depends on fonts and the rendering
technique. Unicode is primarily used to encode the semantics of the text,
not just glyphs.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

0
qrczak (1266)
11/13/2003 5:35:03 PM
> Bradd W. Szonye <bradd+news@szonye.com> wrote:
>> But what if the character set only has eight characters, a-h? Then
>> you can represent that character class with a small bitmap: 00000111.

Alex Shinn <foof@synthcode.com> wrote:
> You can perform the same optimization in any charset.  An optimized
> char-set class should handle this for you automatically.

Sure, but there are even better optimizations possible when you can
guarantee a small input charset. For example, with a single-byte
charset, you can use a simple 128- or 256-element array for state
transitions, trading a bit of space for O(1)-time transitions. That
doesn't work as well with a large charset, because you need to deal with
range-checking, and you may need to use two-level data structures (at
best) or general lists/trees/hashtables (at worst).

> If you create the charset a-h in Gauche it will in fact create a range
> for those codepoints, resulting in
> 
>    (char-set-contains? #[a-h] my-char)
> 
> to perform the equivalent of
> 
>    (char<? #\a my-char #\h)
> 
> which is as fast as you can hope for in this case even though my-char
> could be a wide character.  It doesn't convert to a bitmap but could
> just as well do that.

"Even though my-char could be a wide character." That qualifier is
important. In many applications, my-char *can't* be a wide character,
which simplifies the procedures and the representation of the state
machine.

> The difference between corresponding sets of wide characters such as
> char-set:hiragana and char-set:ascii is no different from the
> difference between char-set:ascii and char-set:a-h.  In a-h you can
> use the bitmap 00000111 with an offset of 97, and in hiragana you use
> a 128-element bitmap with a higher offset number, but the performance
> and space efficiency is the same.

When dealing with a small character set, that offset isn't necessary,
because it's always 0. That removes one operation from each state
transition. Since there are only about 4 or 5 operations per transition
to begin with, that's about a 25% speed improvement.

Also, if the internal representation is Unicode, you introduce a subtle
performance bias toward English and other Western European languages
(ISO8859-1). That's because the "working set" for those languages is
small and contiguous, which makes it more amenable to O(1) state tables
and bitmaps.

The techniques you've described help to minimize the performance
differences between ISO8859-1 and Unicode. The range checks/offsets do
introduce a significant penalty (about 25% degradation), but that may be
acceptable.

But what about switching from ISO8859-7 to Unicode? Now the "working
set" is split between 00xx (ASCII) and 03xx (Greek). That introduces a
difficult choice between much larger state tables, two-level tables, or
something similar. Depending on the implementation, I would expect the
regex machine would incur at least a +300% space penalty or an
additional +25% performance penalty.

And again, this happens because the system throws away information that
it already has. Also, remember that there's an additional performance
penalty from converting all strings to the internal encoding. Why
bother, when the data is already in the most convenient form?

>> How do you propose to beat a well-tuned, pure ASCII regex machine?

> Did you not read or not understand what I wrote?  In UTF-8, the regexp
> /[abc]/ can be compiled in *exactly* the same way as in an ASCII
> compiler and still work on multi-byte strings.

Three problems:

1. The state machine will need range checking of some kind, which
   imposes a substantial performance penalty.
2. The regex parser will also need to do more work to determine what
   optimizations are possible, since it's no longer /a priori/ true that
   the regex will only include characters within a narrow range.
3. For any ISO8859 script except Latin-1, the character ranges are wider
   than in the single-byte case, which degrades space or time
   efficiency.

Since the regex parser *and* the state machine both become more
complicated, there's going to be a performance hit no matter what
regexes or scanned strings you use.

> Was the problem that you don't understand why this is so?

I certainly don't understand how you can introduce more data and more
complexity without degrading performance. You keep plugging the idea
that this is as efficient as possible, "even though my-char could be a
wide character." But that qualifier makes a significant difference,
because an application that knows /a priori/ that my-char *can't* be a
wide character can do much better. In some cases (e.g., ISO8859-7), the
charset is also laid out more conveniently, so that fewer mathematical
transformations are necessary.

>> I don't think you're understanding me correctly. There are no "chars
>> not allowed in the regexp."

> "Chars not allowed in the regexp" was immediately followed by
> 
>     (e.g. the regexp only has single byte chars or charsets and no
>     instances of either . or ?, and the string is known to have
>     multi-byte characters)
> 
> which you snipped.

How do you determine that the string has multi-byte characters without
scanning it? How do you scan it more quickly than a well-tuned
single-byte regex machine can do it? You can see occasional gains here
if you scan the string many times *and* you know the regexes in advance
-- but you can do even better if you know /a priori/ that the strings
contain no multi-byte characters (because they're in an encoding that
doesn't permit it).

>> The regex machine is built to handle strings of a certain encoding,
>> and it will only ever see characters in that encoding. That's what
>> makes the data optimizations possible. That's why it's important for
>> text-processing procedures to be aware of the encoding (and any
>> optimizations it makes possible).

> Exactly.  It's very important for the procedures to be aware of the
> encoding when optimizing.  Which makes it all the more powerful if you
> guarantee a single consistent internal encoding.  Then not only does
> the programmer not need to be aware of the encoding in everyday use,
> he only needs to worry about one encoding when he optimizes.

No, he needs to worry about at least two: the system's internal
encoding, and the encoding used in the program's inputs. That may be a
net win if the program is heavily internationalized and expects
arbitrary inputs, because he can code solely to the internal encoding.

However, it's a major loss for non-internationalized inputs. Local
charsets are heavily optimized for programming in the local language,
and programmers are accustomed to using them. It makes i18n easier at
the expense of making non-i18n programs slower and more complicated.

Encoding-parameterized texts let the programmer choose which he wants:
To work solely in the local encoding, or to translate everything
internally to a /lingua franca./ Also, he can choose *which* internal
encoding to use. Depending on the application's domain-specific needs,
he might choose UCS-4 (for simple character processing), UTF-8 (for fast
string matching), a specific Asian+Latin encoding (for texts with
limited, bilingual internationalization), etc.

I don't see how it could possibly be better to let the compiler choose
for you when it comes to text processing. There are too many variables,
and most of them are domain-specific. Encoding-parameterized texts lets
the programmer have his cake and eat it too. A single internal encoding
makes him jump through hoops (especially if the encoding is opaque), and
what does the programmer get out of it? Poor performance.

> I get the impression you don't understand what the Perl study function
> does ....

Sorry, I got it mixed up with something else (the "compile this regex
only once" flag). After checking the perldoc, I see that studying is a
win only in some specific cases; the docs specifically note that you
should only use it after performance testing. And it doesn't really help
here; at best, you'll be able to match the performance of a machine
that's aware of the local encoding.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/13/2003 6:49:43 PM
Al Petrofsky writes:

> Do I understand correctly that upcasing a single "latin small letter
> sharp s" gives you two "latin capital letter s"s, but when you
> downcase those, you get two "latin small letter s"s with no sharpness?
> What kind of nonsense is that?

This "nonsense" is absolutely no problem in daily life.  Eszett is
never used at the beginning of words and headlines which are sometimes
set in upper case are usually not converted back to lower case.

> What really puzzles me here is that I thought Germany just did a
> spelling reform recently.  [...] Why on earth would you do that and
> not take the opportunity to fix total breakage like the lack of a
> one-to-one case mapping?

Actually, things got more complicated.  The committee decided to
remove the eszett after short vowels only.  Removing the eszett
completely would have introduced ambiguities (which, by the way, can
be observed in upper case headlines).

> Why not just add a sentence to the char-upcase spec that says "It is
> an error if your language does not have a char-to-char upcase
> function"?

I think that eszett should stay eszett in symbols and characters if
they are converted into upper case.  This means that (char-upper-case?
(char-upcase \#�)) would return #f, for example.  But AFAIK Kanji
characters are neither upper nor lower case so this result wouldn't be
anomalous at all.

On the other hand, a yet to be specified string-upcase function should
convert eszett into upper case double s.  This means that
string-upcase cannot be fully based on char-upcase, but IHMO this is
the way to go.

I'm not 100% sure about string comparison and regular expressions.
While the case-insensitive string comparison functions should probably
use string-upcase, regular expressions should better ignore the eszett
problem.  People can always use disjunctions like (�|ss|SS) to match
the various incarnations of eszett.

BTW, in Perl uc(chr(0xDF)) gives eszett while uc(pack('U',0xDF))
gives double s.

ICU, a C++ library for Unicode text handling, also keeps the lower
case eszett when a single character is converted into upper case with
u_toupper().  But when u_strToUpper() is used to convert a string
eszett is replaced with double s.

> Of course there will always be a need for applications that deal
> with obsolete languages and typography, but should we really try to
> throw all that cruft into a simple general-purpose (including
> pedagogical) language like scheme?

In which way does the eszett problem affect the user?  This is mostly
an implementation problem.

BTW, if you want to get rid of "obsolete" typography then why not
replace "sch" with "sk" in English?  Why not rename this group into
comp.lang.skeme?  The shorter name will save hundreds of megabytes of
hard disk space and band width in the long run ;-)
0
voegelas (2)
11/13/2003 7:22:48 PM
Ray> With exactly fourteen exceptions - the ligatures lacking
Ray> altercase mappings and eszett itself - this entire file maps
Ray> single graphemes to single graphemes.

Sure, but I don't see any reason to believe that will always be true.
(Doesn't matter if it is always true, though, I still don't think 
your idea quite works out reasonably, see below).

Ray> With twenty-eight special characters, I can provide the "missing"
Ray> cases of the exceptions.  With twenty more, I can achieve
Ray> transitive closure within case-change operations.

The idea of making these special cases, especially for eszett, just
seems horribly hackish to me.

Doesn't it mean that you either have a new canoncicalization problem
or strings that spontaneously change lengths in weird ways?

If a STRING-SET! of #\upper-case-eszett changes the length of the 
string, then integer string indexes are hosed.

If STRING-SET! _doesn't_ change the length of the string, then you
have a new canonicalization problem.  Even if a program is careful to
canonicalize all input and not use non-canonical forms internally,
with your system, I can wind up with two versions of a string, one
using upper-case-eszett the other using "SS", but STRING=? tells me
they're different and my trie library treats them as distinct keys.  A
simple database-style application could illustrate the problem.

A model that says "case mappings of graphemes are to strings" seems to
me more consistent with the structure of Unicode -- it's inconsistent
only with the idea of making CHAR? values into Unicode graphemes.

-t

0
lord1 (42)
11/13/2003 9:47:15 PM
Ray Dillinger writes:

> I know this isn't really something Germany did just to mess the rest
> of us up in our efforts to create some kind of sane API for
> character-based programs. [...]

All the proposals that I've seen so far are IMHO much to ambitious :-)

If I were you I would in general keep the lower case eszett and
provide extra functions for string handling (maybe via an SRFI).

When does it really matter that eszett is replaced with double s?

- When a text that is stored in lower or mixed case is output in upper
case, e.g. as a headline in a generated document.

- When two strings that contain a word that may be spelled differently
are compared, e.g. mixed case "Ma�e" compared to upper case "MASSE".

Are there any other cases that matter?

Most people that have to handle German text could probably live with
the following set of functions:

;;; string-upcase: convert a string from lower to upper case.  Eszett
;;; is replaced by double s.
(define (string-upcase s)
  (apply string-append
   (map (lambda (c)
	  (case c
	    ((#\�) "SS")
	    (else (string (char-upcase c)))))
	(string->list s))))

;;; string-downcase: convert a string from upper to lower case.  No
;;; special conversion.
(define (string-downcase s)
  (list->string (map char-downcase (string->list s))))

;;; string-ci=?: compare two strings case-insensitively.
;;; string-upcase is NOT used (see below).
(define (string-ci=? a b)
  (string=? (string-downcase a) (string-downcase b)))

As far as I can see, there are two problems:

Problem 1: String Comparison

"Ma�e" and "Masse" may be different words, i.e. "measure" and "mass".
But "Masse" may also be the ASCII spelling of "Ma�e".  It's not clear
whether string-ci=? should return #t or #f in the following case:

(string-ci=? "Ma�e" "Masse") => #f ???

The umlaut characters �, � and � are another problem (AFAIK not
handled by Unicode at all) that may help to solve this problem.

For example, my surname is normally spelled with an "�".  But the
ASCII spelling uses "oe" instead of "�".

(string-ci=? "V�gele" "Voegele") => #f

I think that #f is correct.  Special cases like this ought to be
handled with special functions, e.g. an expression like (canonicalize
"V�gele") could be used to convert "V�gele" into "Voegele" before the
strings are compared:

(string-ci=? (canonicalize "V�gele") "Voegele") => #t

Thus (string-ci=? "Ma�e" "Masse"), which is a somewhat similar
problem, should return #f.

Of course, two strings can be converted into upper case explicitly
before they are compared:

(string?= (string-upcase "Ma�e") (string-upcase "Masse")) => #t

Problem 2: Conversion from Strings to Symbols and vice versa

The following expression does not return the same symbol if the given
symbol contains an eszett:

(string->symbol (string-upcase (symbol->string 'ma�e)))

Not really a problem, people ought to use char-upcase instead of
string-upcase if they (for whatever reason) want to make sure that the
name of a symbol is converted into upper case.
0
voegelas (2)
11/14/2003 12:58:13 AM
On Fri, 14 Nov 2003 01:58:13 +0100, Andreas Voegele <voegelas@gmx.net>
wrote:

>- When two strings that contain a word that may be spelled differently
>are compared, e.g. mixed case "Ma�e" compared to upper case "MASSE".

In fact, Unicode contains a mechanism for case folding (to facilitate
case-insensitive comparisons) that is _separate_ from case conversion
per se (see the CaseFolding.txt Unicode data file). So case-insensitive
comparisons are actually easier than true case conversion. (In effect,
what happens is that both upper- and lowercase strings are converted
into a "neutral" format. Some of the characters in those "neutral"
strings are uppercase, some lowercase, but it really doesn't matter;
upper- and lowercase versions of characters are mapped onto identical
"neutral" strings.)

>The umlaut characters �, � and � are another problem (AFAIK not
>handled by Unicode at all) that may help to solve this problem.
>
>For example, my surname is normally spelled with an "�".  But the
>ASCII spelling uses "oe" instead of "�".
>
>(string-ci=? "V�gele" "Voegele") => #f
>
>I think that #f is correct.  Special cases like this ought to be
>handled with special functions, e.g. an expression like (canonicalize
>"V�gele") could be used to convert "V�gele" into "Voegele" before the
>strings are compared:
>
>(string-ci=? (canonicalize "V�gele") "Voegele") => #t

This would have to be locale-specific. While the � -> oe
canonicalization is appropriate for German, it isn't necessarily
appropriate for other languages. (Although it's very rarely seen these
days, � and other umlaut-enhanced vowels can even appear in English
text, where they are used to indicate that two adjacent vowels are in
separate syllables; e.g., "co�rdinate" or "re�xamine" or "na�ve.")

Unicode acknowledges the existence of these kinds of ambiguities, and
distinguishes between weak, strong and exact matching. Any software that
purports to support Unicode, even if it doesn't support all of these
different kinds of matching, should at least be able to specify which
one(s) it does support.

-Steve

0
see94 (37)
11/14/2003 2:11:56 AM
On Thu, 13 Nov 2003 20:22:48 +0100, Andreas Voegele <voegelas@gmx.net>
wrote:

>BTW, if you want to get rid of "obsolete" typography then why not
>replace "sch" with "sk" in English?

Given that I have a personal vested interest in that sequence of
characters, I'd have to say that any change would be a Very Bad Idea. :)

-Steve

0
see94 (37)
11/14/2003 2:12:01 AM
"Bradd W. Szonye" <bradd+news@szonye.com> wrote in message news:<slrnbr7ki5.4oo.bradd+news@szonye.com>...
> Sure, but there are even better optimizations possible when you can
> guarantee a small input charset. For example, with a single-byte
> charset, you can use a simple 128- or 256-element array for state
> transitions, trading a bit of space for O(1)-time transitions. That
> doesn't work as well with a large charset, because you need to deal with
> range-checking, and you may need to use two-level data structures (at
> best) or general lists/trees/hashtables (at worst).

Ah, Gauche uses 128-bit array for ASCII range charset.
Regexp compiler also take advantage of it---if you compile
regexp "[abc]", it runs exactly like you're dealing with
pure-ASCII charset.

If and only if the regexp contains a charset with the chars
out of ASCII range (large charset), some extra calculation is done.
Even then, if the matching string is pure-ASCII, the overhead is
a few integer comparison.  It only matters only when both
regexp has large charset, _and_ the matching string has multibyte
chars.

Even if regexp contains multibyte chars, as far as it is
not in charset, there's no overhead.  Similarly, even if matching
string contains multibyte chars, there's no overhead unless
regexp has large charset.

That's because the regexp engine works as a byte stream.
Theoretically I could also compile large charset matching part to 
a byte stream DFA, but that'll be trade-off between space and speed.

--shiro
0
shiro (31)
11/14/2003 2:38:18 AM
At Thu, 13 Nov 2003 18:49:43 GMT, Bradd W. Szonye <bradd+news@szonye.com> wrote:
> 
> I certainly don't understand how you can introduce more data and more
> complexity without degrading performance.

OK, I'll explain.  I recommend reading the Unicode standard for further
detail.

UTF-8 was designed specifically with the intent that it be backwards
compatible with ASCII.  Moreover any byte in a UTF-8 char encoding
uniquely guarantees it's position, so that the 2nd byte of a wide
character never occurs in the 3rd byte position.  This absence of
overlap means in a UTF-8 string, if you search for any character and
find it, you are guaranteed to have found that character.  If you see
the letter "a" you know that it truly is just the letter "a" and will be
preceded and succeeded with valid characters.  If you see character
encoding for, say, LAO_VOWEL_SIGN_MAI_KON then you can be certain you
found that character.  It doesn't matter if you search for it in a
character at a time or a byte at a time.  Thus a
string-{index,contains,prefix?}  search for UTF-8 strings within UTF-8
strings doesn't need to change or pay heed to the fact that you're
working with UTF-8 at all and can use simple byte matching.  The exact
same code for ASCII just works in this case for UTF-8.

Regular expressions are slightly more complicated because of two
extensions over basic string scanning:

  1) character classes, which may have wide characters
  2) the . pattern (really a special case of 1 using the class of all chars)

If a regexp doesn't contain either of these, you can fall back on normal
byte-matching.  If a the only .'s in a regexp followed by *'s, again you
can use byte matching.  For example:

  /foo(.*)bar/

works perfectly well using normal byte-matching because the matching of
"foo" and "bar" guarantee valid byte boundaries.  Likewise

  /(.*)foo(.*)bar(.*)/

also works with byte-matching because the greedy nature of regexps
ensures we match from the beginning and to the end of the string.
However with

  /foo.bar/

after matching foo we do need to read a single character (not byte), but
this is a small cost in an isolated step.  Apart from at that location
we can use byte matching.  This is also a much more rare pattern than
you might imagine, which I'll get back to in a moment.

The more general exception is character classes, but these too pose a
minimal problem.  Given

  /[a-h]/

we have a character class, and in UTF-8 characters can be wide, but in
this case we're only matching single byte characters, so the regexp as a
whole can work at the byte level.  With the complement

  /[^a-h]/

although the charset contains wide characters, any byte not in [a-h]
indicates a character not in [a-h] and we can once again use
byte-matching.

When you do have a real multi-byte character set such as

  /[:CyrillicUppercase:]/

then at that point in the regexp you do need to read by character and
then check for presence in the character set.  This more linguistic use
of a regexp implies you're dealing with linguistic text, specifically
you expect the possibility of Cyrillic characters, at which point you
need to work at a character level regardless and using a separate text
encoding won't help.  [Actually, there *are* further optimizations you
can take to get back to byte-matching in the above case, but we'll stick
to a simple compiler for now.]

The question is how often are the linguistic and single . cases used in
practice?  Regexps evolved from system languages like awk and sed, used
as quick and dirty parsers for specifically formatted files.  Largely
because of Perl they gained more wide-spread use as quick parsers for
Internet protocols and standards such as HTTP, HTML, RFC-822 date
formatting and the like.  They're compact and work most of the time for
such simple cases so are often better suited to the task than writing a
full parser.  However, all of the above formats are defined in terms of
ASCII delimiters, and so long as you match those delimiters the text in
between can be UTF-8 and the regexp doesn't change.  You tend to match
specifically on delimiters and arbitrary chunks in between, not worrying
about fixed length, with patterns like

  ^([^:]+):(.*)$

to match the first line of an RFC-822 header.  This is yet another
regexp that only needs byte matching.

When you write too much Perl code you start to see Perl regexps as
massive sledgehammer for which everything is a nail.  Things that really
shouldn't be done as regexps end up being regexps.  Scanning some 200k
odd lines of in-house Perl code (pity me), we have an unimaginable
number of regexps.  This is in a system written before Perl had Unicode
regexp support, so we just treat everything as UTF-8 strings and use a
$RX_UTF8_CHAR sub-pattern whenever we need to match a single arbitrary
UTF-8 char.  In all that code that variable is used exactly _once_,
despite the fact that we're processing bilingual English/Japanese text.

Although understanding the above optimizations may not be simple, most
of the time it's just a matter of realizing you don't need to do
anything special, and thus the implementation is trivial.  The special
cases are very rare, easy to detect, and incur an almost insignificant
overhead.  It is *far* less complex than tagging strings and writing a
separate regexp engine for every single encoding.

> > Exactly.  It's very important for the procedures to be aware of the
> > encoding when optimizing.  Which makes it all the more powerful if you
> > guarantee a single consistent internal encoding.  Then not only does
> > the programmer not need to be aware of the encoding in everyday use,
> > he only needs to worry about one encoding when he optimizes.
> 
> No, he needs to worry about at least two: the system's internal
> encoding, and the encoding used in the program's inputs.

No, because the ports automatically convert to and from the internal
encoding so the programmer doesn't need to worry about that at all.

  [... snip continued discussion of conversion issues]

You're really confusing me here, because I was actually agreeing with
you and claiming there can be advantages to multiple internal encodings
(though regexps are not one of them).  Why are you being so
argumentative?  If that's what you really want then fine, for the sake
of argument, I'll take the opposite view.

Multiple internal encodings, apart from making your string scanners,
parsers, and regexp engines more complex, incurring more overhead for
basic operations, and making it much more difficult to implement new
low-level string routines because of the barrier to entry of writing for
multiple encodings, can in fact incur far worse performance than using a
single internal encoding.  With a single encoding all text coming into
or out of the system incurs a simple, efficient, one-time conversion.
With multiple encodings you have the danger of implicit conversions to
preferred encodings for different operations, which can be difficult for
the programmer to prevent.  The programmer may be tempted to convert and
optimize for a known fixed incoming charset, only to have clashes when
other parts of the code have optimized for the outgoing charset.  A
simple loop building an output string from alternate encoding sources
can force the accumulated string back and forth between encodings,
incurring O(n) conversions instead of just 2!

-- 
Alex

0
foof (110)
11/14/2003 3:18:58 AM
> Bradd W. Szonye <bradd+news@szonye.com> wrote:
>> I certainly don't understand how you can introduce more data and more
>> complexity without degrading performance.

Alex Shinn <foof@synthcode.com> wrote:
> OK, I'll explain.  I recommend reading the Unicode standard for
> further detail.
> 
> UTF-8 was designed specifically with the intent that it be backwards
> compatible with ASCII ....

Yes, I'm quite familiar with UTF-8 and its advantages.

> Thus a string-{index,contains,prefix?}  search for UTF-8 strings
> within UTF-8 strings doesn't need to change or pay heed to the fact
> that you're working with UTF-8 at all and can use simple byte
> matching.  The exact same code for ASCII just works in this case for
> UTF-8.

Right. UTF-8 is very good for simple string matching.

> Regular expressions are slightly more complicated because of two
> extensions over basic string scanning:
> 
>   1) character classes, which may have wide characters
>   2) the . pattern (really a special case of 1 using the class of all chars)
> 
> If a regexp doesn't contain either of these [or if all "." are
> actually ".*"], you can fall back on normal byte-matching.

Yes, UTF-8 is pretty good at dealing with pure ASCII inputs. However,
you're overlooking a couple of important facts:

First, while full Unicode support still isn't terribly common, the
256-byte ISO8859 character sets are. So an internal UTF-8 encoding will
do pretty well for American English inputs, but they won't do nearly as
well for French, German, Greek, etc. -- there's a significant
performance bias toward American English. (The same is also true for
UCS-4 and Western European languages, but not to the same extent.)

Second, UTF-8 is generally regarded as a good encoding for disk storage
but not for internal encodings. It's very good for a few tasks, like
string searching, but very poor for other common tasks, like string
mutation, character indexing, and substring extraction. UCS-4 or an
ISO8859 charset are better general-purpose encodings.

Therefore, if a system has a single internal encoding, I would *not*
expect it to be UTF-8. It simply isn't the best general-purpose
encoding.

>>> It's very important for the procedures to be aware of the encoding
>>> when optimizing.  Which makes it all the more powerful if you
>>> guarantee a single consistent internal encoding.  Then not only does
>>> the programmer not need to be aware of the encoding in everyday use,
>>> he only needs to worry about one encoding when he optimizes.

>> No, he needs to worry about at least two: the system's internal
>> encoding, and the encoding used in the program's inputs.

> No, because the ports automatically convert to and from the internal
> encoding so the programmer doesn't need to worry about that at all.

Programmers do need to worry about the storage format, though. Maybe not
in the program itself, but it's important to the overall system design.
At the very least, he needs to know whether it's possible to convert
from the internal encoding to the required output encoding. (Problems in
that area might even disallow the use of the internal encoding, which is
very bad.)

> You're really confusing me here, because I was actually agreeing with
> you and claiming there can be advantages to multiple internal
> encodings (though regexps are not one of them).  Why are you being so
> argumentative?

Because you're making claims that I know to be untrue. You keep
insisting that programs can be just as efficient when they're forced to
use a generic internal encoding, with an infinite character set, despite
the facts that it requires a round-trip conversion and that the infinite
character set precludes some significant optimizations. That's rubbish.

> Multiple internal encodings, apart from making your string scanners,
> parsers, and regexp engines more complex ....

It doesn't make them more complex from the user's point of view. It's a
bit tougher for the system implementor, but the extra cost is not enough
to justify passing it on to users.

> incurring more overhead for basic operations ....

Encoding-parameterized text does not incur extra overhead for any
operation. This is one place where a smart compiler *can* make a
significant difference. See C++ for a significant example of prior art;
it uses metaprogramming (i.e., compile-time polymorphism) to eliminate
the overhead.

Even in a naive implementation, I wouldn't expect the overhead to exceed
the overhead of round-tripping the encoding from inputs to internal to
outputs.

> and making it much more difficult to implement new low-level string
> routines because of the barrier to entry of writing for multiple
> encodings ....

Again, that's a problem that's already been solved. All you need is a
solid low-level interface that works well for all known encodings. Shiro
touched on this with the "three level interface" idea. The lowest-level
interface lets you get at bytes, but you rarely need to use that. The
mid-level interface abstracts the character sets, shift states, etc., to
a point where you can write encoding-portable string manipulation
routines. The mid-level also provides the traits you need to take
advantage of the data optimizations I've been talking about. That
interface still isn't much fun to work with, so you build even
higher-level interfaces on top of that, to provide the end-user stuff
like regexes, string searching, etc.

> can in fact incur far worse performance than using a single internal
> encoding ....

How?! Normally, programs do one of two things. Many work entirely in the
"local" encoding, with no conversions and full optimization based on the
a priori information you get from knowing exactly which encoding you're
using. The rest convert all data into a single internal encoding chosen
by the *application designer*, not the compiler author, to best fit the
application's specific needs.

> With a single encoding all text coming into or out of the system
> incurs a simple, efficient, one-time conversion.

Exactly the same thing happens in the parameterized case, except that
the internal encoding is chosen by the guy who knows the application
domain. It isn't forced into some general-purpose encoding chosen by an
optimistic compiler writer.

> With multiple encodings you have the danger of implicit conversions to
> preferred encodings for different operations, which can be difficult
> for the programmer to prevent ....

Not if you do it right. You're talking about what happens when a
clueless programmer makes a hash of things (with help from a
poorly-designed system library that encourages it). Programmers who know
what they're doing will either choose a single internal encoding *of
their own choice*, or they'll stick to the "local" encoding throughout.

> The programmer may be tempted to convert and optimize for a known
> fixed incoming charset, only to have clashes when other parts of the
> code have optimized for the outgoing charset.

Why would they do that? Either the input and output are the same, or the
program uses an efficient internal encoding of the application
designer's choosing. Again, you're talking about bad practices.

> A simple loop building an output string from alternate encoding
> sources can force the accumulated string back and forth between
> encodings, incurring O(n) conversions instead of just 2!

Which is why you don't do that. Forcing a particular internal encoding
on the developer prevents this, but that's just a weak solution that is
better addressed with proper training and coding standards.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/14/2003 9:32:40 AM
Tom Lord wrote:
> 
> Ray> With exactly fourteen exceptions - the ligatures lacking
> Ray> altercase mappings and eszett itself - this entire file maps
> Ray> single graphemes to single graphemes.
> 
> Sure, but I don't see any reason to believe that will always be true.

I do.  Unicode consortium has announced their intention to admit 
no more characters in that class ever. 


> The idea of making these special cases, especially for eszett, just
> seems horribly hackish to me.

I don't deny it.  I'm not really satisfied with the solution.  
But unicode already has thousands of unassigned codepoints and
invalid combining sequences and stuff, so the idea of an "invalid" 
character was already there.  And they had a special-use area 
for the implementations to use, so spots were available. All 
I'm really doing is using some implementation-defined characters 
as pivot values.
 
> Doesn't it mean that you either have a new canoncicalization problem
> or strings that spontaneously change lengths in weird ways?

The way it works now, strings never change length except when 
you explicitly canonicalize them.  Upcase them, downcase them, 
titlecase them, etc, as you like, but the number of graphemes 
never changes until and unless you canonicalize the string.  In 
the case of those fourteen characters (and eszett is the only 
one of them that can even appear in a canonicalized string), you 
may get some quasicharacters from invalid case changes, but the 
length changes only when you canonicalize the string. 

There are some limitations: eg, string=? is guaranteed to work 
correctly only on canonical strings.  
 
> If STRING-SET! _doesn't_ change the length of the string, then you
> have a new canonicalization problem.  Even if a program is careful to
> canonicalize all input and not use non-canonical forms internally,
> with your system, I can wind up with two versions of a string, one
> using upper-case-eszett the other using "SS", but STRING=? tells me
> they're different and my trie library treats them as distinct keys.  A
> simple database-style application could illustrate the problem.

Right.  And right, I address it with an explicit canonicalization
operation.  upper-case-eszett is not a "real" character, it's a 
place marker for a spot that will get two "s" characters inserted 
in canonicalization. Your trie library should be called only on 
canonical strings. 

Note that by giving people direct access to the bits-n-bytes 
of representation, your proposal introduces the power to create 
malformed sequences, and therefore also lets itself in for 
canonicalization issues.  Nobody working with unicode will ever 
completely escape the need to do canonicalization, I think. 

Although at this point I'm convinced that your most recent proposal
(which I've replied to in another thread) is better-designed 
than what I've done so far.  Good work.  

I will point out though that for ridiculously large texts (the 
hundred-megabyte text that holds someone's novel for example), 
it will in the long run be cheaper and more efficient to keep 
count of locations at the grapheme level *ONLY* -- keeping track 
of codepoints as you go back and forth between cases and 
transformations of various kinds will quickly become maddening
or suck up time linear in the size of the novel for each operation. 

				Bear



				Bear
0
bear (1219)
11/14/2003 9:50:11 AM
Andreas Voegele <voegelas@gmx.net> wrote:
> When does it really matter that eszett is replaced with double s?
> - When a text that is stored in lower or mixed case is output in upper
> case, e.g. as a headline in a generated document.
> - When two strings that contain a word that may be spelled differently
> are compared, e.g. mixed case "Ma�e" compared to upper case "MASSE".
> 
> Are there any other cases that matter?

Not if you're careful. Moral of the story: Text processing is a bit like
working with inexact numbers. Just as you need to be careful about when
and how you round off numbers, you also need to be careful not to do
unnecessary case-changing. In particular, round trips are a bad idea.

> Problem 1: String Comparison
> 
> "Ma�e" and "Masse" may be different words, i.e. "measure" and "mass".
> But "Masse" may also be the ASCII spelling of "Ma�e".  It's not clear
> whether string-ci=? should return #t or #f in the following case:
> 
> (string-ci=? "Ma�e" "Masse") => #f ???

There's no perfect answer to this, at least not without true natural
language processing. This isn't much different from asking

    (string-ci=? "resume" "r�sum�") => #f ???

in an English-language locale. Are these alternate spellings of the same
word, meaning "curriculum vitae"? Or are they two different words, with
the first one meaning "continue"? English collation rules say that both
words sort identically, but that's not the same as saying that they're
both the same word.

The spelling doesn't even need to differ. You can run into the same
problem with any homonym; how about

    (string-ci=? "bear" "bear") => #f ???

when one word means "carry" and the other means "ursine mammal"? To a
human reader, they're different words, but a computer can't understand
that without natural language comprehension.

In short: I wouldn't worry about this. Computers aren't intelligent
enough to do the right thing anyway, so just try to minimize the
potential surprises to users.

> The umlaut characters �, � and � are another problem (AFAIK not
> handled by Unicode at all) that may help to solve this problem.
> 
> For example, my surname is normally spelled with an "�".  But the
> ASCII spelling uses "oe" instead of "�".
> 
> (string-ci=? "V�gele" "Voegele") => #f

This is just another version of the same problem. Although I think this
particular sub-problem may already be "solved" by prior art (so far as
you can solve it without natural language comprehension).

> Problem 2: Conversion from Strings to Symbols and vice versa
> 
> The following expression does not return the same symbol if the given
> symbol contains an eszett:
> 
> (string->symbol (string-upcase (symbol->string 'ma�e)))
> 
> Not really a problem, people ought to use char-upcase instead of
> string-upcase if they (for whatever reason) want to make sure that the
> name of a symbol is converted into upper case.

Better solution: Avoid round-trip case changes like the plauge, just as
you'd avoid multiple rounding in a mathematical formula.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/14/2003 10:19:27 AM
> "Bradd W. Szonye" wrote:
>> Sure, but there are even better optimizations possible when you can
>> guarantee a small input charset. For example, with a single-byte
>> charset, you can use a simple 128- or 256-element array for state
>> transitions, trading a bit of space for O(1)-time transitions. That
>> doesn't work as well with a large charset ....

Shiro Kawai <shiro@acm.org> wrote:
> Ah, Gauche uses 128-bit array for ASCII range charset. Regexp compiler
> also take advantage of it---if you compile regexp "[abc]", it runs
> exactly like you're dealing with pure-ASCII charset.

Yes, you can approach the efficiency of a small-charset regex machine if
you're careful. It helps if you stick to ASCII. It starts getting ugly
once you introduce ISO 8859 characters -- the trade-offs are different
for UTF-8 and UCS-4, but either way there are some problems.

What bothers me most about a privileged internal encoding, however, is
that there is no single encoding that's universally superior for all
text-processing tasks. This regex stuff is just one concrete example.
ISO8859-1 is much better for regex machines than Unicode, if you're
working solely with Western European inputs, and the bias is even more
obvious for ISO8859-7 (which is split into two ranges in Unicode).

But that's not the only encoding trade-off. For example, UTF-8 is
generally superior to UCS-4 for simple string matching, IIRC, but UCS-4
is far superior for applications which do a lot of string indexing and
character manipulation. (And again, if your inputs are purely
Latin-based, ISO 8859 charsets are superior to both. I suspect that the
same is true for purely Japanese or purely Korean inputs.)

Generality is good, but it's also costly when you don't need it. And
it's especially costly when some language designer decides to hide the
actual implementation from you. For all of these reasons, I'm strongly
opposed to a text-processing library with a "preferred" internal
encoding, especially one with an infinite character set!
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/14/2003 10:35:10 AM
Ray Dillinger <bear@sonic.net> wrote:
>>> With exactly fourteen exceptions - the ligatures lacking altercase
>>> mappings and eszett itself - this entire file maps single graphemes
>>> to single graphemes.

> Tom Lord wrote:
>> Sure, but I don't see any reason to believe that will always be true.

> I do.  Unicode consortium has announced their intention to admit no
> more characters in that class ever. 

Be careful here; they don't have authority over natural languages and
how we write them. While I don't see it happening in the near future,
there's really nothing to stop (for example) Americans from turning "th"
into a hard ligature with no uppercase version, like esszed. If that
happens, the Unicode folks have only two choices: accept the character
or become another useless, universally-ignored standards body.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/14/2003 10:44:01 AM
Alex Shinn <foof@synthcode.com> writes:
> UTF-8 was designed specifically with the intent that it be backwards
> compatible with ASCII.  Moreover any byte in a UTF-8 char encoding
> uniquely guarantees it's position, so that the 2nd byte of a wide
> character never occurs in the 3rd byte position. 

This can't be right, since it means that escape chars have no useful function.

The UTF-8 RFC at http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html
describes how the US-ASCII chars are indeed never used in a wide char encoding
sequence, and the first byte of a wide char also never shows up in remaining
bytes of an encoded char.

However, the 2nd, 3rd bytes, etc., and indeed overlap. The encoding is nicely
summarized by this table:

   UCS-4 range (hex.)     UTF-8 octet sequence (binary)
   0000 0000-0000 007F    0xxxxxxx
   0000 0080-0000 07FF    110xxxxx 10xxxxxx
   0000 0800-0000 FFFF    1110xxxx 10xxxxxx 10xxxxxx
   
   0001 0000-001F FFFF    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
   0020 0000-03FF FFFF    111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
   0400 0000-7FFF FFFF    1111110x 10xxxxxx ... 10xxxxxx

which clearly shows overlap.

> This absence of overlap means in a UTF-8 string, if you search for any
> character and find it, you are guaranteed to have found that character.  If
> you see the letter "a" you know that it truly is just the letter "a" and
> will be preceded and succeeded with valid characters.

This is still true, which is of course the important point. The trick is to
search always for all the bytes of a character at once, and never for single
bytes of the form 10xxxxxx.

-- 
Cheers,                                        The Rhythm is around me,
                                               The Rhythm has control.
Ray Blaak                                      The Rhythm is inside me,
rAYblaaK@STRIPCAPStelus.net                    The Rhythm has my soul.
0
rAYblaaK (362)
11/14/2003 6:15:19 PM
At Fri, 14 Nov 2003 09:32:40 GMT, Bradd W. Szonye <bradd+news@szonye.com> wrote:
> 
> Alex Shinn <foof@synthcode.com> wrote:
> > 
> > If a regexp doesn't contain either of these [or if all "." are
> > actually ".*"], you can fall back on normal byte-matching.
> 
> Yes, UTF-8 is pretty good at dealing with pure ASCII inputs. However,
> you're overlooking a couple of important facts:
> 
> First, while full Unicode support still isn't terribly common, the
> 256-byte ISO8859 character sets are. So an internal UTF-8 encoding will
> do pretty well for American English inputs, but they won't do nearly as
> well for French, German, Greek, etc. -- there's a significant
> performance bias toward American English. (The same is also true for
> UCS-4 and Western European languages, but not to the same extent.)

The optimizations above hold no matter what kind of text you use.

> Second, UTF-8 is generally regarded as a good encoding for disk storage
> but not for internal encodings.

This is simply not true as UTF-8 has been adopted as the internal
encoding for languages such as TCL and Perl, libraries such as Gtk and
SDL, and network encodings such as XML.

> You keep insisting that programs can be just as efficient when they're
> forced to use a generic internal encoding, with an infinite character
> set, despite the facts that it requires a round-trip conversion and
> that the infinite character set precludes some significant
> optimizations. That's rubbish.

I was insisting regular expressions can be just as efficient, which they
can, and existing optimizers exist for this in Perl and Gauche at the
least.  The engine is just as efficient, though granted the strings you
are matching may be slightly longer in UTF-8.

> > Multiple internal encodings, apart from making your string scanners,
> > parsers, and regexp engines more complex ....
> 
> It doesn't make them more complex from the user's point of view. It's a
> bit tougher for the system implementor, but the extra cost is not enough
> to justify passing it on to users.

OK, "user" is ambiguous, as a programmer I consider myself a user of the
language.  I want to be able to write libraries such as regular
expressions and efficient text buffers without having to write
implementations for multiple internal encodings.  You are correct that a
clean API for working at different character levels will help with this,
but optimization becomes more difficult.

> > can in fact incur far worse performance than using a single internal
> > encoding ....
> 
> How?! Normally, programs do one of two things. Many work entirely in the
> "local" encoding, with no conversions and full optimization based on the
> a priori information you get from knowing exactly which encoding you're
> using. The rest convert all data into a single internal encoding chosen
> by the *application designer*, not the compiler author, to best fit the
> application's specific needs.

The problem is when you have multiple internal encodings you can get
mismatches.  The system default encoding that things like symbols get
encoded as are probably going to be different from the specific local
encoding you want to optimize your application for.  If the default
encoding is the same, then the system as a whole is only optimized for
that one locale.  When you perform operations on strings in two
different encodings you either need to do a conversion first, or loop
sequentially through them at the highest level of "character"
definition.  Even an inefficient single character encoding is going to
beat this because it can use byte-level matching.

A better way to handle optimizations for legacy systems is to do what
Gauche does and allow compile-time specification of the single internal
encoding.

> > A simple loop building an output string from alternate encoding
> > sources can force the accumulated string back and forth between
> > encodings, incurring O(n) conversions instead of just 2!
> 
> Which is why you don't do that. Forcing a particular internal encoding
> on the developer prevents this, but that's just a weak solution that is
> better addressed with proper training and coding standards.

You don't necessarily have control over the intermediate
representations.  You've already got the difference between the default
encoding and the local encoding, plus you've got to worry about the
encodings of foreign libraries and network protocols (which are tending
towards UTF-8 and other locale-neutral encodings).  Even working within
one locale you may have disagreements on what the preferred encoding is,
and while writing an optimized Russian language application using
ISO-8859-5 you may want to use someone else's Russian library that
optimized for KOI8-R.  With a single internal encoding you always have a
worst case conversion cost of 2 (zero when working with your own data
and libraries), but with multiple internal encodings you have an
unbounded worst case scenario.

-- 
Alex

0
foof (110)
11/17/2003 2:19:54 AM
At Fri, 14 Nov 2003 10:44:01 GMT, Bradd W. Szonye <bradd+news@szonye.com> wrote:
> 
> Ray Dillinger <bear@sonic.net> wrote:
> 
> > I do.  Unicode consortium has announced their intention to admit no
> > more characters in that class ever. 
> 
> Be careful here; they don't have authority over natural languages and
> how we write them. While I don't see it happening in the near future,
> there's really nothing to stop (for example) Americans from turning "th"
> into a hard ligature with no uppercase version, like esszed. If that
> happens, the Unicode folks have only two choices: accept the character
> or become another useless, universally-ignored standards body.

No, no, we need the uppercase version, otherwise we couldn't spell "The
Holy Ghost" :)

But as has already been pointed out, ligatures are a font distinction,
not a character set distinction.  Even if new ligatures are created, no
one these days would add a ligature to a character set, it's only for
backwards compatibility that they exist at all.

-- 
Alex

0
foof (110)
11/17/2003 2:25:23 AM
lucier@math.purdue.edu (Brad Lucier) wrote in message news:<77776c11.0311101033.63194e64@posting.google.com>...
> I'll be more explicit here about what these computations tell me about
> mzscheme's implementation of bignums and rationals and mzscheme's
> integration of gmp into its runtime library.

Thanks very much for your in-depth analysis. It seems clear that
number chrunchers should use the beta Gambit, but I'll accept simple
improvements to MzScheme any day.

Scott Owens and I took a day last week to incorporate your
improvements. We changed rational multiplication and addition as you
suggested, we started using GMP's gcd function, we upgraded the GMP
sqrt function to a newer version, and we made `integer-sqrt'
available.

Below are benchmark results comparing MzScheme 205 to 205.6 on my Mac
(733 MHz G4):

 * The "+" in rows below show where MzScheme improved significantly,
   and I think they include all of places that you predicted.

 * Comparing your benchmarks to mine, it looks like MzScheme is now
   within a small constant fact of Gambit's performance for these
   particular tests. Except for the GC-intensive tests (marked "gc"
   below), my 205 numbers are all about half yours (due to
   hardware, I assume), so I think I'm measuring correctly. For
   GC-intensive functions, my numbers differ, but that's not
   surprising. (A "gc/+" annotation means that the test used to be
   GC-intensive, but isn't anymore.)

 * The last test (using `integer-sqrt') is suspicious. It takes a
   fraction of a second to compute in MzScheme, while Gambit takes
   several minutes. As far as I can tell, I copied the test case
   correctly and MzScheme is computing correctly, so I'm stuck.


> This, to me, indicates a design problem that cannot be fixed just by
> going back and adding the few hundred lines of code to implement the
> changes I suggested above.  Design is not about fixing bugs and
> inefficiencies; it's about organizing and thinking about the code in a
> way that lessons the likelihood of such things coming up in the first
> place.

I don't understand what you're getting at in this context. The
problems that you pointed out are all about domain knowledge for
numbers, and not about program architecture.

If, as the architect, I don't know fast algorithms for square root,
there's no way I'm going to design a program to make a slow
square-root function less likely. The best that a design can do is to
easily accomodate algorithms supplied by a domain expert, and our
design seems to be doing that. Is there some reason that other
algorithmic improvements could require a different design?

In any case, I'm certain that there are many places where MzScheme
could use better algorithms, and I'm happy to be informed by an
expert!

Matthew

----------------------------------------

Formula                                               CPU times (ms)
                                            mzscheme 205  mzscheme 205.6

(expt 3 1000000)  ; a                            2370          750  +
(expt 3 1000001)  ; b                            2380          750  +
(* a a)           ; c                            1510         1320
(* a b)           ;                              2200         1840
(quotient c a)    ;                              5260         4550
(sqrt c)          ;                             25820         3250  +
(fib 100000)      ; a, note 1                   10610        14400  gc
(fib 100001)      ; b                           12430        15730  gc
(gcd a b)         ;                             18990          200  gc/+
(gcd a b)         ; a=3^100000, b=2^100000      19120            0  gc/+
(expt1 3 1000000) ; note 2                        860          750
(expt2 3 1000000) ; note 3                       3070         2620
(* a a)           ; a=3^1000000                  1500         1320
(expt 10 10000000); a                          244420        40300  +
(expt 2 10000000) ; b                           40000           30  +
(quotient a b)    ;                           1502110           80  +
(expt 2/3 10000)  ; a                             590            0  gc/+
(expt 3/5 10000)  ; b                            1000            0  gc/+
(* a b)           ;                               590            0  +
(fib 1000)        ; note 4                       4040           90  +
(factorial 10000) ; note 5                        870         1150  gc
(partial-factorial 0 10000) ; note 6              100           90
(binary-splitting-compute-e 1000) ; note 7        610          100  +
(naive-compute-e 1000) ; note 8                113390         6640  gc/+
(binary-splitting-compute-pi 1000) ; note 9      1900          180  gc/+
(pi-brent-salamin) ; n. 10, beta^k=10^1000000       -      4290880  
(pi-brent-salamin) ; beta^k=2^33219                 -          270  


Version 205.6 is available as a "nightly build":
    http://download.plt-scheme.org/scheme/
0
mflatt (38)
11/22/2003 1:42:37 PM
Brad Lucier wrote:
> gcd is based on a lot of quotients, 

Why isn't Josef Stein's algorithm used?

It uses only halving and subtraction.

-- 
Jens Axel S�gaard

0
usenet153 (246)
11/22/2003 6:57:52 PM
Jens Axel S�gaard <usenet@jasoegaard.dk> writes:

> Brad Lucier wrote:
>> gcd is based on a lot of quotients,
>
> Why isn't Josef Stein's algorithm used?
>
> It uses only halving and subtraction.

I played around with it.  It is a lot more complicated, but it would
take work to make it run faster than Euclid's algorithm in Scheme.
(There is just too much overhead in testing even and odd and shifting
when you need to tag intermediate results.)
0
jrm (1310)
11/24/2003 3:40:41 PM
Bradd wrote:
>> Be careful here; [the Unicode Consortium doesn't] have authority over
>> natural languages and how we write them. While I don't see it
>> happening in the near future, there's really nothing to stop (for
>> example) Americans from turning "th" into a hard ligature with no
>> uppercase version, like esszed. If that happens, the Unicode folks
>> have only two choices: accept the character or become another
>> useless, universally-ignored standards body.

Alex Shinn <foof@synthcode.com> wrote:
> No, no, we need the uppercase version, otherwise we couldn't spell
> "The Holy Ghost" :)

Heh.

> But as has already been pointed out, ligatures are a font distinction,
> not a character set distinction.

Sometimes they're both, as is the case with esszed. That's what I was
referring to when I called it a "hard ligature." For another example
that doesn't require ligatures at all, consider what would happen if
English returned to using the "thorn" character for the "th" sound, but
only in title case (like "Ye Olde Shoppe" -- the Y in "Ye" is actually a
thorn).

> Even if new ligatures are created, no one these days would add a
> ligature to a character set, it's only for backwards compatibility
> that they exist at all.

I'm guessing that no sane programmer would endorse that kind of
character-set wackiness, but not everyone's a programmer.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd
0
news152 (508)
11/26/2003 9:42:20 AM
Matthew:

Re:

> It seems clear that number chrunchers should use the beta Gambit ...

You are a gentleman and a scholar, as my grade-nine religion teacher would
say (no smiley).  You are too generous to Gambit.  Here are the times on a
2GHz G5:

Formula                                               CPU times (ms)
                                            beta Gambit-C      mzscheme 205.7

(expt 3 1000000)  ; a                             180              260
(expt 3 1000001)  ; b                             320              260
(* a a)           ; c                             230              500
(* a b)           ;                               280              640
(quotient c a)    ;                              1360             1690
(sqrt c)          ;                              2730             1150
(fib 100000)      ; a, note 1                    1420             2220
(fib 100001)      ; b                            1340             2860 (gc)
(gcd a b)         ;                              9490               70
(gcd a b)         ; a=3^100000, b=2^100000          0                0
(expt1 3 1000000) ; note 2                        190              260
(expt2 3 1000000) ; note 3                       1180              950
(* a a)           ; a=3^1000000                   220              470
(expt 10 10000000); a                            7140            14830
(expt 2 10000000) ; b                              40               10
(quotient a b)    ;                                30               30
(expt 2/3 10000)  ; a                               0                0
(expt 3/5 10000)  ; b                              10                0
(* a b)           ;                                 0                0
(fib 1000)        ; note 4                          0              160
(factorial 10000) ; note 5                        470              290
(partial-factorial 0 10000) ; note 6               60               40
(binary-splitting-compute-e 1000) ; note 7        220              230
(naive-compute-e 1000) ; note 8                 18800             3250
(binary-splitting-compute-pi 1000) ; note 9       590              250
(pi-brent-salamin) ; n. 10, beta^k=10^1000000  271430          1278890
(pi-brent-salamin) ; beta^k=2^3321929          183860           205170

The (gc) note means that there was an extra 800ms of gc for that computation
compared to the previous one.  There was a transcription error in my previous
posts about the value of beta^k when beta=2; that's been corrected here.
(The two values of beta^k give roughly the same number of correct digits
of pi.)

The main thing I see here is that Gambit needs a better algorithm for gcd,
especially if one wants to do serious rational arithmetic.  It could also use
a better sqrt, but the one it has can be improved only by a constant factor
so I don't think I'm going to worry about it.

Re:

> > This, to me, indicates a design problem that cannot be fixed just by
> > going back and adding the few hundred lines of code to implement the
> > changes I suggested above.  Design is not about fixing bugs and
> > inefficiencies; it's about organizing and thinking about the code in a
> > way that lessons the likelihood of such things coming up in the first
> > place.
>
>
> I don't understand what you're getting at in this context. The
> problems that you pointed out are all about domain knowledge for
> numbers, and not about program architecture.

What I mean by "design" is mainly what questions were asked about the goals
of the (sub-)program and what things should that program achieve.  For
example, I can imagine the following rough questions being asked about
numerics:

1.  Do we want to support more than fixnums and flonums, i.e., the complete
    numeric tower?
    No.  Stop here.  (Bigloo, Stalin, Chicken?)
    Yes. Go to next questions.
    (a) Do we want to have a high-performance bignum and ratnum
        implementation?
        No.  Stop here. (Gambit 3.0)
        Yes. Go to next questions. (mzscheme, beta Gambit)
        (i) Do we write it ourselves or use some existing code?
            Write it ourselves ... (New questions) (beta Gambit)
            Use someone else's code. (mzscheme)
            (I) Do we use someone else's code for everything?
                Yes
                No.  (mzscheme)

I feel that any trip through this decision tree leads to a valid design,
up to this point.

At this point it was starting to get confusing to me when I tried mzscheme.
*Any* implementor of a programming language with arbitrary-sized integers
and rationals will need at least *some* "domain knowledge for numbers".
Nearly all the examples I gave in my original post arose from the following
questions about how gmp was integrated into mzscheme.

(A) GCD's are expensive.  Where can we eliminate the need for them in
    rational arithmetic?

(B) It is faster to square a number rather than multiply two different
    numbers of the same size.  Where can we take advantage of that?

(C) In binary arithmetic, various operations on numbers with high powers of
    two (e.g., multiplication) can trivially be speeded up tremendously.
    Where can we take advantage of that?

Now, I would expect the premises of (A) and (C) to be known to any programmer
who has programmed anything numerical; (B) might be more obscure.

So it is a question of how someone else's bignum implementation is integrated
into a scheme.  If one relies, e.g., on gmp for all bignum and rational ops,
then one can defer to the gmp developers, who really *are* experts (I'd
rather characterize myself as an enthusiastic amateur).  However, if one
is going to use gmp only for the base ops and roll one's own more complex
operations, then the question of how these are integrated is a valid one.
The question in (A) is covered in the first section of Chapter 4.5 of
Knuth's Seminumerical Algorithms on rational arithmetic, for example.

Re:

> If, as the architect, I don't know fast algorithms for square root,
> there's no way I'm going to design a program to make a slow
> square-root function less likely. The best that a design can do is to
> easily accomodate algorithms supplied by a domain expert, and our
> design seems to be doing that.

I'm in no way critical of mzscheme because it wasn't using an asymptotically
optimal sqrt algorithm; I was critical of it because I could not see that
gmp's algorithms were properly integrated into its number system, and
trivial things that appear in the first paragraphs of Knuth's chapter on
rational arithmetic are not used. 

In the context of this post, I'd like to reply to Shriram's:

> > This, to me, indicates a design problem that cannot be fixed just by
> > going back and adding the few hundred lines of code to implement the
> > changes I suggested above.
> 
> Thanks for your detailed comments.  I am surprised by your conclusion,
> though.  Assuming your analysis is fairly thorough, why wouldn't
> "going back and adding a few hundred lines of code" not do the trick?

Because *no* analysis is ever complete (I'll leave aside "fairly thorough")
and without the questions behind the analysis, it can never be improved.
For example, I originally missed the optimization

gcd(2^k alpha, 2^j beta)=2^min(j,k) gcd(alpha,beta)

and another one for adding rational numbers whose denominators are relatively
prime.  The gcd optimization was probably picked up by using the new gcd
from gmp, but I don't know if the optimization for adding rationals was
added to mzscheme.  I am not trying to be sarcastic, but the logical limit of
my understanding of what you're saying is: Here is a design for bignum and
rational arithmetic:

;;; begin code
;;; end code

Now send me bug reports and I'll fill in the rest.

Kevin Ryde said:

> > gcd is based on a lot of quotients,

> That's not usually the best approach.

and he's right, I learned a lot about gcd algorithms since that post.  My
understanding is that the gcd in gmp-4.1.2 is asymptotically O(N^2) in the
number of bits, the same as the Euclidean algorithm, but it is much faster
in practice.

Brad
0
lucier (68)
12/3/2003 10:41:54 PM
lucier@math.purdue.edu (Brad Lucier) writes:

> So it is a question of how someone else's bignum implementation is integrated
> into a scheme.  If one relies, e.g., on gmp for all bignum and rational ops,
> then one can defer to the gmp developers, who really *are* experts (I'd
> rather characterize myself as an enthusiastic amateur).  

Though do note this interesting article by Jaffer, which the peerless
Jens Axel recently pointed out on this forum:

  http://www.swiss.ai.mit.edu/~jaffer/CNS/DIMPA

As always, inexpert users can undo the good intentions of the expert
developers.

Shriram
0
sk1 (223)
12/4/2003 4:21:10 AM
Shriram Krishnamurthi wrote:

> Though do note this interesting article by Jaffer, which the peerless
> Jens Axel recently pointed out on this forum:
> 
>   http://www.swiss.ai.mit.edu/~jaffer/CNS/DIMPA

For the record I completely agree with Jaffer's analysis that for many
programs manipulating integers, even symbolic math packages like
JACAL, the vast majority of exact integers manipulated are rather
small and often fit in a fixnum or a minimal size bignum.  This means
that for good performance it is essential to treat the fixnum-fixnum
and fixnum-smallbignum cases first ("inline" in a certain sense), and
only perform a call to the more general bignum-bignum routines when
there is no alternative.

Gambit's numerical library is entirely written in Scheme (extended
with primitives to manipulate the bignum internal representation, such
as extracting a big-digit, adding two big-digits, etc).  The numerical
procedures dispatch on the representation of the parameters (fixnum,
bignum, flonum, ratnum, cpxnum) to select the appropriate operation.
The fixnum-fixnum cases are usually handled by inlined fixnum
primitives and a check for overflow or other special condition.  The
bignum routines are only called when a fixnum-fixnum operation would
overflow or when one of the arguments is a bignum.  The main
difference with SCM (aside from the implementation language which
is C in SCM) is that the bignum routines use fancy algorithms
(for example multiplication uses the naive O(n^2) algorithm, or
Karatsuba's divide-and-conquer or FFT depending on the size of
operands).  This means that Gambit can be efficient at integer
arithmetic when integers are relatively small (but not only fixnums)
and when manipulating huge integers (applications in number theory).
By doing everything in Scheme, Gambit also has a small overhead for
calling the bignum routines (i.e. it is not a Scheme-to-C call and no
expensive representation conversion is needed).

From what I hear, mzscheme and Guile use GMP for implementing bignum
arithmetic.  This is fine for applications manipulating huge integers,
but it is a poor choice for the common case of relatively small
integers, UNLESS the system special cases the fixnum-fixnum case and
makes it so that there is no need for an expensive representation
conversion when fixnums and small bignums need to be passed to GMP.
Another issue is memory management for the large integers. Gambit uses
the GC'ed Scheme heap and GMP probably uses the C heap directly (i.e.
malloc/free).  Using two heaps (Scheme's and C's) makes it very hard
to control memory usage (for example how can you make sure that the
program uses no more than a 10 megabyte RAM footprint, perhaps
because you are on an embedded system with no virtual memory?
how can you ensure that the C heap won't get fragmented and require
much more memory in one run than in a similar previous run? etc).

So although Brad Lucier's measurements are interesting, I would be
more interested in seeing how the systems behave when the integers
manipulated are small, and often cross the fixnum/bignum boundary.

Marc

0
feeley619 (51)
12/4/2003 3:53:58 PM
Thanks Marc -- that's very informative.  To what extent does (the
current) Gambit make these decisions (like inlining) automatically,
and to what extent does it rely on user annotations?

Shriram
0
sk1 (223)
12/4/2003 5:29:52 PM
Marc Feeley <feeley@IRO.UMontreal.CA> writes:
>
> UNLESS the system special cases the fixnum-fixnum case and
> makes it so that there is no need for an expensive representation
> conversion when fixnums and small bignums need to be passed to GMP.

That's so in guile I think.

> GMP probably uses the C heap directly (i.e.  malloc/free).

Yep, but via function pointers an application can set.

> Using two heaps (Scheme's and C's) makes it very hard
> to control memory usage

There's a choice between the gmp mpz level which manages memory, or
the mpn level which operates on supplied arrays of words.
0
user42 (6)
12/6/2003 11:57:11 PM
Reply:

Similar Artilces:

to Scheme or not to Scheme..
I go through SICP book trying to do it's exercises. Some people recommend to use Scheme for that, so at first I started using Scheme. After a week of frustration I switched back to Common Lisp and I enjoy it and make better progress. I guess I don't have to do the exercises by the letter, but I need to learn general principles. What do you think? Bigos <ruby.object@googlemail.com> writes: > I guess I don't have to do the exercises by the letter, but I need to > learn general principles. What do you think? Generally speaking, that is true, though some e...

New Wraith Scheme, Pixie Scheme II, and Pixie Scheme III, and source available now.
Today I released version 2.14 of Wraith Scheme and version 1.01 of Pixie Scheme II. Today also, version 1.00 of Pixie Scheme III was released to the App Store. Wraith Scheme 2.14 is a shareware and open-source full R5 Scheme implementation for the Apple Macintosh, with enhancements for parallel processing, by which I mean multiple copies of the Wraith Scheme application (separate Unix processes) all running at once, sharing Scheme main memory. Wraith Scheme 2.14 is a full 64-bit Macintosh application, that only runs on Macintoshes with Intel processors that can execute 64-bit code, and that are running at least MacOS 10.6 (Snow Leopard). Wraith Scheme 2.14 fixes a few minor bugs and adds a few new utility procedures. See the "What's New" section of the Wraith Scheme Help File (available from the Help Menu within the application) for additional information. Pixie Scheme II 1.01 is a shareware and open-source Scheme implementation for the Apple Macintosh, which is almost R5: It is in great part a design prototype for a possible iPad implementation of Scheme, and since the iPad generally hides its underlying Unix file system from users, Pixie Scheme II does not have any of the R5 procedures for access to files. Pixie Scheme II does have many of the enhancements that Wraith Scheme has. Pixie Scheme II has a rather different graphical user interface than does Wraith Scheme, which is both more iPad-like and more like the graphical user interface of the origina...

New Wraith Scheme, Pixie Scheme II, and Pixie Scheme III, and source available now. #2
Today I released version 2.14 of Wraith Scheme and version 1.01 of Pixie Scheme II. Today also, version 1.00 of Pixie Scheme III was released to the App Store. Wraith Scheme 2.14 is a shareware and open-source full R5 Scheme implementation for the Apple Macintosh, with enhancements for parallel processing, by which I mean multiple copies of the Wraith Scheme application (separate Unix processes) all running at once, sharing Scheme main memory. Wraith Scheme 2.14 is a full 64-bit Macintosh application, that only runs on Macintoshes with Intel processors that can execute 64-bit code, and that are running at least MacOS 10.6 (Snow Leopard). Wraith Scheme 2.14 fixes a few minor bugs and adds a few new utility procedures. See the "What's New" section of the Wraith Scheme Help File (available from the Help Menu within the application) for additional information. Pixie Scheme II 1.01 is a shareware and open-source Scheme implementation for the Apple Macintosh, which is almost R5: It is in great part a design prototype for a possible iPad implementation of Scheme, and since the iPad generally hides its underlying Unix file system from users, Pixie Scheme II does not have any of the R5 procedures for access to files. Pixie Scheme II does have many of the enhancements that Wraith Scheme has. Pixie Scheme II has a rather different graphical user interface than does Wraith Scheme, which is both more iPad-like and more like the graphical user interface of the origina...

Scheme reader (in Scheme)?
Hi, does anyone know of a Scheme reader in Scheme (for the purpose of bootstrapping an implementation, of course)? I.e. something that would build on read-char and perhaps peek-char to provide read ? I know it's not that hard to write the reader, but who knows, maybe I'm overlooking some great piece of work that even has support for the various reader extension SRFI's... Thanks, Dan Muresan vqgdiee02@sneakemail.com wrote: > Hi, does anyone know of a Scheme reader in Scheme (for the purpose of > bootstrapping an implementation, of course)? I.e. something that would > build on read-char and perhaps peek-char to provide read ? From Scheme48: scheme/rts/read.scm vqgdiee02@sneakemail.com skrev: > Hi, does anyone know of a Scheme reader in Scheme (for the purpose of > bootstrapping an implementation, of course)? I.e. something that would > build on read-char and perhaps peek-char to provide read ? > > I know it's not that hard to write the reader, but who knows, maybe > I'm overlooking some great piece of work that even has support for the > various reader extension SRFI's... Here is a start: <http://www.cs.indiana.edu/eip/compile/scanparse.html> -- Jens Axel S�gaard Dan Muresan wrote: > Hi, does anyone know of a Scheme reader in Scheme (for the purpose of > bootstrapping an implementation, of course)? I.e. something that would > build on read-char and perhaps peek-char to provide read ? Most of the...

From python to scheme, but which scheme?
I've learned python over the past few years and like it very much. However it's too slow, and I've recently been re-invigorated intellectually with a reading of SICP. This is mostly for hobby/fun purposes; I'm not a professional programmer. I have a specific small program I'd like to translate to scheme (event driven simulator for battery operated devices), but I'm not sure where to go from here. I've narrowed my choices down to PLT scheme and Gambit-C. PLT because it looks like it's designed for learning and has the HTDP book associated with it, and Gambit-C because it seems to have the potential for embedding scheme into a non-OS driven CPU or microcontroller... and today I discovered a seemingly dead (?) language Dylan which appears to be scheme with a different syntax. I would not have understood what's going with SICP had I not already had a pretty decent understanding of Python, however going through that book showed me some weaknesses in some of python's semantics when you try to view it as a funtional language. There are other recent posts here showing differences in those languages. I'd like to learn more about functional programming, but haskell seems crazy and python seems limited in even the first few things I tried, whic hI think has to do with where variables are bound, scope, and that sort of thing (eg SICP models state with lambdas, ["static" variables in C], but in python you have to use a mutable l...

why not the scheme implementation written in scheme?
Hi. why are almost all of implementatin of scheme written in c? is it possible to write a not very slow scheme compiler & intepreter in scheme-- not toy, but a real implementation? I'm writing a scheme->llvm compiler in scheme as a toy. I don't know whether it is possible to make it as a real compiler. -- Thanks & Regards Changying Li There are various reasons. In case of Gauche, it's because one of my goal was interoperability with C, specifically that I wanted to use core part of Gauche as "convenient list processing library" from C. It is certainly possible to write a real implementation in Scheme. Take a look at Scheme48, for example. (Its core part is written in subset of Scheme which can be compiled to native code via C. Other things are built on top of it.) --shiro On 3$B7n(B24$BF|(B, $B8a8e(B4:36, Changying Li <lchangy...@gmail.com> wrote: > Hi. > why are almost all of implementatin of scheme written in c? > is it possible to write a not very slow scheme compiler & intepreter in > scheme-- not toy, but a real implementation? > > I'm writing a scheme->llvm compiler in scheme as a toy. I don't know whether it > is possible to make it as a real compiler. > > -- > > Thanks & Regards > > Changying Li Changying Li <lchangying@gmail.com> writes: > why are almost all of implementatin of scheme written in c? > is it possible to write a not ve...

Scheme interpreter written in Scheme
Hi all! I'm looking for the sources of a scheme interpreter with a small core written, for example, in C and a library written in Scheme itself to provide r5rs compliance. In particular, I am interested in adding r5rs functionalities to an interpreter that doesn't have it. For example, it cannot accept more than one list for (map) and I think I can solve this problem simply using the (map) implementation from this library. Thank you in advance, Ignazio >>>>> "Ignazio" == neclepsio <neclepsio@hotmail.com> writes: Ignazio> I'm looking for the sources of a scheme interpreter with a small core Ignazio> written, for example, in C and a library written in Scheme itself to provide Ignazio> r5rs compliance. That isn't really the same as what you're saying in the Subject: header. The core of Scheme 48 is written in a subset of Scheme (with small parts---the OS interface and bignum arithmetic---in C), and almost all of the standard library is implemented in regular Scheme as well. Albeit, it already has almost full R5RS compliance. -- Cheers =8-} Mike Friede, V�lkerverst�ndigung und �berhaupt blabla "Michael Sperber" <sperber@informatik.uni-tuebingen.de> ha scritto nel messaggio news:y9l8yc3hsof.fsf@sams.informatik.uni-tuebingen.de... > That isn't really the same as what you're saying in the Subject: > header. The core of Scheme 48 is written in a subset of Scheme (with > small pa...

scheme implementation written in scheme
Hello. I'd like to study/cannibalize an existing scheme implementation written in scheme. I can't determine from the scheme wiki which ones if any are written in scheme. Does anyone know of any, preferrably with a no-restrictions license? At the moment I'm primarily interested in the macro expander - I'm looking for scheme code that just expands a darn R5RS-compliant macro into the equivalent scheme code s-expression, no extensions, no weird symbols in the s-exp, just the exact equivalent of cl's expandmacro-1, that could then be further macro-expanded or compiled into working scheme code. I didn't find alexpander sufficient. - Jake The only implementation I am aware of is Scheme48: http://www.s48.org/index.html jake skrev: > Hello. I'd like to study/cannibalize an existing scheme > implementation written in scheme. I can't determine from the scheme > wiki which ones if any are written in scheme. Does anyone know of > any, preferrably with a no-restrictions license? <http://www.cs.indiana.edu/eip/compile/> Also get "Lisp in Small Pieces" either at the library or a bookstore. -- Jens Axel S�gaard On 2007-08-02 09:10:25 -0500, jake <jacob.miles@gmail.com> said: > At the moment I'm primarily interested in the macro expander - I'm > looking for scheme code that just expands a darn R5RS-compliant macro > into the equivalent scheme code s-expression, no extensions, no weird > sy...

Learning Scheme with PLT Scheme
What is the best book to use when learning Scheme with the latest PLT download? Robert On Fri, 31 Jul 2009 15:21:23 -0400, Robert Hicks <sigzero@gmail.com> wrote: > What is the best book to use when learning Scheme with the latest PLT > download? Us whatever book fits your needs, and PLT will likely work for you. SICP, TSPL, The Little Schemer, HtDP, &c. are all good books, and you can use PLT Scheme with each of them. Aaron W. Hsu -- Of all tyrannies, a tyranny sincerely exercised for the good of its victims may be the most oppressive. -- C. S. Lewis On Jul 31, 3:13=A0pm, "Aaron W. Hsu" <arcf...@sacrideo.us> wrote: > On Fri, 31 Jul 2009 15:21:23 -0400, Robert Hicks <sigz...@gmail.com> wrot= e: > > What is the best book to use when learning Scheme with the latest PLT > > download? > > Us whatever book fits your needs, and PLT will likely work for you. SICP,= =A0 > TSPL, The Little Schemer, HtDP, &c. are all good books, and you can use = =A0 > PLT Scheme with each of them. It is true. Each book has a different style, you should find the one that you like best. Robert Hicks <sigzero@gmail.com> writes: > What is the best book to use when learning Scheme with the latest > PLT download? Besides the books that you got recommendation for, note that PLT comes with several documents that are intended for learning how to use the system. These are the first block of entries on the docum...

Needed: mergesort (disk-to-disk) in scheme; statical scheme code analyzer; scheme profiler
I am developing a code for symbolical computations (not general purpose) with quadratic surds and rationals, some thousand - thousand- a-half lines of GNU-MIT scheme, and I am encountering from time to time memory related errors as "Out of memory" and "Maximum recursion depth exceeded". An important part of my code deals with merging two sorted files, each line beeing a scheme object, into a unique sorted file. I am using no ready-made libraries. Maybe someone can help me in finding a library, defining a function that takes two input files, one output file and a predicate for comparing two elements, and manages the work by itself? Some other questions: Is there a way to extract a dependency graph from my rather spaghetti code? Is there a way to decide statically wheter my functions do really always behave as tail-recursive (as I intended)? Is there a way of profiling my scheme code, so that I can select the data on which my code behaves badly? Thanks in advance Alex ...

Do scheme programmers *read* scheme programs?
Of course they do, but what I mean is the following: when I read a book, I internally "vocalize" the words as I read them, and to a large extent this is what I mean by "reading". When I read say a C program it is (for the most part) much of the same. When I see for(i=0;i<100;i++); I mentally vocalize "for eye equals zero, eye less than 100, eye plus plus" On the other hand, I find that even though I can read and even (sometimes) write short scheme programs I do so more as a problem in pattern recognition/understanding without (much) internal vocalization. As an experiment, I wrote a short VBScript program that takes any text file you feed it and reads it out loud line by line using a text-to-speech engine. When I feed it a short C program I can to some extend listen along and have at least some idea of the meaning of the code as I hear it. When I feed it a short scheme program the result strikes me as surreal- almost like a schizophrenic word salad. So my main question is: are experienced Scheme programmers just so much more fluent in the language that they get to the point that they can "read" it in close to the traditional sense, or are they engaged in a different sort of cognitive activity. This second possibility doesn't strike me as absurd: at one time "reading" meant reading out loud or at least with your lips moving, and the advent of silent reading seemed like a radical deperture. Maybe scheme programmers h...

Scheme in Basic, Scheme-2-Basic
I'm looking for a scheme interpreter written in BASIC, preferable Visual Basic --- and most preferably Visual Basic for Applications. A Scheme to BASIC translator would also be great. I'm in an environment where I need to write some software and the only development environment I have is Microsoft Office Visual Basic for Applications. (The security 'droids have conclued that because MS Office is "standard," VBA must be "safe.") If you can help me please respond with a copy to pcolsen@comcast.net Engineer wrote: > I'm in an environment where I need to write some software and the only > development environment I have is Microsoft Office Visual Basic for > Applications. (The security 'droids have conclued that because MS > Office is "standard," VBA must be "safe.") Not exactly what you are asking for, but here is a way to call Scheme from VBA among other: <http://www.plt-scheme.org/software/mzcom/> --=20 Jens Axel S=F8gaard Thanks for the pointer. Unfortunately the only programming language to which I have access is VBA for Microsoft Office. Organization policy forbids "development" on my machine. Luckily the security 'droids don't consider writing VBA macros as "development." Anything I do must be bootstrapped from VBA. "Engineer" <pcolsen@comcast.net> writes: > Unfortunately the only programming language to which I have access is &...

SchemEd
The author of SymLibEd, part of the circuit editor program SchemEd, says that to enable the program to be run on the Iyonix the Castle Toolbox Module should be replaced with the RO Toolbox. My Iyonix is stable and I find the thought of making this change unattractive. Are there any alternatives or am I worrying unneccessarily. Malcolm Smith -- T M Smith in North Yorkshire, England Using an Iyonix and RiscOS 5.11.3 In article <ff14ec124e.tmsmith@tmsmith.freeuk.com>, Malcolm Smith <tmsmith@freeuk.com> wrote: > The author of SymLibEd, part of the circuit editor program SchemEd, > says that to enable the program to be run on the Iyonix the Castle > Toolbox Module should be replaced with the RO Toolbox. > My Iyonix is stable and I find the thought of making this change > unattractive. > Are there any alternatives or am I worrying unneccessarily. You are worrying unnecessarily. John -- John Williams, Wirral, Merseyside, UK - no attachments to these addresses! Non-RISC OS posters change user to johnrwilliams or put 'risc' in subject for reliable contact! Who is John Williams? http://www.picindex.info/author/ In article <ff14ec124e.tmsmith@tmsmith.freeuk.com>, Malcolm Smith <tmsmith@freeuk.com> wrote: > The author of SymLibEd, part of the circuit editor program SchemEd, > says that to enable the program to be run on the Iyonix the Castle > Toolbox Module should be replaced with the RO Toolbox. > ...

[Ann] dot-scheme: a PLT Scheme interface to .NET
Hi there, I have been working on a FFI for PLT Scheme and the .NET framework and I have reached a point were I think the code might be useful for others. If you are interested take a look at: http://www.rivendell.ws/dot-scheme Appended below is a MS SqlServer OLE-DB interface built on top of dot-scheme. This should give an idea of how dot-scheme can be used. Criticisms or insights are quite welcome. -pp ; dot-db provides access to OLE-DB databases through scheme ; use: ; `open-connection' to well, open a connection ; `close-connection' to close it ; `execute-sql' to execute queries against an open connection (module dot-db mzscheme (provide open-connection close-connection execute-sql) (require (lib "etc.ss") (lib "system.data.ss" "dot-net") (lib "dot-utils.ss" "dot-scheme")) (define-struct connection (obj)) ; (database server) -> connection ; opens a ole-db connection to `database' on `server' using integrated ; NT security. returns the ; opened connection object. (define (open-connection database server) ; ::ole-db-connection is a dot-scheme object that represents the ; OleDbConnection .NET data type. new is a procedure that invokes ; the .NET construtor for the specified type. Note that scheme ; strings are automatically transalated to their .NET counterparts. (let ((c (new ::ole-db-conne...

getting set up to learn some scheme with mit scheme and sicp
I guess this si the most straightforward way to learn scheme I tried plt scheme, it was cool, scsh i could not make head nor tail of docs wise; but I heard that sicp is the best book for learning programming... gavino wrote: > I guess this si the most straightforward way to learn scheme > > I tried plt scheme, it was cool, scsh i could not make head nor tail of > docs wise; but I heard that sicp is the best book for learning > programming... How to Design Programs/PLT Scheme is a good combo as well. I got to ch4, and eyes glazed over. Was fun though. Scheme seems to make everything little chains of computations, kinda cool, and easy to define new procs..... sicp has been around longer no? gavino wrote: > I got to ch4, and eyes glazed over. > Was fun though. > Scheme seems to make everything little chains of computations, kinda > cool, and easy to define new procs..... > > sicp has been around longer no? > Yah, SICP has been around since early-80's I think. It's a much "heavier" book in terms of theory and technique. Bear Ray Dillinger <bear@sonic.net> writes: > gavino wrote: > > I got to ch4, and eyes glazed over. > > Was fun though. > > Scheme seems to make everything little chains of computations, kinda > > cool, and easy to define new procs..... > > sicp has been around longer no? > > > > Yah, SICP has been around since early-80's I think. >...

Interacting Scheme and Fortran; what is in store for Scheme in terms of scripting
Hia all Is there any reason why no one posts code showing how to use Scheme for scripting? I have never done any serious Unix scripting. However, recently a colleague shared his Fortran code with me. The Fortran program got all its input from a .sh script. For example: == .... while ( $MONTHCOUNT <= $NMONTH ) ../FORTAN_PROGRAM.out << EOF $YEARS[$YEARCOUNT] $MONTH[$MONTHCOUNT] $MONTHTAG[$MONTHCOUNT] $GOME $MODEL $OUT $LONMIN $LONMAX $LATMIN $LATMAX $HARDYEAR $FILE_TYPE EOF @ MONTHCOUNT ++ .... == How would that look like in Scheme? Is Bigloo capable of doing this? Or which Scheme might be a good .sh replacement? Thanks, Schneewittchen frankenstein <klohmuschel@yahoo.de> writes: > Hia all > > Is there any reason why no one posts code showing how to use Scheme > for scripting? > > I have never done any serious Unix scripting. However, recently a > colleague shared his Fortran code with me. The Fortran program got all > its input from a .sh script. > > > For example: > while ( $MONTHCOUNT <= $NMONTH ) > ./FORTAN_PROGRAM.out << EOF > $YEARS[$YEARCOUNT] > $MONTH[$MONTHCOUNT] > $MONTHTAG[$MONTHCOUNT] > $GOME > $MODEL > $OUT > $LONMIN > $LONMAX > $LATMIN > $LATMAX > $HARDYEAR > $FILE_TYPE > EOF > @ MONTHCOUNT ++ This is not a fortran script. This is a shell script. (csh I assume). > How would that look like in Scheme? Is Bigloo capable of doing this? > O...

Pixie Scheme III -- Scheme on the iPad (*NOT* a release notice...)
I gave a talk about my iPad Scheme implementation, Pixie Scheme III, locally in the San Francisco area a few weeks ago, and happened to see a little interest in it on the Internet, so thought I had better post something here: I *do* have an R5 Scheme running under iOS on the iPad. It includes all required features of R5 Scheme except for file-system access (since the iPad pretty much doesn't allow user access to the underlying Unix file system), plus numerous enhancements. Pixie Scheme III much resembles my Macintosh Scheme application, Wraith Scheme, though with more of an iPad style user interface. For further information about Wraith Scheme, see the Software page of my web site, whose home page is http://web.mac.com/Jay_Reynolds_Freeman Pixie Scheme III is *not* in the App Store at the moment. I am still chasing a few bugs and tweaking the user interface. I will quite likely submit it for approval within the next month or two, and if Apple accepts it, I will post here. If Apple does not accept it, it is possible that I will release source code anyway, so that Schemers who are also Xcode/iOS developers can install it themselves and have it to play with. Time will tell. Anyone who has specific questions is welcome to send me EMail. ...

how to run script with umb-scheme and have scheme exit afterwards?
I am brand new to scheme. How to I run a file, call it temp.scm, and have umb-scheme not leave me at a umb-scheme prompt? jani@persian.com (Jani Yusef) wrote in message news:<d3be1825.0407191806.74bb860b@posting.google.com>... > I am brand new to scheme. How to I run a file, call it temp.scm, and > have umb-scheme not leave me at a umb-scheme prompt? I can't answer your question, but .... Have you considered using another implementation? My impression is that umb-scheme is a toy Scheme and is not widely used (although, for some unfathomable reason, it often comes with Linux distributions). For newbies, DrScheme (http://www.drscheme.org/) is usually a good recommendation. -- G. ...

R5 release of Wraith Scheme (shareware scheme for Macintosh)
I have a new R5 release of Wraith Scheme, which is a shareware Scheme implementation for the Apple Macintosh, available for download from the "software" page of my personal web site: http://web.mac.com/Jay_Reynolds_Freeman What follows are portions of the "README" file for the current distribution. This "README" file accompanies the fourth release of Wraith Scheme, version 1.20, release date 8 June, 2007. Wraith Scheme is an implementation of the "R5" version of the Scheme programming language for the Apple Macintosh (trademark). Wraith Scheme was written by me, Jay Reynolds Freeman, and is copyright Jay Reynolds Freeman, 2007. Wraith Scheme is shareware: You are welcome to use Wraith Scheme for free, forever, and to pass out free copies of it to anybody else. If you would like to make a shareware donation for it, that's fine, and there is information in the program about how to go about it, but in no sense do I request, insist, or expect that you do so. [...] System Requirements: Wraith Scheme requires an Apple Macintosh running OS X 10.4 or later. The application takes up about 2.3 MByte of storage on disk, and can run usefully in as little as 10 MByte of memory. Wraith Scheme is Universal Binary (trademark), and should run equally well on Macintoshes using Intel microprocessors and on Macintoshes using PowerPC (t...

scheme code from "exploring computer science with scheme"
does anyone have or know of a link where the code from the book "exploring computer science with scheme" can be downloaded? The link given in the book no longer exists. kelly "Gary Kelly" <garykelly@earthlink.net> writes: >does anyone have or know of a link where the code from the book "exploring >computer science with scheme" can be downloaded? The link given in the book >no longer exists. I don't know if this is all the code from the book, but look here: http://inst/~instcd/inst-cd/classes/cs3s/index.html for the support files needed to use the book. bh@abbenay.CS.Berkeley.EDU (Brian Harvey) writes: > I don't know if this is all the code from the book, but look here: > > http://inst/~instcd/inst-cd/classes/cs3s/index.html > > for the support files needed to use the book. I wouldn't be surprised if that URL worked for Brian, given his e-mail address, but for the rest of us, it's: http://www-inst.eecs.berkeley.edu/~instcd/inst-cd/classes/cs3s/index.html -- Prabhakar Ragde, Professor plragde at uwaterloo dot ca Cheriton School of Computer Science http://www.cs.uwaterloo.ca/~plragde Faculty of Mathematics DC 1314, (519)888-4567,x4660 University of Waterloo Waterloo, Ontario CANADA N2L 3G1 Prabhakar Ragde <plragde@uwaterloo.removethis.ca> writes: >http://www-inst.eecs.berkeley.edu/~instcd/inst-cd/classes/cs3s/ind...

"The Scheme Programming Language", 3rd ed.
I'm thinking of buying Dybvig's TSPL, 3rd edition. Before doing so I want to confirm this edition is based on R5RS. I would expect that to be the case based on the publication date, but the book's page at MIT press mentions only the "revised report" without saying if it is R5 or R4 (like the second edition.) Thanks, Roberto Waltman [ Please reply to the group, return address is invalid ] Roberto Waltman wrote: > I'm thinking of buying Dybvig's TSPL, 3rd edition. > Before doing so I want to confirm this edition is based on R5RS. > I would expect that to be the case based on the publication date, but > the book's page at MIT press mentions only the "revised report" > without saying if it is R5 or R4 (like the second edition.) The book is based on R5RS and is freely available on the net at: http://www.scheme.com/tspl3 You can look before you buy. Aziz,,, Abdulaziz Ghuloum wrote: >The book is based on R5RS and is freely available on the net at: > http://www.scheme.com/tspl3 Thank you, Roberto Waltman [ Please reply to the group, return address is invalid ] ...

Wraith Scheme 2.15 released, also Pixie Schemes ...
Today I released version 2.15 of Wraith Scheme and version 1.02 of Pixie Scheme II. Today also, version 1.01 of Pixie Scheme III was released to the App Store. Wraith Scheme 2.15 is a shareware and open-source full R5 Scheme implementation for the Apple Macintosh, with enhancements for parallel processing, by which I mean multiple copies of the Wraith Scheme application (separate Unix processes) all running at once, sharing Scheme main memory. Wraith Scheme 2.15 is a full 64-bit Macintosh application, that only runs on Macintoshes with Intel processors that can execute 64-bit code, and that are running at least MacOS 10.6 (Snow Leopard). Wraith Scheme 2.15 contains enhancements that provide a number of pushbuttons, sliders, sense switches, and level indicators, for user-programmable input/output. Pixie Scheme II 1.02 is a shareware and open-source Scheme implementation for the Apple Macintosh, which is almost R5: It is in great part a design prototype for a possible iPad implementation of Scheme, and since the iPad generally hides its underlying Unix file system from users, Pixie Scheme II does not have any of the R5 procedures for access to files. Pixie Scheme II does have many of the enhancements that Wraith Scheme has. Pixie Scheme II has a rather different graphical user interface than does Wraith Scheme, which is both more iPad-like and more like the graphical user interface of the original Pixie Scheme, more than twenty years ago. I think it is rather cute. Pix...

Any special support for SICP in GNU/MIT Scheme or PLT Scheme?
Is there support for any Scheme quirks in "Structure and Interpretation of Computer Programs"? Is there support for restricting students to only the subset of the language that they need (No set! before chapter 3)? I seem to recall some older version of DrScheme documentation mentioning a SICP "language" but I don't see anything in PLT version 203 (except Michael Sperber's implementation of the picture language). The documentation for MIT Scheme 7.7.1 has the phrase "`6.001' is the SICP compatibility package" in the user guide http://www.gnu.org/software/mit-scheme/documentation/user_3.html#SEC9 but I do not find any code that is less than a decade old for supporting SICP. http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/scheme/impl/mit_scm/sicp/ has a tarball made in 1995 out of files with 1993 dates but I assume that this is significantly out of date. (A Google search of groups did turn up a 1997 or 1998 revision ftp.swiss.ai.mit.edu/pub/users/ziggy/SICP/) Some of the web pages at http://sicp.ai.mit.edu/ use the terms "6.001 Scheme" and "MIT Scheme" in ways that make me think there is a difference. There is mention on the web of a Scheme 48 library for SICP but the latest version of the Scheme 48 documentation does not mention it in the libraries section. http://s48.org/0.57/manual/s48manual.html Thanks in advance for any help. -- Allyn Dimock dimock dot cs dot uml dot edu Allyn Dimo...

Register now! for the Scheme Workshop
2009 Workshop on Scheme and Functional Programming Coordinated with the Symposium in Honor of Mitchell Wand August 22, 2009 Boston, Massachusetts, USA http://www.schemeworkshop.org/2009 CALL FOR PARTICIPATION To the delight of all and sundry, the 2009 Scheme and Functional Programming Workshop will be held on August 22nd at Northeastern University, and it is a signal honor for me to be able to invite YOU to the WORLD'S FOREMOST WORKSHOP on the marvelous Scheme language, and to present a program PACKED with contributions from familiar faces and new ones, certain to amaze, delight, and edify. Lend us your ears, and we will widen the space between them. - John Clements IMPORTANT DATES August 11, 2009 - Registration deadline August 22, 2009 - Workshop on Scheme and Functional Programming August 23-24, 2009 - Symposium in Honor of Mitchell Wand: http://www.ccs.neu.edu/home/wand-symposium VENUE Northeastern University Boston Massachusetts Building and Room TBA ACCOMMODATION A limited block of hotel rooms has been reserved for participants of the Scheme Workshop and/or the Mitchell Wand Symposium at hotels in Cambridge and Boston. See the workshop web site for more information, and please note that some of these special rates expire soon (one as early as July 27th). REGISTRATION The registration fee will be $40 to help cover the operating costs and lunch ...

Web resources about - Whither now, Oh Scheme. - comp.lang.scheme

CNN Interviews Group of Women Trump Fans Basically Willing to Excuse Any Behavior
CNN’s Martin Savidge interviewed a circle of Arizona conservative women Trump fans, who were not only undeterred in their support of the GOP ...

Report: Text hidden deep in Facebook Messenger code hints at monetization plan
According to a report out of The Information today, Facebook has begun preparing several new features for Facebook Messenger, including a new ...

EgyptAir Hijacking Suspect Arrested in Cyprus
A man claiming to wear an explosive vest forced an EgyptAir plane to land in Larnaca, Cyprus, before being arrested. All the passengers have ...

Facebook, Apple, Google, other tech CEOs demand North Carolina repeal anti-LGBT law
The Human Rights Campaign (HRC) and Equality NC just released a letter addressed to North Carolina Governor Pat McCrory, urging him to repeal ...

Sony 4K Streaming Service Ultra Launches Next Month
Sony first announced its new 4K streaming service at the International Consumer Electronics Show earlier this year and the company has revealed ...

Oscar-winning actress Patty Duke dies, age 69
(Reuters) - Oscar-winning American actress Patty Duke, widely known for the 1960s show "The Patty Duke Show," died on Tuesday, her representative ...

California's $15 minimum wage: Brown defends sudden embrace of wage hike
Gov. Jerry Brown called his proposal to boost California's minimum wage 50 percent over the next six years 'a matter of economic justice' and ...

Here are the brands that appear most in 'Batman v Superman'
"Batman v Superman" had a big opening weekend grossing $420 million worldwide . The superhero epic included surprise appearances not only from ...

Instagram Bumps Up Video Length From 15 to 60 Seconds
Instagram will soon let users record, share and watch longer video clips: The Facebook-owned photo and video sharing service is increasing the ...

VR Week: The 5 best VR laptops: these notebooks are ready for the Rift
Best VR Laptops If you want to explore virtual worlds in their highest fidelity, but don't have room for a PC, you're going to need a powerful ...

Resources last updated: 3/29/2016 9:19:10 PM