From: pkhuong
Subject: On "lisp machines"
Date: 
Message-ID: <51184814.0408241910.23153720@posting.google.com>
I noticed that siliconising Lisp has been a pretty hot topic lately.
Unfortunately, those who express the most interest in this endeavor
seem to be not as well-read as the admittedly extraordinary standard
on cll. The topic of [virtual] machines for eager functional languages
(of which Lisp is a member, even with its multiparadigm-ness) is not
some new virgin research topic. The SECD machine
(http://en.wikipedia.org/wiki/SECD_machine) is a well known
architecture with relatively simple semantics. There has been attempts
to cast an SECD machine in silicon - i unfortunately don't have any
further information on the endeavour, except that it was at a Canadian
university. Additionally, in one of his articles on Linear Lisp, Baker
describes a virtual machines that supports hash consing and
update-in-place of linear objects. It seems to me it might not be
impossible to allow objects external to the cons hash and thus also
have shareable, mutable objects.

Still, even if one wants to explore new avenues, there is still one
interesting conclusion that can be reached: stack machines are a
natural fit for functional languages (especially eager ones). It is
also another topic that has been explored in the past, although much
less so than register machines. Koopman
[http://www.ece.cmu.edu/~koopman/stack.html] and Ertl
[http://www.complang.tuwien.ac.at/projects/forth.html] (and c.l.forth,
on its good days) should have interesting pointers on the efficient
implementation of stack machines.

This still leaves the area of garbage collection support uncovered,
but, superficially, automatic forwarding pointers and/or tagged (and
aligned!) pointers should help. Baker also had interesting ideas on
the use of a specialised MMU for GC. In any case, i can only hope less
time will be wasted on reinventing the wheel, and more on useful work.

Paul Khuong

From: Duncan Entwisle
Subject: Re: On "lisp machines"
Date: 
Message-ID: <family*.*entwisle*-4A4F62.08095525082004@newstrial.btopenworld.com>
In article <····························@posting.google.com>,
 ··········@pvk.ca (pkhuong) wrote:

> This still leaves the area of garbage collection support uncovered,
> but, superficially, automatic forwarding pointers and/or tagged (and
> aligned!) pointers should help.

I'm still reading through a lot of the papers on Lisp/Scheme in hardware 
(it's tough going as I'm not a computer scientist, and I'm starting 
learning Lisp/Scheme at the same time :-). However, I thought automatic 
forwarding pointers were already implemented in the original, famous, 
lisp machines with respect to garbage collection?

> Baker also had interesting ideas on the use of a specialised MMU
> for GC.

Thanks. I'm always happy for leads on interesting papers - there's an 
awful lot of stuff out there, and sifting though it as an individual 
without an expert in the field can be time consuming.

I have a *vague* (and probably stupid :-) idea on the hardware garbage 
collection front. However I've no intention of sharing it until I've 
investigated it enough to have some idea what I'm talking about. This 
will probably take me a long time.

> In any case, i can only hope less time will be wasted on
> reinventing the wheel, and more on useful work.

I can understand frustration at seeing people go down blind allies, but 
I would suggest the best way to counter that is to point out other 
avenues to consider - as you have done with hardware garbage collection 
strategies.

Also the term "useful" will have different meanings for different 
people, as people have different aims, and do things for differing 
reasons.

"Live and let live"

Duncan.
From: Joe Marshall
Subject: Re: On "lisp machines"
Date: 
Message-ID: <isb75ow6.fsf@ccs.neu.edu>
··········@pvk.ca (pkhuong) writes:

> I noticed that siliconising Lisp has been a pretty hot topic lately.

> Still, even if one wants to explore new avenues, there is still one
> interesting conclusion that can be reached: stack machines are a
> natural fit for functional languages (especially eager ones). It is
> also another topic that has been explored in the past, although much
> less so than register machines. Koopman
> [http://www.ece.cmu.edu/~koopman/stack.html] and Ertl
> [http://www.complang.tuwien.ac.at/projects/forth.html] (and c.l.forth,
> on its good days) should have interesting pointers on the efficient
> implementation of stack machines.

The MIT Lisp Machines and their descendants (LMI, Symbolics, TI) used
a stack-machine model at the Lisp level.

We decided to abandon that model.  Although stack machines are easy to
implement, they simply do not have the data processing bandwidth of a
register machine.

> This still leaves the area of garbage collection support uncovered,
> but, superficially, automatic forwarding pointers and/or tagged (and
> aligned!) pointers should help. 

It's been done.  It would be interesting to look at the memory model
that the Aries project (MIT LCS) developed.
From: mikel evins
Subject: Re: On "lisp machines"
Date: 
Message-ID: <Mr1Xc.12042$aX2.11187@newssvr29.news.prodigy.com>
Joe Marshall wrote:
> ··········@pvk.ca (pkhuong) writes:
> 
> 
>>I noticed that siliconising Lisp has been a pretty hot topic lately.
> 
> 
>>Still, even if one wants to explore new avenues, there is still one
>>interesting conclusion that can be reached: stack machines are a
>>natural fit for functional languages (especially eager ones). It is
>>also another topic that has been explored in the past, although much
>>less so than register machines. Koopman
>>[http://www.ece.cmu.edu/~koopman/stack.html] and Ertl
>>[http://www.complang.tuwien.ac.at/projects/forth.html] (and c.l.forth,
>>on its good days) should have interesting pointers on the efficient
>>implementation of stack machines.
> 
> 
> The MIT Lisp Machines and their descendants (LMI, Symbolics, TI) used
> a stack-machine model at the Lisp level.
> 
> We decided to abandon that model.  Although stack machines are easy to
> implement, they simply do not have the data processing bandwidth of a
> register machine.

Your description of the K-machine persuaded me to try making a 
register-based rather than a stack-based VM. Periodically, as I'm 
working on it, I re-read that paper, looking for ideas.

 >>This still leaves the area of garbage collection support uncovered,
 >>but, superficially, automatic forwarding pointers and/or tagged (and
 >>aligned!) pointers should help.
 >
 >
 > It's been done.  It would be interesting to look at the memory model
 > that the Aries project (MIT LCS) developed.
 >

Another interesting tip; thanks!

I believe you mentioned some time back that you had experimented some 
with threaded interpreters, and had liked the results. If there anything 
more you can say about that?
From: Joe Marshall
Subject: Re: On "lisp machines"
Date: 
Message-ID: <8yc35hgt.fsf@ccs.neu.edu>
mikel evins <·····@evins.net> writes:

> I believe you mentioned some time back that you had experimented some
> with threaded interpreters, and had liked the results. If there
> anything more you can say about that?

Not an awful lot.  I've been (very slowly) hacking up a VM based on
threaded interpretation.  It's hard to do a threaded interpreter
efficiently in C, so I've been using machine code and I've only been
hacking it in my spare time.  I've been trying to figure out ways to
handle advanced features like upward funargs so that there is zero
overhead if you do *not* use them (so testing a flag to see if it is
safe to stack allocate is definitely out, yet you want to stack
allocate by default, so closure code has to figure out how to
back-patch things).
From: mikel
Subject: Re: On "lisp machines"
Date: 
Message-ID: <9B4Xc.8070$QJ3.2272@newssvr21.news.prodigy.com>
Joe Marshall wrote:
> mikel evins <·····@evins.net> writes:
> 
> 
>>I believe you mentioned some time back that you had experimented some
>>with threaded interpreters, and had liked the results. If there
>>anything more you can say about that?
> 
> 
> Not an awful lot.  I've been (very slowly) hacking up a VM based on
> threaded interpretation.  It's hard to do a threaded interpreter
> efficiently in C, so I've been using machine code and I've only been
> hacking it in my spare time.  I've been trying to figure out ways to
> handle advanced features like upward funargs so that there is zero
> overhead if you do *not* use them (so testing a flag to see if it is
> safe to stack allocate is definitely out, yet you want to stack
> allocate by default, so closure code has to figure out how to
> back-patch things).

I'm familiar with Koopman's TIGRE machine, and one of my collaborators 
thinks he knows a way to make CLOS-style dispatch fast using a similar 
approach. I'm interested in those approaches, and in whatever results 
you see in your work. I wonder whether it might be feasible to devise a 
set of combinators that are lisp-friendly, that would support 
supercombinator compilation to make typical idioms run fast.

However, in the short term I'm just writing a register-based VM to 
execute instructions that are not too different from SECD-style 
instructions (apart from the shift to a register-based style), and 
writing it in a high-level language in order to get things done quickly, 
but with the likelihood that I'm going to need to recode the inner loop 
in a few assembly languages, after I've more or less decided just what 
that inner loop consists of.

One complication is that I'm trying to make it possible to manage memory 
in several different ways, depending on what you want your VM instance 
to do.
From: pkhuong
Subject: Re: On "lisp machines"
Date: 
Message-ID: <51184814.0408290740.292fb7d6@posting.google.com>
mikel <·····@evins.net> wrote in message news:<···················@newssvr21.news.prodigy.com>...
> Joe Marshall wrote:
> > Not an awful lot.  I've been (very slowly) hacking up a VM based on
> > threaded interpretation.  It's hard to do a threaded interpreter
> > efficiently in C, so I've been using machine code and I've only been
> > hacking it in my spare time.  I've been trying to figure out ways to
> > handle advanced features like upward funargs so that there is zero
> > overhead if you do *not* use them (so testing a flag to see if it is
> > safe to stack allocate is definitely out, yet you want to stack
> > allocate by default, so closure code has to figure out how to
> > back-patch things).

[see my last paragraph, on my own attempt at compiling a functional
language while stack allocating]

> I'm familiar with Koopman's TIGRE machine, and one of my collaborators 
> thinks he knows a way to make CLOS-style dispatch fast using a similar 
> approach. I'm interested in those approaches, and in whatever results 
> you see in your work. I wonder whether it might be feasible to devise a 
> set of combinators that are lisp-friendly, that would support 
> supercombinator compilation to make typical idioms run fast.

Wow. I never thought i'd hear about TIGRE on cll :) From my experience
trying to adapt the concept to modern x86, there's at least one major
hurdle: self-modifying code is extremely bad on x86. This is now such
a common comment for most modern architectures (with D caches) that
it's almost become a clich�. However this is even more true for x86
due to its variable length instructions and backward compatibility
requirements. The generation of code per se isn't bad for performance,
it's the modification of memory that is already in the D cache that's
horrible: it makes the chip flush the whole D cache, resulting in a,
shall we say, extremely slow execution speed for the next couple
cycles.

Another possible problem (ie, i think it exists, but i'm not sure to
which chipset it applies) is that, with the way the units are
arranged, back-to-back pushes can be inefficient (pipeline stall).
TIGRE's un-idiomatic use of chains of calls isn't exactly well
supported by such architectures.

Moreover, in an eager language, graph rewriting is useless (since
branches are never shared). While one could output a tree instead of a
graph (and twist call-by-need into call-by-value), it would be
needlessly inefficient. Moreover, the obvious optimisation of inlining
the calls of branch-less segments into supercombinators puts us right
back where we would have started with a more standard implementation
for an eager language.

> However, in the short term I'm just writing a register-based VM to 
> execute instructions that are not too different from SECD-style 
> instructions (apart from the shift to a register-based style), and 
> writing it in a high-level language in order to get things done quickly, 
> but with the likelihood that I'm going to need to recode the inner loop 
> in a few assembly languages, after I've more or less decided just what 
> that inner loop consists of.

I'm working on my own implementation, but am compiling directly to C
instead of using a VM, and i think it can be simpler or comparable to
using one. It's pretty standard in that it CPSes the program to
simplify continuations and allow infinite tail-calls (save current
continuations, go back to the trampoline). However, instead of
constantly pogo-sticking between the driver and useful code, it
longjmp's back to the driver only when the stack is about to be
exhausted (i believe an ML does that). It's also tagless � la tagless
g-machine, in that it instead devotes a full pointer to a function
that does tpe-specific oeprations (like a class) [it makes my life
easier to just put a function pointer than to extend everything to
support new types ;)]. So far, it's pretty standard, but it's *stack
allocating by default*. Like the grandparent poster, i have to
backpatch addresses when elements are evicted from the stack (which
only happens when they are pointed to by the heap or when the
continuation must be saved beore longjmp'ing back to the driver, as
described by Baker in "CONS should not CONS its arguments: Cheney On
The MTA, Pt. II"[i think]). However, i believe my tagless approach
makes it easier, since the copying routine can be specialised for the
specific type (e.g., array of int's VS cons cell). In other words,
it's using the stack like a first generation in a generational GC, and
moves things from the first to the second and last generation, the
heap, with a na�ve copying collector. It also tries to efficiently use
reference count (efficiently: don't increment children's refcount more
than twice, use anchored pointers to further minimise updates) to
deduce linearity, allowing functional style to dynamically choose to
use mutation, but that's irrelevant.

It's interesting to see what other amateurs (i'm just beginning
pre-dentistry - no CS or even university experience :) are up to.

Paul Khuong

PS Mr Marshall: What kind of threading are you using (i assume you
mean threaded as in interpretation method, not multi-process?).
From: mikel
Subject: Re: On "lisp machines"
Date: 
Message-ID: <rNoYc.9244$QJ3.5536@newssvr21.news.prodigy.com>
pkhuong wrote:
> mikel <·····@evins.net> wrote in message news:<···················@newssvr21.news.prodigy.com>...
> 
>>Joe Marshall wrote:
>>
>>>Not an awful lot.  I've been (very slowly) hacking up a VM based on
>>>threaded interpretation.  It's hard to do a threaded interpreter
>>>efficiently in C, so I've been using machine code and I've only been
>>>hacking it in my spare time.  I've been trying to figure out ways to
>>>handle advanced features like upward funargs so that there is zero
>>>overhead if you do *not* use them (so testing a flag to see if it is
>>>safe to stack allocate is definitely out, yet you want to stack
>>>allocate by default, so closure code has to figure out how to
>>>back-patch things).
> 
> 
> [see my last paragraph, on my own attempt at compiling a functional
> language while stack allocating]
> 
> 
>>I'm familiar with Koopman's TIGRE machine, and one of my collaborators 
>>thinks he knows a way to make CLOS-style dispatch fast using a similar 
>>approach. I'm interested in those approaches, and in whatever results 
>>you see in your work. I wonder whether it might be feasible to devise a 
>>set of combinators that are lisp-friendly, that would support 
>>supercombinator compilation to make typical idioms run fast.
> 
> 
> Wow. I never thought i'd hear about TIGRE on cll :) From my experience
> trying to adapt the concept to modern x86, there's at least one major
> hurdle: self-modifying code is extremely bad on x86. This is now such
> a common comment for most modern architectures (with D caches) that
> it's almost become a clich�. However this is even more true for x86
> due to its variable length instructions and backward compatibility
> requirements. The generation of code per se isn't bad for performance,
> it's the modification of memory that is already in the D cache that's
> horrible: it makes the chip flush the whole D cache, resulting in a,
> shall we say, extremely slow execution speed for the next couple
> cycles.
> 
> Another possible problem (ie, i think it exists, but i'm not sure to
> which chipset it applies) is that, with the way the units are
> arranged, back-to-back pushes can be inefficient (pipeline stall).
> TIGRE's un-idiomatic use of chains of calls isn't exactly well
> supported by such architectures.

That's what I was afraiid of; consequently my VM presently makes no 
attempt to use such a scheme. I don't want to write off my 
collaborator's ideas because he's a clever guy and his particular area 
of expertise is category theory as applied to practical CLOS 
implementation. But his solution may be theoretical in nature, and may 
not take into account the idiosyncracies of modern microprocessors. 
We'll have to see.


> Moreover, in an eager language, graph rewriting is useless (since
> branches are never shared). While one could output a tree instead of a
> graph (and twist call-by-need into call-by-value), it would be
> needlessly inefficient. Moreover, the obvious optimisation of inlining
> the calls of branch-less segments into supercombinators puts us right
> back where we would have started with a more standard implementation
> for an eager language.

Good point. I was wondering sort of idly, without putting any actual 
effort into thinking about it, whether there might be a set of 
combinators suitable for Lisp (as opposed to suitable for lambda 
calculus) that could benefit from the supercombinator approach. Sounds 
like you're saying that such a set of combinators winds up in practice 
being equivalent to standard Lisp implementation techniques.

>>However, in the short term I'm just writing a register-based VM to 
>>execute instructions that are not too different from SECD-style 
>>instructions (apart from the shift to a register-based style), and 
>>writing it in a high-level language in order to get things done quickly, 
>>but with the likelihood that I'm going to need to recode the inner loop 
>>in a few assembly languages, after I've more or less decided just what 
>>that inner loop consists of.
> 
> 
> I'm working on my own implementation, but am compiling directly to C
> instead of using a VM, and i think it can be simpler or comparable to
> using one. 

I want to do a VM so that I can make a portable implementation whose 
code and image formats are binary-compatible across hardware 
architectures. I like what the Squeak folks have done with this 
approach. In fact I like Squeak well enough that I would use it for a 
lot of work except that

1. I'd rather use Lisp
2. I'd rather it was capable of using the native window system
3. I'd rather it was capable of more easily making smaller programs and 
libraries

So I'm hoping to make a Lisp that can satisfy those requirements without 
giving up the other benefits of Squeak. (Highly portable; code and 
images binary compatible across hardware platforms; built-in graphical 
environment for development and for construction of UI; excellent 
support for networking; network-sharable projects; network-loadable 
partial images; easy to turn into a bootable system; etc, etc.)

I don't think compiling to C is any harder than writing a VM, but it 
would sacrifice the platform-independence of compiled  code and images 
that I desire to preserve.

> It's pretty standard in that it CPSes the program to
> simplify continuations and allow infinite tail-calls (save current
> continuations, go back to the trampoline). However, instead of
> constantly pogo-sticking between the driver and useful code, it
> longjmp's back to the driver only when the stack is about to be
> exhausted (i believe an ML does that). It's also tagless � la tagless
> g-machine, in that it instead devotes a full pointer to a function
> that does tpe-specific oeprations (like a class) [it makes my life
> easier to just put a function pointer than to extend everything to
> support new types ;)]. So far, it's pretty standard, but it's *stack
> allocating by default*. Like the grandparent poster, i have to
> backpatch addresses when elements are evicted from the stack (which
> only happens when they are pointed to by the heap or when the
> continuation must be saved beore longjmp'ing back to the driver, as
> described by Baker in "CONS should not CONS its arguments: Cheney On
> The MTA, Pt. II"[i think]). However, i believe my tagless approach
> makes it easier, since the copying routine can be specialised for the
> specific type (e.g., array of int's VS cons cell). In other words,
> it's using the stack like a first generation in a generational GC, and
> moves things from the first to the second and last generation, the
> heap, with a na�ve copying collector. It also tries to efficiently use
> reference count (efficiently: don't increment children's refcount more
> than twice, use anchored pointers to further minimise updates) to
> deduce linearity, allowing functional style to dynamically choose to
> use mutation, but that's irrelevant.

Chicken Scheme also uses the Cheney on the MTA approach. I've seen 
complaints that in practice Cheney on the MTA results in unacceptable GC 
overhead, but I don't know whether that's true; I guess maybe you'll be 
in a position to confirm or deny it. Chicken at present seems to be 
neither outstandingly fast nor outstandingly slow.

> 
> It's interesting to see what other amateurs (i'm just beginning
> pre-dentistry - no CS or even university experience :) are up to.
> 
> Paul Khuong
> 
> PS Mr Marshall: What kind of threading are you using (i assume you
> mean threaded as in interpretation method, not multi-process?).

Threaded-code interpretation, like a Forth interpreter. Code objects 
have embedded pointers to eval and apply routines. Executing a piece of 
code means jumping to the appropriate embedded routine. Indirect 
threading, so as to avoid the self-modifying code problems you mentioned 
above (and to make it easier to relocate code).

There's a discussion of direct and indirect threading and how to write 
threaded-code interpreters here:

http://www.complang.tuwien.ac.at/forth/threaded-code.html




--me
From: Joe Marshall
Subject: Re: On "lisp machines"
Date: 
Message-ID: <4qmkraqx.fsf@ccs.neu.edu>
mikel <·····@evins.net> writes:

> I've seen complaints that in practice Cheney on the MTA results in
> unacceptable GC overhead, but I don't know whether that's true.

In my experience it seems fine.  The overhead comes from other places:
since you never return, any return-address prediction done by the
hardware is wasted, any stack caching beyond the current frame is
wasted, and all the effort of saving the base-of-frame, previous stack
pointer, etc. is wasted.
From: mikel
Subject: Re: On "lisp machines"
Date: 
Message-ID: <tAHYc.13756$ml3.13175@newssvr29.news.prodigy.com>
Joe Marshall wrote:
> mikel <·····@evins.net> writes:
> 
> 
>>I've seen complaints that in practice Cheney on the MTA results in
>>unacceptable GC overhead, but I don't know whether that's true.
> 
> 
> In my experience it seems fine.  The overhead comes from other places:
> since you never return, any return-address prediction done by the
> hardware is wasted, any stack caching beyond the current frame is
> wasted, and all the effort of saving the base-of-frame, previous stack
> pointer, etc. is wasted.

Interesting; thanks. Do you think these sources of overhead obviate any 
supposed advantages of the approach?
From: Joe Marshall
Subject: Re: On "lisp machines"
Date: 
Message-ID: <isb0ppyh.fsf@ccs.neu.edu>
mikel <·····@evins.net> writes:

> Joe Marshall wrote:
>> mikel <·····@evins.net> writes:
>>
>>>I've seen complaints that in practice Cheney on the MTA results in
>>>unacceptable GC overhead, but I don't know whether that's true.
>> In my experience it seems fine.  The overhead comes from other
>> places:
>> since you never return, any return-address prediction done by the
>> hardware is wasted, any stack caching beyond the current frame is
>> wasted, and all the effort of saving the base-of-frame, previous stack
>> pointer, etc. is wasted.
>
> Interesting; thanks. Do you think these sources of overhead obviate
> any supposed advantages of the approach?

The advantages are portability, interoperability, ease of
call-with-current-continuation, and ease of tail-recursion.  If
performance is less important than all of these, then Cheney on the
MTA is a clear win.
From: mikel evins
Subject: Re: On "lisp machines"
Date: 
Message-ID: <iYOYc.13640$eD3.11984@newssvr27.news.prodigy.com>
Joe Marshall wrote:
> mikel <·····@evins.net> writes:
> 
> 
>>Joe Marshall wrote:
>>
>>>mikel <·····@evins.net> writes:
>>>
>>>
>>>>I've seen complaints that in practice Cheney on the MTA results in
>>>>unacceptable GC overhead, but I don't know whether that's true.
>>>
>>>In my experience it seems fine.  The overhead comes from other
>>>places:
>>>since you never return, any return-address prediction done by the
>>>hardware is wasted, any stack caching beyond the current frame is
>>>wasted, and all the effort of saving the base-of-frame, previous stack
>>>pointer, etc. is wasted.
>>
>>Interesting; thanks. Do you think these sources of overhead obviate
>>any supposed advantages of the approach?
> 
> 
> The advantages are portability, interoperability, ease of
> call-with-current-continuation, and ease of tail-recursion.  If
> performance is less important than all of these, then Cheney on the
> MTA is a clear win.

Check. I can testify to at least some of those advantages in Chicken.
From: felix
Subject: Re: On "lisp machines"
Date: 
Message-ID: <e36dad49.0409060054.11858ea3@posting.google.com>
mikel <·····@evins.net> wrote in message news:<···················@newssvr21.news.prodigy.com>...
> 
> Chicken Scheme also uses the Cheney on the MTA approach. I've seen 
> complaints that in practice Cheney on the MTA results in unacceptable GC 
> overhead, but I don't know whether that's true; I guess maybe you'll be 
> in a position to confirm or deny it. Chicken at present seems to be 
> neither outstandingly fast nor outstandingly slow.
> 

This is not exactly true. GC using this Scheme is actually *very*
efficient. Generally you will not find many portable Scheme compilers that
generate faster code, while still supporting general tail-calls and full
continuations.


cheers,
felix
From: mikel evins
Subject: Re: On "lisp machines"
Date: 
Message-ID: <ENa%c.12502$QJ3.11198@newssvr21.news.prodigy.com>
felix wrote:
> mikel <·····@evins.net> wrote in message news:<···················@newssvr21.news.prodigy.com>...
> 
>>Chicken Scheme also uses the Cheney on the MTA approach. I've seen 
>>complaints that in practice Cheney on the MTA results in unacceptable GC 
>>overhead, but I don't know whether that's true; I guess maybe you'll be 
>>in a position to confirm or deny it. Chicken at present seems to be 
>>neither outstandingly fast nor outstandingly slow.
>>
> 
> 
> This is not exactly true. GC using this Scheme is actually *very*
> efficient. Generally you will not find many portable Scheme compilers that
> generate faster code, while still supporting general tail-calls and full
> continuations.

I don't mean to impugn Chicken, which I like very much, but while we're 
on the subject, I wonder whether you know how to make the appended 
Chicken code go fast. It is a contrived program that I wrote while 
investigating several compilers for a tools project; basically, I wanted 
  to find a compiler for a pleasant language that could read large files 
and extract information about the text in them, and do it rapidly. 
Chicken read the (large) test files quite rapidly, but I couldn't get 
the loop that gets individual characters to perform very well, trying 
various techniques. If thre is a very fast way to get each character in 
turn, I'd like to know what it is.

(I ended up using Ocaml for this particular project; it beat the pants 
off everything else I tried, including GCC).

(declare (uses format posix lolevel))

(define $buffer-size 4096)

(define (get-each-character buffer)
     (let ((ch #f))
       (do ((i 0 (+ i 1)))
           ((>= i $buffer-size) i)
         (set! ch (string-ref buffer i)))))

(define (read-each-character in-file buffer)
     (let ((total 0))
       (do ((result (file-read in-file $buffer-size buffer)
                    (file-read in-file $buffer-size buffer)))
           ((<= (list-ref result 1) 0) total)
         (set! total (+ total (list-ref result 1)))
         (get-each-character buffer))))

(define (main)
     (let* ((path (list-ref (argv) 1))
            (fd (file-open path open/rdonly))
            (buffer (make-string $buffer-size))
            (start-time (current-milliseconds))
            (total (read-each-character fd buffer))
            (end-time (current-milliseconds))
            (elapsed (- end-time start-time)))
       (format #t "~%Read ~D characters in ~D milliseconds" total elapsed)
       (file-close fd)))

(main)
From: felix
Subject: Re: On "lisp machines"
Date: 
Message-ID: <aef769bd.0409070321.547f65c8@posting.google.com>
mikel evins <·····@evins.net> wrote in message news:<·····················@newssvr21.news.prodigy.com>...
> 
> I don't mean to impugn Chicken, which I like very much, but while we're 
> on the subject, I wonder whether you know how to make the appended 
> Chicken code go fast. It is a contrived program that I wrote while 
> investigating several compilers for a tools project; basically, I wanted 
>   to find a compiler for a pleasant language that could read large files 
> and extract information about the text in them, and do it rapidly. 
> Chicken read the (large) test files quite rapidly, but I couldn't get 
> the loop that gets individual characters to perform very well, trying 
> various techniques. If thre is a very fast way to get each character in 
> turn, I'd like to know what it is.
> 

Your code looks fine (I would replace `(list-ref _ 1)' with `(cadr _)'
perhaps). What optimizations settings are you using? With

csc -Ob slow.scm 

You should get a reasonable performance (this switches off all
runtime-checks, disables multithreading and compiles in fixnum mode).
For example, the get-each-character loop boils down to

static C_word C_fcall f_66(C_word t0,C_word t1){
C_word tmp;
C_word t2;
C_word t3;
C_word t4;
C_word t5;
C_word t6;
loop:
t2=t1;
if(C_truep((C_word)C_fixnum_greater_or_equal_p(t2,C_fix(4096)))){
return(t1);}
else{
t3=(C_word)C_subchar(((C_word*)t0)[2],t1);
t4=(C_word)C_u_fixnum_plus(t1,C_fix(1));
t6=t4;
t1=t6;
goto loop;}}

(All the C_...() invocations above are macros)

BTW, gcc (with -O3 -fomit-frame-pointer -fno-strict-aliasing)
optimizes this merrily into:

f_66:
	movl	8(%esp), %eax
	jmp	.L3
	.p2align 4,,7
.L8:
	addl	$2, %eax
.L3:
	cmpl	$8192, %eax
	jle	.L8
	ret

The csc option -O2 really is a must, and -f (fixnum mode) is 
particular important for this example.


cheers,
felix
From: mikel evins
Subject: Re: On "lisp machines"
Date: 
Message-ID: <xoh%c.17060$0y4.13806@newssvr27.news.prodigy.com>
felix wrote:
> mikel evins <·····@evins.net> wrote in message news:<·····················@newssvr21.news.prodigy.com>...
> 
>>I don't mean to impugn Chicken, which I like very much, but while we're 
>>on the subject, I wonder whether you know how to make the appended 
>>Chicken code go fast. It is a contrived program that I wrote while 
>>investigating several compilers for a tools project; basically, I wanted 
>>  to find a compiler for a pleasant language that could read large files 
>>and extract information about the text in them, and do it rapidly. 
>>Chicken read the (large) test files quite rapidly, but I couldn't get 
>>the loop that gets individual characters to perform very well, trying 
>>various techniques. If thre is a very fast way to get each character in 
>>turn, I'd like to know what it is.
>>
> 
> 
> Your code looks fine (I would replace `(list-ref _ 1)' with `(cadr _)'
> perhaps). What optimizations settings are you using? With
> 
> csc -Ob slow.scm 
> 
> You should get a reasonable performance (this switches off all
> runtime-checks, disables multithreading and compiles in fixnum mode).

The GCC options you suggested really improved things a lot. The Chicken 
option -Ob by itself produced the results I saw before (where Chicken's 
output was around thirty times slower than Ocaml's). Adding the GCC 
options improves things to the point that the Chicken output squeaks 
ahead of CMUCL into third place, with Ocaml's code only about twice as fast.

I'd say that moves it into the 'fast' category; thanks for the suggestion.

> The csc option -O2 really is a must, and -f (fixnum mode) is 
> particular important for this example.

Got those; it was the GCC options at the csc command line that really 
made the difference. Thanks again.
From: Joe Marshall
Subject: Re: On "lisp machines"
Date: 
Message-ID: <8ybwrb6o.fsf@ccs.neu.edu>
··········@pvk.ca (pkhuong) writes:

> PS Mr Marshall: What kind of threading are you using (i assume you
> mean threaded as in interpretation method, not multi-process?).

I meant `threaded interpreter' in the same sense used by `Forth'
aficionados.  It is `indirect threading' rather than `direct' because
(as you noted) modifying the instruction stream requires a cache
flush.  There are numerous other advantages to indirect threading, but
ease of garbage collection is probably the most important one.
From: pkhuong
Subject: Re: On "lisp machines"
Date: 
Message-ID: <51184814.0408301553.4aec658d@posting.google.com>
··········@pvk.ca (pkhuong) wrote in message news:<····························@posting.google.com>...
[...]
> Wow. I never thought i'd hear about TIGRE on cll :) From my experience
> trying to adapt the concept to modern x86, there's at least one major
> hurdle: self-modifying code is extremely bad on x86. This is now such
> a common comment for most modern architectures (with D caches) that
> it's almost become a clich�. However this is even more true for x86
> due to its variable length instructions and backward compatibility
> requirements. The generation of code per se isn't bad for performance,
> it's the modification of memory that is already in the D cache that's
> horrible: it makes the chip flush the whole D cache, resulting in a,
> shall we say, extremely slow execution speed for the next couple
> cycles.
Yes, i did mean I Cache.