Pure speed

From: Emilian
Subject: Pure speed
Date: Tue, 13 Sep 2005 13:02:52 +0000
Message-ID: <1126616572.763330.107070@f14g2000cwb.googlegroups.com>

Hy,

With an advice from c.l.l[1] I discovered http://spoj.sphere.pl/ which
allows me to submit lisp solutions to some contest problems.

It's all nice and I've become a little format and loop master(never
used them that much in the uni) but there is one thing that's bothering
me: SPEED.

I'm using CLISP and managed quite often for some tight time-restricted
problems to get Time Limit Exceeded.

Of course, my lisp code / algorithm could be inefficient but even the
raw io speed test( http://spoj.sphere.pl/problems/INTEST/ ) fails in my
case (it should get about 1MB/s).

Are there any basic things I should know ?
- as far of recursion, I try to make them tail recursive which should
be almost as fast as a normal loop.
- where do I get some description of (declare) ? I found some code on
the net with (declare (fixnum)) (declare (optimize)) but I would really
like some complete guide.
- what functions should I use for I/O? Most of the time I use plain
(read) for  numbers and (read-line) for strings. Is (parse-integer
(read-line)) better than (read) ? As far as I understand (format) is
really slow. Any specific replacement available ?

Please note that I'm not trying to destroy the spirit of lisp for the
sake of speed. That is, I would prefer code that I'm able to understand
and that feels right than code that is just full of (declare)s. But --
I need some hints to speed up my code.

Regards,
Emilian Bold

1.
http://groups.google.com/group/comp.lang.lisp/browse_thread/thread/bb71de5c2e245035

Re: Pure speed ···············@yahoo.com
- Re: Pure speed Jeff M.
  - Re: Pure speed Joe Marshall
    - Re: Pure speed Pascal Bourguignon
      - Re: Pure speed Emilian
        Re: Pure speed Pascal Bourguignon
        Re: Pure speed Emilian
        Re: Pure speed Patrick Frankenberger
        Re: Pure speed Joe Marshall
      - Re: Pure speed Duane Rettig
- Re: Pure speed Joe Marshall
  - Re: Pure speed David Steuber
Re: Pure speed Joe Marshall
- Re: Pure speed Jeff M.
  - Re: Pure speed ··············@hotmail.com
    - Re: Pure speed Jeff M.
      - Re: Pure speed Ulrich Hobelmann
      - Re: Pure speed Duane Rettig
  - Re: Pure speed ·······@gmail.com
    - Re: Pure speed Ulrich Hobelmann
      - Re: Pure speed Lars Brinkhoff
      - Re: Pure speed Marcin 'Qrczak' Kowalczyk
  - Re: Pure speed Joe Marshall
    - Re: Pure speed Ulrich Hobelmann
      - Re: Pure speed Joe Marshall
        Re: Pure speed Rob Thorpe
        Re: Pure speed Joe Marshall
        Re: Pure speed Ulrich Hobelmann
        Re: Pure speed Rob Thorpe
Re: Pure speed elof

From: ···············@yahoo.com
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 14:00:30 +0000
Message-ID: <1126620030.702840.35510@o13g2000cwo.googlegroups.com>

The sad fact is that most Common Lisps don't generally implement tail
recursion as a loop.  This is controversial (see elsewhere in this
group), but the Lisp standard does not require it.  You can implement
your functions in two ways, one tail-recursive and one not, compile
them, and use disassemble (see below) to compare the results.

If you write a tail-recursive function to iterate down a long list, the
main problem is not speed, but stack overflow.  If the implementation
doesn't use a loop, it will have to remember all the nested calls, and
will likely overflow when the nesting is a few thousand deep.

Rather than write a treatise, I'll give some examples of how to use
declare.  I'll use functions that add two numbers.  My assumption is
that the numbers are always small integers, and the sums are also small
integers.  I need to tell the compiler about this assumption.  'add'
tells the compiler nothing, 'add2' tells it part of the story, and
'add3' tells it the whole story.

(defun add (x y) ;; no declarations at all
  (+ x y))

(compile 'add) ;; an alternative to compile-file

(disassemble 'add)

In the implementation I use--Allegro CL 7.0 on Windows--the result
of the disassemble is 56 bytes.

In add2, I tell the compiler that x and y are small integers that
should fit in one machine word, and that we want it to optimize for
speed.  Note: if you want your whole file to be fast, put the line
(declaim (optimize speed))
at the top of the file, and omit (optimize speed) from the individual
functions.  But you shouldn't do that until the code is thoroughly
debugged.  The best while developing is to put this line at the top of
the file:
(declaim (optimize safety debug))

Back to add2:

(defun add2 (x y)
  (declare (optimize speed) (fixnum x y))
  (+ x y))

(compile 'add2)

(disassemble 'add2)

On my platform, this is still 56 bytes.  Didn't the declarations do any
good?  The answer is that the compiler is correct.  Even if x and y are
big enough to fit in one machine word, there is no guarantee that x+y
will fit in one word.  After x+y is computed, the system has to "box"
it up as a bignum [arbitrary-precision integer], just in case it
doesn't fit.  The disassemble output may give you a hint as to what's
happening: on my platform, I saw BOX-TO-NEW-BIGNUM.  There are also
compiler flags that give such hints.

The way to do it is

(defun add3 (x y)
  (declare (optimize speed) (fixnum x y)
  (the fixnum (+ x y)))

The "the" form tells the compiler that the answer will definitely be a
fixnum.  After compile and disassemble, add3 is only 24 bytes long.  It
comes down to a single assembly instruction "addl", together with trap
and interrupt handling.

Of course, the burden is on the *programmer* to be sure x+y is small.
You may think x will be around 10 or 100, but are you sure it won't
grow into the millions?

From: Jeff M.
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 15:23:58 +0000
Message-ID: <1126625038.169031.263160@g44g2000cwa.googlegroups.com>

A couple of things I would add/comment on:

Non tail-call optimizing implementations will be slower. Speed is a
factor vs. stack overflow. Stack overflow is a symptom of large data
sets. But every call/ret pair has a definite speed hit. Without
tail-call optimization any recursive solution will be slower than an
iterative one.

The problem described on the website is very trivial. The naive
solution in Lisp is a couple functions, each a couple lines of code:

(defun nk (s)
  (let ((n (read s))
        (k (read s)))
    (values n k)))

(defun run (n k s)
  (let ((count 0))
    (dotimes (i n count)
      (let ((x (read s)))
        (when (zerop (mod x k))
          (incf count))))))

(defun test (filename)
  (with-open-file (s filename)
    (multiple-value-bind (n k)
        (nk s)
      (run n k s))))

So, now, examining this program, there is no recursion, but there are
only two spots that could use optimization (everything else is only
called once): the conditional (mod), and the read. So: a couple simple
declares should help quite a bit for the conditional:

(defun run (n k s)
  (declare (optimize speed) (fixnum n k))
  (let ((count 0))
    (dotimes (i n count)
      (let ((x (read s))
        (declare (fixnum x))
        (when (zerop (the fixnum (mod x k)))
          (incf count))))))

The read is a bit harder to optimize out (perhaps other Lispniks here
have a solution for that, though). However, I usually like to trust
that the language implements its features well, and I'm going to trust
that read does it's job well.

So, we're good to go. Generally speaking, though. This is how I program
Lisp. Very high level. Get it working. Profile. Optimize a couple small
sections. Done. And most of the time, the speed benefit in development
time is much more  important than the runtime of the program. Otherwise
we'd all still be coding assembly.

Jeff M.

From: Joe Marshall
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 23:47:42 +0000
Message-ID: <oe6wil7l.fsf@alum.mit.edu>

"Jeff M." <·······@gmail.com> writes:

> A couple of things I would add/comment on:
>
> Non tail-call optimizing implementations will be slower.  Speed is a
> factor vs. stack overflow.  Stack overflow is a symptom of large data
> sets. But every call/ret pair has a definite speed hit.  Without
> tail-call optimization any recursive solution will be slower than an
> iterative one.

Not necessarily.  The Intel Pentium architecture has a special
hardware buffer for call/return pairing that eliminates a lot of the
cost.

> The problem described on the website is very trivial. The naive
> solution in Lisp is a couple functions, each a couple lines of code:
>
> (defun nk (s)
>   (let ((n (read s))
>         (k (read s)))
>     (values n k)))
>
> (defun run (n k s)
>   (let ((count 0))
>     (dotimes (i n count)
>       (let ((x (read s)))
>         (when (zerop (mod x k))
>           (incf count))))))
>
> (defun test (filename)
>   (with-open-file (s filename)
>     (multiple-value-bind (n k)
>         (nk s)
>       (run n k s))))
>
> So, now, examining this program, there is no recursion, but there are
> only two spots that could use optimization (everything else is only
> called once): the conditional (mod), and the read. 

This program is going to be completely dominated by the call to read.

> So: a couple simple
> declares should help quite a bit for the conditional:
>
> (defun run (n k s)
>   (declare (optimize speed) (fixnum n k))
>   (let ((count 0))
>     (dotimes (i n count)
>       (let ((x (read s))
>         (declare (fixnum x))
>         (when (zerop (the fixnum (mod x k)))
>           (incf count))))))
>
> The read is a bit harder to optimize out (perhaps other Lispniks here
> have a solution for that, though). However, I usually like to trust
> that the language implements its features well, and I'm going to trust
> that read does it's job well.

No matter how well READ is implemented, it is a heavyweight procedure.
It has to fetch things from a stream, dispatch through a readtable,
assemble tokens, etc.

> So, we're good to go. Generally speaking, though. This is how I program
> Lisp. Very high level. Get it working. Profile. Optimize a couple small
> sections. Done. And most of the time, the speed benefit in development
> time is much more  important than the runtime of the program. Otherwise
> we'd all still be coding assembly.

I agree.  I'm all for tail call optimization, too.  I think it is of
critical importance.  But I don't expect astounding gains in
performance just because of tail call optimization.

From: Pascal Bourguignon
Subject: Re: Pure speed
Date: Wed, 14 Sep 2005 02:43:04 +0000
Message-ID: <87fys85pzb.fsf@thalassa.informatimago.com>

Joe Marshall <·········@alum.mit.edu> writes:
> No matter how well READ is implemented, it is a heavyweight procedure.
> It has to fetch things from a stream, dispatch through a readtable,
> assemble tokens, etc.

And no matter how well this is implemented, it will still have to wait
ages for the hard disk to spin, or worse, for a user to decide what to type.

It's worse than before, when the ratio of speeds CPU/DISK was about
1000. Now it's about 4000000 and getting bigger everyday.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCS d? s++:++ a+ C+++ UL++++ P--- L+++ E+++ W++ N+++ o-- K- w--- 
O- M++ V PS PE++ Y++ PGP t+ 5+ X++ R !tv b+++ DI++++ D++ 
G e+++ h+ r-- z? 
------END GEEK CODE BLOCK------

From: Emilian
Subject: Re: Pure speed
Date: Wed, 14 Sep 2005 06:31:47 +0000
Message-ID: <1126679507.545964.281750@f14g2000cwb.googlegroups.com>

Thanks guys for the answers so far.

Jeff M., I've also submitted your solution changed a little to read
from the terminal and also got Time Limit Exceeded.

(defun run (n k)
  (declare (optimize speed) (fixnum n k))
  (let ((count 0))
    (dotimes (i n count)
      (let ((x (read)))
        (declare (fixnum x))
        (when (zerop (the fixnum (mod x k)))
          (incf count))))
    (format t "~D~%" count)))

(run (read) (read))

elof's version also gets TLE.

(defun run ()
    (do ((n (read) (decf n))
         (k (read))
         (divisible 0))
        ((= n 0) divisible)
      (when (zerop (mod (read) k))
        (incf divisible))))

(format t "~D~%" (run))


mmcconnell17's example was quite nice, never knew the (the) macro?
exists. I'm also looking now at
http://www.lisp.org/HyperSpec/Body/sym_declare.html to see more
(declare) options.

As for (read), I assume it should be really slow compared to
(parse-integer (read-line)) since it's basically much more than a
read-integer function. No ?

Regards,
Emilian Bold


Pascal Bourguignon wrote:
> Joe Marshall <·········@alum.mit.edu> writes:
> > No matter how well READ is implemented, it is a heavyweight procedure.
> > It has to fetch things from a stream, dispatch through a readtable,
> > assemble tokens, etc.
>
> And no matter how well this is implemented, it will still have to wait
> ages for the hard disk to spin, or worse, for a user to decide what to type.
>
> It's worse than before, when the ratio of speeds CPU/DISK was about
> 1000. Now it's about 4000000 and getting bigger everyday.
>
> --
> __Pascal Bourguignon__                     http://www.informatimago.com/
> -----BEGIN GEEK CODE BLOCK-----
> Version: 3.12
> GCS d? s++:++ a+ C+++ UL++++ P--- L+++ E+++ W++ N+++ o-- K- w---
> O- M++ V PS PE++ Y++ PGP t+ 5+ X++ R !tv b+++ DI++++ D++
> G e+++ h+ r-- z? 
> ------END GEEK CODE BLOCK------

From: Pascal Bourguignon
Subject: Re: Pure speed
Date: Wed, 14 Sep 2005 07:53:58 +0000
Message-ID: <87d5ncdqzt.fsf@thalassa.informatimago.com>

"Emilian" <···················@gmail.com> writes:

> Thanks guys for the answers so far.
>
> Jeff M., I've also submitted your solution changed a little to read
> from the terminal and also got Time Limit Exceeded.
>
> (defun run (n k)
>   (declare (optimize speed) (fixnum n k))
>   (let ((count 0))
>     (dotimes (i n count)
>       (let ((x (read)))
>         (declare (fixnum x))
>         (when (zerop (the fixnum (mod x k)))
>           (incf count))))
>     (format t "~D~%" count)))
>
> (run (read) (read))
>
> elof's version also gets TLE.
>
> (defun run ()
>     (do ((n (read) (decf n))
>          (k (read))
>          (divisible 0))
>         ((= n 0) divisible)
>       (when (zerop (mod (read) k))
>         (incf divisible))))
>
> (format t "~D~%" (run))
>
>
> mmcconnell17's example was quite nice, never knew the (the) macro?
> exists. I'm also looking now at
> http://www.lisp.org/HyperSpec/Body/sym_declare.html to see more
> (declare) options.
>
> As for (read), I assume it should be really slow compared to
> (parse-integer (read-line)) since it's basically much more than a
> read-integer function. No ?

Really, it's not necessarily slower.  If it's compiled to a DFA, the
size of the DFA shouldn't matter for speed.


The problem with SPOJ is that we don't know with what options gcl and
clisp are launched.  I'm not even sure clisp programs are compiled!

But the problem we have about I/O is encodings, and about
*standard-input* (and *standard-output*) is that their external-format
must be specified on the command line, and unfortunately with
clisp-2.33.2 or clisp-2.35, it's not possible to change it "at run-time":

[2]> (setf (stream-external-format *standard-input*) charset:iso-8859-1)
*** - SYSTEM::SET-STREAM-EXTERNAL-FORMAT on #<IO TERMINAL-STREAM> is illegal

Therefore, the speed hit may occur at the character decoding stage.

We can get in lisp the exact same speed as in C, on the condition of
being doing (about) the same.  C doesn't handle characters, only
bytes.  So if you want to do I/O at the same speed than in C, you must
do byte I/O.

(with-open-file (in "/tmp/test" :element-type '(unsigned-byte 8))
  (let ((buffer (make-array 1000000 :element-type '(unsigned-byte 8)
                            :fill-pointer 1000000)))
     (setf (fill-pointer buffer) (read-sequence buffer in))
     buffer))

should be as fast as the C equivalent.


Now, since most implementation extend the iso-8859-1 encoding to be a
256 1-1 byte to character encoding, we can get fast enough test I/O if
we restrict ourselves to this encoding.

(with-open-file (in "/tmp/test" :element-type 'character
                                :external-format charset:iso-8859-1)
  (let ((buffer (make-array 1000000 :element-type 'character
                            :fill-pointer 1000000)))
     (setf (fill-pointer buffer) (read-sequence buffer in))
     buffer))

should not be much slower than the binary version, because we're
reading the same data (which takes us 99.99% of the time), and we're
not doing anything fancy to convert bytes to characters.


[24]> (time (length (let ((bufsiz 4000000))
 (with-open-file (in "/tmp/test" :element-type '(unsigned-byte 8))
  (let ((buffer (make-array bufsiz :element-type '(unsigned-byte 8)
                            :fill-pointer bufsiz)))
     (setf (fill-pointer buffer) (read-sequence buffer in))
     buffer)))))
Real time: 0.119046 sec.
Run time: 0.11 sec.
Space: 4005504 Bytes
GC: 2, GC time: 0.07 sec.
4000000
[25]> (time (length (let ((bufsiz 4000000)) 
 (with-open-file (in "/tmp/test" :element-type 'character
                                 :external-format charset:iso-8859-1)
  (let ((buffer (make-array bufsiz :element-type 'character
                            :fill-pointer bufsiz)))
     (setf (fill-pointer buffer) (read-sequence buffer in))
     buffer)))))
Real time: 0.166368 sec.
Run time: 0.15 sec.
Space: 4005144 Bytes
GC: 2, GC time: 0.06 sec.
4000000


reading the same size as utf-16 shows the time spent in decoding the
characters:

[38]> (time (length (let ((bufsiz 2000000)) (with-open-file (in "/tmp/test" :element-type 'character :external-format charset:utf-16)
  (let ((buffer (make-array bufsiz :element-type 'character
                            :fill-pointer bufsiz)))
     (setf (fill-pointer buffer) (read-sequence buffer in))
     buffer)))))
Real time: 0.225439 sec.
Run time: 0.23 sec.
Space: 6005152 Bytes
GC: 4, GC time: 0.15 sec.
2000000



That said, using CHARACTER instead of (UNSIGNED-BYTE 8) still takes 4
times more memory:

[26]> ARRAY-TOTAL-SIZE-LIMIT
16777215
[27]> SYSTEM::STRING-DIMENSION-LIMIT             
4194304



Finally, to show how unfair life can be, for each read, clisp does 2 mmaps
and 3 mprotects:

[···@thalassa sphere]$ strace  clisp -q -ansi -norc -x '(with-open-file (in "/tmp/test") (loop while (read-line in nil nil)))'  2>&1 |sed -e 's/(.*//'|awk '{m[$1]++}END{for(k in m){printf "%20s %5d\n",k,m[k];}}'
                 dup     1
               uname     2
                   )     1
        gettimeofday     1
                 ---  3109
            readlink     1
          exit_group     1
           getrusage    82
              stat64    63
      rt_sigprocmask    82
             lstat64    11
              execve     2
               write     2
         sigaltstack     1
              access     1
                open   102
               ioctl     4
              munmap    43
           sigreturn  3109
               mmap2  5316
             fcntl64     1
                read  2896
                 brk    22
               close    16
            mprotect  6325
        rt_sigaction    18
             fstat64    20
[···@thalassa sphere]$ strace  ./read /tmp/test 2>&1 |sed -e 's/(.*//'|awk '{m[$1]++}END{for(k in m){printf "%20s %5d\n",k,m[k];}}'
               uname     1
          exit_group     1
              stat64    31
              execve     1
                open    34
               mmap2     5
                read  2859
                 brk     5
               close     1
            mprotect     1
             fstat64     2
[···@thalassa sphere]$ cat read.c
#include <stdio.h>
    int main(int argc,char** argv){
        if(argc>1){
            char line[10000];
            FILE* in=fopen(argv[1],"r");
            while(NULL!=fgets(line,sizeof(line),in));
            return(0);}
        else{return(1);}}


But we can use read-sequence instead:

[···@thalassa sphere]$ strace  clisp -q -ansi -norc -x '(with-open-file (in "/tmp/test") (loop with buffer = (make-string 4000000) while (plusp (read-sequence buffer in))))'  2>&1 |sed -e 's/(.*//'|awk '{m[$1]++}END{for(k in m){printf "%20s %5d\n",k,m[k];}}'
                 dup     1
               uname     2
                   )     1
        gettimeofday     1
                 ---   182
            readlink     1
          exit_group     1
           getrusage     4
              stat64    63
      rt_sigprocmask     4
             lstat64    11
              execve     2
               write     2
         sigaltstack     1
              access     1
                open   102
               ioctl     4
              munmap     3
           sigreturn   182
               mmap2    48
             fcntl64     1
                read  2897
                 brk    22
               close    16
            mprotect   354
        rt_sigaction    18
             fstat64    20

-- 
"Specifications are for the weak and timid!"

From: Emilian
Subject: Re: Pure speed
Date: Thu, 15 Sep 2005 07:01:34 +0000
Message-ID: <1126767694.727160.281420@g49g2000cwa.googlegroups.com>

Hello,

I see that tail recursive may not be such a performance gain but I'm
just looking for speedy tips and tail recursion looks like a good
thing.

Pascal Bourguignon, mentioned a few interesting things in his post.
Basically if there is a lot of time spent at the very basic level of
character decoding then it's game over -- it will never be possible to
make a fast enough lisp program for the SPOJ settings (they do seem to
compile the code btw).

Basically this LISP program

(let ((n (parse-integer (read-line) :junk-allowed t)))
   (loop repeat n do (read-line)))

can't even read the whole damn input within the time limit (that is, I
don't even get a Wrong Answer but Time Limit Exceeded).

Is there some way to make a bigger buffer for I/O ?

I'm really thinking now to abandon SPOJ since I planned on using their
site as a LISP-training site but I don't see the pleasure in writing
solutions that don't have a chance of getting some points (maybe some
of them are plain wrong but I'll never know if they don't even finish).

So - I should find another way to practice my lisp. Anyone know of a
not-so-complex open-source LISP project that needs some help ? (Other
ideeas are welcome).

Regards,
Emilian Bold

Pascal Bourguignon wrote:
> "Emilian" <···················@gmail.com> writes:
>
> > Thanks guys for the answers so far.
> >
> > Jeff M., I've also submitted your solution changed a little to read
> > from the terminal and also got Time Limit Exceeded.
> >
> > (defun run (n k)
> >   (declare (optimize speed) (fixnum n k))
> >   (let ((count 0))
> >     (dotimes (i n count)
> >       (let ((x (read)))
> >         (declare (fixnum x))
> >         (when (zerop (the fixnum (mod x k)))
> >           (incf count))))
> >     (format t "~D~%" count)))
> >
> > (run (read) (read))
> >
> > elof's version also gets TLE.
> >
> > (defun run ()
> >     (do ((n (read) (decf n))
> >          (k (read))
> >          (divisible 0))
> >         ((= n 0) divisible)
> >       (when (zerop (mod (read) k))
> >         (incf divisible))))
> >
> > (format t "~D~%" (run))
> >
> >
> > mmcconnell17's example was quite nice, never knew the (the) macro?
> > exists. I'm also looking now at
> > http://www.lisp.org/HyperSpec/Body/sym_declare.html to see more
> > (declare) options.
> >
> > As for (read), I assume it should be really slow compared to
> > (parse-integer (read-line)) since it's basically much more than a
> > read-integer function. No ?
>
> Really, it's not necessarily slower.  If it's compiled to a DFA, the
> size of the DFA shouldn't matter for speed.
>
>
> The problem with SPOJ is that we don't know with what options gcl and
> clisp are launched.  I'm not even sure clisp programs are compiled!
>
> But the problem we have about I/O is encodings, and about
> *standard-input* (and *standard-output*) is that their external-format
> must be specified on the command line, and unfortunately with
> clisp-2.33.2 or clisp-2.35, it's not possible to change it "at run-time":
>
> [2]> (setf (stream-external-format *standard-input*) charset:iso-8859-1)
> *** - SYSTEM::SET-STREAM-EXTERNAL-FORMAT on #<IO TERMINAL-STREAM> is illegal
>
> Therefore, the speed hit may occur at the character decoding stage.
>
> We can get in lisp the exact same speed as in C, on the condition of
> being doing (about) the same.  C doesn't handle characters, only
> bytes.  So if you want to do I/O at the same speed than in C, you must
> do byte I/O.
>
> (with-open-file (in "/tmp/test" :element-type '(unsigned-byte 8))
>   (let ((buffer (make-array 1000000 :element-type '(unsigned-byte 8)
>                             :fill-pointer 1000000)))
>      (setf (fill-pointer buffer) (read-sequence buffer in))
>      buffer))
>
> should be as fast as the C equivalent.
>
>
> Now, since most implementation extend the iso-8859-1 encoding to be a
> 256 1-1 byte to character encoding, we can get fast enough test I/O if
> we restrict ourselves to this encoding.
>
> (with-open-file (in "/tmp/test" :element-type 'character
>                                 :external-format charset:iso-8859-1)
>   (let ((buffer (make-array 1000000 :element-type 'character
>                             :fill-pointer 1000000)))
>      (setf (fill-pointer buffer) (read-sequence buffer in))
>      buffer))
>
> should not be much slower than the binary version, because we're
> reading the same data (which takes us 99.99% of the time), and we're
> not doing anything fancy to convert bytes to characters.
>
>
> [24]> (time (length (let ((bufsiz 4000000))
>  (with-open-file (in "/tmp/test" :element-type '(unsigned-byte 8))
>   (let ((buffer (make-array bufsiz :element-type '(unsigned-byte 8)
>                             :fill-pointer bufsiz)))
>      (setf (fill-pointer buffer) (read-sequence buffer in))
>      buffer)))))
> Real time: 0.119046 sec.
> Run time: 0.11 sec.
> Space: 4005504 Bytes
> GC: 2, GC time: 0.07 sec.
> 4000000
> [25]> (time (length (let ((bufsiz 4000000))
>  (with-open-file (in "/tmp/test" :element-type 'character
>                                  :external-format charset:iso-8859-1)
>   (let ((buffer (make-array bufsiz :element-type 'character
>                             :fill-pointer bufsiz)))
>      (setf (fill-pointer buffer) (read-sequence buffer in))
>      buffer)))))
> Real time: 0.166368 sec.
> Run time: 0.15 sec.
> Space: 4005144 Bytes
> GC: 2, GC time: 0.06 sec.
> 4000000
>
>
> reading the same size as utf-16 shows the time spent in decoding the
> characters:
>
> [38]> (time (length (let ((bufsiz 2000000)) (with-open-file (in "/tmp/test" :element-type 'character :external-format charset:utf-16)
>   (let ((buffer (make-array bufsiz :element-type 'character
>                             :fill-pointer bufsiz)))
>      (setf (fill-pointer buffer) (read-sequence buffer in))
>      buffer)))))
> Real time: 0.225439 sec.
> Run time: 0.23 sec.
> Space: 6005152 Bytes
> GC: 4, GC time: 0.15 sec.
> 2000000
>
>
>
> That said, using CHARACTER instead of (UNSIGNED-BYTE 8) still takes 4
> times more memory:
>
> [26]> ARRAY-TOTAL-SIZE-LIMIT
> 16777215
> [27]> SYSTEM::STRING-DIMENSION-LIMIT
> 4194304
>
>
>
> Finally, to show how unfair life can be, for each read, clisp does 2 mmaps
> and 3 mprotects:
>
> [···@thalassa sphere]$ strace  clisp -q -ansi -norc -x '(with-open-file (in "/tmp/test") (loop while (read-line in nil nil)))'  2>&1 |sed -e 's/(.*//'|awk '{m[$1]++}END{for(k in m){printf "%20s %5d\n",k,m[k];}}'
>                  dup     1
>                uname     2
>                    )     1
>         gettimeofday     1
>                  ---  3109
>             readlink     1
>           exit_group     1
>            getrusage    82
>               stat64    63
>       rt_sigprocmask    82
>              lstat64    11
>               execve     2
>                write     2
>          sigaltstack     1
>               access     1
>                 open   102
>                ioctl     4
>               munmap    43
>            sigreturn  3109
>                mmap2  5316
>              fcntl64     1
>                 read  2896
>                  brk    22
>                close    16
>             mprotect  6325
>         rt_sigaction    18
>              fstat64    20
> [···@thalassa sphere]$ strace  ./read /tmp/test 2>&1 |sed -e 's/(.*//'|awk '{m[$1]++}END{for(k in m){printf "%20s %5d\n",k,m[k];}}'
>                uname     1
>           exit_group     1
>               stat64    31
>               execve     1
>                 open    34
>                mmap2     5
>                 read  2859
>                  brk     5
>                close     1
>             mprotect     1
>              fstat64     2
> [···@thalassa sphere]$ cat read.c
> #include <stdio.h>
>     int main(int argc,char** argv){
>         if(argc>1){
>             char line[10000];
>             FILE* in=fopen(argv[1],"r");
>             while(NULL!=fgets(line,sizeof(line),in));
>             return(0);}
>         else{return(1);}}
>
>
> But we can use read-sequence instead:
>
> [···@thalassa sphere]$ strace  clisp -q -ansi -norc -x '(with-open-file (in "/tmp/test") (loop with buffer = (make-string 4000000) while (plusp (read-sequence buffer in))))'  2>&1 |sed -e 's/(.*//'|awk '{m[$1]++}END{for(k in m){printf "%20s %5d\n",k,m[k];}}'
>                  dup     1
>                uname     2
>                    )     1
>         gettimeofday     1
>                  ---   182
>             readlink     1
>           exit_group     1
>            getrusage     4
>               stat64    63
>       rt_sigprocmask     4
>              lstat64    11
>               execve     2
>                write     2
>          sigaltstack     1
>               access     1
>                 open   102
>                ioctl     4
>               munmap     3
>            sigreturn   182
>                mmap2    48
>              fcntl64     1
>                 read  2897
>                  brk    22
>                close    16
>             mprotect   354
>         rt_sigaction    18
>              fstat64    20
> 
> -- 
> "Specifications are for the weak and timid!"

From: Patrick Frankenberger
Subject: Re: Pure speed
Date: Thu, 15 Sep 2005 21:38:14 +0000
Message-ID: <dgcpkg$pd5$02$1@news.t-online.com>

Emilian wrote:
> Pascal Bourguignon, mentioned a few interesting things in his post.
> Basically if there is a lot of time spent at the very basic level of
> character decoding then it's game over -- it will never be possible to
> make a fast enough lisp program for the SPOJ settings (they do seem to
> compile the code btw).
> 
> Basically this LISP program
> 
> (let ((n (parse-integer (read-line) :junk-allowed t)))
>    (loop repeat n do (read-line)))
> 
> can't even read the whole damn input within the time limit (that is, I
> don't even get a Wrong Answer but Time Limit Exceeded).

You can read the whole thing fast enough with:
(defparameter *buf*
   (make-array 100000 :element-type '(unsigned-byte 8)))

(defun main()
   (with-open-stream (in (ext:make-stream :input :element-type 
'(unsigned-byte 8)))
     (loop while (= (ext:read-byte-sequence *buf* in) 100000))))

(main)

But there are several bad things about this code:
-it is clisp-specific
-it doesn't read characters but bytes
-you need to decode the input yourself (which might very well be slower 
than the builtin decoding function)

> I'm really thinking now to abandon SPOJ since I planned on using their
> site as a LISP-training site but I don't see the pleasure in writing
> solutions that don't have a chance of getting some points (maybe some
> of them are plain wrong but I'll never know if they don't even finish).

A lot of problems can be solved with clisp or gcl without hacking input 
or output routines. I solved "not so fast multiplication" in clisp.

Look for problems that call for algorithmic optimization instead of 
low-level performance.


Patrick

From: Joe Marshall
Subject: Re: Pure speed
Date: Fri, 16 Sep 2005 16:45:40 +0000
Message-ID: <8xxxq7uz.fsf@alum.mit.edu>

"Emilian" <···················@gmail.com> writes:

> Hello,
>
> I'm really thinking now to abandon SPOJ since I planned on using their
> site as a LISP-training site but I don't see the pleasure in writing
> solutions that don't have a chance of getting some points (maybe some
> of them are plain wrong but I'll never know if they don't even finish).

Just from browsing SPOJ, I see there are a few people using CLisp, so
you might not want to throw in the towel just yet.  In fact, it seems
that a number of your solutions work.

It seems that several of your solutions are just timing out.  CLisp
isn't a speed demon as I'm sure you know.

> Pascal Bourguignon wrote:

>> The problem with SPOJ is that we don't know with what options gcl and
>> clisp are launched.  I'm not even sure clisp programs are compiled!

I bet the programs are just fed in without compiling.

>> But the problem we have about I/O is encodings, and about
>> *standard-input* (and *standard-output*) is that their external-format
>> must be specified on the command line, and unfortunately with
>> clisp-2.33.2 or clisp-2.35, it's not possible to change it "at run-time":

All the SPOJ problems read integers from *standard-input*.  It seems
reasonable to poke around in the internals of CLISP to bypass the
character decoding if that's a problem.

From: Duane Rettig
Subject: Re: Pure speed
Date: Wed, 14 Sep 2005 07:53:42 +0000
Message-ID: <41x3st795.fsf@franz.com>

Pascal Bourguignon <····@mouse-potato.com> writes:

> Joe Marshall <·········@alum.mit.edu> writes:
>> No matter how well READ is implemented, it is a heavyweight procedure.
>> It has to fetch things from a stream, dispatch through a readtable,
>> assemble tokens, etc.
>
> And no matter how well this is implemented, it will still have to wait
> ages for the hard disk to spin, or worse, for a user to decide what to type.
>
> It's worse than before, when the ratio of speeds CPU/DISK was about
> 1000. Now it's about 4000000 and getting bigger everyday.

This is not the only latency that's getting worse relatively;
instruction-cycle-to-memory, pipeline stalls, all get more and
more expensive.  It is remeniscent of my porting Allegro CL to
the Cray architectures in the late '80s - the advice I was given
for optimal performance on the Cray was "never jump, and never
read from memory"  (no smiley here; great pains were always taken
on the Cray to do everything vectorized and to use arithmetic
algorithms rather than branches to decide outcomes).

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: Joe Marshall
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 14:52:57 +0000
Message-ID: <4q8p9fzq.fsf@alum.mit.edu>

···············@yahoo.com writes:

> The sad fact is that most Common Lisps don't generally implement tail
> recursion as a loop.  

This is not true.  Allegro Common Lisp and LispWorks both optimize
tail calls when the compiler optimizations are set properly.  CLisp
will optimize self tail calls.  Corman Common Lisp will as well.
CMUCL will.  I don't know about SBCL or Scieneer.  OpenMCL will.

From: David Steuber
Subject: Re: Pure speed
Date: Thu, 15 Sep 2005 04:28:56 +0000
Message-ID: <87u0gnymwn.fsf@david-steuber.com>

Joe Marshall <·········@alum.mit.edu> writes:

> ···············@yahoo.com writes:
> 
> > The sad fact is that most Common Lisps don't generally implement tail
> > recursion as a loop.  
> 
> This is not true.  Allegro Common Lisp and LispWorks both optimize
> tail calls when the compiler optimizations are set properly.  CLisp
> will optimize self tail calls.  Corman Common Lisp will as well.
> CMUCL will.  I don't know about SBCL or Scieneer.  OpenMCL will.

I don't know what differences exist between the SBCL and CMUCL
versions of the Python compiler, but SBCL will do TCO.

Haven't there been enough threads already where people complained
about TRACE not working as expected because TCO happened?  It might be
easier to enumerate the CL implementations that will never do TCO.

-- 
You should maybe check the chemical content of your breakfast
cereal. --- Bill Watterson

From: Joe Marshall
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 14:55:42 +0000
Message-ID: <zmqh81ap.fsf@alum.mit.edu>

"Emilian" <···················@gmail.com> writes:

> Hy,
>
> Are there any basic things I should know ?
> - as far of recursion, I try to make them tail recursive which should
> be almost as fast as a normal loop.

Tail call optimization is a space improvement and generally does not
have much impact on speed.  

> I need some hints to speed up my code.

Profile your code to find out where it is spending its time.  If you
don't use a profiler, you're just guessing where the bottlenecks are.

From: Jeff M.
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 15:45:23 +0000
Message-ID: <1126626323.671591.208870@z14g2000cwz.googlegroups.com>

> Tail call optimization is a space improvement and
> generally does not have much impact on speed.

Wow! I can't believe I'm reading that in this group. Yes it does.

In the games industry you'd be amazed. We routinely go through (albeit
in C) recursive solutions to problems and make them iterative. This
proves to be a big win in speed: traversing physics collision trees,
searching packfiles, etc. If C has tail-call optimization, many
headaches would go away.

Jeff M.

From: ··············@hotmail.com
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 16:19:00 +0000
Message-ID: <1126628340.354575.71250@g14g2000cwa.googlegroups.com>

Jeff M. wrote:
> > Tail call optimization is a space improvement and
> > generally does not have much impact on speed.
>
> Wow! I can't believe I'm reading that in this group. Yes it does.
>

[C anecdotal data snipped]

Are you suggesting you have real data collected on a Lisp
implementation that backs up your belief, or are you just guessing?

Implementors of commercial Lisp compilers have suggested here in the
past that the speed benefits of tail-call optimization are negligible
or negative.

From: Jeff M.
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 16:32:09 +0000
Message-ID: <1126629129.389057.143850@g14g2000cwa.googlegroups.com>

> Are you suggesting you have real data collected on a Lisp
> implementation that backs up your belief, or are you just
> guessing?

On Lisp, no. Other languages (mainly C/C++ and Forth), yes.

> the speed benefits of tail-call optimization are negligible
> or negative.

1. They can't be negative (I'd love to see a case where that's true).
2. That doesn't make sense.

With tail call optimization (converting a call to a jmp), there are
many benefits, not the least of which are:

* No memory access (pushing register to stack and popping it)
* Instruction pre-fetching on many processors can take advantage of the
jump.

The only situation where it would /appear/ that tail-call optimization
doesn't buy anything would be in a recursive algorithm with multiple
recursion points (ie, the naive fib algorithm) or something similar.

Jeff M.

From: Ulrich Hobelmann
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 17:34:54 +0000
Message-ID: <3oogtvF6rgeiU1@individual.net>

Jeff M. wrote:
> * No memory access (pushing register to stack and popping it)
> * Instruction pre-fetching on many processors can take advantage of the
> jump.
> 
> The only situation where it would /appear/ that tail-call optimization
> doesn't buy anything would be in a recursive algorithm with multiple
> recursion points (ie, the naive fib algorithm) or something similar.

I think that just turning the call/ret into a jump wouldn't buy much 
speed (but space), but the fact that some compilers might turn the 
function body into an iterative loop (with all optimizations that 
entails) does.  C's loop optimization is what you profit from when you 
turn C tail-calling functions into iterative loops.  It's not just the 
calling overhead.

-- 
My mouth says the words, my brain is thinking monstertrucks.
Joey (Friends)

From: Duane Rettig
Subject: Re: Pure speed
Date: Wed, 14 Sep 2005 07:45:55 +0000
Message-ID: <464t4t7m4.fsf@franz.com>

"Jeff M." <·······@gmail.com> writes:

>> Are you suggesting you have real data collected on a Lisp
>> implementation that backs up your belief, or are you just
>> guessing?
>
> On Lisp, no. Other languages (mainly C/C++ and Forth), yes.
>
>> the speed benefits of tail-call optimization are negligible
>> or negative.
>
> 1. They can't be negative (I'd love to see a case where that's true).

I certainly wouldn't be able to show you one.  As an implementor, I'd
have to hang my head in shame if I allowed that :-)

It is certainly possible to turn any tail position call into a jump,
but it is not always worthwhile.  Some situations require all sorts
of manipulations in-place on the stack which are simply better handled
by not doing the "optimization".  If the called function accepts more
arguments than the caller has received, for example, _and_ if both caller
and callee are passing more arguments than are normally passed in
registers, then some significant stack manipulation might be necessary
that otherwise was better handled by simply pushing the arguments deeper
onto the stack.

> 2. That doesn't make sense.

Which is precisely why we don't do it.

> With tail call optimization (converting a call to a jmp), there are
> many benefits, not the least of which are:
>
> * No memory access (pushing register to stack and popping it)

This is only true for trivial cases.  Those cases, indeed, are
well-suited for tail-call merging.  But if, in between your caller
and its tail call, you have called something else not in tail
position, then you will have had to build up stack space and tear
it down anyway.  See the example below.

> * Instruction pre-fetching on many processors can take advantage of the
> jump.

Branch call/return prediction stacks can take advantage of calls and
returns, as well.  A call is no less predictable than a jump, since
both are unconditional.

> The only situation where it would /appear/ that tail-call optimization
> doesn't buy anything would be in a recursive algorithm with multiple
> recursion points (ie, the naive fib algorithm) or something similar.

And how rare do you really think such non-tail recursions are?

Here's an transcript showing where tail-call merging _is_ beneficial
to speed, (the first two examples) and where tail-call merging is
not so fast.  We don't merge tail-calls when there are more arguments
to the callee than the caller had received, so I can't show you an
example of where the timing would have been worse; we simply don't
provide for it.

CL-USER(1): (compile
              (defun foo ()
                (declare (optimize speed (safety 0) (debug 2))
                         (:explain :tailmerging))
                (bar)))
failed to merge a NON-SELF-tail-call to BAR because tail-call-non-self-merge-switch is returning nil
Warning: While compiling these undefined functions were referenced: BAR.
FOO
NIL
NIL
CL-USER(2): (disassemble *)
;; disassembly of #<Function FOO>
;; formals: 
;; constant vector:
0: BAR

;; code start: #x1065be9c:
   0: 55          pushl	ebp
   1: 8b ec       movl	ebp,esp
   3: 56          pushl	esi
   4: 83 ec 24   subl	esp,$36
   7: 8b 5e 12   movl	ebx,[esi+18]  ; BAR
  10: b1 00       movb	cl,$0
  12: ff d7       call	*edi
  14: c9          leave
  15: 8b 75 fc   movl	esi,[ebp-4]
  18: c3          ret
  19: 90          nop
CL-USER(3): (compile
              (defun foo ()
                (declare (optimize speed (safety 0) (debug 0))
                         (:explain :tailmerging))
                (bar)))
;  tail-merging a NON-SELF call to BAR
Warning: While compiling these undefined functions were referenced: BAR.
FOO
NIL
NIL
CL-USER(4): (disassemble *)
;; disassembly of #<Function FOO>
;; formals: 
;; constant vector:
0: BAR

;; code start: #x10660fc4:
   0: 8b 5e 12   movl	ebx,[esi+18]  ; BAR
   3: b1 00       movb	cl,$0
   5: ff e7       jmp	*edi
   7: 90          nop
CL-USER(5): (compile
              (defun foo ()
                (declare (optimize speed (safety 0) (debug 2))
                         (:explain :tailmerging))
                (bas)
                (bar)))
failed to merge a NON-SELF-tail-call to BAR because tail-call-non-self-merge-switch is returning nil
Warning: While compiling these undefined functions were referenced: BAR, BAS.
FOO
NIL
NIL
CL-USER(6): (disassemble *)
;; disassembly of #<Function FOO>
;; formals: 
;; constant vector:
0: BAS
1: BAR

;; code start: #x1066337c:
   0: 55          pushl	ebp
   1: 8b ec       movl	ebp,esp
   3: 56          pushl	esi
   4: 83 ec 24   subl	esp,$36
   7: 8b 5e 12   movl	ebx,[esi+18]  ; BAS
  10: b1 00       movb	cl,$0
  12: ff d7       call	*edi
  14: 8b 5e 16   movl	ebx,[esi+22]  ; BAR
  17: b1 00       movb	cl,$0
  19: ff d7       call	*edi
  21: c9          leave
  22: 8b 75 fc   movl	esi,[ebp-4]
  25: c3          ret
CL-USER(7): (compile
              (defun foo ()
                (declare (optimize speed (safety 0) (debug 0))
                         (:explain :tailmerging))
                (bas)
                (bar)))
;  tail-merging a NON-SELF call to BAR
Warning: While compiling these undefined functions were referenced: BAR, BAS.
FOO
NIL
NIL
CL-USER(8): (disassemble *)
;; disassembly of #<Function FOO>
;; formals: 
;; constant vector:
0: BAS
1: BAR

;; code start: #x10665acc:
   0: 55          pushl	ebp
   1: 8b ec       movl	ebp,esp
   3: 56          pushl	esi
   4: 83 ec 24   subl	esp,$36
   7: 8b 5e 12   movl	ebx,[esi+18]  ; BAS
  10: b1 00       movb	cl,$0
  12: ff d7       call	*edi
  14: c9          leave
  15: 8b 5e 16   movl	ebx,[esi+22]  ; BAR
  18: b1 00       movb	cl,$0
  20: ff e7       jmp	*edi
CL-USER(9): 

Note that in the last two examples, the tail-merged version is indeed
faster than the non-tail-merged version, but not by much; it still
has to build p stack space for the non-tail call to bas, and then
remove it before or after calling bar.  The real optimization is in
the space, not in the speed.   And the real tradeoff is stack-space
vs debuggability issue, rather than speed, since you gain stack reuse
at the expense of the lack of foo being on the stack when bar is called
(in order to debug the calling sequence).

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: ·······@gmail.com
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 17:42:11 +0000
Message-ID: <1126633331.714256.204980@z14g2000cwz.googlegroups.com>

Jeff M. wrote:
> > Tail call optimization is a space improvement and
> > generally does not have much impact on speed.
>
> Wow! I can't believe I'm reading that in this group. Yes it does.
>
> In the games industry you'd be amazed. We routinely go through (albeit
> in C) recursive solutions to problems and make them iterative. This
> proves to be a big win in speed: traversing physics collision trees,
> searching packfiles, etc. If C has tail-call optimization, many
> headaches would go away.

Tail call optimization is a space improvement because you avoid pushing
and popping stack frames at each recursive call, which leads to a speed
improvement.  Space is primary; speed is secondary.

You are also getting the properties of the language and a particular
implementation mixed up.  Common Lisp does not mandate tail call
optimization, yet many popular implementations provide it.  C does not
mandate tail call optimization, yet GCC and Digital's C compiler for
Alpha provide it--and Digital's does a *very* good job at it.  If your
C compiler doesn't have it, maybe you should talk to your vendor.

-Nathan

From: Ulrich Hobelmann
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 17:52:41 +0000
Message-ID: <3oohvaF6u983U2@individual.net>

·······@gmail.com wrote:
> You are also getting the properties of the language and a particular
> implementation mixed up.  Common Lisp does not mandate tail call
> optimization, yet many popular implementations provide it.  C does not
> mandate tail call optimization, yet GCC and Digital's C compiler for
> Alpha provide it--and Digital's does a *very* good job at it.  If your
> C compiler doesn't have it, maybe you should talk to your vendor.

AFAIK gcc only optimizes local tail-calls (in the same function) into 
jumps, not general tail-calls.

-- 
My mouth says the words, my brain is thinking monstertrucks.
Joey (Friends)

From: Lars Brinkhoff
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 18:09:26 +0000
Message-ID: <858xy0q1pl.fsf@junk.nocrew.org>

Ulrich Hobelmann <···········@web.de> writes:
> AFAIK gcc only optimizes local tail-calls (in the same function)
> into jumps, not general tail-calls.

This information may be obsolete, but as far as I recall from my GCC
hacking days, it depends on the back end.  The "middle end" provides
some indications that a call can be turned into a jump, but it's up to
the back end to do something about it.  I know I went to great lengths
to optimize as many calls as possible in my PDP-10 back end.

From: Marcin 'Qrczak' Kowalczyk
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 19:49:21 +0000
Message-ID: <87slw8eoji.fsf@qrnik.zagroda>

Ulrich Hobelmann <···········@web.de> writes:

> AFAIK gcc only optimizes local tail-calls (in the same function)
> into jumps, not general tail-calls.

It optimizes some tail calls to other functions too.

With the calling convention where the caller pops the stack
(the default for C on platforms I know), tail call optimization is
impossible if the caller has a smaller stack frame than the callee.

If addresses of local variables of the caller are taken, it might be
hard for the compiler to prove that the variables are not used after
the call.

On architectures where arguments are passed on the stack, shuffling
them in-place might be less efficient than making another stack frame.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From: Joe Marshall
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 23:41:23 +0000
Message-ID: <slw8ili4.fsf@alum.mit.edu>

"Jeff M." <·······@gmail.com> writes:

>> Tail call optimization is a space improvement and
>> generally does not have much impact on speed.
>
> Wow! I can't believe I'm reading that in this group. Yes it does.

You'd be amazed at what you can read in this group.

> In the games industry you'd be amazed.  We routinely go through (albeit
> in C) recursive solutions to problems and make them iterative.  This
> proves to be a big win in speed:  traversing physics collision trees,
> searching packfiles, etc.  If C has tail-call optimization, many
> headaches would go away.

That is a bit of a different story.  When you go through a recursive
solution and rework it to be iterative you have much more information
available than a compiler typically has.  Compiler tail call
optimization generally works by noting that there is no `further work'
to be done when calling out to another function, so rather than
waiting for the caller to return before discarding the stack frame,
the stack frame is discarded immediately.

The easiest, most naive thing to do is to arrange for the stack frame
of the callee to be moved down on top of the stack frame of the caller
just before the call.  This is slower than a normal call because it is
exactly a normal call plus a block copy.  If the compiler is clever,
it can incrementally destroy the current stack frame as it builds the
new one, but still, in the general case, it may need to keep the old
stack frame around until the new one is completely constructed.

Of course, function calling in C is generally more expensive than it
need be (see Steele:  Debunking the Expensive Procedure Call Myth or,
Procedure Call Implementations Considered Harmful or, LAMDBA: The
Ultimate GOTO), so reworking a function into a loop will have a
greater effect.

From: Ulrich Hobelmann
Subject: Re: Pure speed
Date: Wed, 14 Sep 2005 08:20:18 +0000
Message-ID: <3oq4q3F6tpigU1@individual.net>

Joe Marshall wrote:
> That is a bit of a different story.  When you go through a recursive
> solution and rework it to be iterative you have much more information
> available than a compiler typically has.  Compiler tail call
> optimization generally works by noting that there is no `further work'
> to be done when calling out to another function, so rather than
> waiting for the caller to return before discarding the stack frame,
> the stack frame is discarded immediately.

And that's why in theory it shouldn't be the tiniest bit of work.

> The easiest, most naive thing to do is to arrange for the stack frame
> of the callee to be moved down on top of the stack frame of the caller
> just before the call.  This is slower than a normal call because it is
> exactly a normal call plus a block copy.  If the compiler is clever,
> it can incrementally destroy the current stack frame as it builds the
> new one, but still, in the general case, it may need to keep the old
> stack frame around until the new one is completely constructed.

Well, on architectures that were actually *designed* (not thrown 
together), like most RISC machines, all you have to do is maybe reorder 
values in three or so registers, restore the callee-save stuff (which 
you have to do later anyway), and jump.  Actually the PPC-gcc does this 
once in a while, much more often than the x86 one.  Usually the values 
already exist in r5,r6,r7 (just an example), so it's not a block copy, 
just three register moves that can be pipelined/bla perfectly, using 
maybe ONE cycle.  The big cost is still the jump/call/indirect jump, and 
fetching the new code stream.

> Of course, function calling in C is generally more expensive than it
> need be (see Steele:  Debunking the Expensive Procedure Call Myth or,
> Procedure Call Implementations Considered Harmful or, LAMDBA: The
> Ultimate GOTO), so reworking a function into a loop will have a
> greater effect.

Yep.  But it's interesting how much faster you can make function calls 
even on PPC, and still be within the C calling conventions (and save 
maybe 20-60 bytes per call).  Some people were smoking serious s**t.

-- 
My mouth says the words, my brain is thinking monstertrucks.
Joey (Friends)

From: Joe Marshall
Subject: Re: Pure speed
Date: Wed, 14 Sep 2005 13:46:27 +0000
Message-ID: <ll1zep8s.fsf@alum.mit.edu>

> Joe Marshall wrote:

>> If the compiler is clever, it can incrementally destroy the current
>> stack frame as it builds the new one, but still, in the general
>> case, it may need to keep the old stack frame around until the new
>> one is completely constructed.

Ulrich Hobelmann <···········@web.de> writes:

> Well, on architectures that were actually *designed* (not thrown
> together), like most RISC machines, all you have to do is maybe
> reorder values in three or so registers, restore the callee-save
> stuff (which you have to do later anyway), and jump.

I designed the function calling hardware for the LMI K-machine.  It
had non-overlapping register windows and a hardware freelist for stack
frames so you could free register frames out of order.  Tail call
optimization was supported in hardware by changing the register frame
pointer.  The return address was managed on a separate stack, so there
was no work to be done there.  In the general case, tail calling ended
up faster than trying to shuffle registers and looping.

> Yep.  But it's interesting how much faster you can make function calls
> even on PPC, and still be within the C calling conventions (and save
> maybe 20-60 bytes per call).  Some people were smoking serious s**t.

They still are.  The standard CDECL calling convention puts some
boilerplate stack frame manipulation at the beginning and end of every
function.  This saves the old base of frame, adjusts the base of frame
for the new frame, allocates locals, and undoes it all at the end of
the function.  But the x86 chips these days specially recognize and
optimize this particular sequence of instructions so they execute
extremely fast.  Ugh.

From: Rob Thorpe
Subject: Re: Pure speed
Date: Wed, 14 Sep 2005 14:36:41 +0000
Message-ID: <1126708601.068674.302110@o13g2000cwo.googlegroups.com>

> > Yep.  But it's interesting how much faster you can make function calls
> > even on PPC, and still be within the C calling conventions (and save
> > maybe 20-60 bytes per call).  Some people were smoking serious s**t.
>
> They still are.  The standard CDECL calling convention puts some
> boilerplate stack frame manipulation at the beginning and end of every
> function.  This saves the old base of frame, adjusts the base of frame
> for the new frame, allocates locals, and undoes it all at the end of
> the function.  But the x86 chips these days specially recognize and
> optimize this particular sequence of instructions so they execute
> extremely fast.  Ugh.

I don't think they do.  Have you seen any of the manufacturer's
literature indicate this?

Some machines certainly accelerate stack operations in general.

From: Joe Marshall
Subject: Re: Pure speed
Date: Wed, 14 Sep 2005 16:17:21 +0000
Message-ID: <zmqfd3ou.fsf@alum.mit.edu>

"Rob Thorpe" <·············@antenova.com> writes:

>> > Yep.  But it's interesting how much faster you can make function calls
>> > even on PPC, and still be within the C calling conventions (and save
>> > maybe 20-60 bytes per call).  Some people were smoking serious s**t.
>>
>> They still are.  The standard CDECL calling convention puts some
>> boilerplate stack frame manipulation at the beginning and end of every
>> function.  This saves the old base of frame, adjusts the base of frame
>> for the new frame, allocates locals, and undoes it all at the end of
>> the function.  But the x86 chips these days specially recognize and
>> optimize this particular sequence of instructions so they execute
>> extremely fast.  Ugh.
>
> I don't think they do.  Have you seen any of the manufacturer's
> literature indicate this?

I thought I did, but I can't find it now.  Maybe they don't.

From: Ulrich Hobelmann
Subject: Re: Pure speed
Date: Wed, 14 Sep 2005 17:51:37 +0000
Message-ID: <3or69bF77h3bU1@individual.net>

Joe Marshall wrote:
> "Rob Thorpe" <·············@antenova.com> writes:
> 
>>>> Yep.  But it's interesting how much faster you can make function calls
>>>> even on PPC, and still be within the C calling conventions (and save
>>>> maybe 20-60 bytes per call).  Some people were smoking serious s**t.
>>> They still are.  The standard CDECL calling convention puts some
>>> boilerplate stack frame manipulation at the beginning and end of every
>>> function.  This saves the old base of frame, adjusts the base of frame
>>> for the new frame, allocates locals, and undoes it all at the end of
>>> the function.  But the x86 chips these days specially recognize and
>>> optimize this particular sequence of instructions so they execute
>>> extremely fast.  Ugh.
>> I don't think they do.  Have you seen any of the manufacturer's
>> literature indicate this?
> 
> I thought I did, but I can't find it now.  Maybe they don't.

I think most code is stack-pointer + displacement loads and then the 
usual call-return, I haven't seen actual "push" and "pop" in x86 code 
for a while.

The x86 has an internal buffer to predict and prefetch the return stack 
"ret" target, IIRC, so it's not slowed down by having to read the return 
stack from memory.  Of course that means popping or modifying the return 
stack is *very* expensive.

-- 
My mouth says the words, my brain is thinking monstertrucks.
Joey (Friends)

From: Rob Thorpe
Subject: Re: Pure speed
Date: Fri, 16 Sep 2005 17:28:25 +0000
Message-ID: <1126891705.894012.24270@f14g2000cwb.googlegroups.com>

Ulrich Hobelmann wrote:
> Joe Marshall wrote:
> > "Rob Thorpe" <·············@antenova.com> writes:
> >
> >>>> Yep.  But it's interesting how much faster you can make function calls
> >>>> even on PPC, and still be within the C calling conventions (and save
> >>>> maybe 20-60 bytes per call).  Some people were smoking serious s**t.
> >>> They still are.  The standard CDECL calling convention puts some
> >>> boilerplate stack frame manipulation at the beginning and end of every
> >>> function.  This saves the old base of frame, adjusts the base of frame
> >>> for the new frame, allocates locals, and undoes it all at the end of
> >>> the function.  But the x86 chips these days specially recognize and
> >>> optimize this particular sequence of instructions so they execute
> >>> extremely fast.  Ugh.
> >> I don't think they do.  Have you seen any of the manufacturer's
> >> literature indicate this?
> >
> > I thought I did, but I can't find it now.  Maybe they don't.
>
> I think most code is stack-pointer + displacement loads and then the
> usual call-return, I haven't seen actual "push" and "pop" in x86 code
> for a while.

Yes, they're not that useful, moving the frames by arithmetic is more
useful.

> The x86 has an internal buffer to predict and prefetch the return stack
> "ret" target, IIRC, so it's not slowed down by having to read the return
> stack from memory.  Of course that means popping or modifying the return
> stack is *very* expensive.

Don't mix the architecture with the micro-architecture, some x86
machines work as you describe, but not all.

The Pentium M has a "stack manager" this preprocesses the stack
operations before they reach the execution units, they then execute
with the aid of the execuction units.

Pentium 4's and Athlons do it the way you describe for calls only with
a separate stack on the processor, I think.

I don't know about PPros, PII or PIIIs.

The Pentium I did something odd I can't remember.

Most of these internal stacks have different depths.

From: elof
Subject: Re: Pure speed
Date: Tue, 13 Sep 2005 19:23:55 +0000
Message-ID: <pan.2005.09.13.19.23.51.595647@image.dk>

Den Tue, 13 Sep 2005 06:02:52 -0700. skrev Emilian:

> Hy,
> 
> With an advice from c.l.l[1] I discovered http://spoj.sphere.pl/ which
> allows me to submit lisp solutions to some contest problems.
> 
> It's all nice and I've become a little format and loop master(never
> used them that much in the uni) but there is one thing that's bothering
> me: SPEED.
> 
> I'm using CLISP and managed quite often for some tight time-restricted
> problems to get Time Limit Exceeded.
> 
> Of course, my lisp code / algorithm could be inefficient but even the
> raw io speed test( http://spoj.sphere.pl/problems/INTEST/ ) fails in my
> case (it should get about 1MB/s).
> 
> Are there any basic things I should know ?
> - as far of recursion, I try to make them tail recursive which should
> be almost as fast as a normal loop.
> - where do I get some description of (declare) ? I found some code on
> the net with (declare (fixnum)) (declare (optimize)) but I would really
> like some complete guide.
> - what functions should I use for I/O? Most of the time I use plain
> (read) for  numbers and (read-line) for strings. Is (parse-integer
> (read-line)) better than (read) ? As far as I understand (format) is
> really slow. Any specific replacement available ?
> 
> Please note that I'm not trying to destroy the spirit of lisp for the
> sake of speed. That is, I would prefer code that I'm able to understand
> and that feels right than code that is just full of (declare)s. But --
> I need some hints to speed up my code.
> 
> Regards,
> Emilian Bold


Well if you post your code we can comment on it and maybe help you better.

format is not slow as such but if you do something like

(format file "~D~%" (random (expt 10 9)))

for each line written to the file, and something like

(read file) 

for each line read back in, then you are asking
lisp to ask the operating system to perform a full system call and a write
to or read from the filesystem for each line. So one million numbers
written or read leads to one million system calls, and that takes a lot of
time. Your lisp program will be spending most of its time flipping its
virtuel thumbs waiting for the OS and its filesystem to return. This would
be the same in java, perl, C++ etc. btw.

I made an implementation of the challenge on
http://spoj.sphere.pl/problems/INTEST/ that uses this very approach and
got runtimes of roughly  4 MB pr. second using clisp and roughly 1,4 MB
pr. second using cmu on a 1,3 GHz Centrino laptop running Linux. So even
using such a slow approach the challenge can be passed.

However it would likely be worthwhile to explore an approach where
the file is read into memory in larger chunks of say 8 to 64 KB at a time,
and the numbers extracted from this in-memory structure.

(in-package :cl-user)
(defpackage :lotsofnumbers
  (:use :common-lisp)
  (:export generate-lots-of-numbers-naive process-lots-of-numbers-naive)
  (:documentation "http://spoj.sphere.pl/problems/INTEST/ and c-l-l 13. September 2005"))
(in-package :lotsofnumbers)

(defun generate-lots-of-numbers-naive (&key (numbers (expt 10 6)) (filename "/tmp/lotsofnumbers"))
  "The naive approach. Issues a system call per line written to the filesystem."
  (with-open-file (file filename :if-exists :supersede :direction :output)
    (let ((k (random (expt 10 7))))
      (format file "~D ~D~%" numbers k)
      (loop
	  repeat numbers
	do (format file "~D~%" (random (expt 10 9)))))))

(defun process-lots-of-numbers-naive (&key (filename "/tmp/lotsofnumbers"))
  "The naive approach. Issues a system call per line read from the filesystem."
  (with-open-file (file filename :direction :input)
    (do ((n (read file) (decf n))
	 (k (read file))
	 (divisible 0))
	((= n 0) divisible)
      (when (zerop (mod (read file) k))
	(incf divisible)))))

····@billbox:~/prog/tests/lisp/tests$ clisp
WARNING: *FOREIGN-ENCODING*: reset to ASCII
  i i i i i i i       ooooo    o        ooooooo   ooooo   ooooo
  I I I I I I I      8     8   8           8     8     o  8    8
  I  \ `+' /  I      8         8           8     8        8    8
   \  `-+-'  /       8         8           8      ooooo   8oooo
    `-__|__-'        8         8           8           8  8
        |            8     o   8           8     o     8  8
  ------+------       ooooo    8oooooo  ooo8ooo   ooooo   8

Copyright (c) Bruno Haible, Michael Stoll 1992, 1993
Copyright (c) Bruno Haible, Marcus Daniels 1994-1997
Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998
Copyright (c) Bruno Haible, Sam Steingold 1999-2000
Copyright (c) Sam Steingold, Bruno Haible 2001-2004


[1]> (compile-file "lotsofnumbers.lisp")

Compiling file /home/elof/prog/tests/lisp/tests/lotsofnumbers.lisp ...

Wrote file /home/elof/prog/tests/lisp/tests/lotsofnumbers.fas
0 errors, 0 warnings
#P"/home/elof/prog/tests/lisp/tests/lotsofnumbers.fas" ;
NIL ;
NIL
[2]> (load "lotsofnumbers")
;; Loading file /home/elof/prog/tests/lisp/tests/lotsofnumbers.fas ...
;; Loaded file /home/elof/prog/tests/lisp/tests/lotsofnumbers.fas
T
[3]> (time (lotsofnumbers:generate-lots-of-numbers-naive))

Real time: 7.737365 sec.
Run time: 7.480863 sec.
Space: 375734840 Bytes
GC: 716, GC time: 0.93785 sec.
NIL
[4]> (time (lotsofnumbers:process-lots-of-numbers-naive))

Real time: 3.278943 sec.
Run time: 3.162519 sec.
Space: 15734184 Bytes
GC: 30, GC time: 0.053993 sec.
0
[5]> (lisp-implementation-version)
"2.33.2 (2004-06-02) (built on mcmurdo.warthogs.hbd.com [82.211.81.147])"
[6]> (quit)
Bye.

····@billbox:~/prog/tests/lisp/tests$ ls -l /tmp/lotsofnumbers
-rw-r--r--  1 elof elof 9888489 2005-09-13 21:16 /tmp/lotsofnumbers

····@billbox:~/prog/tests/lisp/tests$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 9
model name      : Intel(R) Pentium(R) M processor 1300MHz
stepping        : 5
cpu MHz         : 1296.155
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr mce cx8 sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 tm pbe est tm2
bogomips        : 2564.09

····@billbox:~/prog/tests/lisp/tests$ uname -a
Linux billbox 2.6.10-5-386 #1 Thu Aug 18 22:23:56 UTC 2005 i686 GNU/Linux


	Kristian