From: Ralph Richard Cook
Subject: Asking for Alioth shootout help on "fasta" benchmark
Date: 
Message-ID: <42bcd573.7722464@newsgroups.bellsouth.net>
I'm working on a CMUCL submission to the Alioth shootout; it's a new
submission to the "fasta" benchmark. Right now my version runs in
about 30 seconds on my machine, but the Java version runs in only 4.
Since I've made a new submission for reverse complement that's faster
than that Java version, I'm hoping that I can get some assistance to
speed this up.

After placing some time functions I believe I've narrowed down my
problem to the "select-random" function, that you'll see in a minute.
I've ripped out anything not needed to show the problem. I included a
time on the gen_random function to show that it doesn't take up the
majority of the time.

When I compile I see warnings about being unable to inline, but I
don't know enough to change the type declarations to clean this up.

Different languages in the shootout went about selecting a value in
different ways. Some did a linear search, some did a binary search.
I've put both versions in the example, but the timing on them is about
the same.

The benchmark says that the floats need to be double floats.

I'm separating files with -----, you run help.cmucl_compile.sh and
help.cmucl_run.sh do to things. The file breakdown is similar to other
benchmarks.

Thanks in advance, and I plan on including credits for those who
assist, unless asked otherwise.

-- "help.lisp" below ---------------------
(defconstant IM     139968)
(defconstant IA       3877)
(defconstant IC     29573)

(defparameter THE_LAST 42)

(declaim (inline gen_random))
(defun gen_random (max)
  (declare (type (signed-byte 32) IM IA IC THE_LAST))
  (declare (double-float max))
  (setq THE_LAST (mod (+ (* THE_LAST IA) IC) IM))
  (/ (* max THE_LAST) IM))

(defstruct (freq (:type vector))
  (c #\x :type base-char)
  (p 0.0d0 :type double-float))

(defparameter HomoSapiens (make-array 0 :adjustable t :fill-pointer
0))
   (vector-push-extend (make-freq :c #\a :p 0.3029549426680d0)
HomoSapiens)
   (vector-push-extend (make-freq :c #\c :p 0.1979883004921d0)
HomoSapiens)
   (vector-push-extend (make-freq :c #\g :p 0.1975473066391d0)
HomoSapiens)
   (vector-push-extend (make-freq :c #\t :p 0.3015094502008d0)
HomoSapiens)

(defun make-cumulative (freqs)
  (let ((cp 0.0d0))
    (declare (double-float cp))
      (dotimes (i (length freqs))
        (incf cp (freq-p (aref freqs i)))
        (setf (freq-p (aref freqs i)) cp))))

(declaim (inline select-random-1))
(defun select-random-1 (freqs len)
  (declare (fixnum len))
  (let ((r (gen_random 1.0d0)))
       (declare (double-float r))
       (dotimes (i len)
         (when (< r (freq-p (aref freqs i)))
           (return-from select-random-1 (freq-c (aref freqs i)))))
       (freq-c (aref freqs (1- len)))))


(declaim (inline select-random-2))
(defun select-random-2 (freqs len)
  (declare (fixnum len))
  (let ((r (gen_random 1.0d0)))
       (declare (double-float r))
       (do*
         ((lo 0)
          (hi (1- len))
          (i (floor (+ hi lo) 2) (floor (+ hi lo) 2)))
         ((<= hi (1+ lo)) (freq-c (aref freqs hi)))
         (declare (fixnum lo hi i))
         (if (< r (freq-p (aref freqs i))) (setf hi i) (setf lo i)))))

(defun main ()

    (make-cumulative HomoSapiens)

    (time (dotimes (i 7500000) (gen_random 1.0d0)))
    (time (dotimes (i 7500000) (select-random-1 HomoSapiens (length
HomoSapiens))))
    (time (dotimes (i 7500000) (select-random-2 HomoSapiens (length
HomoSapiens)))))
-- "help.cmucl_compile" below ------------
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed
0) (space 0)))
(compile-file "help.lisp" :block-compile t  :entry-points '(main))
(quit)
-- "help.cmucl_compile.sh" below ---------
/usr/local/bin/lisp -noinit -batch -eval "(load
\"help.cmucl_compile\")"
-- "help.cmucl_run" below ----------------
(proclaim '(optimize (speed 3) (safety 0) (debug 0) (compilation-speed
0) (space 0)))
(setq *gc-verbose* nil)
(load "help.x86f" :verbose nil :print nil) ; change to help.ppcf for
macs
(main) (quit)
-- "help.cmucl_run.sh" below -------------
/usr/local/bin/lisp -noinit -batch -eval "(load \"help.cmucl_run\"
:verbose nil :print nil)"

From: Manuel Giraud
Subject: Re: Asking for Alioth shootout help on "fasta" benchmark
Date: 
Message-ID: <87ll4yetpm.fsf@pc-m-ronsard.cetp.ipsl.fr>
·············@bellsouth.net (Ralph Richard Cook) writes:

> I'm working on a CMUCL submission to the Alioth shootout; it's a new
> submission to the "fasta" benchmark. Right now my version runs in
> about 30 seconds on my machine, but the Java version runs in only 4.
> Since I've made a new submission for reverse complement that's faster
> than that Java version, I'm hoping that I can get some assistance to
> speed this up.

I've translated a version of fasta to CL. It is far less elaborate than
yours, but I think the problem with this benchmark is that it's a stream
I/O benchmark (and AFAIK, CMUCL is not so good at that). For instance,
to print output switching from 'format' to 'princ' to 'myprint'� seems
to improve things a lot.

� (defun my-print (obj)
    (let ((*print-array* nil)
	  (*print-base* nil)
	  (*print-case* nil)
	  (*print-circle* nil)
	  (*print-escape* nil)
	  (*print-gensym* nil)
	  (*print-length* nil)
	  (*print-level* nil)
	  (*print-lines* nil)
	  (*print-miser-width* nil)
	  (*print-pprint-dispatch* nil)
	  (*print-pretty* nil)
	  (*print-radix* nil)
	  (*print-readably* nil)
	  (*print-right-margin* nil))
      (write obj)))

-- 
Manuel Giraud
From: Ralph Richard Cook
Subject: Re: Asking for Alioth shootout help on "fasta" benchmark
Date: 
Message-ID: <42be6f6d.1567744@newsgroups.bellsouth.net>
I don't think I/O is my problem. In the stripped-down "help" program I
posted, I took out all the I/O and the times on one of the
"select-random" loops are taking longer than the entire run of the
Java version, that does I/O. 

Also, there is part of the benchmark that doesn't do the
floating-point compares and looping through the structure. It just
cycles through a string and prints out the characters. In the full
fasta program, this part takes only about a second.

Manuel Giraud <·············@gmail.com> wrote:

>
>I've translated a version of fasta to CL. It is far less elaborate than
>yours, but I think the problem with this benchmark is that it's a stream
>I/O benchmark (and AFAIK, CMUCL is not so good at that). For instance,
>to print output switching from 'format' to 'princ' to 'myprint'� seems
>to improve things a lot.
>
From: Nicolas Neuss
Subject: Re: Asking for Alioth shootout help on "fasta" benchmark
Date: 
Message-ID: <871x6nfvtt.fsf@ortler.iwr.uni-heidelberg.de>
·············@bellsouth.net (Ralph Richard Cook) writes:

> -- "help.lisp" below ---------------------
> (defconstant IM     139968)
> (defconstant IA       3877)
> (defconstant IC     29573)
>
> (defparameter THE_LAST 42)
>
> (declaim (inline gen_random))
> (defun gen_random (max)
>   (declare (type (signed-byte 32) IM IA IC THE_LAST))
>   (declare (double-float max))
>   (setq THE_LAST (mod (+ (* THE_LAST IA) IC) IM))
>   (/ (* max THE_LAST) IM))

Looks as if you would fast multiplication of 32-bit to 64-bit numbers and
the modulus of 64-bit by 32-bit numbers here.  Maybe there is some
implementation-specific approach for CMUCL.  There are also several fast
random generators in CLOCC where you might look for inspiration.

Nicolas.
From: Nicolas Neuss
Subject: Re: Asking for Alioth shootout help on "fasta" benchmark
Date: 
Message-ID: <87hdfhmp84.fsf@ortler.iwr.uni-heidelberg.de>
Nicolas Neuss <·······@iwr.uni-heidelberg.de> writes:

> ·············@bellsouth.net (Ralph Richard Cook) writes:
>
>> (declaim (inline gen_random))
>> (defun gen_random (max)
>>   (declare (type (signed-byte 32) IM IA IC THE_LAST))
>>   (declare (double-float max))
>>   (setq THE_LAST (mod (+ (* THE_LAST IA) IC) IM))
>>   (/ (* max THE_LAST) IM))
>
> Looks as if you would fast multiplication of 32-bit to 64-bit numbers and
> the modulus of 64-bit by 32-bit numbers here.  Maybe there is some
> implementation-specific approach for CMUCL.  There are also several fast
> random generators in CLOCC where you might look for inspiration.

Sorry, I was wrong, IM is small enough.  Try declaring the size of the
product more precisely (it is smaller than IM*IA), e.g.:

(defun gen_random (max)
  (declare (type (signed-byte 32) IM IA IC THE_LAST))
  (declare (double-float max))
  (setq THE_LAST (mod (+ (the fixnum (* THE_LAST IA)) IC) IM))
  (/ (* max THE_LAST) IM))

Maybe it also helps to use simple arrays instead of adjustable ones.

Yours, Nicolas.
        
From: Nicolas Neuss
Subject: Re: Asking for Alioth shootout help on "fasta" benchmark
Date: 
Message-ID: <87d5q5mon8.fsf@ortler.iwr.uni-heidelberg.de>
Nicolas Neuss <·······@iwr.uni-heidelberg.de> writes:

>   (setq THE_LAST (mod (+ (the fixnum (* THE_LAST IA)) IC) IM))

Sorry, fixnum is too small.  try (unsigned-byte 31).

> Maybe it also helps to use simple arrays instead of adjustable ones.

I reread your message now.  Since you said that GEN_RANDOM is not the
problem, maybe you should try this suggestion.

Nicolas
From: Nicolas Neuss
Subject: Re: Asking for Alioth shootout help on "fasta" benchmark
Date: 
Message-ID: <878y0tmnks.fsf@ortler.iwr.uni-heidelberg.de>
OK, here is a version which is about 3 times faster than the original on my
computer.

Nicolas.

---------------------------------------------------------------------------

(defconstant IM     139968)
(defconstant IA       3877)
(defconstant IC     29573)

(defparameter THE_LAST 42)

(declaim (inline gen_random select-random-1 select-random-2))

(defun gen_random (max)
  (declare (type (unsigned-byte 30) IM IA IC THE_LAST))
  (declare (double-float max))
  (setq THE_LAST (mod (+ (the (unsigned-byte 31) (* THE_LAST IA)) IC) IM))
  (/ (* max THE_LAST) IM))

(defstruct freq
  (c #\x :type base-char)
  (p 0.0d0 :type double-float))

(defparameter HomoSapiens
  (vector
   (make-freq :c #\a :p 0.3029549426680d0)
   (make-freq :c #\c :p 0.1979883004921d0)
   (make-freq :c #\g :p 0.1975473066391d0)
   (make-freq :c #\t :p 0.3015094502008d0)))

(defun make-cumulative (freqs)
  (let ((cp 0.0d0))
    (declare (double-float cp))
      (dotimes (i (length freqs))
        (incf cp (freq-p (aref freqs i)))
        (setf (freq-p (aref freqs i)) cp))))

(defun select-random-1 (freqs len)
  (declare (fixnum len) (simple-vector freqs))
  (let ((r (gen_random 1.0d0)))
       (declare (double-float r))
       (dotimes (i len)
         (when (< r (freq-p (aref freqs i)))
           (return-from select-random-1 (freq-c (aref freqs i)))))
       (freq-c (aref freqs (1- len)))))

(defun select-random-2 (freqs len)
  (declare (fixnum len) (simple-vector freqs))
  (let ((r (gen_random 1.0d0)))
       (declare (double-float r))
       (do*
         ((lo 0)
          (hi (1- len))
          (i (floor (+ hi lo) 2) (floor (+ hi lo) 2)))
         ((<= hi (1+ lo)) (freq-c (aref freqs hi)))
         (declare (fixnum lo hi i))
         (if (< r (freq-p (aref freqs i))) (setf hi i) (setf lo i)))))

(defun main ()
  (make-cumulative HomoSapiens)
  (time (dotimes (i 7500000) (gen_random 1.0d0)))
  (let ((n (length HomoSapiens)))
    (time (dotimes (i 7500000) (select-random-1 HomoSapiens n)))
    (time (dotimes (i 7500000) (select-random-2 HomoSapiens n)))))
From: Ralph Richard Cook
Subject: Re: Asking for Alioth shootout help on "fasta" benchmark
Date: 
Message-ID: <42c8a59a.22017609@newsgroups.bellsouth.net>
Nicolas Neuss <·······@iwr.uni-heidelberg.de> wrote:

>OK, here is a version which is about 3 times faster than the original on my
>computer.
>
>Nicolas.
>
First of all, sorry for the delay, I've been away from Usenet for a
few days.

Second of all, HOLY CRAP! With just a change of a couple of
optimizations, I can't believe
1) How much faster your version is, and
2) The fact that my version conses 5 MILLION bytes and yours conses
ZERO bytes. 

Here are the comparison timings. This was done on a 1 Ghz mac.

Here's the timings on my original version:

; Evaluation took:
;   1.22 seconds of real time
;   1.141605 seconds of user run time
;   0.021987 seconds of system run time
;   1,281,484,830 CPU cycles
;   0 page faults and
;   5,127,264 bytes consed.
; 

; Evaluation took:
;   6.51 seconds of real time
;   6.303516 seconds of user run time
;   0.041191 seconds of system run time
;   6,844,236,696 CPU cycles
;   0 page faults and
;   5,129,280 bytes consed.
; 

; Evaluation took:
;   9.61 seconds of real time
;   8.175848 seconds of user run time
;   0.071087 seconds of system run time
;   10,096,467,228 CPU cycles
;   [Run times include 0.04 seconds GC run time]
;   0 page faults and
;   5,128,656 bytes consed.
; 

Here's with Nicolas' optimizations:

; Evaluation took:
;   0.14 seconds of real time
;   0.133887 seconds of user run time
;   3.87e-4 seconds of system run time
;   143,469,684 CPU cycles
;   0 page faults and
;   0 bytes consed.
; 

; Evaluation took:
;   0.93 seconds of real time
;   0.853737 seconds of user run time
;   0.002775 seconds of system run time
;   973,447,368 CPU cycles
;   0 page faults and
;   0 bytes consed.
; 

; Evaluation took:
;   4.38 seconds of real time
;   4.216071 seconds of user run time
;   0.015618 seconds of system run time
;   4,601,837,088 CPU cycles
;   0 page faults and
;   0 bytes consed.
; 

This is now on par with the Java version, and it's what I'll post,
with a credit.

Thanks.
From: Nicolas Neuss
Subject: Re: Asking for Alioth shootout help on "fasta" benchmark
Date: 
Message-ID: <87fyuvvwl6.fsf@ortler.iwr.uni-heidelberg.de>
·············@bellsouth.net (Ralph Richard Cook) writes:

> Nicolas Neuss <·······@iwr.uni-heidelberg.de> wrote:
>
>>OK, here is a version which is about 3 times faster than the original on my
>>computer.
>>
>>Nicolas.
>>
> First of all, sorry for the delay, I've been away from Usenet for a
> few days.
>
> Second of all, HOLY CRAP! With just a change of a couple of
> optimizations, I can't believe
> 1) How much faster your version is, and
> 2) The fact that my version conses 5 MILLION bytes and yours conses
> ZERO bytes.

> Here are the comparison timings. This was done on a 1 Ghz mac.
> ...
> This is now on par with the Java version, and it's what I'll post,
> with a credit.
>
> Thanks.

Very nice, thank you.  Here are two further structural optimizations:

a) Think about using names like *THE-LAST* or *homo-sapiens* (with stars)
for special variables.  One could also think about avoiding the special
variable *THE-LAST* altogether using

(let ((THE_LAST 42))
  (declare (type (unsigned-byte 30) THE_LAST))
  (defun gen_random (max)
    ...
    ))

b) The declaration of cp in make-cumulative does only help for omitting
warnings.  For the sake of brevity it could be dropped.

You could post the final version here another time for people to have a
look.

Yours, Nicolas.