From: Sandeep Koranne
Subject: Perfromance issues with CMUCL Vs ACL or Lispworks
Date: 
Message-ID: <3CDDC20D.5060706@spam.adelphia.net>
Hi All,
I have developed a small application for test scheduling for ICs
and am running it on a Linux box with Pentium III 500 MHz 128 MB RAM 
system. The linux distribution is RedHat 7.3.

My Q is when I run it within Lispworks Personal (4.2) the application 
takes 365 seconds. With ACL 5.0 Trial it takes 305 seconds, but with 
CMUCL 18d latest release it takes 695 seconds.

I have tried to set the optimize flags as:
(proclaim '(optimize (speed 3) (safety 0) (debug 0)))

Is this difference common (and to be expected).

TIA
Sandeep Koranne

From: Pierre R. Mai
Subject: Re: Perfromance issues with CMUCL Vs ACL or Lispworks
Date: 
Message-ID: <87bsbl8jla.fsf@orion.bln.pmsf.de>
Sandeep Koranne <········@spam.adelphia.net> writes:

> Hi All,
> I have developed a small application for test scheduling for ICs
> and am running it on a Linux box with Pentium III 500 MHz 128 MB RAM
> system. The linux distribution is RedHat 7.3.
> 
> 
> My Q is when I run it within Lispworks Personal (4.2) the application
> takes 365 seconds. With ACL 5.0 Trial it takes 305 seconds, but with
> CMUCL 18d latest release it takes 695 seconds.
> 
> 
> I have tried to set the optimize flags as:
> (proclaim '(optimize (speed 3) (safety 0) (debug 0)))

Aside:

You should never _globally_ proclaim safety to 0.  At safety = 0, your
implementation is given leeway to remove any and all safety checks, so
that any incorrect code can silently produce wrong results, corrupt
the lisp image, delete your hard-disk, etc.

A safety level of 0 should only ever be declared locally in a small,
complete tested, performance critical piece of code, that always
receives correct input, or explicitly checks its input.

In all other cases, at least a safety-level of 1 should be specified.

> Is this difference common (and to be expected).

Differences in performance of this magnitude (and sometimes even
greater magnitude) can occur between conforming ANSI CL
implementations, and they often occur in both directions, since
implementations place different emphasis on various areas of the
language.

That said, such differences in performance can be influenced by the
way the code has been optimized, etc.  Different implementations
require different declarations or other hints to achieve their optimal
level of performance.  So it can happen that code that performs
optimally on Lispworks, performs suboptimally on CMUCL or ACL.  It is
usually possible to change or add a few declarations, and achieve
better results across the baord, or for certain implementations.

Seeing that there is a factor of two between CMUCL and ACL/LWL, it
would be interesting to hear what kind of application this is, and
what areas of CL it exercises, in order to study ways of improving
your applications performance, and/or improving certain areas of
CMUCL.  So if you are at liberty to offer more in the way of
information, or even code snippets that turn out to be bottlenecks,
that would be good.  If you are not at liberty to do this in a public
forum, such as c.l.l, private communications might also be an option.

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein
From: Sandeep Koranne
Subject: Re: Perfromance issues with CMUCL Vs ACL or Lispworks
Date: 
Message-ID: <3CE0A6B3.4000302@spam.adelphia.net>
Pierre R. Mai wrote:
> Sandeep Koranne <········@spam.adelphia.net> writes:
> 
> 
>>Hi All,
>>I have developed a small application for test scheduling for ICs
>>and am running it on a Linux box with Pentium III 500 MHz 128 MB RAM
>>system. The linux distribution is RedHat 7.3.
>>
>>
>>My Q is when I run it within Lispworks Personal (4.2) the application
>>takes 365 seconds. With ACL 5.0 Trial it takes 305 seconds, but with
>>CMUCL 18d latest release it takes 695 seconds.
>>
>>
>>I have tried to set the optimize flags as:
>>(proclaim '(optimize (speed 3) (safety 0) (debug 0)))
> 
> 
> Aside:
> 
> You should never _globally_ proclaim safety to 0.  At safety = 0, your
> implementation is given leeway to remove any and all safety checks, so
> that any incorrect code can silently produce wrong results, corrupt
> the lisp image, delete your hard-disk, etc.
> 
> A safety level of 0 should only ever be declared locally in a small,
> complete tested, performance critical piece of code, that always
> receives correct input, or explicitly checks its input.
> 
> In all other cases, at least a safety-level of 1 should be specified.
> 
> 
>>Is this difference common (and to be expected).
> 
> 
> Differences in performance of this magnitude (and sometimes even
> greater magnitude) can occur between conforming ANSI CL
> implementations, and they often occur in both directions, since
> implementations place different emphasis on various areas of the
> language.
> 
> That said, such differences in performance can be influenced by the
> way the code has been optimized, etc.  Different implementations
> require different declarations or other hints to achieve their optimal
> level of performance.  So it can happen that code that performs
> optimally on Lispworks, performs suboptimally on CMUCL or ACL.  It is
> usually possible to change or add a few declarations, and achieve
> better results across the baord, or for certain implementations.
> 
> Seeing that there is a factor of two between CMUCL and ACL/LWL, it
> would be interesting to hear what kind of application this is, and
> what areas of CL it exercises, in order to study ways of improving
> your applications performance, and/or improving certain areas of
> CMUCL.  So if you are at liberty to offer more in the way of
> information, or even code snippets that turn out to be bottlenecks,
> that would be good.  If you are not at liberty to do this in a public
> forum, such as c.l.l, private communications might also be an option.
> 
> Regs, Pierre.
> 

Point taken about global proclaim declaration.
The application is a scheduler,
given 'n' jobs and 'm' machines, where machine i can process job 'j' in 
time 't_ij' and each job 'j' has a work w_j associated with it calculate 
the optimal assignment of jobs to machine so that all jobs are completed 
in the smalles possible time.

I wrote this as a Single source un-splittable flow, and wrote a 
graph-data structure which I am using in more programs using Graph 
Theory also.

One graph looks like
(setq gl '(:graph-name sandy :num-vertices 3 :num-edges 3 :vertex-vec 
#1A(1 2 3) :edge-vec #1A((1 2 10) (1 3 20) (2 3 5))))
where each vertex can be interpreted as a wwighted vertex, or label, 
each edge is from vertex (1) -> (2) of weight 10.

Its a simple representation but I have macros like (for-all-edges (e G) 
(form)) which iterate over all edges.

Such graphs are created multiple-times in one iteration and deleted 
after computing some property, like the most weighted edge between (1) 
and (3).

After profiling most of the time is spent in the function which creates 
these graphs :

(defun make-ufp-graph (soc spl-list t-pi &optional (print-flag nil) 
(soc-name 'benchmark))
   "Make a graph from the splittable flow list"
   (let* ((g-name soc-name)
          (num-tam (length t-pi))
          (num-cores (length soc))
          (nv (+ 1 num-tam num-cores))
          (ne (+ num-tam (* num-tam num-cores))) ;; we are not 
considering the conflict graph yet
          (va (make-array nv :initial-element nil))
          (ea (make-array ne :initial-element nil))
          (tam-time (second spl-list))
          (answer-list nil)
          (new-schedule 0)
          (min-schedule most-positive-fixnum)
          (g (list :graph-name g-name
                   :num-vertices nv
                   :num-edges ne
                   :vertex-vec va
                   :edge-vec ea))
          (count 0)
          (push-list-local nil)
          (v nil)
          (wtam 0))
     (declare (fixnum num-tam num-cores nv ne new-schedule min-schedule))
     #+nil(format *debug-io* "~%ne = ~D ea = ~S" ne ea)
     ;; now we have made all the vertices
     (dotimes (i nv)
       (setf (aref (graph-ds:vertex-list g) i) (list i nil)))

     ;; for all the cores we must make the demands.
     ;; the core vertices start from num-tam as source is 0
     ;; the demand is the number of bits that must be transported in the 
best case
     ;; this is the work done on 1 bit TAM

     (dotimes (i num-cores)
       (setf (aref (graph-ds:vertex-list g) (+ 1 num-tam i))
             (list (+ 1 num-tam i) (evaluate-core-tam-width (nth i soc) 
1))))

     ;; add an edge between the source vertex and the
     (dotimes (i num-tam)
       #+nil(format *debug-io* "~%Adding an edge between ~D and ~D of 
width ~D" 0 (1+ i) (nth i t-pi))
       (setf (aref (graph-ds:edge-list g) i) (list 0 (1+ i) (nth i t-pi))))
     ;; add edges between the TAM and the demands of infinity
     ;;
     ;; the third of ans-l contains an array.
     #+nil(format *debug-io* "~%Thirs spl-list = ~S" spl-list)
     (dotimes (i num-tam)
       (dolist (val (aref (third spl-list) i))
         (progn
           (setf (aref (graph-ds:edge-list g) (+ count num-tam))
                 (list (1+ i) (+ 1 num-tam val) 'infty))
           (incf count))))
     #+nil(graph-ds:forall-edges x g
                            (when x (format t "~% I am edge ~S" x)))

     #+nil(graph-ds:forall-vertices v g
                               (format t "~% I am vertex ~S of indgree 
~D" v (graph-ds:in-degree g v)))

     (setq count 0)
     #+nil(format t "~% Preprocessing the graph.........")
     #+ignore(format t "~% Initial TAM times = ~A" tam-time)
     (graph-ds:forall-vertices v g
                               (progn
                                 (when (and
                                        (> (first v) num-tam)
                                        (<= (graph-ds:in-degree g v) 1))
                                   (progn
                                     #+nil(format t "~%Moving the demand 
of core ~A to source" (core-info-name (nth (- (first  v) 1 num-tam) soc)))

                                     (setq wtam (which-tam (- (first v) 
1 num-tam) (third spl-list)))
				    (assert wtam)
                                     (push (list (- (first v) 1 num-tam) 
wtam) answer-list)
                                     #+nil(format t "~% Reducing 
capacity on TAM ~D from ~D by ~D"
                                             wtam
                                             (aref tam-time wtam)
                                             (evaluate-core-tam-width 
(nth (- (first  v) 1 num-tam) soc) (nth wtam t-pi))
                                             )
				    #+nil(format *debug-io* "Going to reduce on ~D of ~S by ~D" wtam t-pi
					    (evaluate-core-tam-width (nth (- (first  v) 1 num-tam) soc) (nth wtam t-pi)))
                                     (decf (aref tam-time wtam) 
(evaluate-core-tam-width (nth (- (first  v) 1 num-tam) soc) (nth wtam 
t-pi)))
                                     ))))
     #+nil(format t "~% Current TAM times = ~A" tam-time)

     #+ignore(print-splittable-cores soc g t-pi spl-list)
     ;; if the vertex label is > num-tam+1 then it is a demand vertex 
and if its indgree is > 1 then it has
     ;; a splittable flow.
     ;; perhaps we can write a separate function that can pre-process it

     ;; we have to assign these core to one of the TAMs, calculate epsilons.
     ;; first we try a greedy heuristic.
     (graph-ds:forall-vertices v g
                      (progn
                        (when (and
                               (> (first v) num-tam)
                               (> (graph-ds:in-degree g v) 1)) ;; this 
is an splittable vertex
                          (progn
                            ;; maintain a list of decisions and at the 
end take the one which is best.
                            (setq min-schedule most-positive-fixnum)
                            (setq push-list-local nil)
                            (dolist (val (which-tam-list (core-index v 
num-tam) (third spl-list)))
                              (progn
                                (let ((local-ans-list answer-list))

                                  (push (list (core-index v num-tam) 
val) local-ans-list)
                                  #+nil(format t "~% which-tam-list = 
~S, local-ans-list = ~S"
                                          (which-tam-list (core-index v 
num-tam) (third spl-list)) local-ans-list)
                                  (setq new-schedule 
(evaluate-schedule-makespan soc local-ans-list t-pi))
                                  (when (< new-schedule min-schedule)
                                    (progn
                                      (setq min-schedule new-schedule)
                                      (setq push-list-local (list 
(core-index v num-tam) val))))
                                  #+ignore(format t "~%Trying schedule 
~A with makespan ~D. PLL = ~D"
                                          local-ans-list new-schedule 
push-list-local)
                                  )))
                            #+ignore(format t "~%Pushing ~S into answer 
" push-list-local)
                            (push push-list-local answer-list)))))
     ;; was ignored below
     (when print-flag
       (progn
         (format t "~%pflag = ~S Answer list = ~S" print-flag answer-list)
         (setf global-answer-list answer-list)))
     (evaluate-schedule-makespan soc  answer-list t-pi)))

there are many details missing like what is an soc (its a list of core 
which are structures containing the job info).

I am pitching this implementation against a C implementation on a common 
benchmark set at an International Conference, and I would really love to 
beat or atleast equal the performance.

Any common sources of in-efficiencies which you can point would be very 
nice.

TIA,
Sandeep.
mailto: ········@adelphia.net