Re: Optimize a function for speed

From: Waldek Hebisch
Subject: Re: Optimize a function for speed
Date: Fri, 27 Feb 2009 03:54:21 +0000
Message-ID: <go7o5d$iqi$1@z-news.wcss.wroc.pl>

Francogrex <······@grex.org> wrote:
> Hello, I have this run function but it's going really slow and I would
> like it to go much faster. I thought of hash-tables but how can we fit
> array structures into hash tables? It would be a learning experience
> for me if someone can give me a few hints on how to optimize this for
> speed. I am using SBCL. Thanks
> 

On my machine (using sbcl-1.0.16) your original version takes 42.842s

1) As noted in Dimiter post access to nth element of the mlist is
   takinig a lot of time.  Using array instead of mlist reduced time to
   1.615s
2) Using (proclaim '(optimize (speed 3) (safety 0))) to generate fast
   but unsafe code reduced time to 1.105s
3) You have a big array where you store integes and floats -- such
   array has siginificant cost.  Since apparently you do not need
   the array later the simplest thing is to eliminate array completly
   and just use a bunch of variables to keep current result. This
   reduced time to 0.384s
4) You use global variables and use almost no declarations.  Using let
   to introduce local variables and adding a few declarations reduced
   time to 0.03s -- about 1300 times faster that original version.

Of course, if you need your mtable later you will have to keep it,
but al least you should you specialized integer array for indices
and separate float array for values.  Also, indices are easily
computable -- recomputiong indices may be cheaper then storing them
up in the array.

If you use 32-bit machine and need big n you should replace
"the fixnum" in the progrma below by "the (signed byte 64)"
(but on 64-bit machine the second version gave me slower code).
Also, if you need better accuracy replace 'single-float' by
'double-float' and adjust constants (use 1.0d0 instead of 1.0, etc).

Anyway, my version below:

;; The while macro
(defmacro while (condition &rest body)
  (let ((var (gensym)))
    `(do ((,var nil (progn ,@body)))
         ((null ,condition) ,var))))

(defvar mlist (loop repeat 4000 collect (random 1.0))) ;;I need to use
;;; even 10000 repeats or more

;;The function to optimize
(defun run (mlist)
  (let (
     (n (1- (length mlist)))
     (amlist (coerce (cdr mlist) '(simple-array single-float (*))))
     (maxi 0)
     (maxj 0)
     (curPm 0.0)
     (Pm 0.0)
     (i 0))
  (while (<= i (1- n))
    (let (
      ( sP 0.0)
      (j (1+ i)))
      (declare (fixnum i j n))
      (declare (single-float sP Pm))
    (while (< j n)
      (setf sP (+ sP (aref amlist (1- j))))
      (setf curPm (if (< sP 1) (/ 1 (* (float (the fixnum (* (- j i)
 (+ (- n j) (1+ i))))) sP)) 1.0))
      (when (> curPm Pm) (setf
          Pm curPm
          maxi i
          maxj j))
      (incf j))
    (incf i)))
  (setf mtable2 (list maxi maxj Pm)))
  )

(time (run mlist))

-- 
                              Waldek Hebisch
·······@math.uni.wroc.pl

Re: Optimize a function for speed Dimiter "malkia" Stanev
- Re: Optimize a function for speed Raffael Cavallaro
- Re: Optimize a function for speed Alex Mizrahi
- Re: Optimize a function for speed William James
  - Re: Optimize a function for speed Marco Antoniotti

From: Dimiter "malkia" Stanev
Subject: Re: Optimize a function for speed
Date: Fri, 27 Feb 2009 09:27:18 +0000
Message-ID: <dc11482c-487e-49e4-af17-55032068a01e@v35g2000pro.googlegroups.com>

FYI: A little more optimized version, based on Waldeks:

(defmacro while (condition &rest body)
  (let ((var (gensym)))
    `(do ((,var nil (progn ,@body)))
         ((null ,condition) ,var))))

(defvar mlist (loop repeat 4000 collect (random 1.0e0))) ;;I need to
use
;;; even 10000 repeats or more
;;The function to optimize

(defun run (mlist)
  (declare (optimize (speed 3) (space 0) (debug 0) (compilation-speed
0) (safety 0)
                     #+lispworks (fixnum-safety 0)
                     #+lispworks (float 0)))
  (let ((n (1- (length mlist)))
        (amlist (coerce (cdr mlist) '(simple-array single-float
(*))))
        (maxi 0)
        (maxj 0)
        (curPm 0.0e0)
        (Pm 0.0e0)
        (i 0))
    (declare (fixnum i n maxi maxj))
    (declare (single-float curPm Pm))
    (declare ((simple-array single-float (*)) amlist))
    (while (< i n)
           (let ((sP 0.0e0)
                 (j (+ i 1)))
             (declare (fixnum j))
             (declare (single-float sP))
             (while (< j n)
                    (setf sP (+ sP (aref amlist (- j 1))))
                    (setf curPm
                          (if (< sP 1.0e0)
                              (/ 1.0e0
                                 (* (coerce (* (- j i) (+ (- n j) (+ i
1))) 'single-float)
                                    sP))
                            1.0e0))
                    (when (> curPm Pm)
                      (setf Pm curPm
                            maxi i
                            maxj j))
                    (incf j))
             (incf i)))
    (values Pm maxi maxj)))

(time (run mlist))

; Under Lispworks 32-bit 5.1.2 (Mac OS X, PowerBook 2.6 Ghz Intel Core
2 Duo 4GB RAM 667 Mhz DDR2 SDRAM)
; User time    =        0.031
; System time  =        0.000
; Elapsed time =        0.031
; Allocation   = 16060 bytes
; 0 Page faults

; Note that this time includes the coercion to vector, and I was lazy
to verify how much time it actually took.
; Ideally your data would be prepared in most suitable formats
(specialized arrays)

; So from 48 seconds, went to 0.031s - that's close to 1500 speedup
from the original version

Here is btw, the disassembly from Lispworks, look at this beauty (Yes
I know it's not the most optimized code, and Intel C/C++ would beat
it's pants, but for such a high-level language as Lisp this is simply
amazing):

And the thing is, a peephole optimizer would need just the generate
the bytecodes to optimize it even more (for example jge L8, jumping
back to L2 and adding 4 to a register).

And other lisp compilers might generate even better code than this...

CL-USER 19 > (disassemble 'run)
200C415A:
       0:      55               push  ebp
       1:      89E5             move  ebp, esp
       3:      50               push  eax
       4:      50               push  eax
       5:      50               push  eax
       6:      50               push  eax
       7:      FF35F4330020     push  [200033F4]       ; SYSTEM:*
%DYNAMIC-ENVIRONMENT
      13:      FF3530D76921     push  [2169D730]       ; MLIST
      19:      681FD76921       push  2169D71F         ; MLIST
      24:      8925F4330020     move  [200033F4], esp  ; SYSTEM:*
%DYNAMIC-ENVIRONMENT
      30:      890530D76921     move  [2169D730], eax  ; MLIST
      36:      8B0530D76921     move  eax, [2169D730]  ; MLIST
      42:      B501             moveb ch, 1
      44:      FF152CE50320     call  [2003E52C]       ; LENGTH
      50:      89C6             move  esi, eax
      52:      83EE04           sub   esi, 4
      55:      8975FC           move  [ebp-4], esi
      58:      8B3D30D76921     move  edi, [2169D730]  ; MLIST
      64:      83FF56           cmp   edi, 56
      67:      7403             je    L1
      69:      8B7F07           move  edi, [edi+7]
L1:   72:      57               push  edi
      73:      B502             moveb ch, 2
      75:      B865CC0F20       move  eax, 200FCC65    ; (SIMPLE-ARRAY
SINGLE-FLOAT (*))
      80:      FF15B4761F20     call  [201F76B4]       ;
SYSTEM::COERCE-TO-VECTOR
      86:      8945F0           move  [ebp-10], eax
      89:      C745F800000000   move  [ebp-8], 0
      96:      C745F400000000   move  [ebp-C], 0
     103:      F30F102DB42E0F20 movss xmm5, [200F2EB4]  ; 0.0
     111:      33FF             xor   edi, edi

; BEGIN MAIN LOOP
L2:  113:      3B7DFC           cmp   edi, [ebp-4]
     116:      0F8D80000000     jge   L7
     122:      F30F1035B42E0F20 movss xmm6, [200F2EB4]  ; 0.0
     130:      8D5F04           lea   ebx, [edi+4]
L3:  133:      3B5DFC           cmp   ebx, [ebp-4]
     136:      0F8DB3000000     jge   L8
     142:      89DA             move  edx, ebx
     144:      83EA04           sub   edx, 4
     147:      8B75F0           move  esi, [ebp-10]
     150:      F30F10641605     movss xmm4, [esi+5+edx]
     156:      F30F58F4         addss xmm6, xmm4
     160:      F30F1025C42E0F20 movss xmm4, [200F2EC4]  ; 1.0
     168:      0F2FE6           comiss xmm4, xmm6
     171:      7643             jbe   L6
     173:      89D8             move  eax, ebx
     175:      2BC7             sub   eax, edi
     177:      8B55FC           move  edx, [ebp-4]
     180:      2BD3             sub   edx, ebx
     182:      8D4F04           lea   ecx, [edi+4]
     185:      03D1             add   edx, ecx
     187:      C1FA02           sar   edx, 2
     190:      0FAFD0           imul  edx, eax
     193:      C1FA02           sar   edx, 2
     196:      F30F2ADA         cvtsi2ss xmm3, edx
     200:      F30F59DE         mulss xmm3, xmm6
     204:      F30F1025C42E0F20 movss xmm4, [200F2EC4]  ; 1.0
     212:      F30F5EE3         divss xmm4, xmm3
L4:  216:      0F2FE5           comiss xmm4, xmm5
     219:      760E             jbe   L5
     221:      F20F10EC         movsd xmm5, xmm4
     225:      89FA             move  edx, edi
     227:      8955F8           move  [ebp-8], edx
     230:      89DA             move  edx, ebx
     232:      8955F4           move  [ebp-C], edx
L5:  235:      83C304           add   ebx, 4
     238:      EB95             jmp   L3
L6:  240:      F30F1025C42E0F20 movss xmm4, [200F2EC4]  ; 1.0
     248:      EBDE             jmp   L4
; END MAIN LOOP, BUT TAKE A LOOK ALSO AT Label L8

L7:  250:      83EC08           sub   esp, 8
     253:      C7042486080000   move  [esp], 886
     260:      F30F116C2404     movss [esp+4], xmm5
     266:      B501             moveb ch, 1
     268:      FF15CC0F1420     call  [20140FCC]       ; SYSTEM::RAW-
FAST-BOX-SINGLE
     274:      8B7DF8           move  edi, [ebp-8]
     277:      8B5DF4           move  ebx, [ebp-C]
     280:      893D64300020     move  [20003064], edi  ; SYSTEM:*
%MULTIPLE-VALUE-0
     286:      891D68300020     move  [20003068], ebx  ; T
     292:      FC               cld
     293:      C7056030002000030000 move  [20003060], 300  ; SYSTEM:*
%MULTIPLE-VALUE-COUNT
     303:      8B25F4330020     move  esp, [200033F4]  ; SYSTEM:*
%DYNAMIC-ENVIRONMENT
     309:      5E               pop   esi
     310:      8F4611           pop   [esi+11]
     313:      8F05F4330020     pop   [200033F4]       ; SYSTEM:*
%DYNAMIC-ENVIRONMENT
     319:      C9               leave
     320:      C3               ret
L8:  321:      83C704           add   edi, 4
     324:      E928FFFFFF       jmp   L2
     329:      90               nop
NIL

From: Raffael Cavallaro
Subject: Re: Optimize a function for speed
Date: Fri, 27 Feb 2009 18:19:05 +0000
Message-ID: <66f698a5-9f87-4c98-bc72-39ed2fe1b027@q11g2000yqh.googlegroups.com>

On Feb 27, 4:27 am, "Dimiter \"malkia\" Stanev" <······@gmail.com>
wrote:
> FYI: A little more optimized version, based on Waldeks:
>

And here it is using loop instead of while-incf pair:

(defun run (mlist)
  (declare (optimize (speed 3) (space 0) (debug 0) (compilation-speed
0) (safety 0)
                     #+lispworks (fixnum-safety 0)
                     #+lispworks (float 0)))
  (let ((n (1- (length mlist)))
        (amlist (coerce (cdr mlist) '(simple-array single-float
(*))))
        (maxi 0)
        (maxj 0)
        (curPm 0.0e0)
        (Pm 0.0e0))
    (declare (fixnum n maxi maxj))
    (declare (single-float curPm Pm))
    (declare ((simple-array single-float (*)) amlist))
    (loop for i fixnum below n do
      (let ((sP 0.0e0))
        (declare (single-float sP))
        (loop for j fixnum from (1+ i) below n do
          (setf sP (+ sP (aref amlist (- j 1))))
          (setf curPm
                (if (< sP 1.0e0)
                  (/ 1.0e0
                     (* (coerce (* (- j i) (+ (- n j) (+ i 1)))
'single-float)
                        sP))
                  1.0e0))
          (when (> curPm Pm)
            (setf Pm curPm
                  maxi i
                  maxj j)))))
    (values Pm maxi maxj)))

From: Alex Mizrahi
Subject: Re: Optimize a function for speed
Date: Fri, 27 Feb 2009 17:40:16 +0000
Message-ID: <49a8258a$0$90269$14726298@news.sunsite.dk>

 DmS> Here is btw, the disassembly from Lispworks,

why does it use XMM registers, simple FPU ain't cool anymore?

From: William James
Subject: Re: Optimize a function for speed
Date: Fri, 27 Feb 2009 20:15:48 +0000
Message-ID: <go9hlk01rbj@enews2.newsguy.com>

Dimiter "malkia" Stanev wrote:

> FYI: A little more optimized version, based on Waldeks:
> 
> (defmacro while (condition &rest body)
>   (let ((var (gensym)))
>     `(do ((,var nil (progn ,@body)))
>          ((null ,condition) ,var))))
> 
> (defvar mlist (loop repeat 4000 collect (random 1.0e0))) ;;I need to
> use
> ;;; even 10000 repeats or more
> ;;The function to optimize
> 
> (defun run (mlist)
>   (declare (optimize (speed 3) (space 0) (debug 0) (compilation-speed
> 0) (safety 0)
>                      #+lispworks (fixnum-safety 0)
>                      #+lispworks (float 0)))
>   (let ((n (1- (length mlist)))
>         (amlist (coerce (cdr mlist) '(simple-array single-float
> (*))))
>         (maxi 0)
>         (maxj 0)
>         (curPm 0.0e0)
>         (Pm 0.0e0)
>         (i 0))
>     (declare (fixnum i n maxi maxj))
>     (declare (single-float curPm Pm))
>     (declare ((simple-array single-float (*)) amlist))
>     (while (< i n)
>            (let ((sP 0.0e0)
>                  (j (+ i 1)))
>              (declare (fixnum j))
>              (declare (single-float sP))
>              (while (< j n)
>                     (setf sP (+ sP (aref amlist (- j 1))))
>                     (setf curPm
>                           (if (< sP 1.0e0)
>                               (/ 1.0e0
>                                  (* (coerce (* (- j i) (+ (- n j) (+ i
> 1))) 'single-float)
>                                     sP))
>                             1.0e0))
>                     (when (> curPm Pm)
>                       (setf Pm curPm
>                             maxi i
>                             maxj j))
>                     (incf j))
>              (incf i)))
>     (values Pm maxi maxj)))
> 
> (time (run mlist))
> 
> ; Under Lispworks 32-bit 5.1.2 (Mac OS X, PowerBook 2.6 Ghz Intel Core
> 2 Duo 4GB RAM 667 Mhz DDR2 SDRAM)
> ; User time    =        0.031

For comparison, here's a version that uses doubles instead of singles
and takes 0.082 seconds on a 2GHz laptop using the F# interpreter.

No type annotations are used.



#nowarn "62" ;;

let gen = new System.Random() ;;
let mlist = Array.init 4000 (fun _ -> gen.NextDouble() ) ;;

let run mlist =
  let mlist = Array.sub mlist 1 (pred (Array.length mlist))  in
  let n = Array.length mlist  and
      sp = ref 0.0  and
      pm = ref 0.0  and
      cur_pm = ref 0.0  and
      maxi = ref 0  and
      maxj = ref 0  in
  for i = 0 to pred n do
    sp := 0.0;
    for j = succ i  to  pred n  do
      sp := !sp + mlist.[pred j] ;
      cur_pm := 
        if  !sp >= 1.0  then  1.0  else
          1.0 / ( float_of_int ((j - i) * (n - j + i + 1))  * !sp) ;
      if  !cur_pm > !pm  then
        ( pm := !cur_pm ;
          maxi := i;
          maxj := j ) ;
    done ;
  done ;
  printf "%d  %d  %f\n" !maxi !maxj !pm ;;

let timer = System.Diagnostics.Stopwatch.StartNew () in
run mlist;
printf "\n%d milliseconds\n" timer.ElapsedMilliseconds;

From: Marco Antoniotti
Subject: Re: Optimize a function for speed
Date: Fri, 27 Feb 2009 20:53:08 +0000
Message-ID: <b8a2f83f-e598-4180-809e-78307e6f2cd6@b16g2000yqb.googlegroups.com>

On Feb 27, 9:15 pm, "William James" <> wrote:
> Dimiter "malkia" Stanev wrote:
> > FYI: A little more optimized version, based on Waldeks:
>
> > (defmacro while (condition &rest body)
> >   (let ((var (gensym)))
> >     `(do ((,var nil (progn ,@body)))
> >          ((null ,condition) ,var))))
>
> > (defvar mlist (loop repeat 4000 collect (random 1.0e0))) ;;I need to
> > use
> > ;;; even 10000 repeats or more
> > ;;The function to optimize
>
> > (defun run (mlist)
> >   (declare (optimize (speed 3) (space 0) (debug 0) (compilation-speed
> > 0) (safety 0)
> >                      #+lispworks (fixnum-safety 0)
> >                      #+lispworks (float 0)))
> >   (let ((n (1- (length mlist)))
> >         (amlist (coerce (cdr mlist) '(simple-array single-float
> > (*))))
> >         (maxi 0)
> >         (maxj 0)
> >         (curPm 0.0e0)
> >         (Pm 0.0e0)
> >         (i 0))
> >     (declare (fixnum i n maxi maxj))
> >     (declare (single-float curPm Pm))
> >     (declare ((simple-array single-float (*)) amlist))
> >     (while (< i n)
> >            (let ((sP 0.0e0)
> >                  (j (+ i 1)))
> >              (declare (fixnum j))
> >              (declare (single-float sP))
> >              (while (< j n)
> >                     (setf sP (+ sP (aref amlist (- j 1))))
> >                     (setf curPm
> >                           (if (< sP 1.0e0)
> >                               (/ 1.0e0
> >                                  (* (coerce (* (- j i) (+ (- n j) (+ i
> > 1))) 'single-float)
> >                                     sP))
> >                             1.0e0))
> >                     (when (> curPm Pm)
> >                       (setf Pm curPm
> >                             maxi i
> >                             maxj j))
> >                     (incf j))
> >              (incf i)))
> >     (values Pm maxi maxj)))
>
> > (time (run mlist))
>
> > ; Under Lispworks 32-bit 5.1.2 (Mac OS X, PowerBook 2.6 Ghz Intel Core
> > 2 Duo 4GB RAM 667 Mhz DDR2 SDRAM)
> > ; User time    =        0.031
>
> For comparison, here's a version that uses doubles instead of singles
> and takes 0.082 seconds on a 2GHz laptop using the F# interpreter.
>
> No type annotations are used.
>
> #nowarn "62" ;;
>
> let gen = new System.Random() ;;
> let mlist = Array.init 4000 (fun _ -> gen.NextDouble() ) ;;
>
> let run mlist =
>   let mlist = Array.sub mlist 1 (pred (Array.length mlist))  in
>   let n = Array.length mlist  and
>       sp = ref 0.0  and
>       pm = ref 0.0  and
>       cur_pm = ref 0.0  and
>       maxi = ref 0  and
>       maxj = ref 0  in
>   for i = 0 to pred n do
>     sp := 0.0;
>     for j = succ i  to  pred n  do
>       sp := !sp + mlist.[pred j] ;
>       cur_pm :=
>         if  !sp >= 1.0  then  1.0  else
>           1.0 / ( float_of_int ((j - i) * (n - j + i + 1))  * !sp) ;
>       if  !cur_pm > !pm  then
>         ( pm := !cur_pm ;
>           maxi := i;
>           maxj := j ) ;
>     done ;
>   done ;
>   printf "%d  %d  %f\n" !maxi !maxj !pm ;;
>
> let timer = System.Diagnostics.Stopwatch.StartNew () in
> run mlist;
> printf "\n%d milliseconds\n" timer.ElapsedMilliseconds;

Shouldn't you be working on something else?  There are two pending
code requests on you.  You can also try the last one in F# if you
decided to drop Ruby in the last few days. :)

Cheers
--
Marco