From: dpapathanasiou
Subject: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <1156977853.357826.228400@e3g2000cwe.googlegroups.com>
I've written two functions to compare two lists of string tokens.

The first one finds out which elements are common to both:

  (defun atoms-in-common (atoms corpus)
    "Given atoms (a list of strings) and corpus (also a list of
strings), return a list of those atoms
  common to both."
    (remove-if #'null (mapcar #'(lambda (x) (when (member x corpus
:test #'string=) x)) atoms)))

And the second one finds out which elements of the first list are not
present in the second:

  (defun atoms-in-variance (atoms corpus)
    "Given atoms (a list of strings) and corpus (also a list of
strings), return a list of those atoms
  not found in corpus."
    (remove-if #'null (mapcar #'(lambda (x) (when (not (member x corpus
:test #'string=)) x)) atoms)))

Logically, both are fine.

But when I ran them under the "Metering System" utility, I got some
dramatic differences in "cons per call":
                                                              Cons
                            %      %                          Per
Total     Total
Function                   Time   Cons    Calls  Sec/Call     Call
Time      Cons
-----------------------------------------------------------------------------------------
ATOMS-IN-COMMON:          12.87    0.21      960  0.000406      116
0.390      111192
ATOMS-IN-VARIANCE:         9.57    0.16       12  0.024167     7131
0.290       85568

Though both (atoms-in-common) and (atoms-in-variance) were not always
called together, when they were (12 times in the above run), the same
two lists were sent to both (atoms-in-common) and (atoms-in-variance),
in that sequence.

So why such a big difference -- 116 cons per call for (atoms-in-common)
versus 7,131 for (atoms-in-variance)?

Does that single (not) statement in (atoms-in-variance) make such an
inpact?

Or can either function be re-written more efficiently?

From: Carl Taylor
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <wRpJg.11342$5i3.6613@bgtnsc04-news.ops.worldnet.att.net>
dpapathanasiou wrote:
> I've written two functions to compare two lists of string tokens.

[...]

> Or can either function be re-written more efficiently?


Why not try an experiment with Lisp's set operators in your particular 
implementation?

Carl Taylor


(defparameter *corpus*
  '("ab" "cd" "ef" "ghi" "jkl" "mn" "opq" "rs" "tuvw" "xyz"))
(defparameter *atoms*
  '("cd" "mn" "**" "?!"  "w4" "rs" "pw3"))

(defun atoms-in-common (atoms corpus)
  (set-exclusive-or corpus
                    (set-difference corpus
                                    atoms
                                    :test #'string=)
                    :test #'string=))

(compile *)

(defun atoms-in-variance (atoms corpus)
  (set-difference atoms corpus :test #'string=))

(compile *)

(atoms-in-common *atoms* *corpus*)
(atoms-in-variance *atoms* *corpus*) 
From: Pascal Bourguignon
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <87y7t5pyw6.fsf@informatimago.com>
"dpapathanasiou" <···················@gmail.com> writes:

> I've written two functions to compare two lists of string tokens.
>
> The first one finds out which elements are common to both:
>
>   (defun atoms-in-common (atoms corpus)
>     "Given atoms (a list of strings) and corpus (also a list of
> strings), return a list of those atoms
>   common to both."
>     (remove-if #'null (mapcar #'(lambda (x) (when (member x corpus
> :test #'string=) x)) atoms)))
>
> And the second one finds out which elements of the first list are not
> present in the second:
>
>   (defun atoms-in-variance (atoms corpus)
>     "Given atoms (a list of strings) and corpus (also a list of
> strings), return a list of those atoms
>   not found in corpus."
>     (remove-if #'null (mapcar #'(lambda (x) (when (not (member x corpus
> :test #'string=)) x)) atoms)))
>
> Logically, both are fine.
>
> But when I ran them under the "Metering System" utility, I got some
> dramatic differences in "cons per call":
>                                                               Cons
>                             %      %                          Per
> Total     Total
> Function                   Time   Cons    Calls  Sec/Call     Call
> Time      Cons
> -----------------------------------------------------------------------------------------
> ATOMS-IN-COMMON:          12.87    0.21      960  0.000406      116
> 0.390      111192
> ATOMS-IN-VARIANCE:         9.57    0.16       12  0.024167     7131
> 0.290       85568
>
> Though both (atoms-in-common) and (atoms-in-variance) were not always
> called together, when they were (12 times in the above run), the same
> two lists were sent to both (atoms-in-common) and (atoms-in-variance),
> in that sequence.
>
> So why such a big difference -- 116 cons per call for (atoms-in-common)
> versus 7,131 for (atoms-in-variance)?
>
> Does that single (not) statement in (atoms-in-variance) make such an
> inpact?

No.


> Or can either function be re-written more efficiently?

Yes.


1) the result you obtain depend on the data you feed.

2) you can expect, a priori, that the sum of the numbers of cons per call
   for (atoms-in-common atoms corpus) and  (atoms-in-variance atoms corpus)
   be constant, for a given (length  atoms) and corpus.

3) your implementations are O(length(atom)*length(corpus)). This is bad.
   (Try with a lists of 1, 5, 10, 50, 100, 500, 1000, 5000, 10000 atoms.)

Also note that most implementations use rather naive algorithms for
set functions and functions like delete-duplicates.  So you better
benchmark them or read their sources, and implement this kind of
function with hashtable or other more efficient data structure if you
need the performance.


;; untested code follows:

(defun preprocess-corpus (corpus &key (test (function eql)))
  (let ((table (make-hash-table :test test))) 
      (dolist (item corpus table)
         (setf (gethash item table) t))))

(defun atoms-in-common (atoms preprocessed-corpus)
  (loop :for atom :in atoms
        :when (gethash atom preprocessed-corpus)
        :collect atom))

(defun atoms-in-common (atoms preprocessed-corpus)
  (loop :for atom :in atoms
        :unless (gethash atom preprocessed-corpus)
        :collect atom))


(dolist (corpus collection-of-corpuses)
  (let ((preprocessed-corpus (preprocess-corpus corpus)))
    (dolist (atoms collection-of-atoms)
      (assert (= (length atoms)
                 (+ (length (atoms-in-common   atoms preprocessed-corpus))
                    (length (atoms-in-variance atoms preprocessed-corpus))))))))


-- 
__Pascal Bourguignon__
From: dpapathanasiou
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <1157026582.389774.204270@p79g2000cwp.googlegroups.com>
Hi Pascal,

I understand that both implementations are O( length()*length() ), but
what confuses me is why metering shows a huge difference in consing,
when the two functions are essentially the same.

As I said, I thought initially that the extra (not) statement in the
second function may be the culprit, but as you said, that should have
no bearing on it.

Is it something wrong with the metering package or my use of it?  I
tried it several times, and in each case the cons per call ratios were
the same, i.e. about 110 cons per call for the first function, and over
7,100 for the second.

P.S. Thanks for the idea of pre-processing one of the lists: the first
of the two lists gets run through both functions over and over again
(i.e. we are checking sets of candidate lists versus a target list, so
preprocessing the target list would save us processing cycles).
From: Thomas A. Russ
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <ymiac5ksbfz.fsf@blackcat.isi.edu>
"dpapathanasiou" <···················@gmail.com> writes:

> Hi Pascal,
> 
> I understand that both implementations are O( length()*length() ), but
> what confuses me is why metering shows a huge difference in consing,
> when the two functions are essentially the same.

But they are not essentially the same.
The consing comes from the work that REMOVE has to do to create
new list structure with the null values no longer present.  The
amount of consing required to do this depends on the number of
NIL values that need to be removed and on the length of the tail
that remains unchanged.

> As I said, I thought initially that the extra (not) statement in the
> second function may be the culprit, but as you said, that should have
> no bearing on it.

Correct.  Or rather partly correct.  The effect of the NOT is to
influence how many matches exist, and not affect consing in and
of itself.

> Is it something wrong with the metering package or my use of it?  I
> tried it several times, and in each case the cons per call ratios were
> the same, i.e. about 110 cons per call for the first function, and over
> 7,100 for the second.

No.  As Pascal said, it really depends on the particulars of the data
that you feed to the functions.

> P.S. Thanks for the idea of pre-processing one of the lists: the first
> of the two lists gets run through both functions over and over again
> (i.e. we are checking sets of candidate lists versus a target list, so
> preprocessing the target list would save us processing cycles).
> 

-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Pascal Bourguignon
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <87y7t4vfo4.fsf@thalassa.informatimago.com>
"dpapathanasiou" <···················@gmail.com> writes:

> Hi Pascal,
>
> I understand that both implementations are O( length()*length() ), but
> what confuses me is why metering shows a huge difference in consing,
> when the two functions are essentially the same.

Because one conses the elements of atoms that are in corpus, and the
other conses the elements of atoms that are not in corpus.

So if you have in atoms 1000 elements that are in corpus and 10 that
are not, obviously the first one will cons 100 times more than the
other.


> As I said, I thought initially that the extra (not) statement in the
> second function may be the culprit, but as you said, that should have
> no bearing on it.

Well, when you profile the consing of the lisp functions you usually
discard the consing done on purpose, to find implicit inherencies.  In
that aspect, a NOT won't cons at all (unless we're running with some
special debugging option or something like that).

The point is that for example, a mere numerical function, like:

(defun isquare (x) (cond ((minusp x) (- (isquare (- x))))
                         ((zerop x) 0)
                         (t (+ (isquare (1- x)) (* -2 (1- x)) 1))))

will cons, to allocate the bignum values generated.  You might want to
optimize such a function to avoid consing.

On the otherhand, a function such as:

(defun explode (str) (loop :for ch :across str :collect ch))

will cons O(length(str)) cons cell, and this is not something you want
to avoid, since it's the purpose of the function.  But then, you
cannot complain that:

  (explode "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")

conses more than:

  (explode "y")



> Is it something wrong with the metering package or my use of it?  I
> tried it several times, and in each case the cons per call ratios were
> the same, i.e. about 110 cons per call for the first function, and over
> 7,100 for the second.

Check the documentation of this metering package.  The column is
entitled "Cons per call", not "Cons cell per call".  Consing includes
both the consing of cons cells, and the consing of other lisp objects.
This may be misleading, since a function that explicitely conses a
resulting list will increment this counter the same way as a function
that inadvertently conses discarted intermediate objects.


> P.S. Thanks for the idea of pre-processing one of the lists: the first
> of the two lists gets run through both functions over and over again
> (i.e. we are checking sets of candidate lists versus a target list, so
> preprocessing the target list would save us processing cycles).

Well if you alway use these two functions together, it would be more
efficient to generate both the results at once:

(defun split-on-corpus (atoms preprocessed-corpus)
  (loop :for atom :in atoms
        :if (gethash atom preprocessed-corpus)
              :collect atom :into present
        :else :collect atom :into absent
        :finally (return (list present absent))))

(split-on-corpus '(a b c d e f) (preprocess-corpus '(a c e)))
--> ((A C E) (B D F))


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"You cannot really appreciate Dilbert unless you read it in the
original Klingon"
From: dpapathanasiou
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <1157116582.210288.143940@b28g2000cwb.googlegroups.com>
> Because one conses the elements of atoms that are in corpus, and the
> other conses the elements of atoms that are not in corpus.

Ah, that's right -- given enough differences in the data (and I'll go
back over the lists we used in the test runs), the in-variance function
could easily be much higher.

> Well if you alway use these two functions together, it would be more
> efficient to generate both the results at once:

We're driven (generally) by the unix philosophy: i.e. a collection of
short, discrete, independent functions, each with simple but
well-defined actions is preferable to building larger functions which
encapsulate or couple multiple behaviors together.

The case could be made either way for this one, though, and certainly
it would be better in terms of efficiency; thanks again for the
suggestion.
From: Stephen Compall
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <tr_Jg.18365$Df2.4209@fe05.news.easynews.com>
dpapathanasiou wrote:
>> Well if you alway use these two functions together, it would be more
>> efficient to generate both the results at once:
> 
> We're driven (generally) by the unix philosophy: i.e. a collection of
> short, discrete, independent functions, each with simple but
> well-defined actions is preferable to building larger functions which
> encapsulate or couple multiple behaviors together.

SRFI-1
(http://srfi.schemers.org/srfi-1/srfi-1.html#FilteringPartitioning)
has a general version of Mr Bourguignon's suggestion called PARTITION;
it takes a partitioning predicate and answers PRESENT and ABSENT as
multiple values.  I think its presence in SRFI-1 is indicator enough
that it is acceptable to use such a protocol in your code.

-- 
Stephen Compall
http://scompall.nocandysw.com/blog
From: dpapathanasiou
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <1157134501.900248.110540@m73g2000cwd.googlegroups.com>
Many thanks for the reference to SRFI-1; I did not know such a project
existed.
From: Pascal Bourguignon
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <87r6yvshnb.fsf@thalassa.informatimago.com>
"dpapathanasiou" <···················@gmail.com> writes:

>> Because one conses the elements of atoms that are in corpus, and the
>> other conses the elements of atoms that are not in corpus.
>
> Ah, that's right -- given enough differences in the data (and I'll go
> back over the lists we used in the test runs), the in-variance function
> could easily be much higher.
>
>> Well if you alway use these two functions together, it would be more
>> efficient to generate both the results at once:
>
> We're driven (generally) by the unix philosophy: i.e. a collection of
> short, discrete, independent functions, each with simple but
> well-defined actions is preferable to building larger functions which
> encapsulate or couple multiple behaviors together.
>
> The case could be made either way for this one, though, and certainly
> it would be better in terms of efficiency; thanks again for the
> suggestion.

What is simplier?

  (defun atoms-in-common (atoms corpus)
    "Given atoms (a list of strings) and corpus (also a list of strings), 
return a list of those atoms common to both."
    (remove-if #'null (mapcar
                         #'(lambda (x) 
                              (when (member x corpus :test #'string=) x)) 
                          atoms)))

  (defun atoms-in-variance (atoms corpus)
    "Given atoms (a list of strings) and corpus (also a list of strings), 
return a list of those atoms not found in corpus."
    (remove-if #'null (mapcar
                         #'(lambda (x) 
                            (when (not (member x corpus :test #'string=)) x))
                          atoms)))

or:

(defun split-on-corpus (atoms preprocessed-corpus)
  (loop :for atom :in atoms
        :if (gethash atom preprocessed-corpus)
              :collect atom :into present
        :else :collect atom :into absent
        :finally (return (list present absent))))


(note the names of the functions too).
And anyways, the unix philosophy works for programs, not for algorithms
and functions.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

THIS IS A 100% MATTER PRODUCT: In the unlikely event that this
merchandise should contact antimatter in any form, a catastrophic
explosion will result.
From: dpapathanasiou
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <1157132908.594232.94280@74g2000cwt.googlegroups.com>
> (defun split-on-corpus (atoms preprocessed-corpus)
>   (loop :for atom :in atoms
>         :if (gethash atom preprocessed-corpus)
>               :collect atom :into present
>         :else :collect atom :into absent
>         :finally (return (list present absent))))

Thanks for posting this; despite reading Seibel's chapter on "loop",
I've been avoiding its use in favor of primitive iterators.

> And anyways, the unix philosophy works for programs, not for algorithms
> and functions.

Well, our approach has always been get it working and optimize later;
having libraries of discrete, well-defined functions does make it
possible to build (and try variations of those designs) quickly.
From: Rob Warnock
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <GsGdnW4beKGSqWfZnZ2dnUVZ_q6dnZ2d@speakeasy.net>
dpapathanasiou <···················@gmail.com> wrote:
+---------------
| > (defun split-on-corpus (atoms preprocessed-corpus)
| >   (loop :for atom :in atoms
| >         :if (gethash atom preprocessed-corpus)
| >               :collect atom :into present
| >         :else :collect atom :into absent
| >         :finally (return (list present absent))))
| 
| Thanks for posting this; despite reading Seibel's chapter on "loop",
| I've been avoiding its use in favor of primitive iterators.
+---------------

Pascal's point about using multiple results to avoid doing
the work twice was completely independent of using the LOOP
macro per se. One could do it just as easily with DOLIST, say:

    (defun split-on-corpus (atoms preprocessed-corpus)
      (let (present absent)
	(dolist (atom atoms (list (reverse present) (reverse absent)))
	  (if (gethash atom preprocessed-corpus)
	    (push atom present)
	    (push atom absent)))))

or even [somewhat less cleanly] with plain ol' DO:

    (defun split-on-corpus (atoms preprocessed-corpus)
      (do ((atoms atoms (cdr atoms))
	   (atom (car atoms) (car atoms))
	   (present)
	   (absent))
	  ((null atoms) (list (reverse present) (reverse absent)))
	(if (gethash atom preprocessed-corpus)
	  (push atom present)
	  (push atom absent))))

The main advantage to the LOOP version is that it [probably]
uses an internal cached tail pointer to implement the COLLECTs,
so the two sub-results don't need to be REVERSE'd. One could
of course do that with DOLIST/DO as well, but learning to use
LOOP/COLLECT is far simpler!  ;-}


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Stephen Compall
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <iMqKg.504580$Em2.441536@fe10.news.easynews.com>
Rob Warnock wrote:
> The main advantage to the LOOP version is that it [probably]
> uses an internal cached tail pointer to implement the COLLECTs,

ANSI seems to say that you can access a collect:into: variable at any
point and get the list so-far collected.  Therefore, unless you want
n^2 collecting, this is highly likely.

> so the two sub-results don't need to be REVERSE'd. One could
> of course do that with DOLIST/DO as well, but learning to use
> LOOP/COLLECT is far simpler!  ;-}

On some implementations, PUSH/NREVERSE can be faster than chasing tails.

-- 
Stephen Compall
http://scompall.nocandysw.com/blog
From: Rob Warnock
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <fsydnat4LKyV3mfZnZ2dnUVZ_tydnZ2d@speakeasy.net>
Stephen Compall  <···············@gmail.com> wrote:
+---------------
| Rob Warnock wrote:
| > The main advantage to the LOOP version is that it [probably]
| > uses an internal cached tail pointer to implement the COLLECTs,
...
| 
| On some implementations, PUSH/NREVERSE can be faster than chasing tails.
+---------------

And on some implementations, PUSH/REVERSE can be faster than
PUSH/NREVERSE -- it all depends on the size of the list being
reversed, the nature of the GC, and, for generational GCs, the size
of the nursery. Oh, and the cost of recording stores into older
generations ["remembered sets"]. Specifically, with generational
GCs and with very large argument lists, chasing tails is often the
fastest, since it leaves the oldest members of the result list(s)
unmodified, and thus the mutator doesn't have to record remembered
set information for stores into the part of the result list(s) that
was already promoted out of the nursery [since there aren't any such
stores, except for the one variable that holds the tail pointer].


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Thomas A. Russ
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <ymiy7t3zahg.fsf@blackcat.isi.edu>
Pascal Bourguignon <···@informatimago.com> writes:

> "dpapathanasiou" <···················@gmail.com> writes:
> 
> >> Because one conses the elements of atoms that are in corpus, and the
> >> other conses the elements of atoms that are not in corpus.
> >
> > Ah, that's right -- given enough differences in the data (and I'll go
> > back over the lists we used in the test runs), the in-variance function
> > could easily be much higher.
> >
> >> Well if you alway use these two functions together, it would be more
> >> efficient to generate both the results at once:
...
> > The case could be made either way for this one, though, and certainly
> > it would be better in terms of efficiency; thanks again for the
> > suggestion.
...
> (defun split-on-corpus (atoms preprocessed-corpus)
>   (loop :for atom :in atoms
>         :if (gethash atom preprocessed-corpus)
>               :collect atom :into present
>         :else :collect atom :into absent
>         :finally (return (list present absent))))

This is also a place where one could use the multiple-value return
feature of Lisp to good effect.  That would involve changing the return
form to (return (VALUES present absent)).  The advantage to doing it
this way is that if all you care about is the list of matches, you don't
have to bother with dealing with the absences.

Assuming that you want the present list more often, this gives you the
efficiency of doing the operations together and the convenience of
having a simple function for what you want.  The ROUND, TRUNCATE,
etc. functions in Common Lisp are designed in the same way.

-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Pascal Bourguignon
Subject: Re: How can "cons per call" be so different for these two very similar functions?
Date: 
Message-ID: <87psefqpv8.fsf@thalassa.informatimago.com>
···@sevak.isi.edu (Thomas A. Russ) writes:

> Pascal Bourguignon <···@informatimago.com> writes:
>
>> "dpapathanasiou" <···················@gmail.com> writes:
>> 
>> >> Because one conses the elements of atoms that are in corpus, and the
>> >> other conses the elements of atoms that are not in corpus.
>> >
>> > Ah, that's right -- given enough differences in the data (and I'll go
>> > back over the lists we used in the test runs), the in-variance function
>> > could easily be much higher.
>> >
>> >> Well if you alway use these two functions together, it would be more
>> >> efficient to generate both the results at once:
> ...
>> > The case could be made either way for this one, though, and certainly
>> > it would be better in terms of efficiency; thanks again for the
>> > suggestion.
> ...
>> (defun split-on-corpus (atoms preprocessed-corpus)
>>   (loop :for atom :in atoms
>>         :if (gethash atom preprocessed-corpus)
>>               :collect atom :into present
>>         :else :collect atom :into absent
>>         :finally (return (list present absent))))
>
> This is also a place where one could use the multiple-value return
> feature of Lisp to good effect.  That would involve changing the return
> form to (return (VALUES present absent)).  The advantage to doing it
> this way is that if all you care about is the list of matches, you don't
> have to bother with dealing with the absences.

Yes,  but since it  was said  that both  functions were  called always
together, I infered no preference for one or the other, hence the LIST
instead of VALUES.

If you needed only one it would be ridiculous to cons the other.

> Assuming that you want the present list more often, this gives you the
> efficiency of doing the operations together and the convenience of
> having a simple function for what you want.  The ROUND, TRUNCATE,
> etc. functions in Common Lisp are designed in the same way.
>
> -- 
> Thomas A. Russ,  USC/Information Sciences Institute

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
I need a new toy.
Tail of black dog keeps good time.
Pounce! Good dog! Good dog!