sort unique

From: gtasso
Subject: sort unique
Date: Fri, 28 Mar 2008 05:23:45 +0000
Message-ID: <b9706300-f30a-4fe8-887e-5f29b08e1afd@2g2000hsn.googlegroups.com>

hello all,

I have been doing this for sometime now  i want to extract a unique
list with from a list of items:

((let* ((cl-list '("Pitman" "Kizales" "Gabriel" "Moon" "Moon" "Pitman"
"Kizales"))
         (unique-list nil)
         (exist nil))
       (dolist (e cl-list)
        (setf exist nil)
  (dolist (f unique-list)
           (if* (equal e f) then
   (setf exist t)))
(if* (not exist) then
     (push e unique-list)))
                   unique-list)

Could anyone show me a better unique extractor algorithm or is there a
build in one in cl.


Regards

George.

Re: sort unique D Herring
- Re: sort unique Ken Tilton
- Re: sort unique Pascal Bourguignon
Re: sort unique Ken Tilton
- Re: sort unique William James
- Re: sort unique Ken Tilton
  - Re: sort unique Scott Burson
Re: sort unique Alex Mizrahi
- Re: sort unique Scott Burson
  - Re: sort unique Ken Tilton
    - Re: sort unique Ken Tilton
    - Re: sort unique Scott Burson
      - Re: sort unique Russell McManus
        Re: sort unique Scott Burson
        Re: sort unique Scott Burson
        Re: sort unique Scott Burson
        Re: sort unique Thomas F. Burdick
        Re: sort unique Marco Antoniotti
        Re: sort unique Russell McManus
        Re: sort unique Marco Antoniotti
        Re: sort unique Thomas A. Russ
        Re: sort unique Marco Antoniotti
        Re: sort unique John Thingstad
- Re: sort unique Kent M Pitman
  - Re: sort unique John Thingstad
    - Re: sort unique Kent M Pitman
  - Re: sort unique Alex Mizrahi
    - Re: sort unique Richard M Kreuter
  - Re: sort unique Robert Maas, see http://tinyurl.com/uh3t
    - Re: sort unique Pascal Bourguignon
      - Re: sort unique Robert Maas, see http://tinyurl.com/uh3t
        Re: sort unique Scott Burson
Re: sort unique Brian
- Re: sort unique William James
  - Re: sort unique Alex Mizrahi
    - Re: sort unique Marco Antoniotti

From: D Herring
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 06:10:13 +0000
Message-ID: <XrOdnbGdE5wQFnHanZ2dnUVZWhednZ2d@comcast.com>

gtasso wrote:
> hello all,
> 
> I have been doing this for sometime now  i want to extract a unique
> list with from a list of items:
> 
> ((let* ((cl-list '("Pitman" "Kizales" "Gabriel" "Moon" "Moon" "Pitman"
> "Kizales"))
>          (unique-list nil)
>          (exist nil))
>        (dolist (e cl-list)
>         (setf exist nil)
>   (dolist (f unique-list)
>            (if* (equal e f) then
>    (setf exist t)))
> (if* (not exist) then
>      (push e unique-list)))
>                    unique-list)
> 
> Could anyone show me a better unique extractor algorithm or is there a
> build in one in cl.

(sort '("Pitman" "Kizales" "Gabriel" "Moon" "Moon" "Pitman" "Kizales") 
#'string<)
(remove-duplicates * :test #'string=)

- Daniel

From: Ken Tilton
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 10:17:13 +0000
Message-ID: <47ecc5c1$0$15164$607ed4bc@cv.net>

D Herring wrote:
> gtasso wrote:
> 
>> hello all,
>>
>> I have been doing this for sometime now  i want to extract a unique
>> list with from a list of items:
>>
>> ((let* ((cl-list '("Pitman" "Kizales" "Gabriel" "Moon" "Moon" "Pitman"
>> "Kizales"))
>>          (unique-list nil)
>>          (exist nil))
>>        (dolist (e cl-list)
>>         (setf exist nil)
>>   (dolist (f unique-list)
>>            (if* (equal e f) then
>>    (setf exist t)))
>> (if* (not exist) then
>>      (push e unique-list)))
>>                    unique-list)
>>
>> Could anyone show me a better unique extractor algorithm or is there a
>> build in one in cl.
> 
> 
> (sort '("Pitman" "Kizales" "Gabriel" "Moon" "Moon" "Pitman" "Kizales") 
> #'string<)

That is a destructive operation on a quoted list, something noobs do 
anyway, we better not encourage them. :)

> (remove-duplicates * :test #'string=)

Sweet.

kt

-- 
http://smuglispweeny.blogspot.com/
http://www.theoryyalgebra.com/

"In the morning, hear the Way;
  in the evening, die content!"
                     -- Confucius

From: Pascal Bourguignon
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 06:59:26 +0000
Message-ID: <87ej9vqqpd.fsf@thalassa.informatimago.com>

D Herring <········@at.tentpost.dot.com> writes:

> gtasso wrote:
>> hello all,
>> I have been doing this for sometime now  i want to extract a unique
>> list with from a list of items:
>> ((let* ((cl-list '("Pitman" "Kizales" "Gabriel" "Moon" "Moon"
>> "Pitman"
>> "Kizales"))
>>          (unique-list nil)
>>          (exist nil))
>>        (dolist (e cl-list)
>>         (setf exist nil)
>>   (dolist (f unique-list)
>>            (if* (equal e f) then
>>    (setf exist t)))
>> (if* (not exist) then
>>      (push e unique-list)))
>>                    unique-list)
>> Could anyone show me a better unique extractor algorithm or is there
>> a
>> build in one in cl.
>
> (sort '("Pitman" "Kizales" "Gabriel" "Moon" "Moon" "Pitman" "Kizales")
> #'string<)
> (remove-duplicates * :test #'string=)

SORT is not needed here.  It's even extremely bad, since SORT is a
destructive function and you're applying it on a literal!

(remove-duplicates  '("Pitman" "Kizales" "Gabriel" "Moon" "Moon" "Pitman" "Kizales") 
                    :test (function string=))
--> ("Gabriel" "Moon" "Pitman" "Kizales")

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"Our users will know fear and cower before our software! Ship it!
Ship it and let them flee like the dogs they are!"

From: Ken Tilton
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 10:09:43 +0000
Message-ID: <47ecc3ff$0$15182$607ed4bc@cv.net>

gtasso wrote:
> hello all,
> 
> I have been doing this for sometime now  i want to extract a unique
> list with from a list of items:
> 
> ((let* ((cl-list '("Pitman" "Kizales" "Gabriel" "Moon" "Moon" "Pitman"
> "Kizales"))
>          (unique-list nil)
>          (exist nil))
>        (dolist (e cl-list)
>         (setf exist nil)
>   (dolist (f unique-list)
>            (if* (equal e f) then
>    (setf exist t)))
> (if* (not exist) then
>      (push e unique-list)))
>                    unique-list)
> 
> Could anyone show me a better unique extractor algorithm or is there a
> build in one in cl.

A progression of alternates:

(defparameter *raw* (list "Pitman" "Kizales" "Gabriel" ;; 0
                       "Moon" "Moon" "Pitman" "Kizales"))

(let (unique-list)
   (dolist (e *raw* (nreverse unique-list))
     (dolist (f unique-list (push e unique-list))
       (when (string= e f)
         (return)))))

(let (unique-list)
   (dolist (e *raw* (nreverse unique-list))
     (unless (find e unique-list :test 'string=)
       (push e unique-list))))

(loop for e in *raw*
     unless (find e unique :test 'string=)
     collect e into unique
     finally (return unique))

(let (unique-list)
   (dolist (e *raw* (nreverse unique-list))
     (pushnew e unique-list :test 'string=)))

(remove-duplicates *raw* :test 'string=)

kt

-- 
http://smuglispweeny.blogspot.com/
http://www.theoryyalgebra.com/

"In the morning, hear the Way;
  in the evening, die content!"
                     -- Confucius

From: William James
Subject: Re: sort unique
Date: Sat, 29 Mar 2008 17:54:39 +0000
Message-ID: <4e67782c-5be9-4e31-9b56-d3897f912da7@m71g2000hse.googlegroups.com>

Ken Tilton wrote:

> (let (unique-list)
>    (dolist (e *raw* (nreverse unique-list))
>      (pushnew e unique-list :test 'string=)))

Arc:

(let uniqs nil
  (each e *raw*  (pushnew e uniqs))
  rev.uniqs)

>
> (remove-duplicates *raw* :test 'string=)

Arc:

dedup.*raw*

Ruby:

raw.uniq

From: Ken Tilton
Subject: Re: sort unique
Date: Sun, 30 Mar 2008 23:44:52 +0000
Message-ID: <47f025b2$0$15172$607ed4bc@cv.net>

Ken Tilton wrote:
> 
> 
> gtasso wrote:
> 
>> hello all,
>>
>> I have been doing this for sometime now  i want to extract a unique
>> list with from a list of items:
>>
>> ((let* ((cl-list '("Pitman" "Kizales" "Gabriel" "Moon" "Moon" "Pitman"
>> "Kizales"))
>>          (unique-list nil)
>>          (exist nil))
>>        (dolist (e cl-list)
>>         (setf exist nil)
>>   (dolist (f unique-list)
>>            (if* (equal e f) then
>>    (setf exist t)))
>> (if* (not exist) then
>>      (push e unique-list)))
>>                    unique-list)
>>
>> Could anyone show me a better unique extractor algorithm or is there a
>> build in one in cl.
> 
> 
> A progression of alternates:
> 
> (defparameter *raw* (list "Pitman" "Kizales" "Gabriel" ;; 0
>                       "Moon" "Moon" "Pitman" "Kizales"))
> 
> (let (unique-list)
>   (dolist (e *raw* (nreverse unique-list))
>     (dolist (f unique-list (push e unique-list))
>       (when (string= e f)
>         (return)))))
> 
> (let (unique-list)
>   (dolist (e *raw* (nreverse unique-list))
>     (unless (find e unique-list :test 'string=)
>       (push e unique-list))))
> 
> (loop for e in *raw*
>     unless (find e unique :test 'string=)
>     collect e into unique
>     finally (return unique))
> 
> (let (unique-list)
>   (dolist (e *raw* (nreverse unique-list))
>     (pushnew e unique-list :test 'string=)))
> 
> (remove-duplicates *raw* :test 'string=)

(a) Had you responded here instead of by email I could have eviscerated 
you in public (the only thing I really enjoy, the only thing that sets 
me apart from serial killers).

(b) As for "I seem to like the last"... do you not have the word 
"progression" on your planet? <sigh>

The sad thing is I went to a lot of trouble with that post and it turns 
out you are just... I better stop here.

From: Scott Burson
Subject: Re: sort unique
Date: Thu, 10 Apr 2008 05:05:42 +0000
Message-ID: <2457514a-7632-4ed0-8682-66b420548b58@r9g2000prd.googlegroups.com>

(antispam)

From: Alex Mizrahi
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 11:56:18 +0000
Message-ID: <47ecdceb$0$90269$14726298@news.sunsite.dk>

 g> Could anyone show me a better unique extractor algorithm or is there a
 g> build in one in cl.

remove-duplicates (or delete-duplicates, destructuve version) are built in.
however typically they require ~N^2 operations (comparing each to each), 
that is not efficient if your list is large.
in this case it's better to use hash table:

(defun remove-duplicates-ht (list)
(loop with ht = (hash-table :test 'equal)
          for w in list
          do (setf (gethash e ht) t)
          finally (return (loop for w being each hash-key of ht
                                          collect w))))

From: Scott Burson
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 16:07:17 +0000
Message-ID: <20c5d04a-60f5-4af4-8f3e-65641184b30f@s13g2000prd.googlegroups.com>

On Mar 28, 4:56 am, "Alex Mizrahi" <········@users.sourceforge.net>
wrote:
>  g> Could anyone show me a better unique extractor algorithm or is there a
>  g> build in one in cl.
>
> remove-duplicates (or delete-duplicates, destructuve version) are built in.
> however typically they require ~N^2 operations (comparing each to each),
> that is not efficient if your list is large.
> in this case it's better to use hash table:

Or my FSet library:

  (defun remove-duplicates-fset (list)
    (fset:convert 'list (fset:convert 'fset:set list))

Of course, if you're making much use of FSet, you might want to leave
the collection as an fset:set rather than converting it back to a
list.

http://common-lisp.net/project/fset/

-- Scott

From: Ken Tilton
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 17:02:01 +0000
Message-ID: <47ed2489$0$25028$607ed4bc@cv.net>

Scott Burson wrote:
> On Mar 28, 4:56 am, "Alex Mizrahi" <········@users.sourceforge.net>
> wrote:
> 
>> g> Could anyone show me a better unique extractor algorithm or is there a
>> g> build in one in cl.
>>
>>remove-duplicates (or delete-duplicates, destructuve version) are built in.
>>however typically they require ~N^2 operations (comparing each to each),
>>that is not efficient if your list is large.
>>in this case it's better to use hash table:
> 
> 
> Or my FSet library:
> 
>   (defun remove-duplicates-fset (list)
>     (fset:convert 'list (fset:convert 'fset:set list))
> 
> Of course, if you're making much use of FSet, you might want to leave
> the collection as an fset:set rather than converting it back to a
> list.
> 
> http://common-lisp.net/project/fset/
> 

I would love to see the time output on some benchmark comparing your 
best effort using standard CL with what you have done with FSet.

By "best CL effort" I do not mean some dazzling cdr/car/rplacd dense 
monstrosity -- let some yobbo do that to see if they can beat FSet -- I 
mean the best code one would naturally write.

Of course you want a nice fat test -- well, maybe set it up so you can 
run at different orders of magnitude to see where it really starts to 
get interesting.

cheers, kenny

-- 
http://smuglispweeny.blogspot.com/
http://www.theoryyalgebra.com/

"In the morning, hear the Way;
  in the evening, die content!"
                     -- Confucius

From: Ken Tilton
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 17:39:36 +0000
Message-ID: <47ed2d58$0$5636$607ed4bc@cv.net>

Ken Tilton wrote:
> 
> 
> Scott Burson wrote:
> 
>> On Mar 28, 4:56 am, "Alex Mizrahi" <········@users.sourceforge.net>
>> wrote:
>>
>>> g> Could anyone show me a better unique extractor algorithm or is 
>>> there a
>>> g> build in one in cl.
>>>
>>> remove-duplicates (or delete-duplicates, destructuve version) are 
>>> built in.
>>> however typically they require ~N^2 operations (comparing each to each),
>>> that is not efficient if your list is large.
>>> in this case it's better to use hash table:
>>
>>
>>
>> Or my FSet library:
>>
>>   (defun remove-duplicates-fset (list)
>>     (fset:convert 'list (fset:convert 'fset:set list))
>>
>> Of course, if you're making much use of FSet, you might want to leave
>> the collection as an fset:set rather than converting it back to a
>> list.
>>
>> http://common-lisp.net/project/fset/
>>
> 
> I would love to see the time output on some benchmark comparing your 
> best effort using standard CL with what you have done with FSet.
> 
> By "best CL effort" I do not mean some dazzling cdr/car/rplacd dense 
> monstrosity -- let some yobbo do that to see if they can beat FSet -- I 
> mean the best code one would naturally write.
> 
> Of course you want a nice fat test -- well, maybe set it up so you can 
> run at different orders of magnitude to see where it really starts to 
> get interesting.

btw, I am guessing the underlying b-tree stuff is a nice delievrable in 
its own right? Or did you use some implementation you can recommend?

kenny

-- 
http://smuglispweeny.blogspot.com/
http://www.theoryyalgebra.com/

"In the morning, hear the Way;
  in the evening, die content!"
                     -- Confucius

From: Scott Burson
Subject: Re: sort unique
Date: Mon, 07 Apr 2008 06:48:00 +0000
Message-ID: <e590ff0b-346f-43b3-8c0e-366e93e78c75@j1g2000prb.googlegroups.com>

On Mar 28, 10:02 am, Ken Tilton <···········@optonline.net> wrote:
> Scott Burson wrote:
> >http://common-lisp.net/project/fset/
>
> I would love to see the time output on some benchmark comparing your
> best effort using standard CL with what you have done with FSet.

I should do this, but (a) it would be a lot of work, and (b) on very
small collections, to be quite honest, FSet doesn't look very good.

FSet's purpose is not to win benchmark wars.  Rather, its goal vis-a-
vis performance is to perform well enough to make functional
collections usable for most application programming tasks.  Functional
collections have a variety of stylistic benefits, I argue, which are
worth trading some raw performance to get.  My belief is that so long
as the time _complexity_ -- the scaling behavior -- of the various
FSet operations is as good as possible, their performance in absolute
terms will rarely be a problem (and when it is, some other data
structure can be substituted).

After all, as collections get larger, O(N) or O(N log N) always beats
O(N^2) eventually.  And the thing about FSet is, since it uses
balanced trees, it has _no_ quadratic operations.  No operation is
worse than O(N log N) -- and some are better than you might expect;
set union, for instance, is O(N).  None of the data structures Lisp
provides has this property.  Sequences are an interesting example.
You might think it silly to put a sequence into a balanced tree, as it
makes indexing O(log N) instead of O(1) as you would get with a
vector.  But with trees, you also get O(log N) concatenation.  What
other data structure do you have lying around that will give you that?

I've also tried to make FSet very fully featured.  Here's an imperfect
but nonetheless useful analogy.  Where the raw Lisp types -- lists and
hash tables -- are like a Lotus Exige, FSet is like a Mercedes sedan.
On the track, the Mercedes is not in the same class as the Lotus.  But
it's smooth and quiet, has every feature you've ever heard of and some
you may not have, seats five (scaling! :), the stereo kicks ass, and
the air conditioner works.  Which car would you rather do your daily
commute in???

-- Scott

From: Russell McManus
Subject: Re: sort unique
Date: Mon, 07 Apr 2008 14:06:42 +0000
Message-ID: <87d4p1yd1p.fsf@thelonious.cl-user.org>

Scott Burson <········@gmail.com> writes:

> I've also tried to make FSet very fully featured.  Here's an
> imperfect but nonetheless useful analogy.  Where the raw Lisp types
> -- lists and hash tables -- are like a Lotus Exige, FSet is like a
> Mercedes sedan.  On the track, the Mercedes is not in the same class
> as the Lotus.  But it's smooth and quiet, has every feature you've
> ever heard of and some you may not have, seats five (scaling! :),
> the stereo kicks ass, and the air conditioner works.  Which car
> would you rather do your daily commute in???

I love the FSet idea.  This should be in every lisp programmer's
toolbox.

Do you have a solution / idiom for integrating FSet with CL:LOOP?
I've often thought that loop was missing generic iteration
capabilities.

Of course it's possible to define a generic iteration protocol and use
it with CL:LOOP.  The usage however remains somewhat clumsy.  I
include below a sketch of such an implementation, and an example
CL:LOOP for that exercises the protocol.

Anyone have any ideas about how this can be made tighter / nicer to
use?


(defpackage iterator (:use :cl))
(in-package :iterator)

(defclass list-iterator ()
  ((list :initarg :list :reader list-iterator-list))
  (:documentation "An implementation of the iterator protocol for lists."))

(defmethod make-iterator ((cons cons))
  (make-instance 'list-iterator :list cons))
(defmethod make-iterator ((obj (eql 'nil)))
  (make-instance 'list-iterator :list nil))
(defmethod has-next-item ((list-iterator list-iterator))
  (not (null (list-iterator-list list-iterator))))
(defmethod next-item ((list-iterator list-iterator))
  (with-slots (list) list-iterator
    (let ((result (car list)))
      (setf list (cdr list))
      result)))

;; here is an example of how to use the protocol with CL:LOOP

(let ((list (list 1 2 3 4)))
  (loop with iterator = (make-iterator list)
     while (has-next-item iterator)
     for item = (next-item iterator)
     collect (1+ item)))

=> (2 3 4 5)

From: Scott Burson
Subject: Re: sort unique
Date: Mon, 07 Apr 2008 17:27:52 +0000
Message-ID: <4ad4b063-e7d3-4d9e-a62f-fb5a952d4dc6@s33g2000pri.googlegroups.com>

On Apr 7, 7:06 am, Russell McManus <···············@yahoo.com> wrote:
> Scott Burson <········@gmail.com> writes:
> > I've also tried to make FSet very fully featured.  Here's an
> > imperfect but nonetheless useful analogy.  Where the raw Lisp types
> > -- lists and hash tables -- are like a Lotus Exige, FSet is like a
> > Mercedes sedan.  On the track, the Mercedes is not in the same class
> > as the Lotus.  But it's smooth and quiet, has every feature you've
> > ever heard of and some you may not have, seats five (scaling! :),
> > the stereo kicks ass, and the air conditioner works.  Which car
> > would you rather do your daily commute in???
>
> I love the FSet idea.  This should be in every lisp programmer's
> toolbox.

Thanks!  That's certainly my opinion :)

> Do you have a solution / idiom for integrating FSet with CL:LOOP?

Uh -- er -- I'm afraid I happen to be one of those who has always
hated `loop'.  So I haven't put any effort into that specifically.

> I've often thought that loop was missing generic iteration
> capabilities.

Yes, and you may wish to check out `iterate', which attempts to clean
up the syntax a little bit and also incorporate extensibility.

http://common-lisp.net/project/iterate/

I'm not an `iterate' user myself, however.

> Of course it's possible to define a generic iteration protocol and use
> it with CL:LOOP.

Ah, good point.

FSet has very generous support for iteration, supporting four
different styles.  One is the set-theoretic style, using operators
`image', `filter', and `fold' (`fold' is like `reduce'); this style is
best when the purpose of the iteration is to construct a new
collection or to compute some function over the elements of the
collection.  (For example, (fold #'union ...) is a very common idiom
that takes a set of sets and returns the union of the elements.)

For imperative iteration (e.g., to print the contents of a
collection), FSet provides iteration macros `do-set', `do-map', `do-
seq', etc. in the style of `dolist'.  These are very fast (amortized
constant time per element) but don't support parallel iteration over
two or more collections.

For cases where you want to do parallel iteration, as well as to
interface to other iteration constructs, FSet provides stateful
iterators very much like you propose (except I used a closure rather
than a CLOS instance).

And finally, though I don't know if anyone will ever use it but me,
FSet is integrated with my old `gmap' macro, which I believe even
predates `loop' and has long been the way I have preferred to write
many functional iterations (those that compute a value rather than
having side effects).  It is like `image', `filter', and `reduce'
rolled into one, but unlike these FSet operations it supports parallel
iteration, index generation, and all the Lisp collection types
including vectors, strings, alists, and even disembodied plists.  For
example:

  (gmap :alist nil (:index 0 nil) (:list foo))

constructs an alist pairing successive integers starting at 0 with the
elements of list `foo'.  To construct an FSet map instead, just change
`:alist' to `:map'.  (The `nil' after `:alist' is an abbreviation for
the identity function -- in this case, the identity function of two
arguments that returns two values.)

Okay, back to your point.  You certainly could use FSet's existing
iterators with `loop' in the manner you demonstrate.  A tighter
integration with `iterate' might also be worthwhile, but I have not
undertaken this.

(Whoops, gotta go... to be continued.)

-- Scott

From: Scott Burson
Subject: Re: sort unique
Date: Mon, 07 Apr 2008 18:47:20 +0000
Message-ID: <3f3ee8ba-d768-4699-a32a-0f3cd9b18088@b9g2000prh.googlegroups.com>

On Apr 7, 10:27 am, Scott Burson <········@gmail.com> wrote:
>
> FSet has very generous support for iteration, supporting four
> different styles.

You see where the Mercedes analogy comes from?  I really have tried to
include the telescoping steering wheel and power seats with 2-driver
memory, the heated seats, the traction control, the navigation system,
the infrared-blocking glass to keep the car cool in the summer, the
DVD player in the back for the kids, and the Peltier-effect cupholder
heater/coolers with autosensor to keep your drink the same temperature
it started at (ha! there's an idea!! I wonder if anyone has done
that???).

-- Scott

From: Scott Burson
Subject: Re: sort unique
Date: Tue, 08 Apr 2008 04:36:48 +0000
Message-ID: <de9e3cb6-fe1b-4183-b864-141809501524@x19g2000prg.googlegroups.com>

(Just another reply to keep the thread warm.)

-- Scott

From: Thomas F. Burdick
Subject: Re: sort unique
Date: Tue, 08 Apr 2008 07:32:56 +0000
Message-ID: <37711de6-8a16-47f8-a201-4bc2e586e441@w5g2000prd.googlegroups.com>

On Apr 7, 7:27 pm, Scott Burson <········@gmail.com> wrote:
> On Apr 7, 7:06 am, Russell McManus <···············@yahoo.com> wrote:
>
> > Of course it's possible to define a generic iteration protocol and use
> > it with CL:LOOP.
>
> Ah, good point.

Well, ANSI CL doesn't provide a way to extend CL:LOOP, but MIT-derived
LOOPs have extension mechanisms.  They seem to be sitting unexported
in most implementations internals, but I'd guess that might have
something to do with a lack of readily available non-built-in
collections libraries.  FSet could be just the thing to nudge
implementors into supporting LOOP universes.

From: Marco Antoniotti
Subject: Re: sort unique
Date: Tue, 08 Apr 2008 10:58:36 +0000
Message-ID: <e30b2e69-657b-4bbc-8adc-4c91590c4cdd@m73g2000hsh.googlegroups.com>

On Apr 8, 9:32 am, "Thomas F. Burdick" <········@gmail.com> wrote:
> On Apr 7, 7:27 pm, Scott Burson <········@gmail.com> wrote:
>
> > On Apr 7, 7:06 am, Russell McManus <···············@yahoo.com> wrote:
>
> > > Of course it's possible to define a generic iteration protocol and use
> > > it with CL:LOOP.
>
> > Ah, good point.
>
> Well, ANSI CL doesn't provide a way to extend CL:LOOP, but MIT-derived
> LOOPs have extension mechanisms.  They seem to be sitting unexported
> in most implementations internals, but I'd guess that might have
> something to do with a lack of readily available non-built-in
> collections libraries.  FSet could be just the thing to nudge
> implementors into supporting LOOP universes.

I'd bet this is a hopeless cause.  First of all there are at least two
different LOOP implementations supported by different CLs (I'd also
bet that Lispworks is now supporting both, as they are still holding
Lucid/Liquid CL code).  Secondly, writing extensions for LOOP is still
a horrible thing to do; cfr. CLSQL.  Third, if CLSQL (which does have
the FOR x BEING EACH TUPLE ...) did not push the vendors/implementors
to agree and export a common interface, I doubt that FSet could
achieve that.

pessimistically yours.
--
Marco Antoniotti

From: Russell McManus
Subject: Re: sort unique
Date: Tue, 08 Apr 2008 12:36:17 +0000
Message-ID: <87iqyswmke.fsf@thelonious.cl-user.org>

"Thomas F. Burdick" <········@gmail.com> writes:

> Well, ANSI CL doesn't provide a way to extend CL:LOOP, but
> MIT-derived LOOPs have extension mechanisms.  They seem to be
> sitting unexported in most implementations internals, but I'd guess
> that might have something to do with a lack of readily available
> non-built-in collections libraries.  FSet could be just the thing to
> nudge implementors into supporting LOOP universes.

Does anyone know where to find some example code that extends MIT
LOOP?

-russ

From: Marco Antoniotti
Subject: Re: sort unique
Date: Tue, 08 Apr 2008 12:58:32 +0000
Message-ID: <d2ec29cc-4090-4045-9617-2b4b5cd3068b@z38g2000hsc.googlegroups.com>

On Apr 8, 2:36 pm, Russell McManus <···············@yahoo.com> wrote:
> "Thomas F. Burdick" <········@gmail.com> writes:
>
> > Well, ANSI CL doesn't provide a way to extend CL:LOOP, but
> > MIT-derived LOOPs have extension mechanisms.  They seem to be
> > sitting unexported in most implementations internals, but I'd guess
> > that might have something to do with a lack of readily available
> > non-built-in collections libraries.  FSet could be just the thing to
> > nudge implementors into supporting LOOP universes.
>
> Does anyone know where to find some example code that extends MIT
> LOOP?
>
> -russ

Check out CLSQL.  Then compare with ITERATE.  Alas, we are stuck with
LOOP.  (Also check out CL-ENUMERATION http://common-lisp.net/project/cl-enumeration/
:) )

Cheers
--
Marco

From: Thomas A. Russ
Subject: Re: sort unique
Date: Tue, 08 Apr 2008 19:32:04 +0000
Message-ID: <ymiod8km9cb.fsf@blackcat.isi.edu>

Russell McManus <···············@yahoo.com> writes:
> 
> Does anyone know where to find some example code that extends MIT
> LOOP?

Well, I would start here.  Chapters 24-30 cover the loop macro.

  http://www.unlambda.com/lmman/lmman_24.html#SEC207

Specifically, you want to look at "The Iteration Framework" [29] and
"Iteration Paths"[30]:

  http://www.unlambda.com/lmman/lmman_29.html#SEC221
  http://www.unlambda.com/lmman/lmman_30.html#SEC226

which has an example of defining an iterator for iterating over the
characters in strings.

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: Marco Antoniotti
Subject: Re: sort unique
Date: Mon, 07 Apr 2008 19:15:12 +0000
Message-ID: <126013b0-74d4-4a6a-8ef2-cceae97d13ea@s8g2000prg.googlegroups.com>

On Apr 7, 4:06 pm, Russell McManus <···············@yahoo.com> wrote:
> Scott Burson <········@gmail.com> writes:
> > I've also tried to make FSet very fully featured.  Here's an
> > imperfect but nonetheless useful analogy.  Where the raw Lisp types
> > -- lists and hash tables -- are like a Lotus Exige, FSet is like a
> > Mercedes sedan.  On the track, the Mercedes is not in the same class
> > as the Lotus.  But it's smooth and quiet, has every feature you've
> > ever heard of and some you may not have, seats five (scaling! :),
> > the stereo kicks ass, and the air conditioner works.  Which car
> > would you rather do your daily commute in???
>
> I love the FSet idea.  This should be in every lisp programmer's
> toolbox.
>
> Do you have a solution / idiom for integrating FSet with CL:LOOP?
> I've often thought that loop was missing generic iteration
> capabilities.

You cannot extend LOOP in an easy way.  If you check CL-ENUMERATION
(shameless plug: it is just what you suggest) you will see that I
added a FOREACH macro that is kind of a - er - kludge.

Extending ITERATE is relatively easy; extending LOOP, when possible,
is ugly as hell and there is no way to avoid the stupid

    for x BEING THE WHATEVER OF baz

Cheers

Marco







>
> Of course it's possible to define a generic iteration protocol and use
> it with CL:LOOP.  The usage however remains somewhat clumsy.  I
> include below a sketch of such an implementation, and an example
> CL:LOOP for that exercises the protocol.
>
> Anyone have any ideas about how this can be made tighter / nicer to
> use?
>
> (defpackage iterator (:use :cl))
> (in-package :iterator)
>
> (defclass list-iterator ()
>   ((list :initarg :list :reader list-iterator-list))
>   (:documentation "An implementation of the iterator protocol for lists."))
>
> (defmethod make-iterator ((cons cons))
>   (make-instance 'list-iterator :list cons))
> (defmethod make-iterator ((obj (eql 'nil)))
>   (make-instance 'list-iterator :list nil))
> (defmethod has-next-item ((list-iterator list-iterator))
>   (not (null (list-iterator-list list-iterator))))
> (defmethod next-item ((list-iterator list-iterator))
>   (with-slots (list) list-iterator
>     (let ((result (car list)))
>       (setf list (cdr list))
>       result)))
>
> ;; here is an example of how to use the protocol with CL:LOOP
>
> (let ((list (list 1 2 3 4)))
>   (loop with iterator = (make-iterator list)
>      while (has-next-item iterator)
>      for item = (next-item iterator)
>      collect (1+ item)))
>
> => (2 3 4 5)

From: John Thingstad
Subject: Re: sort unique
Date: Mon, 07 Apr 2008 15:16:17 +0000
Message-ID: <op.t88x1fgeut4oq5@pandora.alfanett.no>

P� Mon, 07 Apr 2008 16:06:42 +0200, skrev Russell McManus  
<···············@yahoo.com>:

>
> Anyone have any ideas about how this can be made tighter / nicer to
> use?
>

I have been playing around with something similar. A set library based on  
ordering.
Like sort they work for both arrays and lists. Still most of the  
functionality is achieved without duplicating the code by using iterators.  
My experience is that this approach leads to inefficient code in Lisp.
Though it works great in C++ using templates or the for-each iterators in  
C# it is a bad idea in Lisp because of the run time type checking. In  
particular iterator can't determine what to type to dispatch and must do  
this at run-time every time a it-value or it-next is called. (My library,  
not yours.)

Lisp achieves reasonable efficiency by providing specific operations for  
types as well as general ones.
Reasonable performance is achieved by using the most specific operation of  
the problem at hand.

Like char= and = rather than equal. Trying to eliminate this just doesn't  
seem like a good idea in Lisp.

Here is a link to the code as is. Note that this is a work in progress and  
not properly tested.
Never the less it should give you some idea.

http://paste.lisp.org/display/58759

--------------
John Thingstad

From: Kent M Pitman
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 19:16:23 +0000
Message-ID: <utziq64mw.fsf@nhplace.com>

"Alex Mizrahi" <········@users.sourceforge.net> writes:

>  g> Could anyone show me a better unique extractor algorithm or is there a
>  g> build in one in cl.
>
> remove-duplicates (or delete-duplicates, destructuve version) are
> built in.  however typically they require ~N^2 operations (comparing
> each to each), that is not efficient if your list is large.

That's just a bug, though, right?  There's nothing in the spec that 
requires this.  And you're betting against the implementation doing
better if you implement it yourself.

> in this case it's better to use hash table:
>
> (defun remove-duplicates-ht (list)
> (loop with ht = (hash-table :test 'equal)
>           for w in list
>           do (setf (gethash e ht) t)
>           finally (return (loop for w being each hash-key of ht
>                                           collect w))))

Why are you assuming that the system version of remove-duplicates does
not do

 (defun remove-duplicates (list &key ... (test #'eql))
   (if (or (eq test 'eql)    (eq test #'eql)
           (eq test 'eq)     (eq test #'eq)
           (eq test 'equal)  (eq test #'equal)
           (eq test 'equalp) (eq test #'equalp))
        (secretly-use-hash-tables ...)
      (...do something more general...)))

In particular, you don't know that the system doesn't secretly use hash
tables even for the more general case, at least in the case where the 
operator is something the system can analyze.  So by doing it yourself,
you're betting against it.

IMO, you're better off just calling the system thing and sending a bug
report if you observe it to be foolishly slow.  That way, you get the
benefit of improvements as they arrive, you don't have to maintain as
much code, and other people's code gets faster too.

From: John Thingstad
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 20:02:22 +0000
Message-ID: <op.t8qsl8xxut4oq5@pandora.alfanett.no>

P� Fri, 28 Mar 2008 20:16:23 +0100, skrev Kent M Pitman  
<······@nhplace.com>:

>
> Why are you assuming that the system version of remove-duplicates does
> not do
>
>  (defun remove-duplicates (list &key ... (test #'eql))
>    (if (or (eq test 'eql)    (eq test #'eql)
>            (eq test 'eq)     (eq test #'eq)
>            (eq test 'equal)  (eq test #'equal)
>            (eq test 'equalp) (eq test #'equalp))
>         (secretly-use-hash-tables ...)
>       (...do something more general...)))
>
> In particular, you don't know that the system doesn't secretly use hash
> tables even for the more general case, at least in the case where the
> operator is something the system can analyze.  So by doing it yourself,
> you're betting against it.
>
> IMO, you're better off just calling the system thing and sending a bug
> report if you observe it to be foolishly slow.  That way, you get the
> benefit of improvements as they arrive, you don't have to maintain as
> much code, and other people's code gets faster too.

Wouldn't a define-compiler-macro be more appropriate here?
It seems to me the purpose of compiler macros is to allow you to dispatch  
code based on type at compile time (if you know it) although the ANSI  
Hyperspec is vague about it's purpose.
Anyhow it could save you from having to check for type every time the  
function is run. (If it is compiled.)

Anyhow there are some functions like union that run in n^2 no matter how  
you implement it.
This was discussed earlier and we found that with the specification as it  
is you can't make it run in n. If your set's are really large then it  
would make sense to make a more specialized function using hash-tables or  
sorted lists.

Also I find that most implementations aren't all that clever..

--------------
John Thingstad

From: Kent M Pitman
Subject: Re: sort unique
Date: Sat, 29 Mar 2008 23:21:25 +0000
Message-ID: <ubq4xxgju.fsf@nhplace.com>

"John Thingstad" <·······@online.no> writes:

> På Fri, 28 Mar 2008 20:16:23 +0100, skrev Kent M Pitman
> <······@nhplace.com>:
>
> > Why are you assuming that the system version of remove-duplicates does
> > not do
> >
> >  (defun remove-duplicates (list &key ... (test #'eql))
> >    (if (or (eq test 'eql)    (eq test #'eql)
> >            (eq test 'eq)     (eq test #'eq)
> >            (eq test 'equal)  (eq test #'equal)
> >            (eq test 'equalp) (eq test #'equalp))
> >         (secretly-use-hash-tables ...)
> >       (...do something more general...)))
> >
> > In particular, you don't know that the system doesn't secretly use hash
> > tables even for the more general case, at least in the case where the
> > operator is something the system can analyze.  So by doing it yourself,
> > you're betting against it.
> >
> > IMO, you're better off just calling the system thing and sending a bug
> > report if you observe it to be foolishly slow.  That way, you get the
> > benefit of improvements as they arrive, you don't have to maintain as
> > much code, and other people's code gets faster too.
> 
> Wouldn't a define-compiler-macro be more appropriate here?

Keep in mind that as a user, you are not permitted to write such a thing.

As to the system, they are permitted to use arbitrary magic, so
whether or not that is used is a judgment call.  But because the
effect of O(n^2) can be so pronounced, I'd be surprised if
compile-time-only was right since that could not possibly hope to
optimize interactive use (unless the implementation were compiled) nor
APPLY or FUNCALL with opaque first arguments.

It was hotly debated in the design of ANSI CL whether to have
define-compiler-macro at all, since it potentially happens at a
different time than regular compiler optimizations, and some vendors
worried we were requiring them to do optimization a particular way. (A
similar concern, and much more valid, occurs with the MOP, which
purports to explain a lot about how things work in the object system,
and where if you don't actually do it that way, leads to some amount
of confusion.  But in the case of compiler macros, since you can do it
BOTH ways both usefully and transparently, it's not as big a deal.)

In particular, some implementations have pattern matching facilities
and multiple-definitions-per-function that optimize certain cases
(with the definitions of such optimizers more like the way defmethod
is done, with a method combination more like an 'or' method
combination, than like the way defun is done).  There was also a worry
that define-compiler-macro meant that this couldn't happen, but of course
that was also not something that needed to be worried about, since the
call to any user-defined compiler macro can be among those things in the
internal system list, or can be done at another time altogether.

> It seems to me the purpose of compiler macros is to allow you to
> dispatch  code based on type at compile time (if you know it) although
> the ANSI  Hyperspec is vague about it's purpose.

The purpose is to allow you to do something similar in user code to
what the system does.  I was the one that proposed adding it, although
certainly not the one who thought up the idea of such concepts; my
proposal was intended to allow a relatively portable way to allow some
programs to make some compile-time optimizations about functions
written by users.  The specific reason I thought it was critical was
to allow functions like those in CLIM that are heavily keyword-driven,
or functions that are like the sequence functions, to be permitted to
have rewrites into non-keyword forms so they can be called efficiently.

> Anyhow it could save you from having to check for type every time the
> function is run. (If it is compiled.)

Certainly it can.  And it's a reasonable optimization notwithstanding
the fact that I think it's worth the constant time overhead to also
check at runtime if the function did not match.  An implementation can
always have the optimizer turn into
(remove-duplicates-known-to-be-not-equal-or-friends ...) in the case
where there is a constant name of a function that is definitively not
one of the ones that can be analyzed. :)

But, to be clear, the implementation doesn't have to do this via 
define-compiler-macro.

> Anyhow there are some functions like union that run in n^2 no matter
> how  you implement it.

Perhaps my brain is just not functioning at this particular moment,
but I don't see offhand with 3 seconds thought why UNION cannot be
implemented by APPEND (linear time) followed by remove duplicates
(certainly not O(n^2), some say O(n), for some reason  I have a recollection
that O(1) was not quite a fair description of GETHASH, though I know
people who use that... why do I think it should be counted as O(log(n))
or at least O(log(k)) for some value of k that is not wholly constant...?)

> This was discussed earlier and we found that with the specification as
> it  is you can't make it run in n. If your set's are really large then
> it  would make sense to make a more specialized function using
> hash-tables or  sorted lists.

This is a good recipe for having Lisp never get better.

Honestly, I understand what you're saying about the hapinstance of what
vendors do now, but if you just recommend people program around the 
implementations, they have no incentive to get better.

Get a vendor to change, and then call out the vendors who didn't
follow suit.  Use the power of the market...

> Also I find that most implementations aren't all that clever..

How do you think anything else in the history of lisp ever became fast?

The "sufficiently clever compiler" may indeed be a ways away.  But then,
this doesn't require cleverness--just common sense and market pressure.
It's ok to work around this kind of thing as a temporary stopgap while you
wait for a specific fix, but I still think you should not advocate doing
so as anything other than a stopgap measure.  To ask for less is to 
tell the vendors (in which I include free software makers) it's ok to
ignore the matter, that you have it well in hand.

From: Alex Mizrahi
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 22:07:19 +0000
Message-ID: <47ed6c18$0$90272$14726298@news.sunsite.dk>

 ??>> however typically they require ~N^2 operations (comparing
 ??>> each to each), that is not efficient if your list is large.

 KMP> Why are you assuming that the system version of remove-duplicates does
 KMP> not do
 KMP>         (secretly-use-hash-tables ...)

well, you see, i wrote "typically" -- that's what i'd expect from reasonable 
implementation developers: they optimize for a case of small list, implying 
that people with special/bigger needs will figure out that it's slow from 
profiling, and implement the way it works best for them.

but i'd be pleased to know if some implementations actually have smarter 
version, such as checking list size and picking optimal solution. do you 
know any such implementation, by chance?

i've checked source code of SBCL, CMUCL, CLISP, ECL, ABCL and CormanCL -- it 
appears they all have this "bug", working on list without using advanced 
data structures/algorithms.

and, btw, if we'll take a look on SBCL (that is considered quite advanced 
open source implementation), code doen't look very optimizable:

               (if (> (count (funcall key e) sequence :test test :key key
                             :start (if from-end start (+ start step 1))
                             :end (if from-end (- end step 1) end))
                      0)
                   (progn
                     (incf c)
                     (incf step)
                     (setq state2 (funcall step2 sequence state2 from-end2))
                     (when (funcall endp2 sequence state2 limit2 from-end2)
                       (return-from sequence:delete-duplicates (finish)))
                     (setq e (funcall elt2 sequence state2)))

i won't dare to say that "sufficiently smart" compiler can make this as 
optimal as it could be for a given case.
that requires compiler "smart as hell".

From: Richard M Kreuter
Subject: Re: sort unique
Date: Fri, 28 Mar 2008 23:20:52 +0000
Message-ID: <87zlsiwi3w.fsf@progn.net>

"Alex Mizrahi" <········@users.sourceforge.net> writes:

>  ??>> however typically they require ~N^2 operations (comparing
>  ??>> each to each), that is not efficient if your list is large.
>
>  KMP> Why are you assuming that the system version of remove-duplicates does
>  KMP> not do
>  KMP>         (secretly-use-hash-tables ...)
>
> well, you see, i wrote "typically" -- that's what i'd expect from reasonable 
> implementation developers: they optimize for a case of small list, implying 
> that people with special/bigger needs will figure out that it's slow from 
> profiling, and implement the way it works best for them.
>
> but i'd be pleased to know if some implementations actually have smarter 
> version, such as checking list size and picking optimal solution. do you 
> know any such implementation, by chance?
>
> i've checked source code of SBCL, CMUCL, CLISP, ECL, ABCL and
> CormanCL -- it appears they all have this "bug", working on list
> without using advanced data structures/algorithms.

This isn't correct about SBCL.  SBCL's REMOVE-DUPLICATES
implementation for lists uses a hash table internally when all these
conditions hold:

(1) the difference between END and START is greater than 20,
(2) no KEY is supplied,
(3) no TEST-NOT is supplied,
(4) TEST is one of #'EQL, #'EQ, #'EQUAL, #'EQUALP.

See src/code/seq.lisp in SBCL's source tree.  (For some reason it
doesn't look as though a hash table is used for REMOVE-DUPLICATES on
vectors; dunno why.)

> and, btw, if we'll take a look on SBCL (that is considered quite advanced 
> open source implementation), code doen't look very optimizable:
>
>                (if (> (count (funcall key e) sequence :test test :key key
>                              :start (if from-end start (+ start step 1))
>                              :end (if from-end (- end step 1) end))
>                       0)
>                    (progn
>                      (incf c)
>                      (incf step)
>                      (setq state2 (funcall step2 sequence state2 from-end2))
>                      (when (funcall endp2 sequence state2 limit2 from-end2)
>                        (return-from sequence:delete-duplicates (finish)))
>                      (setq e (funcall elt2 sequence state2)))
>
> i won't dare to say that "sufficiently smart" compiler can make this as 
> optimal as it could be for a given case.
> that requires compiler "smart as hell".

The sample above comes from SBCL's generic sequence implementation and
is not executed on lists or vectors.

--
RmK

From: Robert Maas, see http://tinyurl.com/uh3t
Subject: Re: sort unique
Date: Sun, 06 Apr 2008 04:00:55 +0000
Message-ID: <rem-2008apr05-002@yahoo.com>

> From: Kent M Pitman <······@nhplace.com>
> ... you're betting against the implementation doing better if you
> implement it yourself.

Agreed. When in doubt, do benchmark timing tests to see whether the
behaviour provided by the vendor is o(N**2) or l(N * logN) or what.
Then decide whether, on the basis of that info, it's worth
re-inventing a better wheel.

> > in this case it's better to use hash table:
> > (defun remove-duplicates-ht (list)
> > (loop with ht = (hash-table :test 'equal)
> >           for w in list
> >           do (setf (gethash e ht) t)
> >           finally (return (loop for w being each hash-key of ht
> >                                           collect w))))

> Why are you assuming that the system version of remove-duplicates
> does not do
>  (defun remove-duplicates (list &key ... (test #'eql))
>    (if (or (eq test 'eql)    (eq test #'eql)
>            (eq test 'eq)     (eq test #'eq)
>            (eq test 'equal)  (eq test #'equal)
>            (eq test 'equalp) (eq test #'equalp))
>         (secretly-use-hash-tables ...)
>       (...do something more general...)))

Actually in those common cases, given that a hash table is not
*guaranteed* decent speed, you *might* have data where everything
hashes into a single bucket per the vendor-supplied hashing
algorithm, using a hash table might not really be a good thing to
secretly use without the advise and consent of the application
programmer. If the test somehow supports a total ordering on the
possible data values, then sorting per that total ordering (N *
logN) followed by collating adjacent elements in the sorted list,
might be the safest thing for a vendor to secretly supply. If the
test is EQ, then machine address could be secretly used as sort
key. If the test is EQL, then data could be split into classes
(numeric, character, CONS cell, etc.), each such group sorted per
its appropriate total-ordering, adjacent duplicates eliminated
within each group, and finally an APPEND of all the groups. That's
also less set-up overhead (for small datasets) than building a
temporary hash table.

> IMO, you're better off just calling the system thing and sending
> a bug report if you observe it to be foolishly slow.

IMO if the vendor's algorithm does something "obvious", which is
slower than a really tricky or innovative algorithm would be, and
the result is *correct* per the spec, that's not grounds for a bug
report. Maybe something gentler would be appropriate, such as a
"suggestion for further improvement"?

> That way, you get the benefit of improvements as they arrive, you
> don't have to maintain as much code, and other people's code gets
> faster too.

If they don't get pissed at you for filing a BUG REPORT for
something that is merely a performance issue, and they simply
discard your BUG REPORT as another annoying person they'd rather
not have as a customer anyway. If you can serve 90% of the
potential customers, discard the other 10%, and thereby avoid an
order of magnitude extra cost dealing with nasty customers ...

Now in general it really isn't even possible to make the
remove-duplicates algorithm more efficient than o(N**2) with a
user-supplied comparison function. Imagine for example a comparison
function to eliminate keys that hash to the same bucket per some
user-defined really-weird hashing function. (Why did the
application programmer want this? Because he/she was deliberately
trying to show an example of a "perfect hash" for exposition
purposes in some tutorial. So he/she wanted to generate a large set
of random keys, eliminate duplicates, and thereby obtain a set of
keys with no hash-bucket duplicates which filled or almost filled
the hash table, then if almost-filled run a simple loop to try to
fill that last one or two buckets in the table.) Or maybe there
really was no purpose to the complicated user-defined hashing
function except to provoke the vendor-supplied remove-duplicates
function into o(N**2) timing. For example, make a list of random
strings of random lengths, and two strings (ignoring length) are
considered "the same" if their middle character is the same (for
odd-length strings) or middle two characters are both the same (for
even-length strings). Thus
(remove-duplicates listOfStrings :test #'middle-strings-same)
couldn't be optimized automatically by the vendor's algorithm, but
(remove-duplicates listOfStrings :test #'string= :key #'middle-string)
could indeed by optimized by pre-computing the keys then sorting
via #'string< then collating adjacent elements.

Crap, XLISP-PLUS version 2.1g doesn't have multiple-value-setq!! I
was going to write middle-string, using XLISP to debug it, locally
on my Mac, without needing to go online to use CMUCL, just now
(12:04 PDT) while composing newsgroup articles offline. Nevermind.

From: Pascal Bourguignon
Subject: Re: sort unique
Date: Sun, 06 Apr 2008 05:18:11 +0000
Message-ID: <8763uvk1d8.fsf@thalassa.informatimago.com>

·······@yahoo.com (Robert Maas, see http://tinyurl.com/uh3t) writes:

>> From: Kent M Pitman <······@nhplace.com>
>> ... you're betting against the implementation doing better if you
>> implement it yourself.
>
> Agreed. When in doubt, do benchmark timing tests to see whether the
> behaviour provided by the vendor is o(N**2) or l(N * logN) or what.
> Then decide whether, on the basis of that info, it's worth
> re-inventing a better wheel.
>
>> > in this case it's better to use hash table:
>> > (defun remove-duplicates-ht (list)
>> > (loop with ht = (hash-table :test 'equal)
>> >           for w in list
>> >           do (setf (gethash e ht) t)
>> >           finally (return (loop for w being each hash-key of ht
>> >                                           collect w))))
>
>> Why are you assuming that the system version of remove-duplicates
>> does not do
>>  (defun remove-duplicates (list &key ... (test #'eql))
>>    (if (or (eq test 'eql)    (eq test #'eql)
>>            (eq test 'eq)     (eq test #'eq)
>>            (eq test 'equal)  (eq test #'equal)
>>            (eq test 'equalp) (eq test #'equalp))
>>         (secretly-use-hash-tables ...)
>>       (...do something more general...)))
>
> Actually in those common cases, given that a hash table is not
> *guaranteed* decent speed, you *might* have data where everything
> hashes into a single bucket per the vendor-supplied hashing
> algorithm, using a hash table might not really be a good thing to
> secretly use without the advise and consent of the application
> programmer. If the test somehow supports a total ordering on the
> possible data values, then sorting per that total ordering (N *
> logN) followed by collating adjacent elements in the sorted list,
> might be the safest thing for a vendor to secretly supply. If the
> test is EQ, then machine address could be secretly used as sort
> key. If the test is EQL, then data could be split into classes
> (numeric, character, CONS cell, etc.), each such group sorted per
> its appropriate total-ordering, adjacent duplicates eliminated
> within each group, and finally an APPEND of all the groups. That's
> also less set-up overhead (for small datasets) than building a
> temporary hash table.

Note that STL (for C++) uses lessp instead of equalp predicates in
general.  equalp can always be infered as (not (or (lessp a b) (lessp b a)))

So we could add a :LESSP keyword to remove-duplicates, and use a
sorting algorithm when it's provided instead of an equal :TEST.

However, this doesn't replace :TEST EQ, because while SXHASH is
defined on all lisp objects, there is no pointer comparison in lisp
like in C/C++ (and can't be since we want to allow reallocating
garbage collectors).  

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

READ THIS BEFORE OPENING PACKAGE: According to certain suggested
versions of the Grand Unified Theory, the primary particles
constituting this product may decay to nothingness within the next
four hundred million years.

From: Robert Maas, see http://tinyurl.com/uh3t
Subject: Re: sort unique
Date: Sun, 06 Apr 2008 23:35:52 +0000
Message-ID: <rem-2008apr06-002@yahoo.com>

> From: Pascal Bourguignon <····@informatimago.com>
> Note that STL (for C++) uses lessp instead of equalp predicates
> in general.  equalp can always be infered as (not (or (lessp a b)
> (lessp b a)))

That works only in cases where you have a well-defined total
ordering on equivalence classes, such that exactly one of the
following is true for any two items a and b:
(< (class a) (class b))
(= (class a) (class b))
(> (class a) (class b))
Consider for example this simple case where you do *not* have
such a total ordering: Consider sets (in set theory), using
set-equality (same elements) as the equality predicate.
What sort of total ordering can you possibly define for equivalence
classes of sets per set-equality?

> So we could add a :LESSP keyword to remove-duplicates, and use a
> sorting algorithm when it's provided instead of an equal :TEST.

That will work only when there is in fact such a total ordering per
a lessp function you can define.

> However, this doesn't replace :TEST EQ, because while SXHASH is
> defined on all lisp objects, there is no pointer comparison in lisp
> like in C/C++ (and can't be since we want to allow reallocating
> garbage collectors).

And it doesn't work for :TEST #'set-equal either.
And it doesn't work for :TEST {a lot of stuff}!
I suppose it would work exactly in the cases where the application
programmer could have sorted the data per the :LESSP comparator
then collated adjacent elements per the :TEST comparator (or used
the :LESSP comparator in reverse), right? Something like this perhaps:
  (mapcan #'(lambda (tl)
              (if (or (endp (cdr tl))
                      (<lesspPredicate> (car tl) (cadr tl)))
                  (list (car tl))))
    (sort (copy-list original) #'<lesspPredicate>))

From: Scott Burson
Subject: Re: sort unique
Date: Mon, 07 Apr 2008 06:01:34 +0000
Message-ID: <8508fe34-df7c-4a1d-abed-314f606bc7e4@m1g2000pre.googlegroups.com>

On Apr 6, 4:35 pm, ·······@yahoo.com (Robert Maas, see http://tinyurl.com/uh3t)
wrote:
> > From: Pascal Bourguignon <····@informatimago.com>
> > Note that STL (for C++) uses lessp instead of equalp predicates
> > in general.  equalp can always be infered as (not (or (lessp a b)
> > (lessp b a)))
>
> That works only in cases where you have a well-defined total
> ordering on equivalence classes, such that exactly one of the
> following is true for any two items a and b:
> (< (class a) (class b))
> (= (class a) (class b))
> (> (class a) (class b))
> Consider for example this simple case where you do *not* have
> such a total ordering: Consider sets (in set theory), using
> set-equality (same elements) as the equality predicate.
> What sort of total ordering can you possibly define for equivalence
> classes of sets per set-equality?

Sets aren't the problem, actually.  If you have a total ordering for
the elements of the sets, then you can sort the elements of each set
into that order; then you can compare the sets lexicographically, as
if they were sequences.  This is basically how my FSet set-theoretic
collections library works.

I actually went to the trouble of handling unequal-but-equivalent
elements.  FSet gives you a way of saying that two values are unequal
even though neither compares less than the other.  When that happens,
FSet falls back from using balanced binary trees to just putting the
elements in question into a list.

Exactly how useful a feature this is in real life remains to be seen,
but one example of where it could be useful is this.  Suppose you have
a type that you want to put into FSet sets, and for whatever reason,
the way you want to write the ordering function is to compute an
integer hash code for each object and then compare the hash codes.
(Maybe, if you cache the hash codes, this would be a lot faster than
doing it some other way.)  Provided you give FSet a way to know that
two objects of your type are distinct even though they hash to the
same value -- there is, as I say, an interface for doing that -- then
occasional hash collisions will not cause FSet to malfunction.  And as
long as the collisions really are rare, the performance impact will be
negligible.

http://common-lisp.net/project/fset/

-- Scott

From: Brian
Subject: Re: sort unique
Date: Sat, 29 Mar 2008 00:56:49 +0000
Message-ID: <81633df0-5a81-4fdc-9d22-a7b71a048106@a23g2000hsc.googlegroups.com>

REMOVE-DUPLICATES on SBCL can be slow, but what is stopping you from
doing something like:
(defun dd (sequence &key sort (equal #'eql))
  "A possably faster remove-duplicates"
  (declare (optimize speed)
	   (function equal))
  (setf sequence (sort (copy-list sequence) sort))
  (let (ans)
    (dolist (e sequence)
      (if (null ans)
	  (push e ans)
	  (unless (funcall equal e (first ans))
	    (push e ans))))
    (nreverse ans)))

On a list of 10000 random integers (at least), something like (delete-
duplicates (sort (copy-list <list>) #'>) :test #'=) is:
  2.196 seconds of real time
  2.176136 seconds of user run time
  0.004 seconds of system run time
  0 calls to %EVAL
  0 page faults and
  81,920 bytes consed.
... and (dd <list> :sort #'> :equal #'=) is:
  0.01 seconds of real time
  0.012001 seconds of user run time
  0.0 seconds of system run time
  0 calls to %EVAL
  0 page faults and
  127,048 bytes consed.

From: William James
Subject: Re: sort unique
Date: Sat, 29 Mar 2008 18:11:31 +0000
Message-ID: <f82985d2-86fe-4a1f-b083-0ef3b88602a0@e67g2000hsa.googlegroups.com>

Brian wrote:
> REMOVE-DUPLICATES on SBCL can be slow, but what is stopping you from
> doing something like:
> (defun dd (sequence &key sort (equal #'eql))
>   "A possably faster remove-duplicates"
>   (declare (optimize speed)
> 	   (function equal))
>   (setf sequence (sort (copy-list sequence) sort))
>   (let (ans)
>     (dolist (e sequence)
>       (if (null ans)
> 	  (push e ans)
> 	  (unless (funcall equal e (first ans))
> 	    (push e ans))))
>     (nreverse ans)))
>
> On a list of 10000 random integers (at least), something like (delete-
> duplicates (sort (copy-list <list>) #'>) :test #'=) is:
>   2.196 seconds of real time
>   2.176136 seconds of user run time
>   0.004 seconds of system run time
>   0 calls to %EVAL
>   0 page faults and
>   81,920 bytes consed.
> ... and (dd <list> :sort #'> :equal #'=) is:
>   0.01 seconds of real time
>   0.012001 seconds of user run time
>   0.0 seconds of system run time
>   0 calls to %EVAL
>   0 page faults and
>   127,048 bytes consed.

Ruby:

a = Array.new(10_000){ rand( 5999 ) }
t = Time.now
u = a.uniq
puts "#{u.size} unique integers in < #{Time.now-t} seconds."

--- output ---
4888 unique integers in < 0.031 seconds.

This is on a slow (800MHz) laptop.

From: Alex Mizrahi
Subject: Re: sort unique
Date: Sun, 30 Mar 2008 07:44:25 +0000
Message-ID: <47ef44da$0$90264$14726298@news.sunsite.dk>

 WJ> Ruby:

 WJ> a = Array.new(10_000){ rand( 5999 ) }
 WJ> t = Time.now
 WJ> u = a.uniq
 WJ> puts "#{u.size} unique integers in < #{Time.now-t} seconds."

 WJ> --- output ---
 WJ> 4888 unique integers in < 0.031 seconds.

CL-USER> (let ((list (loop repeat 10000 collect (random 5999))))
    (length (time (remove-duplicates list))))
Evaluation took:
  0.003 seconds of real time
  0.0 seconds of user run time
  0.004 seconds of system run time
  0 calls to %EVAL
  0 page faults and
  307,512 bytes consed.
4872

it appears delete-duplicates is not optimized as remove-duplicates, while 
generally destructive operations are faster..

CL-USER> (let ((list (loop repeat 10000 collect (random 5999))))
    (length (time (delete-duplicates list))))
Evaluation took:
  0.71 seconds of real time
  0.712044 seconds of user run time
  0.0 seconds of system run time
  0 calls to %EVAL
  0 page faults and
  0 bytes consed.


but there is no consing with delete-duplicates, so i guess it's optimized 
for space rather than for time..

From: Marco Antoniotti
Subject: Re: sort unique
Date: Sun, 30 Mar 2008 12:16:28 +0000
Message-ID: <a2446264-94ab-45ef-b058-29ea621e3982@8g2000hsu.googlegroups.com>

On Mar 30, 9:44 am, "Alex Mizrahi" <········@users.sourceforge.net>
wrote:
>  WJ> Ruby:
>
>  WJ> a = Array.new(10_000){ rand( 5999 ) }
>  WJ> t = Time.now
>  WJ> u = a.uniq
>  WJ> puts "#{u.size} unique integers in < #{Time.now-t} seconds."
>
>  WJ> --- output ---
>  WJ> 4888 unique integers in < 0.031 seconds.
>
> CL-USER> (let ((list (loop repeat 10000 collect (random 5999))))
>     (length (time (remove-duplicates list))))
> Evaluation took:
>   0.003 seconds of real time
>   0.0 seconds of user run time
>   0.004 seconds of system run time
>   0 calls to %EVAL
>   0 page faults and
>   307,512 bytes consed.
> 4872
>
> it appears delete-duplicates is not optimized as remove-duplicates, while
> generally destructive operations are faster..
>
> CL-USER> (let ((list (loop repeat 10000 collect (random 5999))))
>     (length (time (delete-duplicates list))))
> Evaluation took:
>   0.71 seconds of real time
>   0.712044 seconds of user run time
>   0.0 seconds of system run time
>   0 calls to %EVAL
>   0 page faults and
>   0 bytes consed.
>
> but there is no consing with delete-duplicates, so i guess it's optimized
> for space rather than for time..

FWIW, in general on a MacBook Pro Intel, LW is about 3 times faster
than Ruby with the code above.

Cheers
--
Marco