Faster remove-duplicates with sorted list.

From: Thaddeus L Olczyk
Subject: Faster remove-duplicates with sorted list.
Date: Mon, 11 Feb 2002 19:47:29 +0000
Message-ID: <3c6c1dab.139781203@nntp.interaccess.com>

Is there a fast way to remove duplicate elements in a sorted list
where the function sorting is compatible with the "sorting" function?
( Meaning that if the sort function says  x=>y and y=>x then x=y
and that if not (x=>y and y<=x) then not (x=y). )

I believe that remove-duplicates has to assume that the list is not
sorted. Therefore it has to compare each element, element by element
( ~N^2 ). Whereas a side by side goes like 2*N.

The only problem I have is excessive consing which significantly slows
down the algorithm.

Re: Faster remove-duplicates with sorted list. Marc Spitzer
- Re: Faster remove-duplicates with sorted list. Thomas A. Russ
- Re: Faster remove-duplicates with sorted list. Wade Humeniuk
Re: Faster remove-duplicates with sorted list. Barry Margolin
Re: Faster remove-duplicates with sorted list. Jeff Greif
- Re: Faster remove-duplicates with sorted list. Jeff Greif
  - Re: Faster remove-duplicates with sorted list. Jeff Greif
    - Re: Faster remove-duplicates with sorted list. Thaddeus L Olczyk
      - Re: Faster remove-duplicates with sorted list. Raymond Wiker
      - Re: Faster remove-duplicates with sorted list. Tim Bradshaw
      - Re: Faster remove-duplicates with sorted list. Dr. Edmund Weitz
Re: Faster remove-duplicates with sorted list. David Hanley
Re: Faster remove-duplicates with sorted list. Pierre R. Mai

From: Marc Spitzer
Subject: Re: Faster remove-duplicates with sorted list.
Date: Mon, 11 Feb 2002 21:08:06 +0000
Message-ID: <slrna6gcja.2bqv.marc@oscar.eng.cv.net>

In article <··················@nntp.interaccess.com>, Thaddeus L Olczyk wrote:
> Is there a fast way to remove duplicate elements in a sorted list
> where the function sorting is compatible with the "sorting" function?
> ( Meaning that if the sort function says  x=>y and y=>x then x=y
> and that if not (x=>y and y<=x) then not (x=y). )
> 
> I believe that remove-duplicates has to assume that the list is not
> sorted. Therefore it has to compare each element, element by element
> ( ~N^2 ). Whereas a side by side goes like 2*N.
> 
> The only problem I have is excessive consing which significantly slows
> down the algorithm.

First how much duplacation is there?

one quick way to do this is to make each element of the list a hash
key and then pull the keys as a uniq sublist of your orignal list.  I
cannot tell you how to do yet but it can be done. And here is some
rough code that should be close if the list is sorted(any body feel
free to fix):

(defun strip-dups (first rest)
  (cond ((null rest)
         first)
        ((eql first (car rest)) ; insert proper test 
         (strip-dup first (cdr rest)))
        ( t
          (cons first (strip-dup (car rest) (cdr rest))))))

You could also use a iterative structure but I have not gotten that
far in the book.

marc

From: Thomas A. Russ
Subject: Re: Faster remove-duplicates with sorted list.
Date: Tue, 12 Feb 2002 01:48:58 +0000
Message-ID: <ymilmdzfp11.fsf@sevak.isi.edu>

····@oscar.eng.cv.net (Marc Spitzer) writes:
> one quick way to do this is to make each element of the list a hash
> key and then pull the keys as a uniq sublist of your orignal list.  I
> cannot tell you how to do yet but it can be done.

(defun fast-remove-duplicates (list)
  (let ((ht (make-hash-table :size (length list))))
    (loop for element in list
          unless (gethash element ht)
          do (setf (gethash element ht) t)
          and collect element)))

-- 
Thomas A. Russ,  USC/Information Sciences Institute          ···@isi.edu

From: Wade Humeniuk
Subject: Re: Faster remove-duplicates with sorted list.
Date: Wed, 13 Feb 2002 01:43:32 +0000
Message-ID: <a4cg53$p0p$1@news3.cadvision.com>

>
> (defun strip-dups (first rest)
>   (cond ((null rest)
>          first)
>         ((eql first (car rest)) ; insert proper test
>          (strip-dup first (cdr rest)))
>         ( t
>           (cons first (strip-dup (car rest) (cdr rest))))))
>

(defun nremove-duplicates-sorted (list &key (test #'eql))
  (declare (type cons list))
  (when list
    (loop with place = list do
          (cond
           ((null (cdr place)) (return-from nremove-duplicates-sorted list))
           ((funcall test (car place) (cadr place))
            (setf (cdr place) (cddr place)))
           (t (setf place (cdr place)))))))

CL-USER 12 > (nremove-duplicates-sorted '(1 2 2 3 3 3 4 5 5 5 5))
(1 2 3 4 5)

CL-USER 13 > (nremove-duplicates-sorted nil)
NIL

CL-USER 14 > (nremove-duplicates-sorted '(1 2 3 4 5 6 7 8))
(1 2 3 4 5 6 7 8)

CL-USER 15 > (nremove-duplicates-sorted '(1 1 1 1 1 1 1 1 1 1 1 1 1 1 1))
(1)

CL-USER 17 > (time (nremove-duplicates-sorted '(1 2 2 3 3 3 4 5 5 5 5)))

0.0 seconds used.
Standard Allocation 0 bytes.
Fixlen Allocation 0 bytes.
(1 2 3 4 5)

No consing.

Wade

From: Barry Margolin
Subject: Re: Faster remove-duplicates with sorted list.
Date: Mon, 11 Feb 2002 22:22:59 +0000
Message-ID: <7vX98.26$g01.145682@burlma1-snr2>

In article <··················@nntp.interaccess.com>,
Thaddeus L Olczyk <······@interaccess.com> wrote:
>Is there a fast way to remove duplicate elements in a sorted list
>where the function sorting is compatible with the "sorting" function?
>( Meaning that if the sort function says  x=>y and y=>x then x=y
>and that if not (x=>y and y<=x) then not (x=y). )
>
>I believe that remove-duplicates has to assume that the list is not
>sorted. Therefore it has to compare each element, element by element
>( ~N^2 ). Whereas a side by side goes like 2*N.
>
>The only problem I have is excessive consing which significantly slows
>down the algorithm.

Untested:

(defun remove-duplicates-sorted (list)
  (loop for element in list
	and last-element = '(nil) then element
    unless (eql element last-element)
      collect element))

-- 
Barry Margolin, ······@genuity.net
Genuity, Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

From: Jeff Greif
Subject: Re: Faster remove-duplicates with sorted list.
Date: Tue, 12 Feb 2002 04:09:10 +0000
Message-ID: <Gz0a8.7205$DC2.3524080@typhoon1.we.ipsvc.net>

Using a hashtable also works, as long as the :test function of
remove-duplicates
is one of the legal hashtable equality functions.  Something like this
(untested) is close (but among other things, doesn't use test-not if
present):

(defun my-remove-duplicates (seq
      &key test start end from-end test-not &rest args)
  (if (or (null test) (memq test '(#'equal #'eq #'equalp #'eql)))
     (let ((tab (make-hash-table :size (* 2 (length seq)) :test (or test
#'eql))))
        ;; handle the start/end/from-end keywords
        (apply #'map #'(lambda (x) (setf (gethash x tab) x)) seq args)
        (coerce (type-of seq)
           (loop for k being the hash-keys of tab collect k)))
   (apply #'remove-duplicates seq args)))

"Thaddeus L Olczyk" <······@interaccess.com> wrote in message
·······················@nntp.interaccess.com...
> Is there a fast way to remove duplicate elements in a sorted list
> where the function sorting is compatible with the "sorting" function?
> ( Meaning that if the sort function says  x=>y and y=>x then x=y
> and that if not (x=>y and y<=x) then not (x=y). )
>
> I believe that remove-duplicates has to assume that the list is not
> sorted. Therefore it has to compare each element, element by element
> ( ~N^2 ). Whereas a side by side goes like 2*N.
>
> The only problem I have is excessive consing which significantly slows
> down the algorithm.

From: Jeff Greif
Subject: Re: Faster remove-duplicates with sorted list.
Date: Tue, 12 Feb 2002 05:21:53 +0000
Message-ID: <RD1a8.7524$DC2.3579267@typhoon1.we.ipsvc.net>

Whoops.  How creatively I invented new arguments for standard functions
and incorporated other errors.
This version is better, but still doesn't handle test-not:

(defun my-remove-duplicates (seq &rest args
        &key test start end from-end test-not )
  (if (or (null test) (member test (list #'equal #'eq #'equalp #'eql)
         :test #'eql))
      (let ((tab (make-hash-table :size (* 2 (length seq)) :test (or
test
             #'eql))))
 (let ((seq-to-process
        (if (or start end) (subseq seq start end) seq)))
   (if (not from-end)
       (setf seq-to-process (reverse seq-to-process)))
   (if (listp seq-to-process)
       (loop for x in seq-to-process do
      (setf (gethash x tab) x))
       (loop for x across seq-to-process do
      (setf (gethash x tab) x)))
   (let ((result (loop for k being the hash-keys of tab collect k)))
     (if (not (listp seq))
       (coerce (type-of seq) result)
       result))))
      (apply #'remove-duplicates seq args)))

From: Jeff Greif
Subject: Re: Faster remove-duplicates with sorted list.
Date: Tue, 12 Feb 2002 05:30:11 +0000
Message-ID: <DL1a8.7562$DC2.3585195@typhoon1.we.ipsvc.net>

and the hashing version runs about 100x faster than regular
remove-duplicates on a list of 10000 random integers generated using
(random 10000) in a loop (in Corman Lisp 1.5)

Jeff

From: Thaddeus L Olczyk
Subject: Re: Faster remove-duplicates with sorted list.
Date: Tue, 12 Feb 2002 09:21:47 +0000
Message-ID: <3c68de80.189146109@nntp.interaccess.com>

On Tue, 12 Feb 2002 05:30:11 GMT, "Jeff Greif"
<······@spam-me-not.alumni.princeton.edu> wrote:

>and the hashing version runs about 100x faster than regular
>remove-duplicates on a list of 10000 random integers generated using
>(random 10000) in a loop (in Corman Lisp 1.5)
>
>Jeff
>
If your algorithm runs only 100 times faster an a 10000 element set,
then your algorithm sucks.

From: Raymond Wiker
Subject: Re: Faster remove-duplicates with sorted list.
Date: Tue, 12 Feb 2002 09:33:26 +0000
Message-ID: <86y9hzrqmx.fsf@raw.grenland.fast.no>

······@interaccess.com (Thaddeus L Olczyk) writes:

> On Tue, 12 Feb 2002 05:30:11 GMT, "Jeff Greif"
> <······@spam-me-not.alumni.princeton.edu> wrote:
> 
> >and the hashing version runs about 100x faster than regular
> >remove-duplicates on a list of 10000 random integers generated using
> >(random 10000) in a loop (in Corman Lisp 1.5)
> >
> >Jeff
> >
> If your algorithm runs only 100 times faster an a 10000 element set,
> then your algorithm sucks.

        That's a _phenomenally_ stupid statement. Back in the kill-file
with you...

-- 
Raymond Wiker                        Mail:  ·············@fast.no
Senior Software Engineer             Web:   http://www.fast.no/
Fast Search & Transfer ASA           Phone: +47 23 01 11 60
P.O. Box 1677 Vika                   Fax:   +47 35 54 87 99
NO-0120 Oslo, NORWAY                 Mob:   +47 48 01 11 60

Try FAST Search: http://alltheweb.com/

From: Tim Bradshaw
Subject: Re: Faster remove-duplicates with sorted list.
Date: Tue, 12 Feb 2002 10:36:19 +0000
Message-ID: <ey3aduff0m4.fsf@tfeb.org>

* Thaddeus L Olczyk wrote:

> If your algorithm runs only 100 times faster an a 10000 element set,
> then your algorithm sucks.

Erm.  The algorithm REMOVE-DUPLICATES uses isn't specified, so you
have absolutely no basis at all for this statement, do you?

--tim

From: Dr. Edmund Weitz
Subject: Re: Faster remove-duplicates with sorted list.
Date: Tue, 12 Feb 2002 11:25:15 +0000
Message-ID: <m3k7tj9c2s.fsf@bird.agharta.de>

······@interaccess.com (Thaddeus L Olczyk) writes:

> If your algorithm runs only 100 times faster an a 10000 element set,
> then your algorithm sucks.


                      ______
                     /      \
                   .' PLEASE `.
                   |  DO NOT  |      _____
                   | FEED THE |    ,'.....`.
                   `. TROLLS ,'  ,'........ )
                     \_    _/   |........ ,'
                       |  |     `. .... _/
                       |  |      ,'.,'-'
                       |  |     /../
                       |  |   ,'.,'
                       |  |  /../
                     . |  | /..'
                   .\_\|  |/_/,
                   ___ |  | ___
                     . `--' .
                      .    .


(shamelessly stolen from Gareth McCaughan)

-- 

Dr. Edmund Weitz
Hamburg
Germany

The Common Lisp Cookbook
<http://agharta.de/cookbook/>

From: David Hanley
Subject: Re: Faster remove-duplicates with sorted list.
Date: Wed, 13 Feb 2002 01:07:42 +0000
Message-ID: <281e82b2.0202121707.132306a4@posting.google.com>

······@interaccess.com (Thaddeus L Olczyk) wrote in message news:<··················@nntp.interaccess.com>...
> Is there a fast way to remove duplicate elements in a sorted list
> where the function sorting is compatible with the "sorting" function?

The sorting function has no equality test, so no.

If you have an = function than it's obviously very easy.

(defun r-d-s( list equal-test )
  (let ((res (first list)))
    (loop for elmenent in (rest list) do 
      (unless (funcall element (first res)) (push element res)))
    (nreverse res)))


dave

From: Pierre R. Mai
Subject: Re: Faster remove-duplicates with sorted list.
Date: Tue, 12 Feb 2002 16:33:15 +0000
Message-ID: <87aduer778.fsf@orion.bln.pmsf.de>

······@interaccess.com (Thaddeus L Olczyk) writes:

> Is there a fast way to remove duplicate elements in a sorted list
> where the function sorting is compatible with the "sorting" function?
> ( Meaning that if the sort function says  x=>y and y=>x then x=y
> and that if not (x=>y and y<=x) then not (x=y). )

This                        ^^^^ should be y=>x, else it is redundant
and insufficient for the following.  Furthermore you need to ensure
that your compare relation constitutes a total ordering, so that there
are no "uncomparable" elements.  In that case:

Your precondition ensures that "equal" (according to "=") items will
be sorted consecutively, without unequal elements intervening in the
sorted list:  Regardless of stability issues, if x!=y, then either
x=>y or y<=x is false (because of totality, both cannot be false, so one
is true), hence y is sorted either before or after all z, for which
x=z holds.

Hence you can use a simple O(n) list walk, using O(1) working space
(the last seen element).  Since remove-duplicates is specified not to
destroy the original list, you will of course use upto O(n) space for
the result list.  If you can destroy the original list (i.e. you
really want to have delete-duplicates), then you don't need that
additional storage.

> I believe that remove-duplicates has to assume that the list is not
> sorted. Therefore it has to compare each element, element by element
> ( ~N^2 ). Whereas a side by side goes like 2*N.
> 
> The only problem I have is excessive consing which significantly slows
> down the algorithm.

(defun uniquify-sorted-list (list &key (key #'identity) (test #'eql))
  (loop for element in list
        for element-key = (funcall key element)
        for last-element-key = (load-time-value (gensym))
        then element-key
        unless (funcall test element-key last-element-key)
        collect element))

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein