split-sequence/partition

From: Christophe Rhodes
Subject: split-sequence/partition
Date: Thu, 07 Jun 2001 13:44:06 +0000
Message-ID: <sqr8ww2ymx.fsf@lambda.jesus.cam.ac.uk>

At the risk of reopening a debate long thought dead, I'd like to
gather comments on the split-sequence or partition function.

I've taken the last version that I could find of Arthur Lemmens'
SPLIT, which I like (in Message-ID: <·················@simplex.nl>),
and would like to suggest the following changes:

* naming the function PARTITION rather than SPLIT.

* altering the behaviour of the :from-end keyword argument to
  return the subsequences in original order, for consistency with
  CL:REMOVE, CL:SUBSTITUTE et al. (:from-end being non-NIL only
  affects the answer if :count is less than the number of
  subsequences, by analogy with the above-referenced functions).
  
* changing the :maximum keyword argument to :count, by analogy
  with CL:REMOVE, CL:SUBSTITUTE, and so on.

* adding PARTITION-IF and PARTITION-IF-NOT.

I realise that this is a little far-fetched, but should there be a new
CL standard I hope that something of this nature could be added to it;
failing that, a community standard might be a good thing. I realise
also that many people will already have some similar functionality in
their lisp image, and obviously I don't expect everyone to jump for
joy.

Here's an attempt at a specification for PARTITION and friends; I'm
afraid I'm not very experienced at writing standardese, so this
probably isn't as precise as it needs to be:

       partition  delimiter sequence &key :count :keep-empty-sub�
       seqs :from-end :start :end :test :test-not :key

       partition-if predicate sequence &key  :count  :keep-empty-
       subseqs :from-end :start :end :key

       partition-if-not  predicate  sequence  &key  :count :keep-
       empty-subseqs :from-end :start :end :key

       The primary return value is a list  of  sequences  of  the
       same  kind as the argument sequence that has elements con�
       sisting of subsequences of the argument sequence that were
       delimited in the argument by elements satisfying the test.
       The second value is a sequence of the  same  type  as  the
       argument  sequence  containing  a  copy of the unprocessed
       portion of the argument.

       The :count argument, if supplied,  limits  the  number  of
       subsequences  in  the  first  return  value;  if more than
       :count  delimited  subsequences  exist  in  sequence,  the
       :count leftmost delimited subsequences will be in order in
       the first return value, and the second return  value  will
       contain a copy of the unprocessed remainder of the initial
       sequence.

       If :from-end is non-null, the argument sequence is concep�
       tually processed from right to left, accumulating the sub�
       sequences in reverse order; :from-end only makes a differ�
       ence  in  the  case  of a non-null :count argument. In the
       presence of :from-end, the :count rightmost delimited sub�
       sequences  will  be  in  the  order  that  they are in the
       sequence argument in  the  first  return  value,  and  the
       second  as  before  contains  a  copy  of  the unprocessed
       remainder of the initial sequence.

       If :keep-empty-subseqs  is  non-null,  then  empty  subse�
       quences will be included in the result.

       In  all  cases, the subsequences in the first return value
       will be in the order that they appeared  in  the  original
       sequence.

A reference implementation for this specification is at
<URL:http://www-jcsu.jesus.cam.ac.uk/~csr21/partition.lisp>.

All comments welcome,

Cheers,

Christophe
-- 
Jesus College, Cambridge, CB5 8BL                           +44 1223 524 842
http://www-jcsu.jesus.cam.ac.uk/~csr21/                  (defun pling-dollar 
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)

Re: split-sequence/partition Frank A. Adrian
- Re: split-sequence/partition Pierre R. Mai
  - Re: split-sequence/partition Kent M Pitman
Re: split-sequence/partition Daniel Barlow
Re: split-sequence/partition Pierre R. Mai
- Re: split-sequence/partition Christophe Rhodes
  - Re: split-sequence/partition Daniel Barlow
    - Re: split-sequence/partition Christophe Rhodes
Re: split-sequence/partition Christophe Rhodes
- Re: split-sequence/partition Pierre R. Mai
  - Re: split-sequence/partition Christophe Rhodes

From: Frank A. Adrian
Subject: Re: split-sequence/partition
Date: Thu, 07 Jun 2001 14:20:28 +0000
Message-ID: <M4MT6.15$XC6.57451@news.uswest.net>

The "-if-not" functions in the standard are deprecated.  The proper usage du
jour is to use the functional #'complement to achieve the same thing.  As
such, I'd not include your partition-if-not function.

"Christophe Rhodes" <·····@cam.ac.uk> wrote in message
···················@lambda.jesus.cam.ac.uk...
> A reference implementation for this specification is at
> <URL:http://www-jcsu.jesus.cam.ac.uk/~csr21/partition.lisp>.
>
> All comments welcome,
>
> Cheers,
>
> Christophe
> --
> Jesus College, Cambridge, CB5 8BL                           +44 1223 524
842
> http://www-jcsu.jesus.cam.ac.uk/~csr21/                  (defun
pling-dollar
> (str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
> (set-dispatch-macro-character #\! #\$ #'pling-dollar)

From: Pierre R. Mai
Subject: Re: split-sequence/partition
Date: Thu, 07 Jun 2001 15:25:46 +0000
Message-ID: <87elswuxad.fsf@orion.bln.pmsf.de>

"Frank A. Adrian" <·······@uswest.net> writes:

> The "-if-not" functions in the standard are deprecated.  The proper usage du
> jour is to use the functional #'complement to achieve the same thing.  As
> such, I'd not include your partition-if-not function.

While they were deprecated in the same action[1] that deprecated the
semantically problematic :test-not arguments, the -if-not functions
never exhibited any sematic problem, and hence it seems that maybe
X3J13 was overcome by a sudden attack of creeping schemism.  Since
by eliminating e.g. remove-if-not, at least one very common idiom for
filtering a sequence would be removed from the language, just in order
to satisfy some minimality requirement, which when taken to its
logical conclusion would remove UNLESS in favor of WHEN, etc.

The last few times this was discussed on c.l.l (which probably is the
main area of discourse for "community standards" nowadays), most
participants seemed to agree that deprecation of the "-if-not"
functions was premature, and should be reversed at some later stage.

At the very least I'd therefore recommend to include PARTITION-IF-NOT
in this proposal, maybe "deprecating" it at the same time, to defer on
this decission to some time in the future, when the status of all
-IF-NOT functions is to be rediscussed in some forum.

Regs, Pierre.

Footnotes: 
[1]  See Issue TEST-NOT-IF-NOT:FLUSH-ALL in the HyperSpec for details.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Kent M Pitman
Subject: Re: split-sequence/partition
Date: Thu, 07 Jun 2001 18:42:43 +0000
Message-ID: <sfwiti86sik.fsf@world.std.com>

"Pierre R. Mai" <····@acm.org> writes:

> "Frank A. Adrian" <·······@uswest.net> writes:
> > The "-if-not" functions in the standard are deprecated. [...]
> 
> While they were deprecated in the same action[1] that deprecated the
> semantically problematic :test-not arguments, the -if-not functions
> never exhibited any sematic problem, and hence it seems that maybe
> X3J13 was overcome by a sudden attack of creeping schemism.  Since
> by eliminating e.g. remove-if-not, at least one very common idiom for
> filtering a sequence would be removed from the language, just in order
> to satisfy some minimality requirement, which when taken to its
> logical conclusion would remove UNLESS in favor of WHEN, etc.
> 
> The last few times this was discussed on c.l.l (which probably is the
> main area of discourse for "community standards" nowadays), most
> participants seemed to agree that deprecation of the "-if-not"
> functions was premature, and should be reversed at some later stage.

That's certainly what I'd like to see (though I'm no longer part of the 
official process).  And the base standard isn't likely to change, so we'll
have to live with this "confusion" for a long time, I bet.
 
> At the very least I'd therefore recommend to include PARTITION-IF-NOT
> in this proposal [...]

I concur.

> maybe "deprecating" it at the same time, to defer on
> this decission to some time in the future, when the status of all
> -IF-NOT functions is to be rediscussed in some forum.

Personally I wouldn't bother with this part.

Of course, I wish all vendors optimized COMPLEMENT correctly, but I can just
imagine some that don't and the -IF-NOT version can avoid both some consing 
and probably some execution inefficiency in some cases.

> Footnotes: 
> [1]  See Issue TEST-NOT-IF-NOT:FLUSH-ALL in the HyperSpec for details.

From: Daniel Barlow
Subject: Re: split-sequence/partition
Date: Thu, 07 Jun 2001 15:03:45 +0000
Message-ID: <87hexs9vse.fsf@noetbook.telent.net>

Christophe Rhodes <·····@cam.ac.uk> writes:

>        The second value is a sequence of the  same  type  as  the
>        argument  sequence  containing  a  copy of the unprocessed
>        portion of the argument.

Need it be a copy?  If we allow the second return value to share
structure with the input, the function could potentially be more
efficient - makes sense for long inputs (especially lists, where
copying involves a lot of pointer chasing), especially if the user is
not interested in the secondary value anyway.

-dan

-- 

  http://ww.telent.net/cliki/ - Link farm for free CL-on-Unix resources

From: Pierre R. Mai
Subject: Re: split-sequence/partition
Date: Thu, 07 Jun 2001 17:09:55 +0000
Message-ID: <87r8wwtdwc.fsf@orion.bln.pmsf.de>

Christophe Rhodes <·····@cam.ac.uk> writes:

>        The second value is a sequence of the  same  type  as  the
>        argument  sequence  containing  a  copy of the unprocessed
>        portion of the argument.
> 
>        The :count argument, if supplied,  limits  the  number  of
>        subsequences  in  the  first  return  value;  if more than
>        :count  delimited  subsequences  exist  in  sequence,  the
>        :count leftmost delimited subsequences will be in order in
>        the first return value, and the second return  value  will
>        contain a copy of the unprocessed remainder of the initial
>        sequence.

Might it not be better to return a bounding index as the second value,
in a manner similar to parse-integer or read-from-string?  This will
leave it to the caller to decide what to do with the unprocessed
remainder.

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Christophe Rhodes
Subject: Re: split-sequence/partition
Date: Thu, 07 Jun 2001 18:17:22 +0000
Message-ID: <sqlmn42lzh.fsf@lambda.jesus.cam.ac.uk>

"Pierre R. Mai" <····@acm.org> writes:

> Christophe Rhodes <·····@cam.ac.uk> writes:
> 
> >        The second value is a sequence of the  same  type  as  the
> >        argument  sequence  containing  a  copy of the unprocessed
> >        portion of the argument.
> > 
> >        The :count argument, if supplied,  limits  the  number  of
> >        subsequences  in  the  first  return  value;  if more than
> >        :count  delimited  subsequences  exist  in  sequence,  the
> >        :count leftmost delimited subsequences will be in order in
> >        the first return value, and the second return  value  will
> >        contain a copy of the unprocessed remainder of the initial
> >        sequence.
> 
> Might it not be better to return a bounding index as the second value,
> in a manner similar to parse-integer or read-from-string?  This will
> leave it to the caller to decide what to do with the unprocessed
> remainder.

Quite possibly. I'm not entirely sure how this would interact with
:from-end, though.

Actually, in its current state it's underspecified, in that I'm not
convinced I would understand what should happen with

(partition #\; "foo;bar;baz" :start 1 :end 9)

from reading the `specification' (should the second return value be
"faz"?)

Cheers,

Christophe
-- 
Jesus College, Cambridge, CB5 8BL                           +44 1223 524 842
http://www-jcsu.jesus.cam.ac.uk/~csr21/                  (defun pling-dollar 
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)

From: Daniel Barlow
Subject: Re: split-sequence/partition
Date: Thu, 07 Jun 2001 18:40:55 +0000
Message-ID: <87ae3k9lqg.fsf@noetbook.telent.net>

Christophe Rhodes <·····@cam.ac.uk> writes:

> "Pierre R. Mai" <····@acm.org> writes:
> > Might it not be better to return a bounding index as the second value,
> 
> Quite possibly. I'm not entirely sure how this would interact with
> :from-end, though.

It returns the index of the leftmost character it _did_ process, I'd
guess.  Something you can use as the second value of a SUBSEQ call to
pick up the unprocessed segment.

> Actually, in its current state it's underspecified, in that I'm not
> convinced I would understand what should happen with
> 
> (partition #\; "foo;bar;baz" :start 1 :end 9)
> 
> from reading the `specification' (should the second return value be
> "faz"?)

I wouldn't want it to be (yes, that's not the question).  I agree that
I dont know if "az" or "" is the right answer, though.  

I like Pierre's bounding index idea, personally; that makes both this issue
and my problem with copying sequences go away.

-dan

-- 

  http://ww.telent.net/cliki/ - Link farm for free CL-on-Unix resources

From: Christophe Rhodes
Subject: Re: split-sequence/partition
Date: Fri, 08 Jun 2001 08:29:48 +0000
Message-ID: <sqg0db2x37.fsf@lambda.jesus.cam.ac.uk>

Daniel Barlow <···@telent.net> writes:

> Christophe Rhodes <·····@cam.ac.uk> writes:
> 
> > "Pierre R. Mai" <····@acm.org> writes:
> > > Might it not be better to return a bounding index as the second value,
> > 
> > Quite possibly. I'm not entirely sure how this would interact with
> > :from-end, though.
> 
> It returns the index of the leftmost character it _did_ process, I'd
> guess.  Something you can use as the second value of a SUBSEQ call to
> pick up the unprocessed segment.
> 
> > Actually, in its current state it's underspecified, in that I'm not
> > convinced I would understand what should happen with
> > 
> > (partition #\; "foo;bar;baz" :start 1 :end 9)
> > 
> > from reading the `specification' (should the second return value be
> > "faz"?)
> 
> I wouldn't want it to be (yes, that's not the question).  I agree that
> I dont know if "az" or "" is the right answer, though.  
> 
> I like Pierre's bounding index idea, personally; that makes both this issue
> and my problem with copying sequences go away.

I don't think that it makes this problem go away (though I agree that
it makes the copying question irrelevant); there is (in CL:REMOVE and
friends) a difference between

(remove 'a '(a b c a d e f a) :start 2 :end 7) => (A B C D E F A)

and

(remove 'a (subseq '(a b c a d e f a) 2 7)) => (C D E F)

Yurgh, this is getting more complicated the more I think about it[1].

Conceivably

(partition #\; "foo;bar;baz" :start 1 :end 9) 
should actually return
("foo" "bar" "baz") 
as its primary value, and 
(partition #\; (subseq "foo;bar;baz" 1 9)) 
should return ("oo" "bar" "b"); further
(partition #\; ";oo;bar;baz" :start 1 :end 9)
should return (";oo" "bar" "baz").

I think that this behaviour would enable a consistent index to be
returned as the second return value in a similar way to Pierre's
suggestion.

Cheers,

Christophe

[1] `Doctor, it hurts when I do this.' `Don't do that, then.'
-- 
Jesus College, Cambridge, CB5 8BL                           +44 1223 524 842
http://www-jcsu.jesus.cam.ac.uk/~csr21/                  (defun pling-dollar 
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)

From: Christophe Rhodes
Subject: Re: split-sequence/partition
Date: Sat, 09 Jun 2001 21:31:43 +0000
Message-ID: <sq7kylibls.fsf@lambda.jesus.cam.ac.uk>

Christophe Rhodes <·····@cam.ac.uk> writes:

> At the risk of reopening a debate long thought dead, I'd like to
> gather comments on the split-sequence or partition function.
> 
> I've taken the last version that I could find of Arthur Lemmens'
> SPLIT, which I like (in Message-ID: <·················@simplex.nl>),
> and would like to suggest the following changes:
> 
> * naming the function PARTITION rather than SPLIT.
> 
> * altering the behaviour of the :from-end keyword argument to
>   return the subsequences in original order, for consistency with
>   CL:REMOVE, CL:SUBSTITUTE et al. (:from-end being non-NIL only
>   affects the answer if :count is less than the number of
>   subsequences, by analogy with the above-referenced functions).
>   
> * changing the :maximum keyword argument to :count, by analogy
>   with CL:REMOVE, CL:SUBSTITUTE, and so on.
> 
> * adding PARTITION-IF and PARTITION-IF-NOT.

In the light of the comments I've had so far, I'd like to propose the
following extra changes:

* changing the behaviour of the :start and :end arguments to be
  more consistent with CL:REMOVE, meaning that
  (partition delimiter sequence :start start :end end)
  is no longer equivalent to
  (partiton delimiter (subseq sequence start end).

* The second return value is now an index rather than a copy of a
  portion of the sequence; this index is the `right' one to feed to
  CL:SUBSEQ for continued processing.

This makes the current specification:

       partition  delimiter sequence &key :count :keep-empty-sub-
       seqs :from-end :start :end :test :test-not :key

       partition-if predicate sequence &key  :count  :keep-empty-
       subseqs :from-end :start :end :key

       partition-if-not  predicate  sequence  &key  :count :keep-
       empty-subseqs :from-end :start :end :key

       The primary return value is a list  of  sequences  of  the
       same  kind as the argument sequence that has elements con-
       sisting of subsequences of the argument sequence that were
       delimited in the argument by elements satisfying the test.
       The second value is an index into the sequence  indicating
       the  unprocessed region, suitable as an argument to subseq
       to continue processing in the same manner if desired.

       The :count argument, if supplied,  limits  the  number  of
       subsequences  in  the  first  return  value;  if more than
       :count  delimited  subsequences  exist  in  sequence,  the
       :count leftmost delimited subsequences will be in order in
       the first return value, and the second return  value  will
       be  the  index  into  the  sequence  at  which  processing
       stopped.

       If :from-end is non-null, the argument sequence is concep-
       tually processed from right to left, accumulating the sub-
       sequences in reverse order; :from-end only makes a differ-
       ence  in  the  case  of a non-null :count argument. In the
       presence of :from-end, the :count rightmost delimited sub-
       sequences  will  be  in  the  order  that  they are in the
       sequence argument in  the  first  return  value,  and  the
       second  is the index indicating the end of the unprocessed
       region.

       If :keep-empty-subseqs  is  non-null,  then  empty  subse-
       quences will be included in the result.

       In  all  cases, the subsequences in the first return value
       will be in the order that they appeared  in  the  original
       sequence.

Improvements to the language of this specification are welcome; I
_hope_ that the intent is both clear and useful, and that the
behaviour of these functions is now consistent with the rest of CL.

A reference implementation for this updated specification can be
obtained at
<URL:http://www-jcsu.jesus.cam.ac.uk/~csr21/partition2.lisp>.

Thanks,

Christophe
-- 
Jesus College, Cambridge, CB5 8BL                           +44 1223 524 842
http://www-jcsu.jesus.cam.ac.uk/~csr21/                  (defun pling-dollar 
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)

From: Pierre R. Mai
Subject: Re: split-sequence/partition
Date: Sun, 10 Jun 2001 11:40:43 +0000
Message-ID: <878zj0sguc.fsf@orion.bln.pmsf.de>

Christophe Rhodes <·····@cam.ac.uk> writes:

> In the light of the comments I've had so far, I'd like to propose the
> following extra changes:
> 
> * changing the behaviour of the :start and :end arguments to be
>   more consistent with CL:REMOVE, meaning that
>   (partition delimiter sequence :start start :end end)
>   is no longer equivalent to
>   (partiton delimiter (subseq sequence start end).

I don't think that REMOVE is the right function to take as a
guide-line here:  REMOVE "modifies" the subsequence of the argument
passed in, but returns the whole argument passed in (modulo copying).

Partition creates a list of completely new subsequences, which bear
only an indirect relation to the argument passed in.  It seems both
counter-intuitive and restricting that

(partition #\; "foo;bar;baz" :start 2 :end 9)

and 

(partition #\; "foo;bar;baz")

return the same values, namely ("foo" "bar" "baz") and 11.

As it stands the only use of the start and end bounding indices is to
prevent PARTITION from recognizing the delimiter outside of them:

(partition #\; "foo;bar;baz" :start 4 :end 9)

=> ("foo;bar" "baz"), 11

I'd suggest that PARTITION behave like this:

(partition #\; "foo;bar;baz" :start 2 :end 9)

=> ("o" "bar" "b"), 9

This enables one important use that the other version doesn't allow,
namely that of processing part of a large buffer in-place, without
consing a new subsequence, that is then immediately discarded.

The original kind of behaviour (which I consider to be very rarely
useful[1]) can be re-added on top of the new version, whereas the
contrary isn't possible.  And if you consider that PARTITION would be
used to parse largish CSV files from a buffer, the unnecessary consing
entailed by SUBSEQ can easily run into 100s of bytes per line, which
on the files we process daily (>= 80000 lines) can easily run into MBs
of unnecessary consing, with no way out, besides rewriting PARTITION.
While the GC costs are likely small on an ephemeral GC, the copying
costs aren't.

Regs, Pierre.

Footnotes: 
[1]  I'd guess that treating the part outside the bounding indices
     completely specially, i.e. not merging them with the first and
     last partition respectively to be much more commonly useful.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Christophe Rhodes
Subject: Re: split-sequence/partition
Date: Fri, 15 Jun 2001 09:57:48 +0000
Message-ID: <sqelsmawv7.fsf@lambda.jesus.cam.ac.uk>

"Pierre R. Mai" <····@acm.org> writes:

> Christophe Rhodes <·····@cam.ac.uk> writes:
> 
> > In the light of the comments I've had so far, I'd like to propose the
> > following extra changes:
> > 
> > * changing the behaviour of the :start and :end arguments to be
> >   more consistent with CL:REMOVE, meaning that
> >   (partition delimiter sequence :start start :end end)
> >   is no longer equivalent to
> >   (partiton delimiter (subseq sequence start end).
> 
> [snip]
>
> I'd suggest that PARTITION behave like this:
> 
> (partition #\; "foo;bar;baz" :start 2 :end 9)
> 
> => ("o" "bar" "b"), 9
> 
> This enables one important use that the other version doesn't allow,
> namely that of processing part of a large buffer in-place, without
> consing a new subsequence, that is then immediately discarded.
> 
> The original kind of behaviour (which I consider to be very rarely
> useful[1]) can be re-added on top of the new version, whereas the
> contrary isn't possible.  And if you consider that PARTITION would be
> used to parse largish CSV files from a buffer, the unnecessary consing
> entailed by SUBSEQ can easily run into 100s of bytes per line, which
> on the files we process daily (>= 80000 lines) can easily run into MBs
> of unnecessary consing, with no way out, besides rewriting PARTITION.
> While the GC costs are likely small on an ephemeral GC, the copying
> costs aren't.

I'm willing to buy this argument, and since no-one else seems to care
one way or the other (or else they're all implicitly agreeing with
you) I shall put out a version 3 shortly.

Cheers,

Christophe
-- 
Jesus College, Cambridge, CB5 8BL                           +44 1223 524 842
http://www-jcsu.jesus.cam.ac.uk/~csr21/                  (defun pling-dollar 
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)