At the risk of reopening a debate long thought dead, I'd like to
gather comments on the split-sequence or partition function.
I've taken the last version that I could find of Arthur Lemmens'
SPLIT, which I like (in Message-ID: <·················@simplex.nl>),
and would like to suggest the following changes:
* naming the function PARTITION rather than SPLIT.
* altering the behaviour of the :from-end keyword argument to
return the subsequences in original order, for consistency with
CL:REMOVE, CL:SUBSTITUTE et al. (:from-end being non-NIL only
affects the answer if :count is less than the number of
subsequences, by analogy with the above-referenced functions).
* changing the :maximum keyword argument to :count, by analogy
with CL:REMOVE, CL:SUBSTITUTE, and so on.
* adding PARTITION-IF and PARTITION-IF-NOT.
I realise that this is a little far-fetched, but should there be a new
CL standard I hope that something of this nature could be added to it;
failing that, a community standard might be a good thing. I realise
also that many people will already have some similar functionality in
their lisp image, and obviously I don't expect everyone to jump for
joy.
Here's an attempt at a specification for PARTITION and friends; I'm
afraid I'm not very experienced at writing standardese, so this
probably isn't as precise as it needs to be:
partition delimiter sequence &key :count :keep-empty-sub�
seqs :from-end :start :end :test :test-not :key
partition-if predicate sequence &key :count :keep-empty-
subseqs :from-end :start :end :key
partition-if-not predicate sequence &key :count :keep-
empty-subseqs :from-end :start :end :key
The primary return value is a list of sequences of the
same kind as the argument sequence that has elements con�
sisting of subsequences of the argument sequence that were
delimited in the argument by elements satisfying the test.
The second value is a sequence of the same type as the
argument sequence containing a copy of the unprocessed
portion of the argument.
The :count argument, if supplied, limits the number of
subsequences in the first return value; if more than
:count delimited subsequences exist in sequence, the
:count leftmost delimited subsequences will be in order in
the first return value, and the second return value will
contain a copy of the unprocessed remainder of the initial
sequence.
If :from-end is non-null, the argument sequence is concep�
tually processed from right to left, accumulating the sub�
sequences in reverse order; :from-end only makes a differ�
ence in the case of a non-null :count argument. In the
presence of :from-end, the :count rightmost delimited sub�
sequences will be in the order that they are in the
sequence argument in the first return value, and the
second as before contains a copy of the unprocessed
remainder of the initial sequence.
If :keep-empty-subseqs is non-null, then empty subse�
quences will be included in the result.
In all cases, the subsequences in the first return value
will be in the order that they appeared in the original
sequence.
A reference implementation for this specification is at
<URL:http://www-jcsu.jesus.cam.ac.uk/~csr21/partition.lisp>.
All comments welcome,
Cheers,
Christophe
--
Jesus College, Cambridge, CB5 8BL +44 1223 524 842
http://www-jcsu.jesus.cam.ac.uk/~csr21/ (defun pling-dollar
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)
The "-if-not" functions in the standard are deprecated. The proper usage du
jour is to use the functional #'complement to achieve the same thing. As
such, I'd not include your partition-if-not function.
"Christophe Rhodes" <·····@cam.ac.uk> wrote in message
···················@lambda.jesus.cam.ac.uk...
> A reference implementation for this specification is at
> <URL:http://www-jcsu.jesus.cam.ac.uk/~csr21/partition.lisp>.
>
> All comments welcome,
>
> Cheers,
>
> Christophe
> --
> Jesus College, Cambridge, CB5 8BL +44 1223 524
842
> http://www-jcsu.jesus.cam.ac.uk/~csr21/ (defun
pling-dollar
> (str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
> (set-dispatch-macro-character #\! #\$ #'pling-dollar)
"Frank A. Adrian" <·······@uswest.net> writes:
> The "-if-not" functions in the standard are deprecated. The proper usage du
> jour is to use the functional #'complement to achieve the same thing. As
> such, I'd not include your partition-if-not function.
While they were deprecated in the same action[1] that deprecated the
semantically problematic :test-not arguments, the -if-not functions
never exhibited any sematic problem, and hence it seems that maybe
X3J13 was overcome by a sudden attack of creeping schemism. Since
by eliminating e.g. remove-if-not, at least one very common idiom for
filtering a sequence would be removed from the language, just in order
to satisfy some minimality requirement, which when taken to its
logical conclusion would remove UNLESS in favor of WHEN, etc.
The last few times this was discussed on c.l.l (which probably is the
main area of discourse for "community standards" nowadays), most
participants seemed to agree that deprecation of the "-if-not"
functions was premature, and should be reversed at some later stage.
At the very least I'd therefore recommend to include PARTITION-IF-NOT
in this proposal, maybe "deprecating" it at the same time, to defer on
this decission to some time in the future, when the status of all
-IF-NOT functions is to be rediscussed in some forum.
Regs, Pierre.
Footnotes:
[1] See Issue TEST-NOT-IF-NOT:FLUSH-ALL in the HyperSpec for details.
--
Pierre R. Mai <····@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
"Pierre R. Mai" <····@acm.org> writes:
> "Frank A. Adrian" <·······@uswest.net> writes:
> > The "-if-not" functions in the standard are deprecated. [...]
>
> While they were deprecated in the same action[1] that deprecated the
> semantically problematic :test-not arguments, the -if-not functions
> never exhibited any sematic problem, and hence it seems that maybe
> X3J13 was overcome by a sudden attack of creeping schemism. Since
> by eliminating e.g. remove-if-not, at least one very common idiom for
> filtering a sequence would be removed from the language, just in order
> to satisfy some minimality requirement, which when taken to its
> logical conclusion would remove UNLESS in favor of WHEN, etc.
>
> The last few times this was discussed on c.l.l (which probably is the
> main area of discourse for "community standards" nowadays), most
> participants seemed to agree that deprecation of the "-if-not"
> functions was premature, and should be reversed at some later stage.
That's certainly what I'd like to see (though I'm no longer part of the
official process). And the base standard isn't likely to change, so we'll
have to live with this "confusion" for a long time, I bet.
> At the very least I'd therefore recommend to include PARTITION-IF-NOT
> in this proposal [...]
I concur.
> maybe "deprecating" it at the same time, to defer on
> this decission to some time in the future, when the status of all
> -IF-NOT functions is to be rediscussed in some forum.
Personally I wouldn't bother with this part.
Of course, I wish all vendors optimized COMPLEMENT correctly, but I can just
imagine some that don't and the -IF-NOT version can avoid both some consing
and probably some execution inefficiency in some cases.
> Footnotes:
> [1] See Issue TEST-NOT-IF-NOT:FLUSH-ALL in the HyperSpec for details.
Christophe Rhodes <·····@cam.ac.uk> writes:
> The second value is a sequence of the same type as the
> argument sequence containing a copy of the unprocessed
> portion of the argument.
Need it be a copy? If we allow the second return value to share
structure with the input, the function could potentially be more
efficient - makes sense for long inputs (especially lists, where
copying involves a lot of pointer chasing), especially if the user is
not interested in the secondary value anyway.
-dan
--
http://ww.telent.net/cliki/ - Link farm for free CL-on-Unix resources
Christophe Rhodes <·····@cam.ac.uk> writes:
> The second value is a sequence of the same type as the
> argument sequence containing a copy of the unprocessed
> portion of the argument.
>
> The :count argument, if supplied, limits the number of
> subsequences in the first return value; if more than
> :count delimited subsequences exist in sequence, the
> :count leftmost delimited subsequences will be in order in
> the first return value, and the second return value will
> contain a copy of the unprocessed remainder of the initial
> sequence.
Might it not be better to return a bounding index as the second value,
in a manner similar to parse-integer or read-from-string? This will
leave it to the caller to decide what to do with the unprocessed
remainder.
Regs, Pierre.
--
Pierre R. Mai <····@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
"Pierre R. Mai" <····@acm.org> writes:
> Christophe Rhodes <·····@cam.ac.uk> writes:
>
> > The second value is a sequence of the same type as the
> > argument sequence containing a copy of the unprocessed
> > portion of the argument.
> >
> > The :count argument, if supplied, limits the number of
> > subsequences in the first return value; if more than
> > :count delimited subsequences exist in sequence, the
> > :count leftmost delimited subsequences will be in order in
> > the first return value, and the second return value will
> > contain a copy of the unprocessed remainder of the initial
> > sequence.
>
> Might it not be better to return a bounding index as the second value,
> in a manner similar to parse-integer or read-from-string? This will
> leave it to the caller to decide what to do with the unprocessed
> remainder.
Quite possibly. I'm not entirely sure how this would interact with
:from-end, though.
Actually, in its current state it's underspecified, in that I'm not
convinced I would understand what should happen with
(partition #\; "foo;bar;baz" :start 1 :end 9)
from reading the `specification' (should the second return value be
"faz"?)
Cheers,
Christophe
--
Jesus College, Cambridge, CB5 8BL +44 1223 524 842
http://www-jcsu.jesus.cam.ac.uk/~csr21/ (defun pling-dollar
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)
Christophe Rhodes <·····@cam.ac.uk> writes:
> "Pierre R. Mai" <····@acm.org> writes:
> > Might it not be better to return a bounding index as the second value,
>
> Quite possibly. I'm not entirely sure how this would interact with
> :from-end, though.
It returns the index of the leftmost character it _did_ process, I'd
guess. Something you can use as the second value of a SUBSEQ call to
pick up the unprocessed segment.
> Actually, in its current state it's underspecified, in that I'm not
> convinced I would understand what should happen with
>
> (partition #\; "foo;bar;baz" :start 1 :end 9)
>
> from reading the `specification' (should the second return value be
> "faz"?)
I wouldn't want it to be (yes, that's not the question). I agree that
I dont know if "az" or "" is the right answer, though.
I like Pierre's bounding index idea, personally; that makes both this issue
and my problem with copying sequences go away.
-dan
--
http://ww.telent.net/cliki/ - Link farm for free CL-on-Unix resources
Daniel Barlow <···@telent.net> writes:
> Christophe Rhodes <·····@cam.ac.uk> writes:
>
> > "Pierre R. Mai" <····@acm.org> writes:
> > > Might it not be better to return a bounding index as the second value,
> >
> > Quite possibly. I'm not entirely sure how this would interact with
> > :from-end, though.
>
> It returns the index of the leftmost character it _did_ process, I'd
> guess. Something you can use as the second value of a SUBSEQ call to
> pick up the unprocessed segment.
>
> > Actually, in its current state it's underspecified, in that I'm not
> > convinced I would understand what should happen with
> >
> > (partition #\; "foo;bar;baz" :start 1 :end 9)
> >
> > from reading the `specification' (should the second return value be
> > "faz"?)
>
> I wouldn't want it to be (yes, that's not the question). I agree that
> I dont know if "az" or "" is the right answer, though.
>
> I like Pierre's bounding index idea, personally; that makes both this issue
> and my problem with copying sequences go away.
I don't think that it makes this problem go away (though I agree that
it makes the copying question irrelevant); there is (in CL:REMOVE and
friends) a difference between
(remove 'a '(a b c a d e f a) :start 2 :end 7) => (A B C D E F A)
and
(remove 'a (subseq '(a b c a d e f a) 2 7)) => (C D E F)
Yurgh, this is getting more complicated the more I think about it[1].
Conceivably
(partition #\; "foo;bar;baz" :start 1 :end 9)
should actually return
("foo" "bar" "baz")
as its primary value, and
(partition #\; (subseq "foo;bar;baz" 1 9))
should return ("oo" "bar" "b"); further
(partition #\; ";oo;bar;baz" :start 1 :end 9)
should return (";oo" "bar" "baz").
I think that this behaviour would enable a consistent index to be
returned as the second return value in a similar way to Pierre's
suggestion.
Cheers,
Christophe
[1] `Doctor, it hurts when I do this.' `Don't do that, then.'
--
Jesus College, Cambridge, CB5 8BL +44 1223 524 842
http://www-jcsu.jesus.cam.ac.uk/~csr21/ (defun pling-dollar
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)
Christophe Rhodes <·····@cam.ac.uk> writes:
> At the risk of reopening a debate long thought dead, I'd like to
> gather comments on the split-sequence or partition function.
>
> I've taken the last version that I could find of Arthur Lemmens'
> SPLIT, which I like (in Message-ID: <·················@simplex.nl>),
> and would like to suggest the following changes:
>
> * naming the function PARTITION rather than SPLIT.
>
> * altering the behaviour of the :from-end keyword argument to
> return the subsequences in original order, for consistency with
> CL:REMOVE, CL:SUBSTITUTE et al. (:from-end being non-NIL only
> affects the answer if :count is less than the number of
> subsequences, by analogy with the above-referenced functions).
>
> * changing the :maximum keyword argument to :count, by analogy
> with CL:REMOVE, CL:SUBSTITUTE, and so on.
>
> * adding PARTITION-IF and PARTITION-IF-NOT.
In the light of the comments I've had so far, I'd like to propose the
following extra changes:
* changing the behaviour of the :start and :end arguments to be
more consistent with CL:REMOVE, meaning that
(partition delimiter sequence :start start :end end)
is no longer equivalent to
(partiton delimiter (subseq sequence start end).
* The second return value is now an index rather than a copy of a
portion of the sequence; this index is the `right' one to feed to
CL:SUBSEQ for continued processing.
This makes the current specification:
partition delimiter sequence &key :count :keep-empty-sub-
seqs :from-end :start :end :test :test-not :key
partition-if predicate sequence &key :count :keep-empty-
subseqs :from-end :start :end :key
partition-if-not predicate sequence &key :count :keep-
empty-subseqs :from-end :start :end :key
The primary return value is a list of sequences of the
same kind as the argument sequence that has elements con-
sisting of subsequences of the argument sequence that were
delimited in the argument by elements satisfying the test.
The second value is an index into the sequence indicating
the unprocessed region, suitable as an argument to subseq
to continue processing in the same manner if desired.
The :count argument, if supplied, limits the number of
subsequences in the first return value; if more than
:count delimited subsequences exist in sequence, the
:count leftmost delimited subsequences will be in order in
the first return value, and the second return value will
be the index into the sequence at which processing
stopped.
If :from-end is non-null, the argument sequence is concep-
tually processed from right to left, accumulating the sub-
sequences in reverse order; :from-end only makes a differ-
ence in the case of a non-null :count argument. In the
presence of :from-end, the :count rightmost delimited sub-
sequences will be in the order that they are in the
sequence argument in the first return value, and the
second is the index indicating the end of the unprocessed
region.
If :keep-empty-subseqs is non-null, then empty subse-
quences will be included in the result.
In all cases, the subsequences in the first return value
will be in the order that they appeared in the original
sequence.
Improvements to the language of this specification are welcome; I
_hope_ that the intent is both clear and useful, and that the
behaviour of these functions is now consistent with the rest of CL.
A reference implementation for this updated specification can be
obtained at
<URL:http://www-jcsu.jesus.cam.ac.uk/~csr21/partition2.lisp>.
Thanks,
Christophe
--
Jesus College, Cambridge, CB5 8BL +44 1223 524 842
http://www-jcsu.jesus.cam.ac.uk/~csr21/ (defun pling-dollar
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)
Christophe Rhodes <·····@cam.ac.uk> writes:
> In the light of the comments I've had so far, I'd like to propose the
> following extra changes:
>
> * changing the behaviour of the :start and :end arguments to be
> more consistent with CL:REMOVE, meaning that
> (partition delimiter sequence :start start :end end)
> is no longer equivalent to
> (partiton delimiter (subseq sequence start end).
I don't think that REMOVE is the right function to take as a
guide-line here: REMOVE "modifies" the subsequence of the argument
passed in, but returns the whole argument passed in (modulo copying).
Partition creates a list of completely new subsequences, which bear
only an indirect relation to the argument passed in. It seems both
counter-intuitive and restricting that
(partition #\; "foo;bar;baz" :start 2 :end 9)
and
(partition #\; "foo;bar;baz")
return the same values, namely ("foo" "bar" "baz") and 11.
As it stands the only use of the start and end bounding indices is to
prevent PARTITION from recognizing the delimiter outside of them:
(partition #\; "foo;bar;baz" :start 4 :end 9)
=> ("foo;bar" "baz"), 11
I'd suggest that PARTITION behave like this:
(partition #\; "foo;bar;baz" :start 2 :end 9)
=> ("o" "bar" "b"), 9
This enables one important use that the other version doesn't allow,
namely that of processing part of a large buffer in-place, without
consing a new subsequence, that is then immediately discarded.
The original kind of behaviour (which I consider to be very rarely
useful[1]) can be re-added on top of the new version, whereas the
contrary isn't possible. And if you consider that PARTITION would be
used to parse largish CSV files from a buffer, the unnecessary consing
entailed by SUBSEQ can easily run into 100s of bytes per line, which
on the files we process daily (>= 80000 lines) can easily run into MBs
of unnecessary consing, with no way out, besides rewriting PARTITION.
While the GC costs are likely small on an ephemeral GC, the copying
costs aren't.
Regs, Pierre.
Footnotes:
[1] I'd guess that treating the part outside the bounding indices
completely specially, i.e. not merging them with the first and
last partition respectively to be much more commonly useful.
--
Pierre R. Mai <····@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
"Pierre R. Mai" <····@acm.org> writes:
> Christophe Rhodes <·····@cam.ac.uk> writes:
>
> > In the light of the comments I've had so far, I'd like to propose the
> > following extra changes:
> >
> > * changing the behaviour of the :start and :end arguments to be
> > more consistent with CL:REMOVE, meaning that
> > (partition delimiter sequence :start start :end end)
> > is no longer equivalent to
> > (partiton delimiter (subseq sequence start end).
>
> [snip]
>
> I'd suggest that PARTITION behave like this:
>
> (partition #\; "foo;bar;baz" :start 2 :end 9)
>
> => ("o" "bar" "b"), 9
>
> This enables one important use that the other version doesn't allow,
> namely that of processing part of a large buffer in-place, without
> consing a new subsequence, that is then immediately discarded.
>
> The original kind of behaviour (which I consider to be very rarely
> useful[1]) can be re-added on top of the new version, whereas the
> contrary isn't possible. And if you consider that PARTITION would be
> used to parse largish CSV files from a buffer, the unnecessary consing
> entailed by SUBSEQ can easily run into 100s of bytes per line, which
> on the files we process daily (>= 80000 lines) can easily run into MBs
> of unnecessary consing, with no way out, besides rewriting PARTITION.
> While the GC costs are likely small on an ephemeral GC, the copying
> costs aren't.
I'm willing to buy this argument, and since no-one else seems to care
one way or the other (or else they're all implicitly agreeing with
you) I shall put out a version 3 shortly.
Cheers,
Christophe
--
Jesus College, Cambridge, CB5 8BL +44 1223 524 842
http://www-jcsu.jesus.cam.ac.uk/~csr21/ (defun pling-dollar
(str schar arg) (first (last +))) (make-dispatch-macro-character #\! t)
(set-dispatch-macro-character #\! #\$ #'pling-dollar)