Splitting sequence at object

From: Florian Weimer
Subject: Splitting sequence at object
Date: Wed, 19 Feb 2003 20:53:49 +0000
Message-ID: <87y94cuh42.fsf@deneb.enyo.de>

How can I use the LOOP facility (or something else which is quite
Lisp-like) to split a sequence at a specific object?  For example, I'd
like to translate "A.B.C.D" to ("A" "B" "C" "D").

Re: Splitting sequence at object Sam Steingold
Re: Splitting sequence at object Edi Weitz
- Re: Splitting sequence at object Florian Weimer
  - Re: Splitting sequence at object Thomas F. Burdick
    - Re: Splitting sequence at object Larry Hunter
      - Re: Splitting sequence at object Joe Marshall
        Re: Splitting sequence at object Larry Hunter
        Re: Splitting sequence at object Thomas F. Burdick
        Re: Splitting sequence at object Larry Hunter
        Re: Splitting sequence at object Duane Rettig
        Re: Splitting sequence at object Joe Marshall
        Re: Splitting sequence at object Thomas F. Burdick
        Re: Splitting sequence at object Joe Marshall
        Re: Splitting sequence at object Thomas F. Burdick
        Re: Splitting sequence at object Joe Marshall
        Re: Splitting sequence at object Joerg Hoehle
        Re: Splitting sequence at object Joe Marshall
        Re: Splitting sequence at object Joe Marshall
Re: Splitting sequence at object K.Khawi
- Re: Splitting sequence at object Matthew Danish

From: Sam Steingold
Subject: Re: Splitting sequence at object
Date: Wed, 19 Feb 2003 21:37:29 +0000
Message-ID: <m3adgsrlxd.fsf@loiso.podval.org>

> * In message <··············@deneb.enyo.de>
> * On the subject of "Splitting sequence at object"
> * Sent on Wed, 19 Feb 2003 21:53:49 +0100
> * Honorable Florian Weimer <··@deneb.enyo.de> writes:
>
> How can I use the LOOP facility (or something else which is quite
> Lisp-like) to split a sequence at a specific object?  For example, I'd
> like to translate "A.B.C.D" to ("A" "B" "C" "D").

<http://clisp.cons.org/impnotes/modules.html#regexp>
<http://clisp.cons.org/impnotes/modules.html#re-with-loop-split>

-- 
Sam Steingold (http://www.podval.org/~sds) running RedHat8 GNU/Linux
<http://www.camera.org> <http://www.iris.org.il> <http://www.memri.org/>
<http://www.mideasttruth.com/> <http://www.palestine-central.com/links.html>
As a computer, I find your faith in technology amusing.

From: Edi Weitz
Subject: Re: Splitting sequence at object
Date: Wed, 19 Feb 2003 21:03:18 +0000
Message-ID: <87n0ksoueh.fsf@duke.agharta.de>

Florian Weimer <··@deneb.enyo.de> writes:

> How can I use the LOOP facility (or something else which is quite
> Lisp-like) to split a sequence at a specific object?  For example,
> I'd like to translate "A.B.C.D" to ("A" "B" "C" "D").

  <http://www.cliki.net/SPLIT-SEQUENCE>

or - if your sequence is a string - use one of the regex
facilities. Shameless self-plug:

  <http://weitz.de/cl-ppcre/#split>

Edi.

From: Florian Weimer
Subject: Re: Splitting sequence at object
Date: Sat, 01 Mar 2003 14:21:48 +0000
Message-ID: <87bs0v181v.fsf@deneb.enyo.de>

Edi Weitz <···@agharta.de> writes:

> Florian Weimer <··@deneb.enyo.de> writes:
>
>> How can I use the LOOP facility (or something else which is quite
>> Lisp-like) to split a sequence at a specific object?  For example,
>> I'd like to translate "A.B.C.D" to ("A" "B" "C" "D").
>
>   <http://www.cliki.net/SPLIT-SEQUENCE>
>
> or - if your sequence is a string - use one of the regex
> facilities. Shameless self-plug:
>
>   <http://weitz.de/cl-ppcre/#split>

Thanks.  Even Debian packages exist.  Nice. 8-)

From: Thomas F. Burdick
Subject: Re: Splitting sequence at object
Date: Sun, 02 Mar 2003 00:39:54 +0000
Message-ID: <xcvk7fi8uud.fsf@conquest.OCF.Berkeley.EDU>

Florian Weimer <··@deneb.enyo.de> writes:

> Edi Weitz <···@agharta.de> writes:
> 
> > Florian Weimer <··@deneb.enyo.de> writes:
> >
> >> How can I use the LOOP facility (or something else which is quite
> >> Lisp-like) to split a sequence at a specific object?  For example,
> >> I'd like to translate "A.B.C.D" to ("A" "B" "C" "D").
> >
> >   <http://www.cliki.net/SPLIT-SEQUENCE>
> >
> > or - if your sequence is a string - use one of the regex
> > facilities. Shameless self-plug:
> >
> >   <http://weitz.de/cl-ppcre/#split>
> 
> Thanks.  Even Debian packages exist.  Nice. 8-)

FWIW, I think SPLIT-SEQUENCE takes the wrong approach by unnecessarily
consing up strings.  I use a different utility:

<http://groups.google.com/groups?selm=xcvvghhpw95.fsf%40conquest.OCF.Berkeley.EDU>

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: Larry Hunter
Subject: Re: Splitting sequence at object
Date: Tue, 04 Mar 2003 05:27:31 +0000
Message-ID: <m3y93vzoos.fsf@huge.uchsc.edu>

Thoms Burdick said:

  FWIW, I think SPLIT-SEQUENCE takes the wrong approach by unnecessarily
  consing up strings.  

A reasonable complaint, so I made a version of my LEX-STRING function
(called NLEX-STRING) which returns displaced vectors and doesn't cons.
The only change is the call to MAKE-ARRAY replaces the SUBSEQ.

Larry

(defconstant *whitespace-characters*
  '(#\space #\tab #\page #\newline #\backspace #\return #\linefeed)
  "The whitespace characters")

(defun nlex-string (string &optional (whitespace-chars *whitespace-characters*))
    "Separates a string at whitespace, returning a list of strings (sharing structure)"
    (assert (and (stringp string) (every #'characterp whitespace-chars)))
    (flet ((whitespace-char? (char) (find char whitespace-chars :test #'char=)))
      (let ((tokens nil)
            (token-end nil))
        (do ((token-start
              (position-if (complement #'whitespace-char?) string) 
              (when token-end (position-if (complement #'whitespace-char?) string
                                           :start (1+ token-end)))))
             ((null token-start) (nreverse tokens))
          (setq token-end (when token-start (position-if #'whitespace-char? string
                                                             :start token-start)))
          (push (make-array (- (or token-end (length string)) token-start)
                            :element-type (array-element-type string)
                            :displaced-to string
                            :displaced-index-offset token-start)
                tokens)))))



-- 

Lawrence Hunter, Ph.D.
Director, Center for Computational Pharmacology
Associate Professor of Pharmacology, PMB & Computer Science

phone  +1 303 315 1094           UCHSC, Campus Box C236    
fax    +1 303 315 1098           School of Medicine rm 2817b   
cell   +1 303 324 0355           4200 E. 9th Ave.                 
email: ············@uchsc.edu    Denver, CO 80262       
PGP key on public keyservers     http://compbio.uchsc.edu/hunter

From: Joe Marshall
Subject: Re: Splitting sequence at object
Date: Tue, 04 Mar 2003 16:07:00 +0000
Message-ID: <of4r2k0r.fsf@ccs.neu.edu>

Larry Hunter <············@uchsc.edu> writes:

> Thoms Burdick said:
> 
>   FWIW, I think SPLIT-SEQUENCE takes the wrong approach by unnecessarily
>   consing up strings.  
> 
> A reasonable complaint, so I made a version of my LEX-STRING function
> (called NLEX-STRING) which returns displaced vectors and doesn't cons.

It calls MAKE-ARRAY, so that conses.  Admittedly, it may not cons as
much (depends on how much storage a displaced array consumes), but the
cost of double-indirection may outweigh the consing cost.

From: Larry Hunter
Subject: Re: Splitting sequence at object
Date: Tue, 04 Mar 2003 17:24:01 +0000
Message-ID: <m31y1ncafi.fsf@huge.uchsc.edu>

Joe Marshall said:

  It calls MAKE-ARRAY, so that conses. Admittedly, it may not cons as
  much (depends on how much storage a displaced array consumes), but
  the cost of double-indirection may outweigh the consing cost.

Geez, Joe the object of my assertion was "strings" not "anything."
Burdick complained that SPLIT-SEQUENCE conses _strings_; my version
doesn't cons any _new strings_. Of course, by the nature of the task
it has to cons _something_ in order to return a list of the split!

Do note that a quick test (in ACL 6.2) does show that Joe's concern
has some merit:

 cl-user(39): (time (nlex-string "this is a test of the lexer."))
 ; space allocation:
 ;  22 cons cells, 432 other bytes, 0 static bytes
 ("this" "is" "a" "test" "of" "the" "lexer.")
 cl-user(40): (time (lex-string "this is a test of the lexer."))
 ; space allocation:
 ;  8 cons cells, 168 other bytes, 0 static bytes
 ("this" "is" "a" "test" "of" "the" "lexer.")

Larry

-- 

Lawrence Hunter, Ph.D.
Director, Center for Computational Pharmacology
Associate Professor of Pharmacology, PMB & Computer Science

phone  +1 303 315 1094           UCHSC, Campus Box C236    
fax    +1 303 315 1098           School of Medicine rm 2817b   
cell   +1 303 324 0355           4200 E. 9th Ave.                 
email: ············@uchsc.edu    Denver, CO 80262       
PGP key on public keyservers     http://compbio.uchsc.edu/hunter

From: Thomas F. Burdick
Subject: Re: Splitting sequence at object
Date: Tue, 04 Mar 2003 23:55:34 +0000
Message-ID: <xcvk7feu1op.fsf@fallingrocks.OCF.Berkeley.EDU>

Larry Hunter <············@uchsc.edu> writes:

> Joe Marshall said:
> 
>   It calls MAKE-ARRAY, so that conses. Admittedly, it may not cons as
>   much (depends on how much storage a displaced array consumes), but
>   the cost of double-indirection may outweigh the consing cost.
> 
> Geez, Joe the object of my assertion was "strings" not "anything."
> Burdick complained that SPLIT-SEQUENCE conses _strings_; my version
> doesn't cons any _new strings_.

Well, to be pedantic, it does, as

  (every #'stringp (nlex-string ...))

would show, but it does avoid the pathological situation, where
SPLIT-SEQUENCE could generate *huge* amounts of garbage.  But...

> Of course, by the nature of the task it has to cons _something_ in
> order to return a list of the split!

Wel, no, that was the root of my objection.  I suspect, based on
looking at my own code, back when I used a SPLIT-VECTOR utility, that
most uses of something like SPLIT-SEQUENCE generate unneeded
intermediate results.  Often, these strings are generated, only to be
operated on and thrown away -- in most cases, a start and end index
into the original string would suffice.  You don't need to cons up a
list of these indexes, you can just iterate over them.  In cases where
you eventually need a list of results, it's best to delay constructing
that list as long as possible.

> Do note that a quick test (in ACL 6.2) does show that Joe's concern
> has some merit:
> 
>  cl-user(39): (time (nlex-string "this is a test of the lexer."))
>  ; space allocation:
>  ;  22 cons cells, 432 other bytes, 0 static bytes
>  ("this" "is" "a" "test" "of" "the" "lexer.")
>  cl-user(40): (time (lex-string "this is a test of the lexer."))
>  ; space allocation:
>  ;  8 cons cells, 168 other bytes, 0 static bytes
>  ("this" "is" "a" "test" "of" "the" "lexer.")

Well, in all fairness to your approach, that doesn't show it at its
best.  The average length of your resulting strings is about 3
characters.  Think about what information is reified in a displaced
array: start and end indexes, the base array, and an element type.
That's at least four pieces of information, which is probably a bad
way to encode 3-character strings.  But if your results could
potentially contain 1k strings, you will win big.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: Larry Hunter
Subject: Re: Splitting sequence at object
Date: Wed, 05 Mar 2003 01:09:30 +0000
Message-ID: <m3vfyybovp.fsf@huge.uchsc.edu>

Thomas Burdick says:

 Well, to be pedantic, it does, as
 
   (every #'stringp (nlex-string ...)) 

 would show. 

I'm sorry, I don't get this. Are you saying they are "new" strings
because each of the displaced arrays are strings in their own right?
Since they all share structure, I find it counter-intuitive to say
that we are "consing up new strings." Now there is no ANSI definition
of what "consing up a string" might be so the language lawyers are out
of luck. My intuition says that (unlike subseq) it's cheap enough so
that I'm never going to have to worry about how much garbage it
creates.

Thomas then goes on to say:

  that was the root of my objection. I suspect, based on looking at my
  own code, back when I used a SPLIT-VECTOR utility, that most uses of
  something like SPLIT-SEQUENCE generate unneeded intermediate
  results. Often, these strings are generated, only to be operated on
  and thrown away -- in most cases, a start and end index into the
  original string would suffice. 

OK, so we could do 

(defun lex-string-positions (string &optional (whitespace-chars *whitespace-characters*))
  "Separates a string at whitespace and returns a list of split positions"
  (assert (and (stringp string) (every #'characterp whitespace-chars)))
  (flet ((whitespace-char? (char) (find char whitespace-chars :test #'char=)))
    (let ((token-positions nil)
	  token-end)
      (do ((token-start
	    (position-if-not #'whitespace-char? string) 
	    (when token-end
	      (position-if-not #'whitespace-char? string :start (1+ token-end)))))
	  ((null token-start) (nreverse token-positions))
	(setq token-end
	      (when token-start
		(position-if #'whitespace-char? string :start token-start)))
	(push (cons token-start (or token-end (length string))) token-positions)))))

And then he says

  You don't need to cons up a list of these indexes, you can just
  iterate over them.

Picky, picky. How about this, which takes a function of three
arguments (string, start, end) and applies it to each
would-have-been-lexed item in the string. Is this garbage-light enough
for you?

(defun burdick-map (function string &optional (whitespace *whitespace*))
  (assert (and (stringp string) (every #'characterp whitespace)))
  (flet ((whitespace-char? (char) (find char whitespace :test #'char=)))
    (let ((results nil)
          token-end)      
      (do ((token-start
	    (position-if-not #'whitespace-char? string) 
	    (when token-end
	      (position-if-not #'whitespace-char? string 
                               :start (1+ token-end)))))
	  ((null token-start) (nreverse results))
	(setq token-end
	      (when token-start
		(position-if #'whitespace-char? string 
                             :start token-start)))
        (push (funcall function string token-start 
                       (or token-end (length string)))
              results)))))

By this point, I'm really wondering what that function you want to
burdick-map is going to be like. Unless it is going to be operating
only on the characters, you are going to end up creating the same
garbage in a different place. 

[And yes, I know that the displaced array solution becomes more
efficient than subseq as the substrings get larger. I offered my
example to show how Joe could be right sometimes.]

Larry



-- 

Lawrence Hunter, Ph.D.
Director, Center for Computational Pharmacology
Associate Professor of Pharmacology, PMB & Computer Science

phone  +1 303 315 1094           UCHSC, Campus Box C236    
fax    +1 303 315 1098           School of Medicine rm 2817b   
cell   +1 303 324 0355           4200 E. 9th Ave.                 
email: ············@uchsc.edu    Denver, CO 80262       
PGP key on public keyservers     http://compbio.uchsc.edu/hunter

From: Duane Rettig
Subject: Re: Splitting sequence at object
Date: Wed, 05 Mar 2003 06:32:11 +0000
Message-ID: <4llzus4r8.fsf@beta.franz.com>

Larry Hunter <············@uchsc.edu> writes:

> Thomas Burdick says:
> 
>  Well, to be pedantic, it does, as
>  
>    (every #'stringp (nlex-string ...)) 
> 
>  would show. 
> 
> I'm sorry, I don't get this. Are you saying they are "new" strings
> because each of the displaced arrays are strings in their own right?
> Since they all share structure, I find it counter-intuitive to say
> that we are "consing up new strings." Now there is no ANSI definition
> of what "consing up a string" might be so the language lawyers are out
> of luck. My intuition says that (unlike subseq) it's cheap enough so
> that I'm never going to have to worry about how much garbage it
> creates.

What is being consed up is the "array header" that comes with a displaced
array.  The underlying simple-vector that is the displaced-to array
remains the only simple-string and is not consed, but since these array
headers are one-dimensional arrays of element-type character, the are
still considered strings, though not simple-strings.

In a 32-bit Allegro CL an array-header is 24 bytes, and in a 64-bit lisp
it is 48 bytes.

Besides the space hit, there is an access hit as well; aref and its inverse
can't be quite as highly optimized as simple-strings.

One possible alternate approach you might consider, if you don't mind
an Allegro CL extension, is to use "pure" strings.  The advisability of
such an approach would depend on how likely these strings are to repeat.
If the liklihood is high, then the function excl:pure-string
(ttp://www.franz.com/support/documentation/6.2/doc/pages/operators/excl/pure-string.htm)
can be used to grab any string that has been created in the .pll (pure
lisp library) file.  This file is usually created automatically, but you
can create one of your own and throw as many strings as you like into
this file.  Then, when you build your lisp with this .pll specified,
you can then grab strings to your heart's content with all consing being
purely ephemeral.  Note the addresses of car's name and the string
gotten by pure-string:

CL-USER(1): (inspect 'car)
The symbol CAR @ #x71006487
  which is an EXTERNAL symbol in the COMMON-LISP package
   0 type ---------> Bit field: #x07
   1 flags --------> Bit field: #x24
   2 hash ---------> Bit field: #x4886
   3 value --------> ..unbound..
   4 package ------> The COMMON-LISP package
   5 function -----> #<Function CAR>
   6 name ---------> A simple-string (3) "CAR"
   7 plist --------> (EXCL::.ARGS. (1 . 1) ...), a proper list with 12 elements
[1i] CL-USER(2): :i 6
A simple-string (3) "CAR" @ #x40d7dac2
   0-> The character #\C [#x0043]
   1-> The character #\A [#x0041]
   2-> The character #\R [#x0052]
[1i] CL-USER(3): (inspect (pure-string "CAR"))
A simple-string (3) "CAR" @ #x40d7dac2
   0-> The character #\C [#x0043]
   1-> The character #\A [#x0041]
   2-> The character #\R [#x0052]
[2i] CL-USER(4): (simple-string-p *)
T
[2i] CL-USER(5): 

Note also that it is a simple-string, so accesses on it can be compiled
to very fast code.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: Joe Marshall
Subject: Re: Splitting sequence at object
Date: Wed, 05 Mar 2003 15:33:35 +0000
Message-ID: <vfyx25gw.fsf@ccs.neu.edu>

Larry Hunter <············@uchsc.edu> writes:

> [I offered my example to show how Joe could be right sometimes.]

Can I quote you on this?

From: Thomas F. Burdick
Subject: Re: Splitting sequence at object
Date: Thu, 06 Mar 2003 02:48:45 +0000
Message-ID: <xcv7kbdnraq.fsf@famine.OCF.Berkeley.EDU>

Larry Hunter <············@uchsc.edu> writes:

> Thomas Burdick says:
> 
>  Well, to be pedantic, it does, as
>  
>    (every #'stringp (nlex-string ...)) 
> 
>  would show. 
> 
> I'm sorry, I don't get this. Are you saying they are "new" strings
> because each of the displaced arrays are strings in their own right?

I just meant this as a smart-ass comment, which I probably should have
cut, because I think it distracted from my point.

> And then he says
> 
>   You don't need to cons up a list of these indexes, you can just
>   iterate over them.
> 
> Picky, picky. How about this, which takes a function of three
> arguments (string, start, end) and applies it to each
> would-have-been-lexed item in the string. Is this garbage-light enough
> for you?

[snip]

> By this point, I'm really wondering what that function you want to
> burdick-map is going to be like. Unless it is going to be operating
> only on the characters, you are going to end up creating the same
> garbage in a different place. 

I'm not out to absolutely destroy consing, but to view the operation
of splitting a string, and other similar operations, as an iteration
path.  Rather than make one pass, creating intermediate results which
are passed over again and thrown out, and repeating -- instead, I
think it's preferable in general to do this as a nested loop.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: Joe Marshall
Subject: Re: Splitting sequence at object
Date: Wed, 05 Mar 2003 15:35:52 +0000
Message-ID: <ptp525d3.fsf@ccs.neu.edu>

···@fallingrocks.OCF.Berkeley.EDU (Thomas F. Burdick) writes:

> Often, these strings are generated, only to be
> operated on and thrown away -- in most cases, a start and end index
> into the original string would suffice.  You don't need to cons up a
> list of these indexes, you can just iterate over them.  In cases where
> you eventually need a list of results, it's best to delay constructing
> that list as long as possible.

It isn't clear that avoiding the consing is a win, though.  On a
modern machine, with a decent amount of RAM and a generational
collector, the cost of consing and collecting a string is extremely
small.

From: Thomas F. Burdick
Subject: Re: Splitting sequence at object
Date: Thu, 06 Mar 2003 02:50:06 +0000
Message-ID: <xcv4r6hnr8h.fsf@famine.OCF.Berkeley.EDU>

Joe Marshall <···@ccs.neu.edu> writes:

> ···@fallingrocks.OCF.Berkeley.EDU (Thomas F. Burdick) writes:
> 
> > Often, these strings are generated, only to be
> > operated on and thrown away -- in most cases, a start and end index
> > into the original string would suffice.  You don't need to cons up a
> > list of these indexes, you can just iterate over them.  In cases where
> > you eventually need a list of results, it's best to delay constructing
> > that list as long as possible.
> 
> It isn't clear that avoiding the consing is a win, though.  On a
> modern machine, with a decent amount of RAM and a generational
> collector, the cost of consing and collecting a string is extremely
> small.

Small, but it's still there.  And that approach leads to code that
makes several passes over the data where one would do.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: Joe Marshall
Subject: Re: Splitting sequence at object
Date: Thu, 06 Mar 2003 14:49:49 +0000
Message-ID: <3cm01rea.fsf@ccs.neu.edu>

···@famine.OCF.Berkeley.EDU (Thomas F. Burdick) writes:

> Joe Marshall <···@ccs.neu.edu> writes:
> 
> > ···@fallingrocks.OCF.Berkeley.EDU (Thomas F. Burdick) writes:
> > 
> > > Often, these strings are generated, only to be
> > > operated on and thrown away -- in most cases, a start and end index
> > > into the original string would suffice.  You don't need to cons up a
> > > list of these indexes, you can just iterate over them.  In cases where
> > > you eventually need a list of results, it's best to delay constructing
> > > that list as long as possible.
> > 
> > It isn't clear that avoiding the consing is a win, though.  On a
> > modern machine, with a decent amount of RAM and a generational
> > collector, the cost of consing and collecting a string is extremely
> > small.
> 
> Small, but it's still there.  

The way I figure it, if it adds less than a few percent, then it isn't.
The payback for squeezing out the last couple of percent of
performance by avoiding consing is far too low to justify it.  The
effort would be better spent on fixing bugs, finding better
algorithms, or even waiting for the next generation of processor.

> And that approach leads to code that makes several passes over the
> data where one would do. 

True.  If this were in an inner loop, then I'd look for a mechanism
that simply managed the starting and ending indices of the
substrings.  But if this were for parsing command line input I
wouldn't bother.

From: Joerg Hoehle
Subject: Re: Splitting sequence at object
Date: Thu, 13 Mar 2003 14:20:10 +0000
Message-ID: <uznnztkkl.fsf@dont.t-systems.UCE.spam.no.com>

Joe Marshall <···@ccs.neu.edu> writes:
> ···@fallingrocks.OCF.Berkeley.EDU (Thomas F. Burdick) writes:
> > Often, these strings are generated, only to be
> > operated on and thrown away -- in most cases, a start and end index
> > into the original string would suffice.  You don't need to cons up a
> > list of these indexes, you can just iterate over them.
Exactly!

> It isn't clear that avoiding the consing is a win, though.  On a
> modern machine, with a decent amount of RAM and a generational
> collector, the cost of consing and collecting a string is extremely
> small.

Your argument forgets that all this garbage once had to be filled with
sensible data, which may mean megabytes of data to be written to
memory (not just read from) memory. That eats cycles - even with a
modern optimized GC!

In an application I recently wrote, which parses hundreds of MB of
logfiles 5-10 times faster than perl when using CLISP, I believe the
single most important performance optimization was: working on large
string buffers (filled via READ-SEQUENCE, 128KB a piece) and working
with index positions, using the :START and :END keywords of functions
to POSITION, STRING=, STRING<, PARSE-INTEGER etc.

I defined a mini domain-specific macro-based language like this:

(column-op action = "accept")
-> (STRING= buffer "accept" :start1 ... :end1 ...)

(column bytes integer)
-> (PARSE-INTEGER buffer :start ... :end ...)

BTW, I also tried out displaced-arrays and found that creating them
took a lot of time (in CLISP, possibly due to the array-bounds
checking or whatever). I rejected this option for this application.

In another application, some 8 years ago, using CMUCL on Sun/Sparc, I
experimentally found a length threshold of 40 bytes below which it's
better to SUBSEQ than to create a displaced array (of course, any such
value if application-specific). SUBSEQ is straightforward to a CPU and
caches, (MAKE-ARRAY :displaced-to ... :displaced-offset) is not so.

Regards,
	Jorg Hohle
Telekom/T-Systems Technology Center

From: Joe Marshall
Subject: Re: Splitting sequence at object
Date: Thu, 13 Mar 2003 14:35:13 +0000
Message-ID: <u1e7qqqm.fsf@ccs.neu.edu>

Joerg Hoehle <············@dont.t-systems.UCE.spam.no.com> writes:

> Joe Marshall <···@ccs.neu.edu> writes:
> 
> > It isn't clear that avoiding the consing is a win, though.  On a
> > modern machine, with a decent amount of RAM and a generational
> > collector, the cost of consing and collecting a string is extremely
> > small.
> 
> Your argument forgets that all this garbage once had to be filled with
> sensible data, which may mean megabytes of data to be written to
> memory (not just read from) memory. That eats cycles - even with a
> modern optimized GC!

To be sure there are cases where avoiding consing *is* a substantial
win.  In the past, there was no question that reducing the consing in
an application would yield substantial performance benefits, but
these days, that isn't necessarily the case.  In general, I don't try
to avoid consing unless I have a strong reason to believe it is
hurting performance.

> In an application I recently wrote, which parses hundreds of MB of
> logfiles 5-10 times faster than perl when using CLISP, I believe the
> single most important performance optimization was: working on large
> string buffers (filled via READ-SEQUENCE, 128KB a piece) and working
> with index positions, using the :START and :END keywords of functions
> to POSITION, STRING=, STRING<, PARSE-INTEGER etc.
> 
> I defined a mini domain-specific macro-based language like this:
> 
> (column-op action = "accept")
> -> (STRING= buffer "accept" :start1 ... :end1 ...)
> 
> (column bytes integer)
> -> (PARSE-INTEGER buffer :start ... :end ...)

This is a prime example of where avoiding needless consing is a win.

From: Joe Marshall
Subject: Re: Splitting sequence at object
Date: Tue, 04 Mar 2003 21:17:10 +0000
Message-ID: <d6l63k89.fsf@ccs.neu.edu>

Larry Hunter <············@uchsc.edu> writes:

> Joe Marshall said:
> 
>   It calls MAKE-ARRAY, so that conses. Admittedly, it may not cons as
>   much (depends on how much storage a displaced array consumes), but
>   the cost of double-indirection may outweigh the consing cost.
> 
> Geez, Joe the object of my assertion was "strings" not "anything."

Sorry.  My pedantic hat was on too tight.  I've adjusted it.

From: K.Khawi
Subject: Re: Splitting sequence at object
Date: Thu, 20 Feb 2003 10:16:18 +0000
Message-ID: <bad977ad.0302200216.67451d74@posting.google.com>

My first post to CLL :-)

Florian Weimer <··@deneb.enyo.de> wrote in message news:<··············@deneb.enyo.de>...
> How can I use the LOOP facility (or something else which is quite
> Lisp-like) to split a sequence at a specific object?  For example, I'd
> like to translate "A.B.C.D" to ("A" "B" "C" "D").

Here is mine:

;;;
(defun split-string (string object)
  (if (plusp (length string))
      (let ((index (position object string)))
	(cond ((null index) string) 	;object not found
	      ((zerop index)	     ;ignore empty sub-lists
	       (split-string (subseq string 1) object))
	      (t (cons (subseq string 0 index)
		       (split-string (subseq string index) object)))))))

;;;

From: Matthew Danish
Subject: Re: Splitting sequence at object
Date: Thu, 20 Feb 2003 16:18:50 +0000
Message-ID: <20030220111850.C22536@lain.cheme.cmu.edu>

On Thu, Feb 20, 2003 at 02:16:18AM -0800, K.Khawi wrote:
> (defun split-string (string object)
>   (if (plusp (length string))
>       (let ((index (position object string)))
> 	(cond ((null index) string) 	;object not found
> 	      ((zerop index)	     ;ignore empty sub-lists
> 	       (split-string (subseq string 1) object))
> 	      (t (cons (subseq string 0 index)
> 		       (split-string (subseq string index) object)))))))

The trouble with this function is that
  (a) it's damn consy (look at all those SUBSEQ calls!)
  (b) it's already been done, better, and widely available
      (see SPLIT-SEQUENCE)

If you want to complete this exercise, purely for educational purposes,
rewrite the function to use :START and :END arguments to sequence
functions, rather than using SUBSEQ.  Also, perhaps handle the wide
range of options that SPLIT-SEQUENCE does.

-- 
; Matthew Danish <·······@andrew.cmu.edu>
; OpenPGP public key: C24B6010 on keyring.debian.org
; Signed or encrypted mail welcome.
; "There is no dark side of the moon really; matter of fact, it's all dark."