From: Ola Rinta-Koski
Subject: split
Date: 
Message-ID: <wkln2fgqwk.fsf@ANJOVIS.i-did-not-set--mail-host-address--so-shoot-me>
  Inspired by similar functionality in Perl, I wrote an implementation
  of "split" that takes two parameters, "where" (the separator between
  different fields in the string) and "str" (the string to be split)
  and returns a list of fields in the source string separated by the
  given separator. It works, but a) it's not too readable b) I suspect
  it is not too efficient. Any ideas how to improve readability and/or
  efficiency (and perhaps robustness)?

(defun split (where str)
  (when str
    (loop for n = -1 then (position where str :start (+ n 1))
      while (and n (< n (+ (length str) 1)))
      for m = (if n (position where str :start (+ n 1)))
      collect (subseq str (+ n 1) m))))
-- 
I'm reporting for duty as a modern person.  I want to do the Latin
Hustle now! Were these parsnips CORRECTLY MARINATED in TACO SAUCE?

From: Hannu Koivisto
Subject: Re: split
Date: 
Message-ID: <8766tjdmvg.fsf@senstation.vvf.fi>
Ola Rinta-Koski <···@iki.fi.REMOVE.THIS> writes:

|   given separator. It works, but a) it's not too readable b) I suspect
|   it is not too efficient. Any ideas how to improve readability and/or
|   efficiency (and perhaps robustness)?

It indeed looked somewhat unreadable; I didn't even bother to read
it aside from a quick glance, so I don't claim this does exactly
the same, but here's what I wrote (I think I've stolen the
documentation from a similar function written by someone else but
I've lost that information) for at least approximately the same
purpose:

(defun string-split (str &optional (separator #\Space))
  "Splits the string STR at each SEPARATOR character occurrence.
The resulting substrings are collected into a list which is returned.
A SEPARATOR at the beginning or at the end of the string STR results
in an empty string in the first or last position of the list
returned."
  (loop for start = 0 then (1+ end)
        for end   = (position separator str :start start)
        collect (subseq str start end)
        until (null end)))

I guess it could be called SEQ-SPLIT or something like that.  I
don't really see any pressing efficiency or robustness issues here
so I won't say anything more about those concerns.

-- 
Hannu
From: Christophe Rhodes
Subject: Re: split
Date: 
Message-ID: <squ2h2wcdl.fsf@lambda.jesus.cam.ac.uk>
Ola Rinta-Koski <···@iki.fi.REMOVE.THIS> writes:

>   Inspired by similar functionality in Perl, I wrote an implementation
>   of "split" that takes two parameters, "where" (the separator between
>   different fields in the string) and "str" (the string to be split)
>   and returns a list of fields in the source string separated by the
>   given separator. It works, but a) it's not too readable b) I suspect
>   it is not too efficient. Any ideas how to improve readability and/or
>   efficiency (and perhaps robustness)?
> 
> (defun split (where str)
>   (when str
>     (loop for n = -1 then (position where str :start (+ n 1))
>       while (and n (< n (+ (length str) 1)))
>       for m = (if n (position where str :start (+ n 1)))
>       collect (subseq str (+ n 1) m))))

You might want to have a look at Deja, for a thread around September
1998; doing a search on comp.lang.lisp for "delimited-substrings" or
"split-sequence" would probably get you some results.

Christophe
-- 
(macrolet ((f (a b) `(unless (equal ,a "") (princ (aref ,a 0)) (setf ,a 
(subseq ,a 1)) (,b))) (z (&rest a) `(format nil ··@{~35r~^ ~}" ,@a))) (
let ((w (z 693 29204 28104384)) (x (z 1984050605838 12977))) (labels ((
y () (f w z))(z () (f x y))) (y) (terpri))))#|Jesus College,Cambridge|#
From: David Hanley
Subject: Re: split
Date: 
Message-ID: <38FB64DB.E3DB2A96@ncgr.org>
Well, I would guess that efficency of your code isn't the single
largest issue here, since allocation and copying of arrays is going
to dominate runtime.  In instances where you may only want the
first token or two, or there are long tokens, you might want to do the
following in the name of efficiency:

(defun make-tokenizer( string &optional (token #\Space) )
  "make a funbction(closure) that returns the tokens
of the source string until there are no more; then it
returns nil.  This is structured this way, because in
many instances, you only want the first few tokens
anyways, and it also uses displaced arrays, because
is cases with long tokens, this may be more efficient."
  (let ((start 0)
        (end -1)
        (len (length string)))
    #'(lambda()
        (setf start (1+ end))
        (if (>= start len) nil
          (progn
            (setf end (position token string :start start))
            (when (not end) (setq end len))
            (make-array (- end start)
                        :element-type 'base-char
                        :displaced-to string
                        :displaced-index-offset start))))))


This actually ran on the first time i tried it.  I must be getting better.
:)

Of course there's a lot to do in this, but I'm sure you get the idea.
I *do* think there should be a way to modify displaced arrays, so this
function could just make one displaced array in the initial let() and
always return modified versions of that.  Then the whole thing
would only allocate a few extra bytes, even if you were stepping
though a very long string.

dave
From: Larry Hunter
Subject: Re: split
Date: 
Message-ID: <3isk8hwbgxs.fsf@work.nci.nih.gov>
Ola Rinta-Koski said:

  Inspired by similar functionality in Perl, I wrote an implementation
  of "split" that takes two parameters, "where" (the separator between
  different fields in the string) and "str" (the string to be split)
  and returns a list of fields in the source string separated by the
  given separator. It works, but a) it's not too readable b) I suspect
  it is not too efficient. Any ideas how to improve readability and/or
  efficiency (and perhaps robustness)?

This is a question that comes up fairly often on c.l.l for some
reason. Here's my version of a string lexer, which I have posted several
times before.  It takes a list of delimiter characters instead of a single
character.  I think it's fairly readable, and it is faster and conses less
than the loop version you posted.  (see timings at the end).

Larry

(defun lex-string (string &key (whitespace '(#\space #\newline)))
  "Separates a string at whitespace and returns a list of strings"
  (flet ((whitespace? (char) (member char whitespace :test #'char=)))
    (let ((tokens nil))
      (do* ((token-start
             (position-if-not #'whitespace? string) 
             (when token-end
               (position-if-not #'whitespace? string :start (1+ token-end))))
            (token-end
             (when token-start
               (position-if #'whitespace? string :start token-start))
             (when token-start
               (position-if #'whitespace? string :start token-start))))
           ((null token-start) (nreverse tokens))
        (push (subseq string token-start token-end) tokens)))))

;; Here are timings for my lex-string and your split in Allegro 5.0.1 on SGI
;; Unix.  Both functions are compiled (optimize (speed 3) (Debug 0) (safety 0))

USER(10): (setq test-string "Split me at my spaces.")
"Split me at my spaces."
[...]
USER(18): (time (dotimes (i 1000 nil) (declare (ignore i)) (lex-string test-string)))
; cpu time (non-gc) 60 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  60 msec user, 0 msec system
; real time  57 msec
; space allocation:
;  13,004 cons cells, 0 symbols, 96,000 other bytes, 0 static bytes
NIL
USER(19): (time (dotimes (i 1000 nil) (declare (ignore i)) (split #\space test-string)))
; cpu time (non-gc) 70 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  70 msec user, 0 msec system
; real time  68 msec
; space allocation:
;  15,004 cons cells, 0 symbols, 96,000 other bytes, 0 static bytes
NIL


-- 
Lawrence Hunter, Ph.D.        Chief, Molecular Statistics and Bioinformatics
National Cancer Institute                             email: ·······@nih.gov
Federal Building, Room 318                          phone: +1 (301) 402-0389
7550 Wisconsin Ave.                                   fax: +1 (301) 480-0223
Bethesda, MD 20892  USA
From: Bernhard Pfahringer
Subject: Re: split
Date: 
Message-ID: <gyug0skdwi0.fsf@borg.i-did-not-set--mail-host-address--so-shoot-me>
For completeness' sake, this is the (slightly debugged) version modelled after
the other general sequence function. It was kind of the final result of the
split-sequence thread 1.5 years ago. It is general, therefore will of course
cons a bit more, but then this could be addressed by appropriate compiler-macros
if necessary.

;;; full-fledged version ala position
;;; BUG FIX: 24-08-1999 (bp): :from-end t caused sub-sequences to be reversed
;;; copyright notice: this function is in the PUBLIC DOMAIN
(defun split-sequence (delimiter seq
			   &key
			   (empty-marker nil keep-empty-subseqs)
			   (from-end nil)
			   (start 0)
			   (end nil)
			   (test nil test-supplied)
			   (test-not nil test-not-supplied)
			   (key nil key-supplied)
			   &aux
			   (len (length seq)))

  "Return list of subsequences in SEQ delimited by DELIMITER.
   If an EMPTY-MARKER is supplied, empty subsequences will be
   represented by EMPTY-MARKER, otherwise they will be discarded.
   All other keywords work analogously to POSITION."

  (declare (optimize (speed 3)(safety 0)(space 0)(debug 0)))
  
  (unless end (setq end len))
  
  (when from-end
    (setf seq (reverse seq))
    (psetf start (- len end)
	   end (- len start)))

  (loop with other-keys = (nconc (when test-supplied (list :test test))
				 (when test-not-supplied (list :test-not test-not))
				 (when key-supplied (list :key key)))
	for left = start then (+ right 1)
	for right = (min (or (apply #'position delimiter seq :start left other-keys)
			     len)
			 end)
	if (< left right)
	collect (let ((subseq (subseq seq left right)))
		  (if from-end
		      (nreverse subseq)
		      subseq))
	else when keep-empty-subseqs collect empty-marker
	until (eq right end)))

--------------------------------------------------------------------------
Bernhard Pfahringer 
Dept. of Computer Science, University of Waikato
G1.25, phone: +64-7-838-4041
········@cs.waikato.ac.nz
--------------------------------------------------------------------------
From: Will Hartung
Subject: Re: split
Date: 
Message-ID: <W9%K4.162370$bm.696812@news1.alsv1.occa.home.com>
>This is a question that comes up fairly often on c.l.l for some
>reason. Here's my version of a string lexer, which I have posted several
>times before.  It takes a list of delimiter characters instead of a single
>character.  I think it's fairly readable, and it is faster and conses less
>than the loop version you posted.  (see timings at the end).
>
>Larry


With a humble *YMMV*, I had far too much time on my hands, so I went and did
this:
Lispworks 4.0.1 on W98, 128MB mem, 400MHz Celeron.

(mark-and-sweep ...) cleans up the first generation garbage, just to be
tidy.

Also, I put the (declare (optimize ...)) line from split-sequence into the
other two routines.

I did 100000 iterations because of the coarse granularity of the LWW timer.

(defun do-timings ()
  (let ((cnt 100000)
        (str "Split at my spaces"))
    (mark-and-sweep 0)
    (print "split-sequence")
    (extended-time (dotimes (i cnt nil) (declare (ignore i))
            (split-sequence #\space str)))
    (mark-and-sweep 0)
    (print "lex-string")
    (extended-time (dotimes (i cnt nil) (declare (ignore i))
            (lex-string str :whitespace '(#\space))))
    (mark-and-sweep 0)
    (print "split")
    (extended-time (dotimes (i cnt nil) (declare (ignore i))
            (split #\space str)))))

Interesting results:

--------
"split-sequence"
2.6 seconds used.
Standard Allocation 11200288 bytes.
Fixlen Allocation 4000192 bytes.
total gc activity   =       0.550
main promote (     1 calls)     =       0.000
mark and sweep (     42 calls)   =       0.550
internal promote (     1 calls) =       0.000
promote (     0 calls)          =       0.000
fixup (     1 calls)            =       0.000
compact (     0 calls)          =       0.000
"lex-string"
4.9 seconds used.
Standard Allocation 11200312 bytes.
Fixlen Allocation 3199992 bytes.
total gc activity   =       0.780
main promote (     1 calls)     =       0.000
mark and sweep (     43 calls)   =       0.780
internal promote (     1 calls) =       0.000
promote (     0 calls)          =       0.000
fixup (     1 calls)            =       0.000
compact (     0 calls)          =       0.000
"split"
3.2 seconds used.
Standard Allocation 11200312 bytes.
Fixlen Allocation 3999992 bytes.
total gc activity   =       0.900
main promote (     1 calls)     =       0.000
mark and sweep (     43 calls)   =       0.900
internal promote (     1 calls) =       0.000
promote (     0 calls)          =       0.000
fixup (     1 calls)            =       0.000
compact (     0 calls)          =       0.000
NIL
-------
split and split-sequence cons pretty much the same, but somehow
split-sequence does it more efficiently as it makes the gc work less. And
while lex-string conses about 25% less than the others, it's almost twice as
slow split-sequence.

However, it must be noted that lex-string will work with multiple
delimiters.

Another mildly interesting observation:

CL-USER 49 > (split-sequence 'a '(a b a c a d a e))
((B) (C) (D) (E))

CL-USER 50 > (split 'a '(a b a c a d a e))
(NIL (B) (C) (D) (E))

CL-USER 51 > (lex-string '(a b a c a d a e) :whitespace '(a))

Error: Type check failed: CHARACTERP is not true of A.
  1 (abort) Return to level 0.
  2 Return to top loop level 0.

Type :b for backtrace, :c <option number> to proceed,  or :? for other
options

split-sequence is more general as a SEQUENCE splitter, rather than just a
string splitter.

Finally, with the following definition:

(defun whitespace? (dummy char)
  (declare (ignore dummy))
  (let ((whitespace '(#\space)))
    (member char whitespace :test #'char=)))

And using the function call:

(split-sequence #\space str :test #'whitespace?)

We get:

"general split-sequence"
4.6 seconds used.
Standard Allocation 11200312 bytes.
Fixlen Allocation 5599992 bytes.
total gc activity   =       0.320
main promote (     1 calls)     =       0.50
mark and sweep (     43 calls)   =       0.270
internal promote (     1 calls) =       0.000
promote (     0 calls)          =       0.000
fixup (     1 calls)            =       0.000
compact (     0 calls)          =       0.000

This offers the same functionality as lex-string to handle multiple
delimiters.

In this case, I would say that it has similiar performance as lex-string,
but conses more.

Also, it should be noted that split-sequence was the effort of many folks on
c.l.l over a period of about two weeks. It suffered many variations.

Share and Enjoy!

Will Hartung
(······@home.com)
From: Larry Hunter
Subject: Re: split
Date: 
Message-ID: <3isg0sjb4yi.fsf@work.nci.nih.gov>
Will,

Thanks for the other info on split, etc.  I thought the 

 CL-USER 51 > (lex-string '(a b a c a d a e) :whitespace '(a))
 Error: Type check failed: CHARACTERP is not true of A.

comment wasn't quite fair, since it's a trivial change to make it handle
this, and it wasn't part of the original task...  I do like the generality
of split-sequence.

Larry

-- 
Lawrence Hunter, Ph.D.        Chief, Molecular Statistics and Bioinformatics
National Cancer Institute                             email: ·······@nih.gov
Federal Building, Room 318                          phone: +1 (301) 402-0389
7550 Wisconsin Ave.                                   fax: +1 (301) 480-0223
Bethesda, MD 20892  USA
From: Will Hartung
Subject: Re: split
Date: 
Message-ID: <F5jL4.163302$bm.713603@news1.alsv1.occa.home.com>
Larry Hunter wrote in message <···············@work.nci.nih.gov>...
>
>Will,
>
>Thanks for the other info on split, etc.  I thought the
>
> CL-USER 51 > (lex-string '(a b a c a d a e) :whitespace '(a))
> Error: Type check failed: CHARACTERP is not true of A.
>
>comment wasn't quite fair, since it's a trivial change to make it handle
>this, and it wasn't part of the original task...  I do like the generality
>of split-sequence.


No malice intended. I was just interested in how the "simple" versions
compared to the "complicated" "general" version, and I was actually a bit
surprised by the results. I was just surprised to find the "general" version
appeared to be faster (though, perhaps, more expensive) than a simpler, more
specific version.

However, in defense of the simple versions, 'split-sequence' caught the
fancy of many folks on c.l.l and it evolved over about two weeks of work by
5 or 6 people, whereas when most people have a need for this kind of
function, they put the 20 minutes or so of work and casual testing into it
until it functions to their specifications, and then leave it.

Seriously, if it doesn't show up at the top of your profiler, who'd ever go
back and try to improve it in any way if it's working fine?

So, again, it wasn't a criticism of your function, just an observation.

Best Regards,

Will Hartung
(······@home.com)
From: Jochen Schmidt
Subject: Re: split
Date: 
Message-ID: <38FCFD6C.DD392B84@gmx.de>
Larry Hunter wrote:

> ;; Here are timings for my lex-string and your split in Allegro 5.0.1 on SGI
> ;; Unix.  Both functions are compiled (optimize (speed 3) (Debug 0) (safety 0))
> 
> USER(10): (setq test-string "Split me at my spaces.")
> "Split me at my spaces."
> [...]
> USER(18): (time (dotimes (i 1000 nil) (declare (ignore i)) (lex-string test-string)))
> ; cpu time (non-gc) 60 msec user, 0 msec system
> ; cpu time (gc)     0 msec user, 0 msec system
> ; cpu time (total)  60 msec user, 0 msec system
> ; real time  57 msec
> ; space allocation:
> ;  13,004 cons cells, 0 symbols, 96,000 other bytes, 0 static bytes
> NIL
> USER(19): (time (dotimes (i 1000 nil) (declare (ignore i)) (split #\space test-string)))
> ; cpu time (non-gc) 70 msec user, 0 msec system
> ; cpu time (gc)     0 msec user, 0 msec system
> ; cpu time (total)  70 msec user, 0 msec system
> ; real time  68 msec
> ; space allocation:
> ;  15,004 cons cells, 0 symbols, 96,000 other bytes, 0 static bytes
> NIL
> 

I've tried some timings with Allegro and CMUCL on my Linux AMD-K6-2
300Mhz 128 MB Machine:

CMUCL:

(time (dotimes (i 1000 nil) (declare (ignore i)) (split #\Space
test-string)))
Evaluation took:
  0.11 seconds of real time
  0.1 seconds of user run time
  0.0 seconds of system run time
  0 page faults and
  212344 bytes consed.

(time (dotimes (i 1000 nil) (declare (ignore i)) (lex-string
test-string)))
Evaluation took:
  0.15 seconds of real time
  0.13 seconds of user run time
  0.0 seconds of system run time
  0 page faults and
  761728 bytes consed.

(time (dotimes (i 1000 nil) (declare (ignore i)) (tokenize-string
test-string)))
Evaluation took:
  0.13 seconds of real time
  0.11 seconds of user run time
  0.01 seconds of system run time
  0 page faults and
  364440 bytes consed.

Allegro:

(time (dotimes (i 1000 nil) (declare (ignore i)) (split #\Space
test-string)))
; cpu time (non-gc) 110 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  110 msec user, 0 msec system
; real time  110 msec
; space allocation:
;  15,004 cons cells, 0 symbols, 96,000 other bytes, 0 static bytes    


(time (dotimes (i 1000 nil) (declare (ignore i)) (lex-string
test-string)))
; cpu time (non-gc) 100 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  100 msec user, 0 msec system
; real time  100 msec
; space allocation:
;  13,004 cons cells, 0 symbols, 96,000 other bytes, 0 static bytes

(time (dotimes (i 1000 nil) (declare (ignore i)) (tokenize-string
test-string)))
; cpu time (non-gc) 120 msec user, 0 msec system
; cpu time (gc)     0 msec user, 0 msec system
; cpu time (total)  120 msec user, 0 msec system
; real time  124 msec
; space allocation:
;  13,004 cons cells, 0 symbols, 336,000 other bytes, 0 static bytes  

My own version of a string-splitter...

(defun make-char-in-seq-p (char-seq)
  #'(lambda (a-char)
    (find a-char char-seq)))

(defun tokenize-string (str &optional (del '(#\Space)) &key (start 0)
(end nil))
  (labels ((rec (pos acc)
				(if (null pos)
					(nreverse acc)
				  (let ((tpos (position-if (make-char-in-seq-p del) str :start
pos)))
					(if (null tpos)
						(nreverse (cons (subseq str pos) acc))
					  (rec (position-if-not (make-char-in-seq-p del) str :start tpos)
						   (cons (subseq str pos tpos) acc)))))))
    (rec (position-if-not (make-char-in-seq-p del) str :start start :end
end) '())))


; It works on sequences too

* (tokenize-string '(a b a c a d a e) '(a))
((B) (C) (D) (E))

-- 
cya,
Jochen Schmidt
···@dataheaven.de
http://www.dataheaven.de
From: Reini Urban
Subject: Re: split
Date: 
Message-ID: <38ff4e03.34289585@judy>
Ola Rinta-Koski wrote:
>  Inspired by similar functionality in Perl, I wrote an implementation
>  of "split" that takes two parameters, "where" (the separator between
>  different fields in the string) and "str" (the string to be split)
>  and returns a list of fields in the source string separated by the
>  given separator. It works, but a) it's not too readable b) I suspect
>  it is not too efficient. Any ideas how to improve readability and/or
>  efficiency (and perhaps robustness)?
>(defun split (where str) ...

I have a functional note on this popular question.
Your definition of the tokenizer maybe too limited.
I for myself came up with the following useful organization:

string-tokenize vs. string-split (and for all other sequence types)
  split keeps empty ("null") tokens, tokenize throws them away.

  (split    ",1,2," '(#\,)) => ("" "1" "2" "")
  (tokenize ",1,2," '(#\,)) => ("1" "2")

seperator argument types:
  single char vs list of chars vs regex

your function only does the first, it cannot split at whitespace
(space, newline, tab), nor at any regex as in perl.

The best implementation requires the loop macro, which I try avoid, or
the equivalent macro expansion of this, which is also unpleasant to
read, so I cannot help there.
--
Reini Urban
http://xarch.tu-graz.ac.at/autocad/news/faq/autolisp.html
From: Bernhard Pfahringer
Subject: Re: split
Date: 
Message-ID: <gyubt34dsxd.fsf@borg.i-did-not-set--mail-host-address--so-shoot-me>
······@x-ray.at (Reini Urban) writes:

> 
> string-tokenize vs. string-split (and for all other sequence types)
>   split keeps empty ("null") tokens, tokenize throws them away.
> 
>   (split    ",1,2," '(#\,)) => ("" "1" "2" "")
>   (tokenize ",1,2," '(#\,)) => ("1" "2")
> 

SPLIT-SEQUENCE can do this as well:

TEST(31): (split-sequence #\, ",1,2," )
("1" "2")
TEST(32): (split-sequence #\,  ",1,2," :empty-marker "")
("" "1" "2" "")

Bernhard
--------------------------------------------------------------------------
Bernhard Pfahringer
Dept. of Computer Science, University of Waikato
G1.25, 838 4041
········@cs.waikato.ac.nz
--------------------------------------------------------------------------
From: Reini Urban
Subject: Re: split
Date: 
Message-ID: <390023c0.2006505@judy>
Bernhard Pfahringer wrote:
>······@x-ray.at (Reini Urban) writes:
>> string-tokenize vs. string-split (and for all other sequence types)
>>   split keeps empty ("null") tokens, tokenize throws them away.
>> 
>>   (split    ",1,2," '(#\,)) => ("" "1" "2" "")
>>   (tokenize ",1,2," '(#\,)) => ("1" "2")
>> 
>
>SPLIT-SEQUENCE can do this as well:
>
>TEST(31): (split-sequence #\, ",1,2," )
>("1" "2")
>TEST(32): (split-sequence #\,  ",1,2," :empty-marker "")
>("" "1" "2" "")

this is a fine trick.

so the only thing your SPLIT-SEQUENCE is missing, is a set of delimiters
(multiple chars, orderless), a string type of delimiter (implied order)
and a regex delimiter. (just to justify the "lex" in some function
names)

and the argument ordering is questionable.
--
Reini Urban
http://xarch.tu-graz.ac.at/autocad/news/faq/autolisp.html