Most efficient way to read words from string.

From: ··············@yahoo.com
Subject: Most efficient way to read words from string.
Date: Wed, 06 Jan 1999 00:00:00 +0000
Message-ID: <770c57$vrk$1@nnrp1.dejanews.com>

What I'd like to do is take a string like "I am a string." and return the
words and punctuation as multiple values such as:

(my-read-from-string "I am a string.")
=> "I"
   "am"
   "a"
   "string"
   "."

Or have it build a list. Which ever is more efficient:
(my-read-from-string "I am a string.")
=> ("I" "am" "a" "string" ".")

`read-from-string' uppercases everything, and returns `STRING.' as the last
word. I've been trying at this for awhile now and it seems that the only
function that can do what I want would have to understand the list of
alphabetic characters a-z and A-Z and use `subseq'. Is there any built in
function that almost does what I want, or any pointers about how to do this
most efficiently, and what to name it when it's done? The iterative `subseq'
version of this function would be enormous if I got it working right and make
a lot of garbage.

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own

Re: Most efficient way to read words from string. Lyman S. Taylor
- Re: Most efficient way to read words from string. David B. Lamkins
  - Re: Most efficient way to read words from string. Lyman S. Taylor
  - Re: Most efficient way to read words from string. Espen Vestre
- Re: Most efficient way to read words from string. Barry Margolin
  - Re: Most efficient way to read words from string. Lyman S. Taylor
- Tokenizer (Re: Most efficient way to read words from string.) SLong
  - Re: Tokenizer (Re: Most efficient way to read words from string.) Marco Antoniotti
  - Re: Tokenizer (Re: Most efficient way to read words from string.) Martti Halminen
  - Re: Tokenizer (Re: Most efficient way to read words from string.) Erik Naggum
    - Re: Tokenizer (Re: Most efficient way to read words from string.) SLong
      - Re: Tokenizer (Re: Most efficient way to read words from string.) Erik Naggum
  - Re: Tokenizer (Re: Most efficient way to read words from string.) Larry Hunter
    - Re: Tokenizer (Re: Most efficient way to read words from string.) SLong
Re: Most efficient way to read words from string. David B. Lamkins
Re: Most efficient way to read words from string. Pierre Mai
- Re: Most efficient way to read words from string. Christopher R. Barry
  - Re: Most efficient way to read words from string. David Bakhash
Re: Most efficient way to read words from string. SLong

From: Lyman S. Taylor
Subject: Re: Most efficient way to read words from string.
Date: Wed, 06 Jan 1999 00:00:00 +0000
Message-ID: <770gc1$3ee@pravda.cc.gatech.edu>

In article <············@nnrp1.dejanews.com>,
 <··············@yahoo.com> wrote:
>What I'd like to do is take a string like "I am a string." and return the
>words and punctuation as multiple values such as:
>
>(my-read-from-string "I am a string.")
>=> "I"
>   "am"
>   "a"
>   "string"
>   "."
>
>Or have it build a list. Which ever is more efficient:
>(my-read-from-string "I am a string.")
>=> ("I" "am" "a" "string" ".")

   I would imagine that you'd have to build a list ( or some sequence )
   to do get to the point where you could do 

      (funcall #'values  seq-of-values ) 

   anyway.  Since the number of values varies here.
   So the latter seems the better way to go. 


>
>`read-from-string' uppercases everything, and returns `STRING.' as the last
>word.

    Not only that, it returns symbols, not strings. 

    I'm afraid you'll need to write you own lexer. 
    (i.e.  walk down the string testing to see word breaks (non alpha-char-p )
           
    
>function that almost does what I want, or any pointers about how to do this
>most efficiently, and what to name it when it's done? 

> The iterative `subseq'
>version of this function would be enormous if I got it working right and make
>a lot of garbage.

     subseq takes a  "start" and "end" argument.   If sequentially called with 
     the proper pair of indices this should create no garbage. Just the
     strings you want.  In this case, the lexer's "job" is to find these
     positions. 

          012345678901234
         "I am a string." 

USER(1): (subseq "I am a string." 0 1 )
"I"
USER(2): (subseq "I am a string." 2 4 )
"am"
USER(3): (subseq "I am a string." 5 6 )
"a"
USER(4): (subseq "I am a string." 7 13 )
"string"
USER(5): (subseq "I am a string." 13 14 )
"."

     [ This is probably an O(n^2) process since subseq must skip
        the ever lenghtening prefix each time.... if subseq is
        naive about the sequence in question.  Probably not worth
        worrying about. ]


     If you're not so concerned with garbage then  string-trim
     ( string-trim-left) could help tokenize the given string. 


-- 

Lyman S. Taylor            "I'm a Doctor! Not a commando." 
(·····@cc.gatech.edu)         The enhanced EMH Doctor in a ST:Voyager epidsode.

From: David B. Lamkins
Subject: Re: Most efficient way to read words from string.
Date: Wed, 06 Jan 1999 00:00:00 +0000
Message-ID: <PUQk2.5258$W_.3447095@news1.teleport.com>

In article <··········@pravda.cc.gatech.edu> , ·····@cc.gatech.edu (Lyman S.
Taylor) wrote:

> In article <············@nnrp1.dejanews.com>,
>  <··············@yahoo.com> wrote:
>>What I'd like to do is take a string like "I am a string." and return the
>>words and punctuation as multiple values such as:
>>
>>(my-read-from-string "I am a string.")
>>=> "I"
>>   "am"
>>   "a"
>>   "string"
>>   "."
>>
>>Or have it build a list. Which ever is more efficient:
>>(my-read-from-string "I am a string.")
>>=> ("I" "am" "a" "string" ".")
>
>    I would imagine that you'd have to build a list ( or some sequence )
>    to do get to the point where you could do
>
>       (funcall #'values  seq-of-values )
>
>    anyway.  Since the number of values varies here.
>    So the latter seems the better way to go.

Also, with the funcall form you can run into CALL-ARGUMENTS-LIMIT, which can
be as low as 50.

[snip]

>> The iterative `subseq'
>>version of this function would be enormous if I got it working right and make
>>a lot of garbage.
>
>      subseq takes a  "start" and "end" argument.   If sequentially called with
>      the proper pair of indices this should create no garbage. Just the
>      strings you want.  In this case, the lexer's "job" is to find these
>      positions.
>

"subseq creates a sequence that is a copy of the subsequence of sequence
bounded by start and end."  (CLHS)

Therefore, subseq allocates additional memory (which may eventually turn
into garbage).

[snip]

>
>      [ This is probably an O(n^2) process since subseq must skip
>         the ever lenghtening prefix each time.... if subseq is
>         naive about the sequence in question.  Probably not worth
>         worrying about. ]
>

Not if the argument is a string; then subseq can index directly to the
character of interest in constant time.

[snip]

--
--
David B. Lamkins <http://www.teleport.com/~dlamkins/>

((lambda (x) (list x (list 'quote x)))
   '(lambda (x) (list x (list 'quote x))))

From: Lyman S. Taylor
Subject: Re: Most efficient way to read words from string.
Date: Wed, 06 Jan 1999 00:00:00 +0000
Message-ID: <770on1$4ef@pravda.cc.gatech.edu>

In article <·····················@news1.teleport.com>,
David B. Lamkins <········@teleport.com> wrote:
>In article <··········@pravda.cc.gatech.edu> , ·····@cc.gatech.edu (Lyman S.
>Taylor) wrote:
...
>>      subseq takes a  "start" and "end" argument.   If sequentially called 
>
>"subseq creates a sequence that is a copy of the subsequence of sequence
>bounded by start and end."  (CLHS)
>
>Therefore, subseq allocates additional memory (which may eventually turn
>into garbage).

    I was operating under the presumption that result consisted of 
    individual strings.  Not aliases to portions of the original. 
    And were "useful", so not "garbage". [ In constrast to 
    something like 
	          (subseq (subseq "A string" 2 8) 0 3 )

    which does create some intermediate garabge. ]

    If you later wished to flip these strings into symbols you'd need 
    to (or should ) make a copy of the strings anyway.

    Additionally if the "source string" is some reusable buffer then 
    you can just refill the next string. As opposed to creating a new
    string for "line".  [ e.g., using read-sequence to fill/refill the buffer
    in constrast to repeatively calling read-line. ]

    Most everything eventually turns into garbage.  Even the aliases. 

    Also, if your "source" string consists of lots of "white space" then 
    if you let it go and just keep the tokens you have a net space
    reduction. 


    







    


-- 
					
Lyman S. Taylor                "Twinkie Cream; food of the Gods" 
(·····@cc.gatech.edu)                     Jarod, "The Pretender"

From: Espen Vestre
Subject: Re: Most efficient way to read words from string.
Date: Thu, 07 Jan 1999 00:00:00 +0000
Message-ID: <w6sodns5ia.fsf@gromit.nextel.no>

"David B. Lamkins" <········@teleport.com> writes:

> "subseq creates a sequence that is a copy of the subsequence of sequence
> bounded by start and end."  (CLHS)
> 
> Therefore, subseq allocates additional memory (which may eventually turn
> into garbage).

I sometimes use displaced arrays where the space usage of subseq gets 
critical:

(defun substr (string from &optional (to (length string)))
  (make-array (- to from) :element-type 'character 
	      :displaced-to string 
	      :displaced-index-offset from))

However, with ACL 5, the initial overhead of this function compared 
to subseq seems to be 32 bytes, which means, I guess, that you'll 
need to have substrings longer than than 32 characters to make any 
sensible use of it...

-- 

  regards,
    Espen Vestre

From: Barry Margolin
Subject: Re: Most efficient way to read words from string.
Date: Wed, 06 Jan 1999 00:00:00 +0000
Message-ID: <GOQk2.117$Pu3.10214@burlma1-snr1.gtei.net>

In article <··········@pravda.cc.gatech.edu>,
Lyman S. Taylor <·····@cc.gatech.edu> wrote:
>   I would imagine that you'd have to build a list ( or some sequence )
>   to do get to the point where you could do 
>
>      (funcall #'values  seq-of-values ) 
>
>   anyway.  Since the number of values varies here.
>   So the latter seems the better way to go. 

I think you meant:

	(apply #'values seq-of-values)

which is equivalent to:

	(values-list seq-of-values)

-- 
Barry Margolin, ······@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Don't bother cc'ing followups to me.

From: Lyman S. Taylor
Subject: Re: Most efficient way to read words from string.
Date: Wed, 06 Jan 1999 00:00:00 +0000
Message-ID: <770mfa$47b@pravda.cc.gatech.edu>

In article <···················@burlma1-snr1.gtei.net>,
Barry Margolin  <······@bbnplanet.com> wrote:
>In article <··········@pravda.cc.gatech.edu>,
>Lyman S. Taylor <·····@cc.gatech.edu> wrote:
...
>>      (funcall #'values  seq-of-values ) 
>
>I think you meant:
>
>	(apply #'values seq-of-values)

   Yes,  sorry about that. :-).

-- 

Lyman S. Taylor                "Twinkie Cream; food of the Gods" 
(·····@cc.gatech.edu)                     Jarod, "The Pretender"

From: SLong
Subject: Tokenizer (Re: Most efficient way to read words from string.)
Date: Mon, 15 Feb 1999 00:00:00 +0000
Message-ID: <36C7C0EE.4DE1@isomedia.com>

Here's a way to break up strings.  It's not bit-fiddling, but then
worrying
about efficiency when using LISP is really like rearranging deck chairs
on
the Titanic.

For efficiency, use C/C++.

Parentheses have been known to go awry during cut and paste...

(defun string-token
    ;;;
    ;;; DESCRIPTION:   Tokenize a string similar to the C string library 
    ;;;                function strtkn. Locates the position p of any
member of 
    ;;;                list token within the input string and returns
the 
    ;;;                substring consisting of the characters from
position 0 
    ;;;                to p-1 and a substring from p+1 to the end of the
input 
    ;;;                string. If no member of token is not found, the
string 
    ;;;                and nil are returned.
    ;;;
    ;;;                Examples:
    ;;;                  (string-token "frog-went-a-courtin'" #\-)
    ;;;                     =>"frog"
    ;;;                       "went-a-courtin'"
    ;;;
    ;;;                  (string-token "frog-went-a-courtin'" #\z)
    ;;;                     =>"frog-went-a-courtin'"
    ;;;                       NIL
    ;;;
    ;;;                  (string-token "frog-went-a-courtin'" (list #\g
#\a))
    ;;;                     =>"fro"
    ;;;                       "-went-a-courtin'"
    ;;;
    ;;;                  (string-token "frog-went-a-courtin'" #\f)
    ;;;                     =>""
    ;;;                       "rog-went-a-courtin'"
    ;;;
    ;;; PARAMETERS:    string         -string-
    ;;;                token          -character or list of characters-
    (string
     token
     )
  (if (and string token)
      (do* ((delim-list (if (and token (listp token)) token (list
token)))
            (char-list (coerce string 'list))
            (limit (list-length char-list))
            (char-count 0 (+ 1 char-count))
            (char (car char-list) (nth char-count char-list))
            )
          ((or
            (member char delim-list)
            (= char-count limit)
            )
           (return
             (values-list
              (if (= char-count limit)
                  (list string nil)
                (list
                 (coerce (butlast char-list (- limit char-count))
'string)
                 (coerce (nthcdr (+ char-count 1) char-list) 'string)
                 )
                )
              )
             )
           )
        )
    )
  )

From: Marco Antoniotti
Subject: Re: Tokenizer (Re: Most efficient way to read words from string.)
Date: Mon, 15 Feb 1999 00:00:00 +0000
Message-ID: <lwemnroedu.fsf@copernico.parades.rm.cnr.it>

Hello "new troll on the block"!

SLong <·········@isomedia.com> writes:

> Here's a way to break up strings.  It's not bit-fiddling, but then
> worrying
> about efficiency when using LISP is really like rearranging deck chairs
> on
> the Titanic.
> 
> For efficiency, use C/C++.

For efficiency, use the best functions and abstractions that are
available to you.

> 
> Parentheses have been known to go awry during cut and paste...

Parentheses have also been known to be used much more effectively than
you do.


> 
> (defun string-token
>     ;;;
>     ;;; DESCRIPTION:   Tokenize a string similar to the C string library 
>     ;;;                function strtkn. Locates the position p of any
> member of 
>     ;;;                list token within the input string and returns
> the 
>     ;;;                substring consisting of the characters from
> position 0 
>     ;;;                to p-1 and a substring from p+1 to the end of the
> input 
>     ;;;                string. If no member of token is not found, the
> string 
>     ;;;                and nil are returned.
>     ;;;
>     ;;;                Examples:
>     ;;;                  (string-token "frog-went-a-courtin'" #\-)
>     ;;;                     =>"frog"
>     ;;;                       "went-a-courtin'"
>     ;;;
>     ;;;                  (string-token "frog-went-a-courtin'" #\z)
>     ;;;                     =>"frog-went-a-courtin'"
>     ;;;                       NIL
>     ;;;
>     ;;;                  (string-token "frog-went-a-courtin'" (list #\g
> #\a))
>     ;;;                     =>"fro"
>     ;;;                       "-went-a-courtin'"
>     ;;;
>     ;;;                  (string-token "frog-went-a-courtin'" #\f)
>     ;;;                     =>""
>     ;;;                       "rog-went-a-courtin'"
>     ;;;
>     ;;; PARAMETERS:    string         -string-
>     ;;;                token          -character or list of characters-
>     (string
>      token
>      )
>   (if (and string token)
>       (do* ((delim-list (if (and token (listp token)) token (list
> token)))
>             (char-list (coerce string 'list))
>             (limit (list-length char-list))
>             (char-count 0 (+ 1 char-count))
>             (char (car char-list) (nth char-count char-list))
>             )
>           ((or
>             (member char delim-list)
>             (= char-count limit)
>             )
>            (return
>              (values-list
>               (if (= char-count limit)
>                   (list string nil)
>                 (list
>                  (coerce (butlast char-list (- limit char-count))
> 'string)
>                  (coerce (nthcdr (+ char-count 1) char-list) 'string)
>                  )
>                 )
>               )
>              )
>            )
>         )
>     )
>   )


e.g. your code could be rewritten as

(defun string-token (string token)
  "DESCRIPTION:   Tokenize a string similar to the C string library 
                  function strtkn. Locates the position p of any
                  member of list token within the input string and returns
                  the substring consisting of the characters from
                  position 0 to p-1 and a substring from p+1 to the end of the
                  input string. If no member of token is not found, the
                  string and nil are returned.
    
                    Examples:
                      (string-token \"frog-went-a-courtin'\" #\-)
                         =>\"frog\"
                           \"went-a-courtin'\"
    
                      (string-token \"frog-went-a-courtin'\" #\z)
                         =>\"frog-went-a-courtin'\"
                            NIL
    
                      (string-token \"frog-went-a-courtin'\" (list #\g #\a))
                         =>\"fro\"
                           \"-went-a-courtin'\"
    
                      (string-token \"frog-went-a-courtin'\" #\f)
                         =>\"\"
                           \"rog-went-a-courtin'\"
    
     PARAMETERS:    string         -string-
                    token          -character or list of characters-"

  (if (and string token) 
      (do* ((delim-list (if (and token (listp token)) token (list token)))
	    (char-list (coerce string 'list))
	    (limit (list-length char-list))
	    (char-count 0 (+ 1 char-count))
	    (char (car char-list) (nth char-count char-list))
	    )
	   ((or (member char delim-list)
		(= char-count limit))
	    (return
	      (values-list
	       (if (= char-count limit)
		   (list string nil)
		   (list (coerce (butlast char-list (- limit char-count))
				 'string)
			 (coerce (nthcdr (+ char-count 1) char-list) 'string))
		   )))))))


If you do not understand why it is better, too bad!

Note also that, for your instruction, your code could be rewritten as follows
(there is one minor difference: where is it?)

(defun string-token (string &optional (separators '(#\Space)))
  (let ((p (position-if #'(lambda (c) (member c separators :test #'char=))
                        string)))
   (if (null p)
       str
       (values (subseq str 0 p)
               (subseq str (1+ p))))))

which is equivalent, faster and less memory hungry.  No wonder your
Lisp code is so slow.

Moreover, what you usually really want is

(defun string-split (str &optional (separator #\Space))
  "Splits the string STR at each SEPARATOR character occurrence.
The resulting substrings are collected into a list which is returned.
A SEPARATOR at the beginning or at the end of the string STR results
in an empty string in the first or last position of the list
returned."
  (declare (type string str)
	   (type character separator))
  (loop	for start  = 0 then (1+ end)
	for end    = (position separator str :start 0)
	             then (position separator str :start start)
	for substr = (subseq str start end)
	             then (subseq str start end)
	collect substr into result
	when (null end)	do (return result)
	))

However, in order to achieve the same effect in C/C++ you'd have to
manually allocate (i.e. with 'malloc') some arrays. Which in *your*
case would probably result in memory leaks around your application
too.

Finally, what is usually very useful is something more complex that
implements a tokenizer in the spirit of Java StringTokenizer class. (I
have it. Do you want it?)

Cheers

-- 
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - (0)6 - 68 10 03 17, fax. +39 - (0)6 - 68 80 79 26
http://www.parades.rm.cnr.it

From: Martti Halminen
Subject: Re: Tokenizer (Re: Most efficient way to read words from string.)
Date: Mon, 15 Feb 1999 00:00:00 +0000
Message-ID: <36C85D6F.481A@rm_spam_trap.dpe.fi>

SLong wrote:

> Here's a way to break up strings.  It's not bit-fiddling, but then
<snip>

> (defun string-token
>     ;;; PARAMETERS:    string         -string-
>     ;;;                token          -character or list of 
>     ;;;                                 characters-
>     (string
>      token
>      )
>   (if (and string token)
>       (do* ((delim-list (if (and token (listp token)) token (list
>                                                             token)))
>             (char-list (coerce string 'list))
>             (limit (list-length char-list))
>             (char-count 0 (+ 1 char-count))
>             (char (car char-list) (nth char-count char-list))
>             )
>           ((or
>             (member char delim-list)
>             (= char-count limit)
>             )
>            (return
>              (values-list
>               (if (= char-count limit)
>                   (list string nil)
>                 (list
>                  (coerce (butlast char-list (- limit char-count))
>                           'string)
>                  (coerce (nthcdr (+ char-count 1) char-list) 'string)
>                  )
   <add sufficient parens here>

If the previous seems a little complicated for the beginners, there are
other ways of doing this, too. For example:

(defun string-token-B (str token)
  (let ((pos (position token str)))
    (if (null pos)
	(values str nil)
      (values (subseq str 0 pos)
	      (subseq str (1+ pos))))))

-- simplified version: only accepts single tokens, no input sanity
checking. Seems to be slightly faster, too.









-- 
________________________________________________________________
    ^.          Martti Halminen
   / \`.        Design Power Europe Oy
  /   \ `.      Tekniikantie 12, FIN-02150 Espoo, Finland
 /\`.  \ |      Tel:+358 9 4354 2306, Fax:+358 9 455 8575
/__\|___\|      ······················@dpe.fi   http://www.dpe.fi

From: Erik Naggum
Subject: Re: Tokenizer (Re: Most efficient way to read words from string.)
Date: Mon, 15 Feb 1999 00:00:00 +0000
Message-ID: <3128099211492121@naggum.no>

* SLong <·········@isomedia.com>
| It's not bit-fiddling, but then worrying about efficiency when using LISP
| is really like rearranging deck chairs on the Titanic.

  well, the interesting question is how _did_ you survive?  if you arranged
  deck chairs the way you write code, it's no wonder the whole ship sank.

#:Erik

From: SLong
Subject: Re: Tokenizer (Re: Most efficient way to read words from string.)
Date: Tue, 16 Feb 1999 00:00:00 +0000
Message-ID: <36C93796.53C4@isomedia.com>

Yeah, sure.  Folks have told me all about you.

From: Erik Naggum
Subject: Re: Tokenizer (Re: Most efficient way to read words from string.)
Date: Tue, 16 Feb 1999 00:00:00 +0000
Message-ID: <3128175105600080@naggum.no>

* SLong <·········@isomedia.com>
| Yeah, sure.  Folks have told me all about you.

  no, they haven't, but it's actually comforting to learn that you think so.

#:Erik

From: Larry Hunter
Subject: Re: Tokenizer (Re: Most efficient way to read words from string.)
Date: Tue, 16 Feb 1999 00:00:00 +0000
Message-ID: <rbsoc61j45.fsf@work.nlm.nih.gov>

Steve Long <·········@isomedia.com> is apparently trolling in the lisp
groups.  He says:

 Here's a way to break up strings.

And translates his C code into lisp (badly).  As others have pointed out,
the whole thing could be done better and much more simply with POSITION and
SUBSEQ.  I want to add that I write parsers in lisp all the time (usually to
process human-friendly commands to complex systems).  Here's the lexer I
use, which does a good bit more than Steve's tokenizer, and is quite
efficient. 


(defun lex-string (string &key (whitespace '(#\space #\newline)))
  "Separates a string at whitespace and returns a list of strings"
  (flet ((whitespace? (char) (member char whitespace :test #'char=)))
    (let ((tokens nil))
      (do* ((token-start
             (position-if-not #'whitespace? string) 
             (when token-end
               (position-if-not #'whitespace? string :start (1+ token-end))))
            (token-end
             (when token-start
               (position-if #'whitespace? string :start token-start))
             (when token-start
               (position-if #'whitespace? string :start token-start))))
           ((null token-start) (nreverse tokens))
        (push (subseq string token-start token-end) tokens)))))


-- 
Lawrence Hunter, PhD.
National Library of Medicine               phone: +1 (301) 496-9303
Bldg. 38A, 9th fl, MS-54                   fax:   +1 (301) 496-0673
Bethesda. MD 20894 USA                     email: ······@nlm.nih.gov

From: SLong
Subject: Re: Tokenizer (Re: Most efficient way to read words from string.)
Date: Tue, 16 Feb 1999 00:00:00 +0000
Message-ID: <36C9391F.7708@isomedia.com>

Larry Hunter wrote:
> 
> Steve Long <·········@isomedia.com> is apparently trolling in the lisp
> groups.  He says:
> 
>  Here's a way to break up strings.
> 

Larry, 

There's always a better way of doing things, especially in a LISP
newsgroup.

Thanks for posting your version of this.  It didn't run any more
efficiently in terms of time or resources, but I have learned something
new.

From: David B. Lamkins
Subject: Re: Most efficient way to read words from string.
Date: Wed, 06 Jan 1999 00:00:00 +0000
Message-ID: <vIQk2.5243$W_.3441093@news1.teleport.com>

In article <············@nnrp1.dejanews.com> , ··············@yahoo.com  
wrote:

> What I'd like to do is take a string like "I am a string." and return the
> words and punctuation as multiple values such as:
>
> (my-read-from-string "I am a string.")
> => "I"
>    "am"
>    "a"
>    "string"
>    "."
>
> Or have it build a list. Which ever is more efficient:
> (my-read-from-string "I am a string.")
> => ("I" "am" "a" "string" ".")
>
> `read-from-string' uppercases everything, and returns `STRING.' as the last
> word. I've been trying at this for awhile now and it seems that the only
> function that can do what I want would have to understand the list of
> alphabetic characters a-z and A-Z and use `subseq'. Is there any built in
> function that almost does what I want, or any pointers about how to do this
> most efficiently, and what to name it when it's done? The iterative `subseq'
> version of this function would be enormous if I got it working right and make
> a lot of garbage.
>
> -----------== Posted via Deja News, The Discussion Network ==----------
> http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own

Actually, the subseq version may not be as bad as you think.  Here are a
couple of versions I hacked together; one uses displaced arrays and the
other uses subseq.  WRT memory allocation, the displaced arrays version wins
for long words, while the subseq version wins for short words.

In MCL 4.2, the boundary seems to be at a "word" size of 20 characters. The
actual allocation has to do with granularity of memory requests and the
relative sizes of simple vs non-simple array headers.

So if I was going to use MCL 4.2 to chunk English text and was concerned
about total memory allocation, the subseq version would save me about 8
bytes per word.

OTOH, if I was more concerned about stressing the GC, I'd probably
preallocate a bunch of displaced arrays, and change the code slightly to use
them as resources (rather than creating them for each use and then
discarding them).

;; Tests of subseq version

? (let ((string "T "))
    (time (read-short-words-from-string string)))
(READ-SHORT-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to
run.
 16 bytes of memory allocated.
("T")

? (let ((string "ThisThat "))
    (time (read-short-words-from-string string)))
(READ-SHORT-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to
run.
 24 bytes of memory allocated.
("ThisThat")

? (let ((string "ThisThatThisThatThis "))
    (time (read-short-words-from-string string)))
(READ-SHORT-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to
run.
 32 bytes of memory allocated.
("ThisThatThisThatThis")

? (let ((string "ThisThatThisThatThisX "))
    (time (read-short-words-from-string string)))
(READ-SHORT-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to
run.
 40 bytes of memory allocated.
("ThisThatThisThatThisX")

;; Tests of displaced arrays version

? (let ((string "ThisThatThisThatThisX "))
    (time (read-long-words-from-string string)))
(READ-LONG-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to
run.
 32 bytes of memory allocated.
("ThisThatThisThatThisX")

? (let ((string "ThisThatThisThatThisThisThatThisThatThisThisThatThisThat"))
    (time (read-long-words-from-string string)))
(READ-LONG-WORDS-FROM-STRING STRING) took 0 milliseconds (0.000 seconds) to
run.
 32 bytes of memory allocated.
("ThisThatThisThatThisThisThatThisThatThisThisThatThisThat")


;; Note: These only recognize words composed of alpha characters, and
;; require that the last word be followed by a non-alpha character.
;; Marking the "word" boundaries properly is left as an exercise. <g>

(defun read-long-words-from-string (string)
  (let ((result ())
        (in-word-p nil)
        (start-word nil))
    (dotimes (i (length string))
      (setq in-word-p (alpha-char-p (char string i)))
      (if start-word
        (unless in-word-p
          (push (make-array (- i start-word)
                            :displaced-to string
                            :displaced-index-offset start-word)
                result)
          (setq start-word nil))
        (when in-word-p
          (setq start-word i))))
    (nreverse result)))

(defun read-short-words-from-string (string)
  (let ((result ())
        (in-word-p nil)
        (start-word nil))
    (dotimes (i (length string))
      (setq in-word-p (alpha-char-p (char string i)))
      (if start-word
        (unless in-word-p
          (push (subseq string start-word i)
                result)
          (setq start-word nil))
        (when in-word-p
          (setq start-word i))))
    (nreverse result)))

--
David B. Lamkins <http://www.teleport.com/~dlamkins/>

((lambda (x) (list x (list 'quote x)))
   '(lambda (x) (list x (list 'quote x))))

From: Pierre Mai
Subject: Re: Most efficient way to read words from string.
Date: Fri, 08 Jan 1999 00:00:00 +0000
Message-ID: <87lnjebq0f.fsf@orion.dent.isdn.cs.tu-berlin.de>

··············@yahoo.com writes:

> What I'd like to do is take a string like "I am a string." and return the
> words and punctuation as multiple values such as:
> 
> (my-read-from-string "I am a string.")
> => "I"
>    "am"
>    "a"
>    "string"
>    "."
> 
> Or have it build a list. Which ever is more efficient:
> (my-read-from-string "I am a string.")
> => ("I" "am" "a" "string" ".")

Here is a simplistic tokenization function most Perl users will find
vaguely familiar.  It currently will only tokenize along a given set
of whitespace (which by definition will be elided from the result), so 
you will have to change it a bit for your punctuation characters,
along which you'll split, but which should be _kept_ in the result
sequence:

(defun split (string &optional max (ws '(#\Space #\Tab)))
  "Split `string' along whitespace as defined by the sequence `ws'.
The whitespace is elided from the result.  The whole string will be
split, unless `max' is a non-negative integer, in which case the
string will be split into `max' tokens at most, the last one
containing the whole rest of the given `string', if any."
  (flet ((is-ws (char) (find char ws)))
    (loop for start = (position-if-not #'is-ws string)
	  then (position-if-not #'is-ws string :start index)
	  for index = (and start
			   (if (and max (= (1+ word-count) max))
			       nil
			       (position-if #'is-ws string :start start)))
	  while start
	  collect (subseq string start index)
	  count 1 into word-count
	  while index)))

Note:

a) This is in no way optimized.  Depending on your usage patterns and
   your implementation this can be tuned quite a lot.  Modify it first 
   to meet your needs, use it, if it's too slow, profile it (this can
   have very interesting ramifications!), and then optimize it, if
   need be.
b) The documentation string is slightly dense...

Regs, Pierre.

PS: I still think that these kind of string-processing functions
should be collected somewhere in a nice library, so that new users
will not find Perl more suitable for simple string-bashing things...
As time permits (which it seldomly does), I'll try to put something
usable together, and release it to the PD.  Anyone who has something
nice to contribute, feel free to contact me... ;)

-- 
Pierre Mai <····@acm.org>               http://home.pages.de/~trillian/
  "One smaller motivation which, in part, stems from altruism is Microsoft-
   bashing." [Microsoft memo, see http://www.opensource.org/halloween1.html]

From: Christopher R. Barry
Subject: Re: Most efficient way to read words from string.
Date: Fri, 08 Jan 1999 00:00:00 +0000
Message-ID: <87emp6ppw1.fsf@2xtreme.net>

Pierre Mai <····@acm.org> writes:

> ··············@yahoo.com writes:
> 
> > What I'd like to do is take a string like "I am a string." and return the
> > words and punctuation as multiple values such as:
> > 
> > (my-read-from-string "I am a string.")
> > => "I"
> >    "am"
> >    "a"
> >    "string"
> >    "."
> > 
> > Or have it build a list. Which ever is more efficient:
> > (my-read-from-string "I am a string.")
> > => ("I" "am" "a" "string" ".")

[tokenizing function deleted]

While looking through the HyperSpec just recently I found this
TOKENIZE-SENTENCE function on the LOOP-FINISH page:

;; While LOOP-FINISH can be used can be used in a variety of 
;; situations it is really most needed in a situation where a need
;; to exit is detected at other than the loop's `top level'
;; (where UNTIL or WHEN often work just as well), or where some 
;; computation must occur between the point where a need to exit is
;; detected and the point where the exit actually occurs.  For example:
 (defun tokenize-sentence (string)
   (macrolet ((add-word (wvar svar)
                `(when ,wvar
                   (push (coerce (nreverse ,wvar) 'string) ,svar)
                   (setq ,wvar nil))))
     (loop with word = '() and sentence = '() and endpos = nil
           for i below (length string)
           do (let ((char (aref string i)))
                (case char
                  (#\Space (add-word word sentence))
                  (#\. (setq endpos (1+ i)) (loop-finish))
                  (otherwise (push char word))))
           finally (add-word word sentence)
                   (return (values (nreverse sentence) endpos)))))
=>  TOKENIZE-SENTENCE
 
 (tokenize-sentence "this is a sentence. this is another sentence.")
=>  ("this" "is" "a" "sentence"), 19
 
 (tokenize-sentence "this is a sentence")
=>  ("this" "is" "a" "sentence"), NIL


> 
> Regs, Pierre.
> 
> PS: I still think that these kind of string-processing functions
> should be collected somewhere in a nice library, so that new users
> will not find Perl more suitable for simple string-bashing things...
> As time permits (which it seldomly does), I'll try to put something
> usable together, and release it to the PD.  Anyone who has something
> nice to contribute, feel free to contact me... ;)

I think I remember this issue coming up before, and someone mentioning
that at least one vendor put together a comprehensive strings
package. I also think that a nice PD library of string bashing
functions that is portable across implementations would be extremely
useful, and I'll help out.

Does anyone know of a strings package that has already been put
together by a vendor, and where API-related documentation for it might
be accessed?

If we don't have to reinvent the wheel here and have two incompatible
interfaces, that would be wonderful.

Christopher

From: David Bakhash
Subject: Re: Most efficient way to read words from string.
Date: Fri, 08 Jan 1999 00:00:00 +0000
Message-ID: <wku2y1og73.fsf@mit.edu>

Pierre's `split' blows the doors off of the `tokenize-sentence' example,
based one simple profiling I did.  It's also more straightforward.  I
just think that if this will be done, then it should be well
thought-out, and use of CLOS should be considered (i.e. as to whether
these functions should be "extensible" in a sense, so we can "split" our
own data structures, etc.  Also, Lisp really doesn't limit many of these
manipulations to strings, but maybe we can apply these things to
sequences, more generally.

This is a very serious issue.  These functions should be
implementation-depenedent, and they should be completely PD, IMO.
Anyway, I began working on this years ago in elisp (i.e. trying to write
many Perl functions in Lisp), but I'll have to find it, and when/if I
do, I'll also make it public.  My `split' was almost a facsimile of
Pierre's, and I think that it should not be too hard to do this
(although to do it _right_ (e.g. extensible, versatile, etc) is another
story altogether).

dave

From: SLong
Subject: Re: Most efficient way to read words from string.
Date: Mon, 15 Feb 1999 00:00:00 +0000
Message-ID: <36C7BC97.69E4@isomedia.com>

Remember, the least efficient way to read words from a string is for YOU
to rewrite code for different cases.  This real-world software
engineering.  So write one defun that handles
the general case of large and small strings, make it easy to understand,
and don't sweat CPU time or memory. 

slong