String manipulation in CLISP

From: Weiguang Shi
Subject: String manipulation in CLISP
Date: Tue, 13 Feb 2001 02:49:45 +0000
Message-ID: <slrn98h869.4tn.wgshi@namao.cs.ualberta.ca>

Hi,

It seems to me that the CLISP that I am learning lacks string manipulation
functions (please do correct me if I am wrong). I am wondering particularly if
there is a function that can split all the space-separated words in a string,
which might be just read through 

        (setq string (read-line)).

into an array or, even better, a list.


Thanks very much
Weiguang

Re: String manipulation in CLISP Vebjorn Ljosa
- Re: String manipulation in CLISP Weiguang Shi
Re: String manipulation in CLISP Marco Antoniotti
- Re: String manipulation in CLISP Tim Bradshaw
- Re: String manipulation in CLISP Sashank Varma
  - Re: String manipulation in CLISP Tim Bradshaw
    - Re: String manipulation in CLISP Pierre R. Mai
      - Re: String manipulation in CLISP Marco Antoniotti
        Re: String manipulation in CLISP Tim Bradshaw
        Re: String manipulation in CLISP Marco Antoniotti
        Re: String manipulation in CLISP Johann Hibschman
        Re: String manipulation in CLISP Vebjorn Ljosa
        Re: String manipulation in CLISP Raymond Wiker
        Re: String manipulation in CLISP Johann Hibschman
        Re: String manipulation in CLISP ········@hex.net
        Re: String manipulation in CLISP Raymond Wiker
        Re: String manipulation in CLISP Hrvoje Niksic
        Re: String manipulation in CLISP Tim Bradshaw
        Re: String manipulation in CLISP Pierre R. Mai
        Re: String manipulation in CLISP Lieven Marchand

From: Vebjorn Ljosa
Subject: Re: String manipulation in CLISP
Date: Tue, 13 Feb 2001 10:40:44 +0000
Message-ID: <cy37l2usw5v.fsf@proto.pvv.ntnu.no>

* ·····@namao.cs.ualberta.ca (Weiguang Shi)
| 
| It seems to me that the CLISP that I am learning lacks string
| manipulation functions (please do correct me if I am wrong). I am
| wondering particularly if there is a function that can split all the
| space-separated words in a string, which might be just read through
| 
|         (setq string (read-line)).
| 
| into an array or, even better, a list.

Common Lisp has a plethora of functions for manipulating strings and
other sequences.  a string is a sequence, so the functions for
manipulating sequences can also be used for manipulating strings.

I recommend reading the appendix to Graham's "ANSI Common Lisp" from
beginning to end.  then make it a habit to look up details in the
Common Lisp Hyperspec.

here's a function which uses POSITION and SUBSEQ to do what you want:

(defun split-sequence (sequence &key (separator #\space))
  (loop
      with start = 0
      for end = (position separator sequence :start start)
      collect (subseq sequence start end)
      until (null end)
      do
	(setf start (1+ end))))

-- 
  Vebjorn

From: Weiguang Shi
Subject: Re: String manipulation in CLISP
Date: Tue, 13 Feb 2001 17:02:49 +0000
Message-ID: <slrn98iq5p.68v.wgshi@namao.cs.ualberta.ca>

On 13 Feb 2001 02:40:44 -0800, Vebjorn Ljosa <·····@ljosa.com> wrote:
>* ·····@namao.cs.ualberta.ca (Weiguang Shi)
>
>Common Lisp has a plethora of functions for manipulating strings and
>other sequences.  a string is a sequence, so the functions for
>manipulating sequences can also be used for manipulating strings.
>
>I recommend reading the appendix to Graham's "ANSI Common Lisp" from
>beginning to end.  then make it a habit to look up details in the
>Common Lisp Hyperspec.
Thanks. I will.

>
>here's a function which uses POSITION and SUBSEQ to do what you want:
>
>(defun split-sequence (sequence &key (separator #\space))
>  (loop
>      with start = 0
>      for end = (position separator sequence :start start)
>      collect (subseq sequence start end)
>      until (null end)
>      do
>	(setf start (1+ end))))
Thanks. It worked!

Weiguang

From: Marco Antoniotti
Subject: Re: String manipulation in CLISP
Date: Tue, 13 Feb 2001 16:09:41 +0000
Message-ID: <y6cy9vazhru.fsf@octagon.mrl.nyu.edu>

·····@namao.cs.ualberta.ca (Weiguang Shi) writes:

> Hi,
> 
> It seems to me that the CLISP that I am learning lacks string manipulation
> functions (please do correct me if I am wrong). I am wondering particularly if
> there is a function that can split all the space-separated words in a string,
> which might be just read through 
> 
>         (setq string (read-line)).
> 
> into an array or, even better, a list.

Come on.  First of all let's think before we ask such a question.
Suppose you have a string (usually a line form a file) which contains
blanks as "field" separators.  Suppose the fields are numbers.  The
simple solution to your question is

	(setf the-fields
              (read-from-string (concatenate 'string
                                             "("
                                             (read-line stream)
                                             ")")))

Now 'the-fields' will contain a list of NUMBERs. "Look ma', no
parsing!"

Of course if you want a much more general solution, SPLIT-SEQUENCE is
just 5 or 6 lines away :)

Cheers

-- 
Marco Antoniotti =============================================================
NYU Courant Bioinformatics Group		 tel. +1 - 212 - 998 3488
719 Broadway 12th Floor                          fax  +1 - 212 - 995 4122
New York, NY 10003, USA				 http://galt.mrl.nyu.edu/valis
             Like DNA, such a language [Lisp] does not go out of style.
			      Paul Graham, ANSI Common Lisp

From: Tim Bradshaw
Subject: Re: String manipulation in CLISP
Date: Tue, 13 Feb 2001 17:04:49 +0000
Message-ID: <nkjsnlisedq.fsf@tfeb.org>

Marco Antoniotti <·······@cs.nyu.edu> writes:

> Come on.  First of all let's think before we ask such a question.
> Suppose you have a string (usually a line form a file) which contains
> blanks as "field" separators.  Suppose the fields are numbers.  The
> simple solution to your question is
> 
> 	(setf the-fields
>               (read-from-string (concatenate 'string
>                                              "("
>                                              (read-line stream)
>                                              ")")))
> 
> Now 'the-fields' will contain a list of NUMBERs. "Look ma', no
> parsing!"
> 

This is the Lisp equivalent of the buffer-overflow problems which are
so pervasive in C.

--tim

From: Sashank Varma
Subject: Re: String manipulation in CLISP
Date: Wed, 14 Feb 2001 15:36:41 +0000
Message-ID: <sashank.varma-1402010936410001@129.59.212.53>

In article <···············@octagon.mrl.nyu.edu>, Marco Antoniotti
<·······@cs.nyu.edu> wrote:

>·····@namao.cs.ualberta.ca (Weiguang Shi) writes:
[snip]
>> there is a function that can split all the space-separated words in a string,
>> which might be just read through 
>> 
>>         (setq string (read-line)).
[snip]
>        (setf the-fields
>              (read-from-string (concatenate 'string
>                                             "("
>                                             (read-line stream)
>                                             ")")))
>
>Now 'the-fields' will contain a list of NUMBERs. "Look ma', no
>parsing!"

i have taken to rendering this trick as:

(setf the-fields (read-from-string (format nil "(~A)" (read-line stream))))

sashank

From: Tim Bradshaw
Subject: Re: String manipulation in CLISP
Date: Wed, 14 Feb 2001 17:10:43 +0000
Message-ID: <nkj1yt1fawc.fsf@tfeb.org>

·············@vanderbilt.edu (Sashank Varma) writes:

> 
> (setf the-fields (read-from-string (format nil "(~A)" (read-line stream))))
> 

Sometimes I feel that I must be some kind of obsessive loony, but
whenever I see this kind of thing it just makes me terrified.  If I
was going to tokenise some string using this technique then before I
did so I'd want to go through the string character by character to
check it had no bad things in it.  While I was doing this I'd tokenise
it, since this adds about a line to the checking code.  Then I'd take
out most of the checks because if you're not using READ you don't need
to be so paranoid.

I'm not saying that READ doesn't have its place -- indeed I wrote a
whole bunch of stuff a while ago about making READ safe(r), and I
regularly use READ/PRINT as a cheap, sane way of doing what XML does
in an expensive, insane way. but I think if you want to split some
string on whitespace you should, well, split it on whitespace.  It's
like saying that the way to split a string is to use an XML parser.
(Of course this is probably *exactly* the kind of insanity that the
XML conspiracy has in mind fo us all, but never mind that.)

I guess partly this horror comes from the fact that in my real life
I'm a systems person, and so I have to update some security-critical
package to fix some buffer-overflow vulnerability approximately once a
week.  So I'm kind of biassed I guess.

--tim

From: Pierre R. Mai
Subject: Re: String manipulation in CLISP
Date: Wed, 14 Feb 2001 18:14:32 +0000
Message-ID: <87elx1xhbr.fsf@orion.bln.pmsf.de>

Tim Bradshaw <···@tfeb.org> writes:

> ·············@vanderbilt.edu (Sashank Varma) writes:
> 
> > 
> > (setf the-fields (read-from-string (format nil "(~A)" (read-line stream))))
> > 
> 
> Sometimes I feel that I must be some kind of obsessive loony, but
> whenever I see this kind of thing it just makes me terrified.  If I

FWIW I feel the same way about such uses of read, so that would make
at least two obsessive loonies... ;)

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Marco Antoniotti
Subject: Re: String manipulation in CLISP
Date: Wed, 14 Feb 2001 22:32:32 +0000
Message-ID: <y6cwvas28vz.fsf@octagon.mrl.nyu.edu>

"Pierre R. Mai" <····@acm.org> writes:

> Tim Bradshaw <···@tfeb.org> writes:
> 
> > ·············@vanderbilt.edu (Sashank Varma) writes:
> > 
> > > 
> > > (setf the-fields (read-from-string (format nil "(~A)" (read-line stream))))
> > > 
> > 
> > Sometimes I feel that I must be some kind of obsessive loony, but
> > whenever I see this kind of thing it just makes me terrified.  If I
> 
> FWIW I feel the same way about such uses of read, so that would make
> at least two obsessive loonies... ;)
> 

Well.  I agree.  But the truth is that we do not have a *PORTABLE* Lex
for Common Lisp.  So, if you know that pretty much what is on the line
fits what can be fed to READ, it is simpler to go ahead and do the
dirty thing.

Cheers.

-- 
Marco Antoniotti =============================================================
NYU Courant Bioinformatics Group		 tel. +1 - 212 - 998 3488
719 Broadway 12th Floor                          fax  +1 - 212 - 995 4122
New York, NY 10003, USA				 http://galt.mrl.nyu.edu/valis
             Like DNA, such a language [Lisp] does not go out of style.
			      Paul Graham, ANSI Common Lisp

From: Tim Bradshaw
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 13:23:51 +0000
Message-ID: <nkj8zn8oza0.fsf@tfeb.org>

Marco Antoniotti <·······@cs.nyu.edu> writes:

> Well.  I agree.  But the truth is that we do not have a *PORTABLE* Lex
> for Common Lisp.  So, if you know that pretty much what is on the line
> fits what can be fed to READ, it is simpler to go ahead and do the
> dirty thing.
> 

But this is just the danger I'm terrified of.  The people who wrote
sendmail/BIND/blah _knew_ that the stuff they had fit within a buffer
of length x and it was all just OK, so they just wrote the obvious
code.  Unfortunately someone else knew that they knew this too...

Lisp doesn't have these issues, thank God, but it has other issues,
and one of them is using READ on data about which you don't know
enough.  Unless you have a suitably armour-plated READ (and I think
such a thing is more-or-less possible to create) then knowing `pretty
much' what the data is still leaves you vulnerable to catastrophic
magazine explosions.  Even when it doesn't leave you at the bottom of
the North sea you still have the problem that it's much too general --
the original question was to tokenize a string -- gratuitously parsing
it into symbols, numbers &c may not be what you want.

This isn't to say that I don't think READ has its uses -- since it
does all the interesting bits of XML it obvously has uses, I'm just a
bit wary of people saying it's a good way of doing things which it's
really not good at, and which might leave you vulnerable to bad
problems when used without considerable understanding.

--tim

From: Marco Antoniotti
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 15:08:49 +0000
Message-ID: <y6cu25w0yri.fsf@octagon.mrl.nyu.edu>

Tim Bradshaw <···@tfeb.org> writes:

> Marco Antoniotti <·······@cs.nyu.edu> writes:
> 
> > Well.  I agree.  But the truth is that we do not have a *PORTABLE* Lex
> > for Common Lisp.  So, if you know that pretty much what is on the line
> > fits what can be fed to READ, it is simpler to go ahead and do the
> > dirty thing.
> > 
> 
> But this is just the danger I'm terrified of.  The people who wrote
> sendmail/BIND/blah _knew_ that the stuff they had fit within a buffer
> of length x and it was all just OK, so they just wrote the obvious
> code.  Unfortunately someone else knew that they knew this too...

	...

> This isn't to say that I don't think READ has its uses -- since it
> does all the interesting bits of XML it obvously has uses, I'm just a
> bit wary of people saying it's a good way of doing things which it's
> really not good at, and which might leave you vulnerable to bad
> problems when used without considerable understanding.
> 

Look, I agree with you.  But the problem remains.  If you want to really
"parse" something, you have - at least - to (1) build some form of AST,
and (2) check that the parsed input matches a given specification.

Shorter than that you are "unsafe" one way or the other.

I actually judged the intentions of the original poster, by picking up
the lurking argument: "how come I can do this in Perl and the latest
wheel on the block avec forced indentation, and I cannot do it in CL?" :)

Cheers

-- 
Marco Antoniotti =============================================================
NYU Courant Bioinformatics Group		 tel. +1 - 212 - 998 3488
719 Broadway 12th Floor                          fax  +1 - 212 - 995 4122
New York, NY 10003, USA				 http://galt.mrl.nyu.edu/valis
             Like DNA, such a language [Lisp] does not go out of style.
			      Paul Graham, ANSI Common Lisp

From: Johann Hibschman
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 19:09:43 +0000
Message-ID: <mthf1voj9k.fsf@astron.berkeley.edu>

Marco Antoniotti writes:

> I actually judged the intentions of the original poster, by picking up
> the lurking argument: "how come I can do this in Perl and the latest
> wheel on the block avec forced indentation, and I cannot do it in CL?" :)

Speaking of which, there is a split-string function in the CLOCC, at
http://clocc.sourceforge.net.

I haven't used it, so I can't comment on its efficiency, but it's
there.

To start a new subthread, what string functions would people want?  I
volunteer to collect any code and forward it to the CLOCC people, for
possible inclusion.  (I've never dealt with them, so I don't know how
hard it is to get anything included.)

The obvious string methods that we're missing are strip (remove
whitespace), rstrip, lstrip (left and right ends), transform (needs
some concept of character sets), find, concatenate (yes, it exists,
but string-cat is a nice abbreviation for (concatenate 'string ...)),
and so on.

I should look at Olin Shiver's Scheme string utilities; they're
well-designed, even if I susupect that judicious use of keyword
arguments would help them quite a bit.

-- 
Johann Hibschman                           ······@physics.berkeley.edu

From: Vebjorn Ljosa
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 19:42:59 +0000
Message-ID: <cy31yszspfg.fsf@proto.pvv.ntnu.no>

* Johann Hibschman <······@physics.berkeley.edu>
| The obvious string methods that we're missing are strip (remove
| whitespace), rstrip, lstrip (left and right ends), 

don't STRING-TRIM, STRING-RIGHT-TRIM and STRING-LEFT-TRIM give you
what you want here?

-- 
Vebjorn

From: Raymond Wiker
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 19:24:11 +0000
Message-ID: <86d7cj7ns4.fsf@raw.grenland.fast.no>

Johann Hibschman <······@physics.berkeley.edu> writes:

> Marco Antoniotti writes:
> 
> > I actually judged the intentions of the original poster, by picking up
> > the lurking argument: "how come I can do this in Perl and the latest
> > wheel on the block avec forced indentation, and I cannot do it in CL?" :)
> 
> Speaking of which, there is a split-string function in the CLOCC, at
> http://clocc.sourceforge.net.
> 
> I haven't used it, so I can't comment on its efficiency, but it's
> there.
> 
> To start a new subthread, what string functions would people want?  I
> volunteer to collect any code and forward it to the CLOCC people, for
> possible inclusion.  (I've never dealt with them, so I don't know how
> hard it is to get anything included.)
> 
> The obvious string methods that we're missing are strip (remove
> whitespace), rstrip, lstrip (left and right ends), 

        You mean something like string-left-trim, string-right-trim
and string-trim :-? 


-- 
Raymond Wiker
·············@fast.no

From: Johann Hibschman
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 19:28:04 +0000
Message-ID: <mtd7cjoiez.fsf@astron.berkeley.edu>

Raymond Wiker writes:

>> The obvious string methods that we're missing are strip (remove
>> whitespace), rstrip, lstrip (left and right ends), 

>         You mean something like string-left-trim, string-right-trim
> and string-trim :-? 

Yes, of course.  I'm enthusiastic, but inexperienced.  I've done quite
a bit of python work, so I was simply quoting their list of functions,
without checking the standard.  Scratch two off the list.

Although it's still a debatable point whether having aliases to those
functions available under a string package would be useful.

--J

-- 
Johann Hibschman                           ······@physics.berkeley.edu

From: ········@hex.net
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 20:30:09 +0000
Message-ID: <wk4rxvbsfi.fsf@mail.hex.net>

>>>>> "Johann" == Johann Hibschman <······@physics.berkeley.edu>
>>>>> writes:
Johann> Raymond Wiker writes:
>>> The obvious string methods that we're missing are strip (remove
>>> whitespace), rstrip, lstrip (left and right ends),

>> You mean something like string-left-trim, string-right-trim and
>> string-trim :-?

Johann> Yes, of course.  I'm enthusiastic, but inexperienced.  I've
Johann> done quite a bit of python work, so I was simply quoting their
Johann> list of functions, without checking the standard.  Scratch two
Johann> off the list.

Johann> Although it's still a debatable point whether having aliases
Johann> to those functions available under a string package would be
Johann> useful.

The _big_ difference in CL is that strings are merely another sort of
sequence; the "library" of string functions includes all those
designed to work with sequences.  There's no "STRING-SUBSEQ," for
instance; Just Plain SUBSEQ does the job for anything that works as a
sequence.

-- 
(reverse (concatenate 'string ····················@" "454aa"))
http://www.ntlug.org/~cbbrowne/lisp.html
"Bawden is misinformed.  Common Lisp has no philosophy.  We are held
together only by a shared disgust for all the alternatives."
-- Scott Fahlman, explaining why Common Lisp is the way it is....

From: Raymond Wiker
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 19:35:07 +0000
Message-ID: <868zn77n9w.fsf@raw.grenland.fast.no>

Johann Hibschman <······@physics.berkeley.edu> writes:

> Raymond Wiker writes:
> 
> >> The obvious string methods that we're missing are strip (remove
> >> whitespace), rstrip, lstrip (left and right ends), 
> 
> >         You mean something like string-left-trim, string-right-trim
> > and string-trim :-? 
> 
> Yes, of course.  I'm enthusiastic, but inexperienced.  I've done quite
> a bit of python work, so I was simply quoting their list of functions,
> without checking the standard.  Scratch two off the list.

        Heh. The only reason I knew of these, is that I was reading
CLtL2 last night. It may not be the _definite_ word on ANSI CL, but
it's a bit handier than the HyperSpec for reading in bed :-)

> Although it's still a debatable point whether having aliases to those
> functions available under a string package would be useful.

-- 
Raymond Wiker
·············@fast.no

From: Hrvoje Niksic
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 13:26:09 +0000
Message-ID: <sxshf1w6pse.fsf@florida.arsdigita.de>

Tim Bradshaw <···@tfeb.org> writes:

> Lisp doesn't have these issues, thank God, but it has other issues,
> and one of them is using READ on data about which you don't know
> enough.  Unless you have a suitably armour-plated READ (and I think
> such a thing is more-or-less possible to create)

What's the big deal with READ, if you disable the obvious `#.'?

From: Tim Bradshaw
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 13:39:58 +0000
Message-ID: <nkj7l2soyj5.fsf@tfeb.org>

Hrvoje Niksic <·······@arsdigita.com> writes:

> 
> What's the big deal with READ, if you disable the obvious `#.'?

disabling #. is the most important thing.  Other issues are things
like it can return cirular structure -- (mapcar ... (read ...)) may
fail to terminate.  It may intern symbols in random packages &c which
may not be desirable. You need to be very sure you know everything
about your readtable and the reader-control variables.  If you're
using it for some constrained purpose, you need to check ruthlessly
that what it returns is something like what you expect it to return
(if you expect a compound object, this check probably needs to do an
occurs check...).

As I said, an armour-plated READ is possible, I think -- there was a
thread a while ago (last year? maybe 1999) where I think I posted
something that claimed to be such a thing.  The trick is to control
the readtable &c, and then to do a walk over the result to check it's
`good'.  And you have to trust the implementation's READ not to blow
up in bad ways -- I have a scheme to write a test-harness that fires
random data at READ for a few hours, but I've not done that yet --
this should not be a huge problem anyway.

--tim

From: Pierre R. Mai
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 17:38:05 +0000
Message-ID: <87ofw3voci.fsf@orion.bln.pmsf.de>

Marco Antoniotti <·······@cs.nyu.edu> writes:

> Well.  I agree.  But the truth is that we do not have a *PORTABLE* Lex
> for Common Lisp.  So, if you know that pretty much what is on the line
> fits what can be fed to READ, it is simpler to go ahead and do the
> dirty thing.

In the context of this thread, where the goal was to split a
string at whitespace boundaries, I don't think anyone uses LEX in the
C world, unless all of this is part of some larger task, which does
warrant the use of a full-blown lexer generator.

So whether a portable lexer generator is available or not for CL seems
beside the point.  It is very, very easy to just do the right thing in
CL with 4 or 5 lines of code, which is not much more verbose than
read, but obviously much less problematic.

Furthermore there are numerous implementations of portable
partitioning and splitting functions that have been posted to c.l.l
over the course of the years, which will reliably solve this and
related problems in a simple one-liner.

Now if we were talking about writing a complete lexer for a complex
input language, together with a corresponding parser, that would be a
completely different matter, and in that context I could see the use
of either some lexer/parser generator (like e.g. Zebu) or indeed a
suitably set-up call to read.

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Lieven Marchand
Subject: Re: String manipulation in CLISP
Date: Thu, 15 Feb 2001 21:01:50 +0000
Message-ID: <m34rxvlkxt.fsf@localhost.localdomain>

"Pierre R. Mai" <····@acm.org> writes:

> In the context of this thread, where the goal was to split a
> string at whitespace boundaries, I don't think anyone uses LEX in the
> C world, unless all of this is part of some larger task, which does
> warrant the use of a full-blown lexer generator.

They'll use strtok which can be used as a poster child in a class for
bletcherous API design.

Apparently the GNU libc man page writer has seen the light:

BUGS
       Never use this function.  This function modifies its first
       argument.   The  identity  of  the delimiting character is
       lost.  This function cannot be used on constant strings.

Off course, this is by no means an exhaustive list of its problems.

-- 
Lieven Marchand <···@wyrd.be>
Gla�r ok reifr skyli gumna hverr, unz sinn b��r bana.