getting a simple-base-string from read-line

From: Jan Rychter
Subject: getting a simple-base-string from read-line
Date: Tue, 13 Dec 2005 12:09:10 +0000
Message-ID: <m2acf5p5ax.fsf@tnuctip.rychter.com>

I've been doing some reading from files in SBCL and I was hoping I could
have read-line return a simple-base-string, since that is what my files
contain, and since I really need the input processing to be fast.

So, here's what I do:

(with-open-file (s "abc" :element-type 'base-char)
  (let ((l (read-line s)))
    (format t "~&~A ~A" (type-of l) (stream-element-type s))))

the results are surprising:

clisp and cmucl produce:
  (SIMPLE-BASE-STRING 3) CHARACTER

SBCL (0.9.7):
  (SIMPLE-ARRAY CHARACTER (3)) CHARACTER

ECL:
  (SIMPLE-STRING 3) BASE-CHAR

AllegroCL 7.0 trial:
  (SIMPLE-ARRAY CHARACTER (3)) CHARACTER

... while I would have expected:
  (SIMPLE-BASE-STRING 3) BASE-CHAR

My particular problem that triggered all this was that I declared my
lines to be of type simple-base-string in SBCL and SBCL compained loudly
that they aren't.

What gives?

--J.

Re: getting a simple-base-string from read-line Thomas A. Russ
- Re: getting a simple-base-string from read-line Duane Rettig
- Re: getting a simple-base-string from read-line Jan Rychter
  - Re: getting a simple-base-string from read-line Christophe Rhodes
    - Re: getting a simple-base-string from read-line Jan Rychter
      - Re: getting a simple-base-string from read-line Christophe Rhodes
  - Re: getting a simple-base-string from read-line Pascal Bourguignon

From: Thomas A. Russ
Subject: Re: getting a simple-base-string from read-line
Date: Tue, 13 Dec 2005 18:09:13 +0000
Message-ID: <ymiek4goomu.fsf@sevak.isi.edu>

Jan Rychter <···@rychter.com> writes:

> 
> I've been doing some reading from files in SBCL and I was hoping I could
> have read-line return a simple-base-string, since that is what my files
> contain, and since I really need the input processing to be fast.
> 
> So, here's what I do:
> 
> (with-open-file (s "abc" :element-type 'base-char)
>   (let ((l (read-line s)))
>     (format t "~&~A ~A" (type-of l) (stream-element-type s))))
> 
> the results are surprising:
> 
> clisp and cmucl produce:
>   (SIMPLE-BASE-STRING 3) CHARACTER
> 
> SBCL (0.9.7):
>   (SIMPLE-ARRAY CHARACTER (3)) CHARACTER
> 
> ECL:
>   (SIMPLE-STRING 3) BASE-CHAR
> 
> AllegroCL 7.0 trial:
>   (SIMPLE-ARRAY CHARACTER (3)) CHARACTER
> 
> ... while I would have expected:
>   (SIMPLE-BASE-STRING 3) BASE-CHAR
> 
> My particular problem that triggered all this was that I declared my
> lines to be of type simple-base-string in SBCL and SBCL compained loudly
> that they aren't.
> 
> What gives?

READ-LINE is not required to return anything more specific type than
STRING.  If you look at MCL, it gets even worse, since what is returned
is actually a string with a fill pointer.

The reason for this flexibility is to allow implementations to choose
various optimizations for the way they do I/O and also for how they
choose to represent strings.  The actual type of the array that is used
for string representations varies among implementations, usually for
very good (internal) reasons.

Unfortunately, this is an area where any optimizations or declarations
that you use will have to either be implementation-specific or you will
have to use COERCE to make sure you get things of the type you expect.

To make the types implementation specific, the easiest way would be to
use a DEFTYPE with the appropriate #+, #- coding to define the type and
then use that in various declarations.  For example:

(deftype readline-string #+(or :cmucl :clisp) SIMPLE-BASE-STRING
                         #+:sbcl              (SIMPLE-ARRY CHARACTER *)
	                 #+:ecl               SIMPLE-STRING
                         #-(or :cmucl :clisp :sbcl) STRING)

(defun process-string (s)
   (declare (type readline-string s)))

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: Duane Rettig
Subject: Re: getting a simple-base-string from read-line
Date: Wed, 14 Dec 2005 08:23:15 +0000
Message-ID: <o0lkyo3x58.fsf@franz.com>

···@sevak.isi.edu (Thomas A. Russ) writes:

> To make the types implementation specific, the easiest way would be to
> use a DEFTYPE with the appropriate #+, #- coding to define the type and
> then use that in various declarations.  For example:
>
> (deftype readline-string #+(or :cmucl :clisp) SIMPLE-BASE-STRING
>                          #+:sbcl              (SIMPLE-ARRY CHARACTER *)
=======================================================================^
> 	                 #+:ecl               SIMPLE-STRING
>                          #-(or :cmucl :clisp :sbcl) STRING)


Two points:

 1. why (in the flagged line) a nondescript dimensionality? strings
are always one-dimensional.

 2. You forgot Allegro CL:

                         #+:allegro           (SIMPLE-ARRY CHARACTER (*))

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182

From: Jan Rychter
Subject: Re: getting a simple-base-string from read-line
Date: Wed, 14 Dec 2005 14:06:11 +0000
Message-ID: <m27ja7db8s.fsf@tnuctip.rychter.com>

Thomas A. Russ:
> Jan Rychter <···@rychter.com> writes:
> > 
> > I've been doing some reading from files in SBCL and I was hoping I could
> > have read-line return a simple-base-string, since that is what my files
> > contain, and since I really need the input processing to be fast.
> > 
> > So, here's what I do:
> > 
> > (with-open-file (s "abc" :element-type 'base-char)
> >   (let ((l (read-line s)))
> >     (format t "~&~A ~A" (type-of l) (stream-element-type s))))
> > 
> > the results are surprising:
> > 
> > clisp and cmucl produce:
> >   (SIMPLE-BASE-STRING 3) CHARACTER
> > 
> > SBCL (0.9.7):
> >   (SIMPLE-ARRAY CHARACTER (3)) CHARACTER
> > 
> > ECL:
> >   (SIMPLE-STRING 3) BASE-CHAR
> > 
> > AllegroCL 7.0 trial:
> >   (SIMPLE-ARRAY CHARACTER (3)) CHARACTER
> > 
> > ... while I would have expected:
> >   (SIMPLE-BASE-STRING 3) BASE-CHAR
> > 
> > My particular problem that triggered all this was that I declared my
> > lines to be of type simple-base-string in SBCL and SBCL compained loudly
> > that they aren't.
> > 
> > What gives?
> 
> READ-LINE is not required to return anything more specific type than
> STRING.  If you look at MCL, it gets even worse, since what is returned
> is actually a string with a fill pointer.
> 
> The reason for this flexibility is to allow implementations to choose
> various optimizations for the way they do I/O and also for how they
> choose to represent strings.  The actual type of the array that is used
> for string representations varies among implementations, usually for
> very good (internal) reasons.
> 
> Unfortunately, this is an area where any optimizations or declarations
> that you use will have to either be implementation-specific or you will
> have to use COERCE to make sure you get things of the type you expect.
> 
> To make the types implementation specific, the easiest way would be to
> use a DEFTYPE with the appropriate #+, #- coding to define the type and
> then use that in various declarations.  For example:
> 
> (deftype readline-string #+(or :cmucl :clisp) SIMPLE-BASE-STRING
>                          #+:sbcl              (SIMPLE-ARRY CHARACTER *)
> 	                 #+:ecl               SIMPLE-STRING
>                          #-(or :cmucl :clisp :sbcl) STRING)
> 
> 
> (defun process-string (s)
>    (declare (type readline-string s)))

Thanks for the explanation. I don't really see the logic behind this,
though: READ-LINE reads from a stream, that has a certain element
type. I can't see why it wouldn't return an array with this precise
element type -- SIMPLE-BASE-ARRAY in this case (since I used BASE-CHAR
as the element type.)

However, your solution doesn't really solve my problem. My goal was to
optimize reading from files. Profiling shows that most of the time is
being spent in two functions, and these mostly process strings. From
what I understand, sticking to SIMPLE-BASE-STRING instead of the more
complex types would improve things under SBCL, and notes from the
compiler seem to confirm that:

   unable to
     optimize
   due to type uncertainty:
     The first argument is a STRING, not a SIMPLE-BASE-STRING.

If I have to write my own read-line, I will consider all this a failure...

--J.

From: Christophe Rhodes
Subject: Re: getting a simple-base-string from read-line
Date: Wed, 14 Dec 2005 14:22:05 +0000
Message-ID: <sq8xunbvxu.fsf@cam.ac.uk>

Jan Rychter <···@rychter.com> writes:

> Thanks for the explanation. I don't really see the logic behind this,
> though: READ-LINE reads from a stream, that has a certain element
> type. I can't see why it wouldn't return an array with this precise
> element type -- SIMPLE-BASE-ARRAY in this case (since I used BASE-CHAR
> as the element type.)

Well, one thing is that the implementation is permitted to upgrade
stream element types much as it is permitted to upgrade array element
types: and indeed if you ask sbcl for the stream-element-type of your
stream, it will tell you CHARACTER, not BASE-CHAR.  It's possible that
there's a lost opportunity for higher performance here, but it might
be a little tricky to get that right.  Ask on sbcl-devel if it turns
out that this is what you actually need...

> However, your solution doesn't really solve my problem. My goal was to
> optimize reading from files. Profiling shows that most of the time is
> being spent in two functions, and these mostly process strings. From
> what I understand, sticking to SIMPLE-BASE-STRING instead of the more
> complex types would improve things under SBCL, and notes from the
> compiler seem to confirm that:
>
>    unable to
>      optimize
>    due to type uncertainty:
>      The first argument is a STRING, not a SIMPLE-BASE-STRING.
>
> If I have to write my own read-line, I will consider all this a failure...

I've slightly lost track of your requirements here, for example with
respect to portability, robustness, and similar such issues.  Speaking
only in terms of SBCL here, READ-LINE at the moment returns only
objects of (SIMPLE-ARRAY CHARACTER (*)), unless you've hit EOF.  So
you can declare the objects you're processing as being of that type,
perhaps with an explicit coercion beforehand for future-proofing.

It's possible, also, that there are code transformations that aren't
being performed on your data objects that could be.  Some more details
would be appreciated, such as the kind of processing you're doing, and
ideally some profiling data, preferably using both the function-based
and the sampling profilers.

As I understand it, though, you're not actually after optimizing the
IO, but after optimizing the processing of the results of IO?  Is that
right?

Christophe

From: Jan Rychter
Subject: Re: getting a simple-base-string from read-line
Date: Wed, 14 Dec 2005 16:43:31 +0000
Message-ID: <m2k6e7bpe4.fsf@tnuctip.rychter.com>

Christophe:
> Jan Rychter <···@rychter.com> writes:
> > Thanks for the explanation. I don't really see the logic behind this,
> > though: READ-LINE reads from a stream, that has a certain element
> > type. I can't see why it wouldn't return an array with this precise
> > element type -- SIMPLE-BASE-ARRAY in this case (since I used BASE-CHAR
> > as the element type.)
> 
> Well, one thing is that the implementation is permitted to upgrade
> stream element types much as it is permitted to upgrade array element
> types: and indeed if you ask sbcl for the stream-element-type of your
> stream, it will tell you CHARACTER, not BASE-CHAR.  It's possible that
> there's a lost opportunity for higher performance here, but it might
> be a little tricky to get that right.  Ask on sbcl-devel if it turns
> out that this is what you actually need...

Ok, understood.

> > However, your solution doesn't really solve my problem. My goal was to
> > optimize reading from files. Profiling shows that most of the time is
> > being spent in two functions, and these mostly process strings. From
> > what I understand, sticking to SIMPLE-BASE-STRING instead of the more
> > complex types would improve things under SBCL, and notes from the
> > compiler seem to confirm that:
> >
> >    unable to
> >      optimize
> >    due to type uncertainty:
> >      The first argument is a STRING, not a SIMPLE-BASE-STRING.
> >
> > If I have to write my own read-line, I will consider all this a failure...
> 
> I've slightly lost track of your requirements here, for example with
> respect to portability, robustness, and similar such issues.  Speaking
> only in terms of SBCL here, READ-LINE at the moment returns only
> objects of (SIMPLE-ARRAY CHARACTER (*)), unless you've hit EOF.  So
> you can declare the objects you're processing as being of that type,
> perhaps with an explicit coercion beforehand for future-proofing.
> 
> It's possible, also, that there are code transformations that aren't
> being performed on your data objects that could be.  Some more details
> would be appreciated, such as the kind of processing you're doing, and
> ideally some profiling data, preferably using both the function-based
> and the sampling profilers.
> 
> As I understand it, though, you're not actually after optimizing the
> IO, but after optimizing the processing of the results of IO?  Is that
> right?

I'm reading and parsing PDB files. Those are pure ASCII files, in a
columnized format, that Fortran people just read using a single READ
plus a FORMAT spec. Here's an example line:

ATOM      3  O2P CYT D   2      26.535  -9.106   4.384  1.00  0.00      DNA1

What I'm after is whatever will make this job happen faster. Right now
what I do is roughly: 1) read-line, 2) perform a series of
read-from-strings mixed with subseqs, with some whitespace trimming and
cl-ppcre helping in some cases. The result is usually one CLOS object
per line or per N lines.

These files can be rather large, easily reaching 20 million lines (1.5GB).

So it's really the parsing that I'd like to optimize. And I was hoping
to do it without thinking too much, hence my interest in
SIMPLE-BASE-STRINGS.

As for other requirements: portability is nice and I'd like to maintain
it whenever possible. Robustness isn't important, those files are very
well behaved and anything I do in CL will be more robust than Fortran
code.

--J.

From: Christophe Rhodes
Subject: Re: getting a simple-base-string from read-line
Date: Wed, 14 Dec 2005 16:58:18 +0000
Message-ID: <sqzmn3aa51.fsf@cam.ac.uk>

Jan Rychter <···@rychter.com> writes:

> I'm reading and parsing PDB files. Those are pure ASCII files, in a
> columnized format, that Fortran people just read using a single READ
> plus a FORMAT spec. Here's an example line:
>
> ATOM      3  O2P CYT D   2      26.535  -9.106   4.384  1.00  0.00      DNA1
>
> What I'm after is whatever will make this job happen faster. Right now
> what I do is roughly: 1) read-line, 2) perform a series of
> read-from-strings mixed with subseqs, with some whitespace trimming and
> cl-ppcre helping in some cases. The result is usually one CLOS object
> per line or per N lines.
>
> These files can be rather large, easily reaching 20 million lines (1.5GB).
>
> So it's really the parsing that I'd like to optimize. And I was hoping
> to do it without thinking too much, hence my interest in
> SIMPLE-BASE-STRINGS.

There shouldn't be a noticeable difference in speed between using
SIMPLE-BASE-STRINGS and (SIMPLE-ARRAY CHARACTER (*))s; the most likely
effect arises from cache pressure because all the data is four times
larger, but I would expect that to be relatively minor compared with
the effort involved in parsing a float.

READ-FROM-STRING and SUBSEQ are both potentially expensive function
calls: SUBSEQ always returns fresh sequences, so it has to allocate
the result; READ-FROM-STRING must both decide what kind of object it's
going to return, and then create that object.  Float parsing is a
little bit of an art; efficient ways to implement it are known, though
looking at your data makes me suspect that your data file has
fixed-width printed representation of floats rather than the real
thing.

Without seeing any profiling or other data, though, all this is
speculation.

Christophe

From: Pascal Bourguignon
Subject: Re: getting a simple-base-string from read-line
Date: Wed, 14 Dec 2005 16:03:53 +0000
Message-ID: <87psnz1x92.fsf@thalassa.informatimago.com>

Jan Rychter <···@rychter.com> writes:
> However, your solution doesn't really solve my problem. My goal was to
> optimize reading from files. Profiling shows that most of the time is
> being spent in two functions, and these mostly process strings. From
> what I understand, sticking to SIMPLE-BASE-STRING instead of the more
> complex types would improve things under SBCL, and notes from the
> compiler seem to confirm that:
>
>    unable to
>      optimize
>    due to type uncertainty:
>      The first argument is a STRING, not a SIMPLE-BASE-STRING.
>
> If I have to write my own read-line, I will consider all this a failure...

If your "optimize" means "fast I/O as in C", then you must do no more
processing than in C, that is, you must forget about characters, since
they don't exist in C.  Try :element-type '(unsigned-byte 8) and
read-sequence / write-sequence.  And yes, you'll need to implement
your own READ-BYTE-LINE.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"You can tell the Lisp programmers.  They have pockets full of punch
 cards with close parentheses on them." --> http://tinyurl.com/8ubpf