Why there's no unread-byte?

From: Valeriy E. Ushakov
Subject: Why there's no unread-byte?
Date: Thu, 11 Mar 1999 00:00:00 +0000
Message-ID: <7c71vd$h3v$1@news.ptc.spbu.ru>

What is the reason behind having PEEK-CHAR and UNREAD-CHAR in the
standard, but not PEEK-BYTE and UNREAD-BYTE.  If I want to read a file
with possibly mixed CR, LF and CRLF end-of-lines in binary mode
(meaning C-speak "binary", i.e. stream element type (unsigned-byte 8)),
how can I peek at the possible LF after the CR?

I've been reading hyperspec, but can't put various pieces together in
a consistent way.  May be I'm just missing something obvious?

Thank you.

SY, Uwe
-- 
···@ptc.spbu.ru                         |       Zu Grunde kommen
http://www.ptc.spbu.ru/~uwe/            |       Ist zu Grunde gehen

Re: Why there's no unread-byte? Erik Naggum
- Re: Why there's no unread-byte? Valeriy E. Ushakov
Re: Why there's no unread-byte? Kent M Pitman
- Re: Why there's no unread-byte? David Combs
- Re: Why there's no unread-byte? Tim Bradshaw
  - Re: Why there's no unread-byte? Kent M Pitman

From: Erik Naggum
Subject: Re: Why there's no unread-byte?
Date: Thu, 11 Mar 1999 00:00:00 +0000
Message-ID: <3130103987436896@naggum.no>

* "Valeriy E. Ushakov" <···@ptc.spbu.ru>
| What is the reason behind having PEEK-CHAR and UNREAD-CHAR in the
| standard, but not PEEK-BYTE and UNREAD-BYTE.  If I want to read a file
| with possibly mixed CR, LF and CRLF end-of-lines in binary mode (meaning
| C-speak "binary", i.e. stream element type (unsigned-byte 8)), how can I
| peek at the possible LF after the CR?

  you read the entire file (or a reasonable fraction thereof) into a vector
  with READ-SEQUENCE first.  (I hope we get memory-mapped file I/O soon.)

#:Erik

From: Valeriy E. Ushakov
Subject: Re: Why there's no unread-byte?
Date: Thu, 11 Mar 1999 00:00:00 +0000
Message-ID: <7c789s$jom$1@news.ptc.spbu.ru>

Erik Naggum <····@naggum.no> wrote:
> * "Valeriy E. Ushakov" <···@ptc.spbu.ru>
> | What is the reason behind having PEEK-CHAR and UNREAD-CHAR in the
> | standard, but not PEEK-BYTE and UNREAD-BYTE.  If I want to read a file
> | with possibly mixed CR, LF and CRLF end-of-lines in binary mode (meaning
> | C-speak "binary", i.e. stream element type (unsigned-byte 8)), how can I
> | peek at the possible LF after the CR?

>   you read the entire file (or a reasonable fraction thereof) into a vector
>   with READ-SEQUENCE first.

Thanks for the confirmation.

What really puzzles me is that character streams can be sufficiently
"complex" because of EOL conventions, variable byte encodings
(Far-East charsets, UTF-8), additional character attributes (as I
guess from LM stories and discussion of characters in hyperspec) and,
perhaps, other factors.  So unread/peek for character streams is
potentially more complex than unread/peek for binary streams.  Thus
I'd guess that the complexity of implementation was not the deciding
argument.  Was it because binary streams were a new development at the
time, as existence of READ-AND-WRITE-BYTES:NEW-FUNCTIONS cleanup issue
suggests?

[The old saying about determined programmer, that can write FORTRAN
program in any language, summarize in a proverbial form the fact that
every language (not necessarily _programming_ language) has its own
ways and its own mindset, which is probably what Chomsky called
"language intuition".  What I'm trying to figure out is the Lisp ways.]

>   (I hope we get memory-mapped file I/O soon.)

For what value(s) of "we"?  mmap-sequence?  Or do you mean
implementations doing mmaped I/O behind the scene?

Thanks.

SY, Uwe
-- 
···@ptc.spbu.ru                         |       Zu Grunde kommen
http://www.ptc.spbu.ru/~uwe/            |       Ist zu Grunde gehen

From: Kent M Pitman
Subject: Re: Why there's no unread-byte?
Date: Thu, 11 Mar 1999 00:00:00 +0000
Message-ID: <sfwd82gbu5q.fsf@world.std.com>

"Valeriy E. Ushakov" <···@ptc.spbu.ru> writes:

> What is the reason behind having PEEK-CHAR and UNREAD-CHAR in the
> standard, but not PEEK-BYTE and UNREAD-BYTE.  If I want to read a file
> with possibly mixed CR, LF and CRLF end-of-lines in binary mode
> (meaning C-speak "binary", i.e. stream element type (unsigned-byte 8)),
> how can I peek at the possible LF after the CR?

I suspect it's an accident of history that people who were dealing with
READ-BYTE mostly wrote binary formats that were such that you didn't have
to do much put-back.  e.g., in the fasl formats I'm familiar with, you
design them so you grab an "opcode" and then it tels you definitively
what to do next.  You're right that if you go to implement the other
(character) stuff, you need this--but I think probably this (binary)
stuff was thought to be subprimitive.  Also, it's become more popular 
(for want of a better word) to re-implement "character I/O" now that
character formats have evolved in ways that the spec has not explicitly
kept pace with, and sometimes you have to resort to binary to really make
sure you're not going to lose with the character stuff, so the problem
is probably amplified now over when the original design was done.

I think it's a reasonable criticism that at least one of these two isn't
provided.  Though hopefully by time we fix them, we'll also generalize
the stream protocol so that providing the equivalent behavior as user-defined
extensions wouldn't interfere with using the self-same object by tools
that didn't need or know about how to get the extra functionality.

As Erik points out, READ-SEQUENCE is a compensating force you can use now
to ameliorate the effect of the problem.  It wasn't there at the time
this problem you mention came up, so it's more just "lucky" it's there now.

Still, you may be able to seriously outrace some byte/character-at-a-time
I/O by using READ-SEQUENCE and letting it do I/O in blocks and then simulating
your character I/O using essentially AREF instead, which if you arrange
things right via macros or inlined functions is going to fly much faster
than anything you get out-of-the-box from vendors... at least, that's my
guess. I've done it at least in at least two implementations this way 
at times in the past, and I don't have any recent data to say whether it's
still an issue with those so I won't name which implementations.  (Partly
it may be a constraint of theCL specification that makes it hard to be
totally optimal since CL says READ-CHAR and friends are functions, not macros
or inlined functions, so privately you can allow yourself better tools
than the vendors have the option of using.)

> I've been reading hyperspec, but can't put various pieces together in
> a consistent way.  May be I'm just missing something obvious?

I don't think so.  Sounds like you're reading things as they are.

From: David Combs
Subject: Re: Why there's no unread-byte?
Date: Mon, 15 Mar 1999 00:00:00 +0000
Message-ID: <dkcombsF8nBvG.89G@netcom.com>

In article <···············@world.std.com>,
Kent M Pitman  <······@world.std.com> wrote:
>"Valeriy E. Ushakov" <···@ptc.spbu.ru> writes:
>
>> What is the reason behind having PEEK-CHAR and UNREAD-CHAR in the
>> standard, but not PEEK-BYTE and UNREAD-BYTE.  If I want to read a file
>> with possibly mixed CR, LF and CRLF end-of-lines in binary mode
>> (meaning C-speak "binary", i.e. stream element type (unsigned-byte 8)),
>> how can I peek at the possible LF after the CR?
>
><snip>
>Still, you may be able to seriously outrace some byte/character-at-a-time
>I/O by using READ-SEQUENCE and letting it do I/O in blocks and then simulating
>your character I/O using essentially AREF instead, which if you arrange
>things right via macros or inlined functions is going to fly much faster
>than anything you get out-of-the-box from vendors... at least, that's my
>guess. I've done it at least in at least two implementations this way 
>at times in the past, and I don't have any recent data to say whether it's
>still an issue with those so I won't name which implementations.  (Partly
>it may be a constraint of theCL specification that makes it hard to be
>totally optimal since CL says READ-CHAR and friends are functions, not macros
>or inlined functions, so privately you can allow yourself better tools
>than the vendors have the option of using.)
>
<snip>

Sample code would be REALLY nice.  Annotated.

Thanks!

From: Tim Bradshaw
Subject: Re: Why there's no unread-byte?
Date: Mon, 15 Mar 1999 00:00:00 +0000
Message-ID: <nkjvhg2tz4p.fsf@tfeb.org>

Kent M Pitman <······@world.std.com> writes:

> (Partly
> it may be a constraint of theCL specification that makes it hard to be
> totally optimal since CL says READ-CHAR and friends are functions, not macros
> or inlined functions, so privately you can allow yourself better tools
> than the vendors have the option of using.)

This must be wrong, mustn't it?.  As far as I can see, READ-CHAR *can*
be inlined because it's in the CL package, so you aren't allowed to do
anything to it anyway (but perhaps I'm missing something -- I thought
the constraints on what you were allowed to do to things in the CL
package were designed to allow compilers to do various clever
transformations, like inlining).

But apart from that, I can't see any reason other than laziness (or
other thinge being seen as more important, which is a better reason!)
why CL implementations don't provide good I/O performance using
ordinary streams and READ-CHAR.  Even if you can't inline it, you're
still talking about function-call cost vs moving disk head cost, and
my experience of some implementations has definitely been that there
was something a lot worse than that wrong with their I/O, and I think
it was usually dismal failure to buffer streams effectively.

To give them their due (and avoid naming them!) I tried several
systems recently and found that they were really acceptably fast, even
without any tricks like using READ-SEQUENCE & then AREF.

(But does anyone implement READ-SEQUENCE using mmap yet ...).

--tim

From: Kent M Pitman
Subject: Re: Why there's no unread-byte?
Date: Mon, 15 Mar 1999 00:00:00 +0000
Message-ID: <sfwzp5e1rc5.fsf@world.std.com>

Tim Bradshaw <···@tfeb.org> writes:

> Kent M Pitman <······@world.std.com> writes:
> 
> > (Partly
> > it may be a constraint of theCL specification that makes it hard to be
> > totally optimal since CL says READ-CHAR and friends are functions, not macros
> > or inlined functions, so privately you can allow yourself better tools
> > than the vendors have the option of using.)
> 
> This must be wrong, mustn't it?.  As far as I can see, READ-CHAR *can*
> be inlined because it's in the CL package, so you aren't allowed to do
> anything to it anyway (but perhaps I'm missing something -- I thought
> the constraints on what you were allowed to do to things in the CL
> package were designed to allow compilers to do various clever
> transformations, like inlining).

You're probably right.  (I'd have to look more carefully at the specific
words to have a stronger sense than pobably, but I'm swayed by your argument.
For some reason, this particular issue gets unwired from my brain periodically
and someone has to remind me.  There is a competing clause about how you
can't optimize a link on a function call to not check the function cell,
but that's to promote redefinition, which the system isn't constrained
by in the case of reefinition being expliclty prohibited for system
functions.  Of course, in fairness to the vendors, they are not required
to optimize out the links and systems are much harder to patch if they do
inline this kind of thing--user code would all have to be recompiled to
accept any change to the I/O system.  So it might not be something systems
want to do.  But you're probably right that they could ...)

> But apart from that, I can't see any reason other than laziness (or
> other thinge being seen as more important, which is a better reason!)
> why CL implementations don't provide good I/O performance using
> ordinary streams and READ-CHAR.  Even if you can't inline it, you're
> still talking about function-call cost vs moving disk head cost, and
> my experience of some implementations has definitely been that there
> was something a lot worse than that wrong with their I/O, and I think
> it was usually dismal failure to buffer streams effectively.

This has been my  experience as well.  It has always surprised me, too.
But there are only so many things one has time to track down.  This is a
strong argument for open-sourcing because open-sourcing DOES tend to expose
problems of this particular kind.  (There are disadvantages to open
sourcing discussed separately, so I don't mean to say this is the only
matter.  But it's worth citing this case in context.)

> To give them their due (and avoid naming them!) I tried several
> systems recently and found that they were really acceptably fast, even
> without any tricks like using READ-SEQUENCE & then AREF.

That's good.