From: Kaz Kylheku
Subject: CLISP + Araneida issues.
Date: 
Message-ID: <1167428214.984390.97880@h40g2000cwb.googlegroups.com>
There is some buggy behavior in the interaction between Araneida and
CLISP, whose ultimate consequence is the failure of proper page flow
due to aborted TCP connections.

One way to sweep the buggy behavior under the carpet is to turn on the
use of buffered streams.

Bug 1: unbuffered I/O (Araneida issue)

For some strange reason, the clisp-compat.lisp module in Aaraneida
requests CLISP to make a :BUFFERED NIL socket stream, which is crazy:
all the reads are one byte at a time! Sucking bits through a straw.
Okay, this is more of a performance problem than an actual bug. But
thanks to this, other bugs are exposed.

Bug 2: line endings (CLISP issue)

Araneida also requests CLISP to make the stream with an ISO-8859-1
encoding, for 8 bit clean transport, and in the encoding it requests a
:UNIX handling for line endings.

However, this :UNIX mode does not work the way Araneida expects. It
expects to receive the individual CR and LF characters. The CLISP
socket stream stream still does some bogus CR-LF recognition in spite
of the UNIX mode. When READ-CHAR sees a CR, it returns a newline
character to the caller and leaves the following linefeed unread in the
socket.

A strange behavior is seen when the body of a HTTP POST is received
according to the Content-Length. Because CLISP has not read the newline
before the body, the READ-SEQUENCE results an off-by-one read. The
read() request is performed for the exact Content-Length. But it does
not begin at the start of the body, but rather at the newline which
precedes the body! Somehow the unbuffered CLISP stream realizes this,
and performs an extra one-byte-read to fetch the last byte of the body,
so the string which holds the entire body is filled correctly.

Even though the body is fetched properly, the way it's done is just too
strange. CLISP should just have the proper support for the requested
line endings and not play any silly games. So I regard this as a bug.
In a :UNIX line ending mode, CLISP should just return any carriage
return as a carriage return character to the caller.

Bug 2: Extra data from a browser is not handled properly. (Browser,
Araneida)

I see cases when a browser writes an extra newline after the HTTP POST
(in direct contravention of the HTTP RFC's!!!) That is to say, the body
is there to the exact content length, and then there is an extra
linefeed. Because the CLISP stream is in an unbuffered mode, it ignores
this newline, even though it probably arrived in the same TCP segment
as the previous bytes.

So what happens is that since Araneida is a HTTP 1.0 implementation
with no persistent connection support, it replies to the request and
then closes the connection. But because it has not read that one last
byte from the offending HTTP client, the underlying TCP stack turns the
close into an abort. A RST segment is sent to the browser, and the
ensuing behavior depends on race conditions. I find that over local
loopback TCP, the page is usually read properly. Between two machines,
the response page usually won't load. The browser either hangs and
times out, or reports an error.

If you turn on buffering for the CLISP stream, the problem goes away.
Or at least its probability becomes much smaller. The CLISP socket
snarfs all of the available data which is smaller than its 4096 byte
buffer.  For the problem to reproduce with buffering, the extra byte
would have to arrive late, in its own TCP segment, so as not to be
included in the buffer fill operation.

There is no easy way to fix this, because a receiver has no control
over the timing of extra erroneous data on the connection. Checking for
it results in a race condition. Still, the server should at least try.
Before doing the close, poll the socket for any pending unreceived
data. Also, doing a half-close wouldn't be a bad idea: just shut down
the socket in the writing direction, and then read data from the client
until /it/ closes, or some reasonable time limit is exceeded. A
half-close won't abort the connection even if there is unread data. I'm
going to try implementing this fix.

I also started hacking Araneida to do HTTP 1.1 properly, with chunked
transfer-coding and persistent connections, so this won't be an issue
anyway. (If I finish this work, I will test it on various
implementations, with and without threads).

From: ·······@gmail.com
Subject: Re: CLISP + Araneida issues.
Date: 
Message-ID: <1167435437.400376.107080@k21g2000cwa.googlegroups.com>
Kaz Kylheku wrote:

> I also started hacking Araneida to do HTTP 1.1 properly, with chunked
> transfer-coding and persistent connections, so this won't be an issue
> anyway. (If I finish this work, I will test it on various
> implementations, with and without threads).

You might want to look at Chunga for chunked encoding:

  http://weitz.de/chunga/

They are used in Drakma and Hunchentoot.

Cheers,
Edi.
From: Sam Steingold
Subject: Re: CLISP + Araneida issues.
Date: 
Message-ID: <m31wmggj7i.fsf@loiso.podval.org>
> * Kaz Kylheku <········@tznvy.pbz> [2006-12-29 13:36:55 -0800]:
>
> Bug 2: line endings (CLISP issue)
>
> Araneida also requests CLISP to make the stream with an ISO-8859-1
> encoding, for 8 bit clean transport, and in the encoding it requests a
> :UNIX handling for line endings.
>
> However, this :UNIX mode does not work the way Araneida expects. It
> expects to receive the individual CR and LF characters. The CLISP
> socket stream stream still does some bogus CR-LF recognition in spite
> of the UNIX mode. When READ-CHAR sees a CR, it returns a newline
> character to the caller and leaves the following linefeed unread in the
> socket.

this is in line with the Unicode recommendation.

http://clisp.cons.org/impnotes/clhs-newline.html
http://www.unicode.org/reports/tr13/tr13-9.html

> Even though the body is fetched properly, the way it's done is just too
> strange. CLISP should just have the proper support for the requested
> line endings and not play any silly games. So I regard this as a bug.
> In a :UNIX line ending mode, CLISP should just return any carriage
> return as a carriage return character to the caller.

if you get what you want, why is it a bug?
if the compiled code works correctly, but the compilation proceeds along
a path you do not like, would you also report it as a bug?

see also
https://sourceforge.net/tracker/index.php?func=detail&aid=731952&group_id=1355&atid=351355

-- 
Sam Steingold (http://sds.podval.org/) on Fedora Core release 6 (Zod)
http://memri.org http://camera.org http://israelunderattack.slide.com
http://dhimmi.com http://honestreporting.com http://truepeace.org
Profanity is the one language all programmers know best.
From: Kaz Kylheku
Subject: Re: CLISP + Araneida issues.
Date: 
Message-ID: <1167591974.362062.251530@42g2000cwt.googlegroups.com>
Sam Steingold wrote:
> > * Kaz Kylheku <········@tznvy.pbz> [2006-12-29 13:36:55 -0800]:
> >
> > Bug 2: line endings (CLISP issue)
> >
> > Araneida also requests CLISP to make the stream with an ISO-8859-1
> > encoding, for 8 bit clean transport, and in the encoding it requests a
> > :UNIX handling for line endings.
> >
> > However, this :UNIX mode does not work the way Araneida expects. It
> > expects to receive the individual CR and LF characters. The CLISP
> > socket stream stream still does some bogus CR-LF recognition in spite
> > of the UNIX mode. When READ-CHAR sees a CR, it returns a newline
> > character to the caller and leaves the following linefeed unread in the
> > socket.
>
> this is in line with the Unicode recommendation.
>
> http://clisp.cons.org/impnotes/clhs-newline.html

This doesn't talk about the :UNIX stream mode. When programmers request
a :UNIX stream, they expect it to be byte-for-byte. Linefeed characters
terminate lines and are returned as a newline (where #\newline is the
same as #\linefeed), and everything else is passed through.

It's quite obvious that the intent in Araneida is to obtain this kind
of transparent stream, similar to doing an fdopen(socket, "rb") in
POSIX C.

It wants something like a binary stream, but where characters come out
instead of bytes.

Arguably, the app should perhaps use a binary stream, and use
READ-BYTE, but that introduces all kinds of inconvenient extra
processing in a protocol that is largely textual.

> http://www.unicode.org/reports/tr13/tr13-9.html

I don't see where this document recommends that a programming language
library should ignore the request for Unix semantics on a stream. :)

> > Even though the body is fetched properly, the way it's done is just too
> > strange. CLISP should just have the proper support for the requested
> > line endings and not play any silly games. So I regard this as a bug.
> > In a :UNIX line ending mode, CLISP should just return any carriage
> > return as a carriage return character to the caller.
>
> if you get what you want, why is it a bug?

I only get what I want because Araneida's HTTP header parsing is coded
defensively against missing carriage returns in the protocol. If that
weren't the case, nothing would work. The application would be correct
in, for instance, continuing to accumulate a HTTP header line until it
sees a carriage return, causing it to read the entire header as one big
line.

And, also, note that in the body that is being POST'ed, there are no
embedded carriage returns. Would it still work right if it was
arbitrary binary content, rather than a set of text-based parameters
from a form submission?

> if the compiled code works correctly, but the compilation proceeds along
> a path you do not like, would you also report it as a bug?

The operating system calls which an application makes are externally
visible behavior. If an application requests an unbuffered stream, it
means that it wants some control over these system calls.

Changes to externally visible behavior are improper optimization.[1]

But this is moot, because nobody in his right mind would use an
unbuffered stream to try to nail down the exact reads and writes, /and/
also request line ending conversion on that same stream! Who cares what
happens in an unbuffered stream if DOS-like line ending treatment is
requested.

So I'm not saying that CLISP should have some different algorithm for
dealing with DOS line endings in an unbuffered stream, just that it
shouldn't be doing that at all if :UNIX was requested.

---
1. The semantics of the Unix read and write system calls can vary quite
drastically among devices. For instance on a raw tape device, the
behavior might be that reads which are smaller than the block length
are truncated, longer reads are short, and each read advances to the
next block.
From: Pascal Bourguignon
Subject: Re: CLISP + Araneida issues.
Date: 
Message-ID: <87k607lsd2.fsf@thalassa.informatimago.com>
"Kaz Kylheku" <········@gmail.com> writes:

> Sam Steingold wrote:
>> > * Kaz Kylheku <········@tznvy.pbz> [2006-12-29 13:36:55 -0800]:
>> >
>> > Bug 2: line endings (CLISP issue)
>> >
>> > Araneida also requests CLISP to make the stream with an ISO-8859-1
>> > encoding, for 8 bit clean transport, and in the encoding it requests a
>> > :UNIX handling for line endings.
>> >
>> > However, this :UNIX mode does not work the way Araneida expects. It
>> > expects to receive the individual CR and LF characters. The CLISP
>> > socket stream stream still does some bogus CR-LF recognition in spite
>> > of the UNIX mode. When READ-CHAR sees a CR, it returns a newline
>> > character to the caller and leaves the following linefeed unread in the
>> > socket.
>>
>> this is in line with the Unicode recommendation.
>>
>> http://clisp.cons.org/impnotes/clhs-newline.html
>
> This doesn't talk about the :UNIX stream mode. When programmers request
> a :UNIX stream, they expect it to be byte-for-byte. Linefeed characters
> terminate lines and are returned as a newline (where #\newline is the
> same as #\linefeed), and everything else is passed through.
>
> It's quite obvious that the intent in Araneida is to obtain this kind
> of transparent stream, similar to doing an fdopen(socket, "rb") in
> POSIX C.

And what does "rb" mean please?

Yes,   :external-format '(unsigned-byte 8)  !


> It wants something like a binary stream, but where characters come out
> instead of bytes.

What character?  There are no character in C!

Try:

    char c=48;
    char d=1;
    printf("%d\n",c+d);

and observe how all you have is 1-byte integers.

If you need read macros to implement C ' and C ", you can easily write
them:

(defun string-encode (string external-format)
  #+clisp (ext:convert-string-to-bytes string external-format)
  #-clisp (map 'vector (lambda (ch)
                          (if (char= #\newline ch)
                              10
                             (let ((code (position ch " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~")))
                               (if code
                                 (+ 32 code)
                                 (error "Character ~C is not a standard character, I cannot encode it in ASCII." ch)))))
                 string))

(defun monobyte-character-encode (ch external-format)
  (let ((bytes (string-encode (string ch) external-format)))
     (unless (= 1 (length bytes))
        (error "Cannot encode multi-byte characters like ~C" ch))
     (aref bytes 0)))


(set-dispatch-macro-character #\$ #\'
    (lambda (stream subchar arg)
       (declare (ignore subchar arg))
       (monobyte-character-encode (read-char stream t nil t) 
                         (stream-external-format stream))))

(set-dispatch-macro-character #\$ #\"
    (lambda (stream subchar arg)
       (declare (ignore subchar arg))
       (unread-char subchar stream)
       (string-encode (read stream t nil t) 
                      (stream-external-format stream))))

Now you can write:

 (defconstant +cr+ 13) 
 (defconstant +lf+ 10)

 (read-sequence buffer socket)
 (if (equal $"GET " (subseq buffer 0 4))
    (write-sequence $"200 SUCCESS" socket)
    (write-sequence $"404 Not here" socket))
 (write-byte +CR+) (write-byte +LF+) 


> Arguably, the app should perhaps use a binary stream, and use
> READ-BYTE, but that introduces all kinds of inconvenient extra
> processing in a protocol that is largely textual.

The protocol is not textual.  


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

READ THIS BEFORE OPENING PACKAGE: According to certain suggested
versions of the Grand Unified Theory, the primary particles
constituting this product may decay to nothingness within the next
four hundred million years.
From: Joerg Hoehle
Subject: Re: CLISP + Araneida issues.
Date: 
Message-ID: <u1wln5afb.fsf@users.sourceforge.net>
"Kaz Kylheku" <········@gmail.com> writes:
> This doesn't talk about the :UNIX stream mode. When programmers request
> a :UNIX stream, they expect it to be byte-for-byte.
Where does it (what it?) say so?

IMHO, the discussion summarizes as follows:

o Binary that is mostly ASCII is useful in computer practice.
  That's why C is succesful at mixing integers and characters.

o Problems arise because CLISP converts CRLF into #\Newline even when
  the programmer wishes for or the idiom requires an exact binary octet
  count (e.g. Content-Length headers).

I'd object against changing CLISP text file I/O semantics.  Automatic
conversion of line-endings into platform-independent single #\Newline
is really a very sane thing to do.  It has avoided many a subtle bug,
in many programming languages.  Common Lisp got this right.

Maybe a fourth I/O mode is needed?
  - but to what character would codes >128 translate to?
  - what about multibyte characters?
Or introduce bivalent streams?

Can someone propose reasonable semantics for such streams?

I think the problem arises because CLISP's behaviour may differ from
the other Lisp implementations.  Which does not mean that one is wrong.

>And, also, note that in the body that is being POST'ed, there are no
>embedded carriage returns. Would it still work right if it was
>arbitrary binary content, rather than a set of text-based parameters
If you would submit an attached file, there likely would be more than
one character shift between Content-Length and actual content.
One shift per CRLF. Sometimes, I've trouble with my old Netscape on
MS-Windows with files from the web for similar reasons.

Thinking a little bit about this, there would be a need for "faithful"
input socket streams but :CRLF output streams (at least for the HTTP
header). This dissociation of input and output features is currently
not present in CLISP.

Regards,
	Jorg Hohle
Telekom/T-Systems Technology Center
From: Kaz Kylheku
Subject: Re: CLISP + Araneida issues.
Date: 
Message-ID: <1169517095.150783.323190@v45g2000cwv.googlegroups.com>
Joerg Hoehle wrote:
> "Kaz Kylheku" <········@gmail.com> writes:
> > This doesn't talk about the :UNIX stream mode. When programmers request
> > a :UNIX stream, they expect it to be byte-for-byte.
> Where does it (what it?) say so?

It's in the Unix culture. Also, codified here:

http://www.opengroup.org/onlinepubs/007908799/xsh/fopen.html

``The character b has no effect, but is allowed for ISO C standard
conformance.''

> I'd object against changing CLISP text file I/O semantics.  Automatic
> conversion of line-endings into platform-independent single #\Newline
> is really a very sane thing to do.

True, but not in cases when the programmer /explicitly/ requests :UNIX
mode via a CLISP-specific extension!

The sane thing to do is to assume that the programmer knows what he or
she is doing, when something is explicitly requested.

> It has avoided many a subtle bug, in many programming languages.
> Common Lisp got this right.

Common Lisp doesn't specify anything about a :UNIX flag that can be
specified as a property in an encoding argument to the :EXTERNAL-FORMAT
of OPEN.

I don't care if CLISP's text streams do munge both LF and CR-LF on
input, by default. Heck, throw in the recognition of VMS records for
good measure.

But if it is specified that a particular operating system's
representation is to be assumed, then /only/ that operating system's
representation should be assumed.

> Maybe a fourth I/O mode is needed?
>   - but to what character would codes >128 translate to?
>   - what about multibyte characters?

CLISP already has encodings to handle these just fine. It's an
orthogonal issue. If I'm reading UTF-8 text and I specified :UNIX mode,
I don't want it to munge CR-LF, even though I do want it to process the
multi-byte encodings and return proper characters.

> I think the problem arises because CLISP's behaviour may differ from
> the other Lisp implementations.  Which does not mean that one is wrong.

Different from others is often as good as wrong, in practice.

> Thinking a little bit about this, there would be a need for "faithful"
> input socket streams but :CRLF output streams (at least for the HTTP
> header). This dissociation of input and output features is currently
> not present in CLISP.

You can always bind two separate streams on the same operating system
object. CLISP exposes all the low-level pieces to do this.
From: Andrew Reilly
Subject: Re: CLISP + Araneida issues.
Date: 
Message-ID: <pan.2007.01.22.22.24.14.289698@areilly.bpc-users.org>
On Mon, 22 Jan 2007 17:49:12 +0100, Joerg Hoehle wrote:

> Maybe a fourth I/O mode is needed?
>   - but to what character would codes >128 translate to?

values between 128 and 255?

>   - what about multibyte characters?

There are no multi-byte octets.

> Or introduce bivalent streams?

Or nice, simple mechanisms to parse chunks of octets in ways that you
might like.

> Can someone propose reasonable semantics for such streams?

vector of integer [0..255]

I was quite impressed with the way Python handles the cr/lf issue: when
you ask for characters, or read a whole file or some specific number
of bytes into one string, you get octet-for-octet.  Any functions
that overlay "line" semantics on top of that are end-of-line agnostic: any
of the conventions will work. Text output uses the end-of-line convention
of the local platform, but that can be overridden with an optional
argument. (From memory, it's a while since I bothered to investigate the
details.)

Cheers,

-- 
Andrew
From: Pascal Bourguignon
Subject: Re: CLISP + Araneida issues.
Date: 
Message-ID: <877iver8gc.fsf@thalassa.informatimago.com>
Andrew Reilly <···············@areilly.bpc-users.org> writes:

> On Mon, 22 Jan 2007 17:49:12 +0100, Joerg Hoehle wrote:
>
>> Maybe a fourth I/O mode is needed?
>>   - but to what character would codes >128 translate to?
>
> values between 128 and 255?

"Values" between 128 and 255 ARE NOT CHARACTERS.


>>   - what about multibyte characters?
>
> There are no multi-byte octets.

Yes, there are.  On 4-bit processors.  But anyways, we're discussing
characters here, not octets.


>> Or introduce bivalent streams?
>
> Or nice, simple mechanisms to parse chunks of octets in ways that you
> might like.
>
>> Can someone propose reasonable semantics for such streams?
>
> vector of integer [0..255]

Yep.


And in lisp, it's trivial to work with vectors of bytes instead of
strings of characters: most of the functions are the same, since
they're sequence or vector functions.  The only functions you need to
rewrite to use vectors of bytes instead of strings of characters are:


Accessor CHAR, SCHAR ; not really, we have AREF and SVREF.

Function STRING      ; not really, we have VECTOR, byt you could
                     ; write a BYTE-VECTOR function to convert
                     ; from character, byte, string and symbol
                     ; to byte vectors.


Function STRING-UPCASE, STRING-DOWNCASE, STRING-CAPITALIZE,
NSTRING-UPCASE, NSTRING-DOWNCASE, NSTRING-CAPITALIZE

Function STRING-TRIM, STRING-LEFT-TRIM, STRING-RIGHT-TRIM

Function STRING=, STRING/=, STRING<, STRING>, STRING<=, STRING>=,
STRING-EQUAL, STRING-NOT-EQUAL, STRING-LESSP, STRING-GREATERP,
STRING-NOT-GREATERP, STRING-NOT-LESSP

Function STRINGP     ; There's VECTORP, 
                     ; you could add a BYTE-VECTOR-P

Function MAKE-STRING ; We could have a MAKE-BYTE-VECTOR

A reader macro to use instead of " to read bytes.
eg. #"Hello" --> #(72 101 108 108 111)


and you'll probably want a BYTE-FORMAT function to encapsulate FORMAT.




-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

PUBLIC NOTICE AS REQUIRED BY LAW: Any use of this product, in any
manner whatsoever, will increase the amount of disorder in the
universe. Although no liability is implied herein, the consumer is
warned that this process will ultimately lead to the heat death of
the universe.
From: Pascal Bourguignon
Subject: Re: CLISP + Araneida issues.
Date: 
Message-ID: <878xgom1z1.fsf@thalassa.informatimago.com>
Sam Steingold <···@gnu.org> writes:

>> * Kaz Kylheku <········@tznvy.pbz> [2006-12-29 13:36:55 -0800]:
>>
>> Bug 2: line endings (CLISP issue)
>>
>> Araneida also requests CLISP to make the stream with an ISO-8859-1
>> encoding, for 8 bit clean transport, and in the encoding it requests a
>> :UNIX handling for line endings.
>>
>> However, this :UNIX mode does not work the way Araneida expects. It
>> expects to receive the individual CR and LF characters. The CLISP
>> socket stream stream still does some bogus CR-LF recognition in spite
>> of the UNIX mode. When READ-CHAR sees a CR, it returns a newline
>> character to the caller and leaves the following linefeed unread in the
>> socket.
>
> this is in line with the Unicode recommendation.
>
> http://clisp.cons.org/impnotes/clhs-newline.html
> http://www.unicode.org/reports/tr13/tr13-9.html

Which is all right with text stream.  The error of the OP (the
libraries he uses) is that he uses a text stream to implement a binary
protocol such as HTTP.

HTTP sockets must be opened as binary streams!


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"Debugging?  Klingons do not debug! Our software does not coddle the
weak."
From: Kaz Kylheku
Subject: Re: CLISP + Araneida issues.
Date: 
Message-ID: <1167595648.876655.262060@k21g2000cwa.googlegroups.com>
Pascal Bourguignon wrote:
> > http://clisp.cons.org/impnotes/clhs-newline.html
> > http://www.unicode.org/reports/tr13/tr13-9.html
>
> Which is all right with text stream.

It's not all right to be doing DOS line ending conversion on a text
stream that is requested UNIX mode.

Since that request is made using special CLISP extensions, any
reasoning about standard Common Lisp text streams does not apply.

> The error of the OP (the
> libraries he uses) is that he uses a text stream to implement a binary
> protocol such as HTTP.

HTTP is a hybrid protocol. Most of the HTTP grammar generates strings
which are text. However, some grammar productions make use of a
syntactic unit called OCTET.

See ``2.2. Basic Rules'' in the HTTP 1.1 RFC.

> HTTP sockets must be opened as binary streams!

Because HTTP is largely text based, with binary data being the
exception, the most convenient tool for dealing with HTTP is a clean
stream of 8 bit characters, which can be called upon to serve as octets
when interpreting or generating binary passages.

So, the design decision in Araneida is sensible.
From: Pascal Bourguignon
Subject: Re: CLISP + Araneida issues.
Date: 
Message-ID: <87bqljlrsi.fsf@thalassa.informatimago.com>
"Kaz Kylheku" <········@gmail.com> writes:
> Because HTTP is largely text based, with binary data being the
> exception, the most convenient tool for dealing with HTTP is a clean
> stream of 8 bit characters, which can be called upon to serve as octets
> when interpreting or generating binary passages.

See?  You're asking for something that doesn't exist!  
"8 bit character" is an oxymoron.
Either it's a character, or it's an octet.

Write it as: "The most convenient tool for dealing with HTTP is a
clean stream of 8 bit octets."  That is, a binary stream.



-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we. -- Georges W. Bush