AServe + LispWorks bivalent-socket Problem solved

From: Jochen Schmidt
Subject: AServe + LispWorks bivalent-socket Problem solved
Date: Thu, 05 Apr 2001 00:32:30 +0000
Message-ID: <9age4r$51cns$1@ID-22205.news.dfncis.de>

Exactly this minute AllegroServe's Webserver runs the first time showing me
images as it should do.
I've rewritten parts of it (e. g. htmlgen.cl, and main.cl) to use binary 
transfers for all things. If text is transmitted it is converted into 
binary-values and then transmitted as binary data.
I'll now run some further tests and if all goes well we will have an 
complete AllegroServe for LispWorks soon.
As I've read at the sourceforge-pages for AllegroServe John Foderaro
wants that ports to other Lisps reside on other Webpages and will be 
managed seperately from the main AllegroServe. As the newer versions seem 
to include more ACL-specific stuff I maybe offer my port on my webpage 
http://www.dataheaven.de (I'll make an anouncement here if it's out)

I've also rewritten Christopher Double's NET.URI module. It know uses a 
more RFC compatible URI-parser and has a nearly complete function set.

Additionally I'am over to do a new version of my (yet unreleased) POP3 
module. I dislike the way it is implemented now. The new version will allow 
to use pop-scheme URIs (with NET.URI this is why I had to rewrite the 
parser) like ·········@mail.st-peter.stw.uni-erlangen.de:123
The pop3 module will allow to open a mailbox and use it with the usual POP3 
commands. It will be possible to easily fetch the mails - which are parsed 
and stored in an CLOS-Object. the headers are written in a hashtable in the 
object and the body is stored as is. It should be possible to enhance this 
by MIME handling and Attachments ...

You see I've much to do besides my ususal job - please understand that all 
this stuff won't be ready tomorrow - I'll do what I can to get it ready as 
fast as I can but I've not much time.

Regards,
Jochen

URI parsing Raymond Wiker
- Re: URI parsing Jochen Schmidt
  - Re: URI parsing Rob Warnock
    - Re: URI parsing ·····@relay.known.net
- Re: URI parsing Jochen Schmidt
  - Re: URI parsing Raymond Wiker
    - Re: URI parsing Jochen Schmidt
    - Re: URI parsing Pierre R. Mai

From: Raymond Wiker
Subject: URI parsing
Date: Thu, 05 Apr 2001 10:02:53 +0000
Message-ID: <867l0zbr5e.fsf@raw.grenland.fast.no>

Jochen Schmidt <···@dataheaven.de> writes:

        [ ... ]

> I've also rewritten Christopher Double's NET.URI module. It know
> uses a more RFC compatible URI-parser and has a nearly complete
> function set.

        I wrote a small (37 lines) function yesterday for parsing
URIs. When I compared it with NET.URI, I noticed that I didn't handle
fragments (internal anchors in html files). On the other hand, it
*does* handle usernames and passwords.

        The code is included below; I would appreciate
comments/suggestions. 

(defun split-url (url-string)
  "Splits the given URL into components representing the protocol,
user, password, host, port number, path and arguments."
  (flet ((split (sep string)
	   (let ((pos (search sep string)))
	     (if pos
		 (list (subseq string 0 pos)
		       (subseq string (+ pos (length sep))))
		 string))))
    (macrolet ((try-split (sep str (true-1-var true-2-var)
			       &optional false-var)
		 (let ((ressym (gensym)))
		   `(let ((,ressym (split ,sep ,str)))
		     (if (listp ,ressym)
			 (setf ,true-1-var (car ,ressym)
			       ,true-2-var (cadr ,ressym))
			 ,(when false-var
				`(setf ,false-var ,ressym)))))))
      (let (proto user pass host port path args)
	(try-split "://" url-string (proto host) host)
	(try-split "/" host (host path))
	(try-split ·@" host (user host))
	(when user
	  (try-split ":" user (user pass)))
	(try-split ":" host (host port))
	(when path
	  (let ((argstr nil))
	    (try-split "?" path (path argstr))
	    (when argstr
	      (setq args
		    (let ((args '()) (end nil))
		      (loop (try-split "?" argstr (arg argstr) end)
			  (if end
			      (return-from nil
				(nreverse (cons end args)))
			      (push arg args))))))))
	(values proto user pass host port path args)))))

-- 
Raymond Wiker
·············@fast.no

From: Jochen Schmidt
Subject: Re: URI parsing
Date: Thu, 05 Apr 2001 12:48:54 +0000
Message-ID: <9ahp9s$5j1m4$1@ID-22205.news.dfncis.de>

Raymond Wiker wrote:

> Jochen Schmidt <···@dataheaven.de> writes:
> 
>         [ ... ]
> 
>> I've also rewritten Christopher Double's NET.URI module. It know
>> uses a more RFC compatible URI-parser and has a nearly complete
>> function set.
> 
>         I wrote a small (37 lines) function yesterday for parsing
> URIs. When I compared it with NET.URI, I noticed that I didn't handle
> fragments (internal anchors in html files). On the other hand, it
> *does* handle usernames and passwords.
> 
>         The code is included below; I would appreciate
> comments/suggestions.

Thanks - my new parsing function of NET.URI it handles 
scheme, authority, path, query and fragment parts. It is only 22 Lines 
long. I don't thought that it would be a such good idea to bring in more 
specialized fields, as after the scheme each URI is free to define it's own 
syntax.
Th function has only 22 lines but uses a macro "WITH-SCAN-MACROS" which is 
inspired by AServes WITH-BETTER-SCAN-MACROS. I enhanced it a bit:

*WITH-SCAN-MACROS* (buffer-entry [pos-entry]) form*
=>result*

buffer-entry::= existing-stringvar | (new-buffername source-buffer)
pos-entry::= position-var-symbol | start-position | (position-var-symbol 
start-position)

If you use simply a symbol as buffer-entry then there has to be a prior 
defined stringbuffer by that name that contains the data you want to parse.
If you use the (new-buffername source-buffer) notation then WITH-SCAN-MACROS
emits code that COP-SEQs an existing "source-buffer" to a new local binding 
by the name "new-buffername".

If you use a symbol as pos-entry a new lexical binding is created by that 
name and initialized to the value 0.  If you use a number instead a lexical 
binding named "pos" is created and initialized to the given value. If  you 
use the (position-var-symbol start-position) notation then a binding named 
by position-var-symbol is created and initialized to start-position.

WITH-SCAN-MACROS walks through it's body and collects all names of known 
scan-macros. Only the in the body used scan-macros and the scan-macros they 
depend on are emitted to the macrolet WITH-SCAN-MACROS creates.

There are up to now the following scan-macros:

COLLECT-TO <char>  ; collect all from actual position to given-char 
(without the char)
COLLECT-TO-CHARS <sequence-of-chars> ; like COLLECT-TO but any char of the 
sequence may end the collecting
COLLECT-TO-EOL ; collect to end of line
COLLECT-TO-EOS ; collect to end of string
SKIP-N <n> ; skip n chars
SKIP-TO <char> ; skip to char
SKIP-TO-NOT <char> ; skip to first char that is not eql to <char>

This may change. I may add more scan-macros and/or remove existing ones.
an SKIP-WS (skipping whitespace) may be good.

It is possible in the body to modifiy the pos and the buffer variables 
directly. Instead of (skip-n 5) you may write (incf pos 5).

The macros have a keyword-argument named :errorp. It is by default set to t 
this means that code is emitted that tries to call a macro named (fail) if 
the position reaches a premature end. To get proper error-handling here you 
have to define this macro. Instead of that you may set errorp to nil so 
that nothing happens at a premature end. Note that by setting errorp of a 
collecting macro to nil you throw away the chars the macro collected until 
the premature end is reached. If you set errorp to :result in a collecting 
macro - the macro returns the chars it has collected so far.

Heres my URI parsing function.

(defun %parse-uri (str)
  (with-scan-macros (str)
     (let* ((scheme (let ((result (collect-to-chars '(#\: #\/ #\? #\#)
                                                          :errorp :result)))
                      (if (eql (schar str pos) #\:)
                          (progn (incf pos) result)
                        (progn (setf pos 0) nil))))
            (authority
             (when (and (eql (schar str pos) #\/)
                        (eql (schar str (1+ pos)) #\/))
               (incf pos 2)
               (collect-to-chars '(#\/ #\? #\# #\return) :errorp :result)))
            (path
             (when (or (not scheme) (eql (schar str pos) #\/))
               (collect-to-chars '(#\? #\# #\return) :errorp :result)))
            (query 
             (when (eql (schar str pos) #\?)
               (collect-to-chars '(#\# #\return) :errorp :result)))
            (fragment
             (when (eql (schar str pos) #\#)
               (collect-to-eos :errorp :result))))
       (values scheme authority path query fragment))))

From: Rob Warnock
Subject: Re: URI parsing
Date: Fri, 06 Apr 2001 13:23:10 +0000
Message-ID: <9akfvu$818lf$1@fido.engr.sgi.com>

Jochen Schmidt  <···@dataheaven.de> wrote:
+---------------
| Raymond Wiker wrote:
| >         I wrote a small (37 lines) function yesterday for parsing
| > URIs. When I compared it with NET.URI, I noticed that I didn't handle
| > fragments (internal anchors in html files). On the other hand, it
| > *does* handle usernames and passwords.
| 
| Thanks - my new parsing function of NET.URI it handles 
| scheme, authority, path, query and fragment parts. It is only 22 Lines 
| long. I don't thought that it would be a such good idea to bring in more 
| specialized fields, as after the scheme each URI is free to define it's
| own syntax.
+---------------

But at least for the "Common Internet Scheme Syntax" [RFC 1738 Section 3.1],
that is, anything that starts with "//" after the scheme (including the
"http:", "ftp:", & "telnet:" schemes), you definitely should parse *all*
of the elements (if present):

    //<user>:<password>@<host>:<port>/<url-path>

There have been published exploits recently that involved deceiving users
by formatting a "user" component that *looked* like a domain name but wasn't,
because of a later ·@". See the following RISKS Digest articles for an
especially sneaky example:

    "Making something look hacked when it isn't"
    <URL:http://catless.ncl.ac.uk/Risks/21.16.html#subj5.1>

    "The risk of a seldom-used URL syntax"
    <URL:http://catless.ncl.ac.uk/Risks/21.16.html#subj6.1>
    <URL:http://catless.ncl.ac.uk/Risks/21.18.html#subj15.1>


-Rob

-----
Rob Warnock, 31-2-510		····@sgi.com
SGI Network Engineering		<URL:http://reality.sgi.com/rpw3/>
1600 Amphitheatre Pkwy.		Phone: 650-933-1673
Mountain View, CA  94043	PP-ASEL-IA

From: ·····@relay.known.net
Subject: Re: URI parsing
Date: Tue, 10 Apr 2001 17:29:17 +0000
Message-ID: <m3elv0vf2q.fsf@relay.known.net>

····@rigden.engr.sgi.com (Rob Warnock) writes:

> Jochen Schmidt  <···@dataheaven.de> wrote:
> +---------------
> | Raymond Wiker wrote:
> | >         I wrote a small (37 lines) function yesterday for parsing
> | > URIs. When I compared it with NET.URI, I noticed that I didn't handle
> | > fragments (internal anchors in html files). On the other hand, it
> | > *does* handle usernames and passwords.
> | 
> | Thanks - my new parsing function of NET.URI it handles 
> | scheme, authority, path, query and fragment parts. It is only 22 Lines 
> | long. I don't thought that it would be a such good idea to bring in more 
> | specialized fields, as after the scheme each URI is free to define it's
> | own syntax.
> +---------------
> 
> But at least for the "Common Internet Scheme Syntax" [RFC 1738 Section 3.1],

From RFC 2396:

   ... This document
   defines the generic syntax of URI, including both absolute and
   relative forms, and guidelines for their use; it revises and replaces
   the generic definitions in RFC 1738 and RFC 1808.

You should ignore RFCs 1738 and look at RFC 2396 instead.

Kevin Layer

From: Jochen Schmidt
Subject: Re: URI parsing
Date: Thu, 05 Apr 2001 22:34:30 +0000
Message-ID: <9airjj$5e0jn$1@ID-22205.news.dfncis.de>

Raymond Wiker wrote:

> Jochen Schmidt <···@dataheaven.de> writes:
> 
>         [ ... ]
> 
>> I've also rewritten Christopher Double's NET.URI module. It know
>> uses a more RFC compatible URI-parser and has a nearly complete
>> function set.
> 
>         I wrote a small (37 lines) function yesterday for parsing
> URIs. When I compared it with NET.URI, I noticed that I didn't handle
> fragments (internal anchors in html files). On the other hand, it
> *does* handle usernames and passwords.
> 
>         The code is included below; I would appreciate
> comments/suggestions.
> 
> (defun split-url (url-string)
>   "Splits the given URL into components representing the protocol,
> user, password, host, port number, path and arguments."
>   (flet ((split (sep string)
> (let ((pos (search sep string)))
> (if pos
> (list (subseq string 0 pos)
> (subseq string (+ pos (length sep))))
> string))))
>     (macrolet ((try-split (sep str (true-1-var true-2-var)
> &optional false-var)
> (let ((ressym (gensym)))
> `(let ((,ressym (split ,sep ,str)))
> (if (listp ,ressym)
> (setf ,true-1-var (car ,ressym)
> ,true-2-var (cadr ,ressym))
> ,(when false-var
> `(setf ,false-var ,ressym)))))))
>       (let (proto user pass host port path args)
> (try-split "://" url-string (proto host) host)
> (try-split "/" host (host path))
> (try-split ·@" host (user host))
> (when user
> (try-split ":" user (user pass)))
> (try-split ":" host (host port))
> (when path
> (let ((argstr nil))
> (try-split "?" path (path argstr))
> (when argstr
> (setq args
> (let ((args '()) (end nil))
> (loop (try-split "?" argstr (arg argstr) end)
> (if end
> (return-from nil
> (nreverse (cons end args)))
> (push arg args))))))))
> (values proto user pass host port path args)))))

Be carefully some URI-types are not handled correctly by this code:

e.g. a relative URI without scheme:

foo/index.html
gets "foo" as host and index.html as path

··········@dataheaven.de
mailto is taken as username jsc as password

The function does handle URLs but not URIs in general.
See RFC2396 for further info on URIs

Regards,
Jochen

From: Raymond Wiker
Subject: Re: URI parsing
Date: Fri, 06 Apr 2001 05:55:25 +0000
Message-ID: <86snjma7xu.fsf@raw.grenland.fast.no>

Jochen Schmidt <···@dataheaven.de> writes:

> Raymond Wiker wrote:
> 
> >         I wrote a small (37 lines) function yesterday for parsing
> > URIs. When I compared it with NET.URI, I noticed that I didn't handle
> > fragments (internal anchors in html files). On the other hand, it
> > *does* handle usernames and passwords.
> > 
> >         The code is included below; I would appreciate
> > comments/suggestions.

        [ snip ]

> Be carefully some URI-types are not handled correctly by this code:
> 
> e.g. a relative URI without scheme:
> 
> foo/index.html
> gets "foo" as host and index.html as path
> 
> ··········@dataheaven.de
> mailto is taken as username jsc as password
> 
> The function does handle URLs but not URIs in general.
> See RFC2396 for further info on URIs

        Will do; thanks.

        There's a further bug in my code: for arguments, the code
assumes that #\? is used to separate args. This should be #\&. Anyway,
it's probably better *not* to split up the arguments in this function.

        The code you posted earlier looked interesting. Is the rest of
the code available anywhere?

-- 
Raymond Wiker
·············@fast.no

From: Jochen Schmidt
Subject: Re: URI parsing
Date: Fri, 06 Apr 2001 07:17:39 +0000
Message-ID: <9ajq8f$5o151$1@ID-22205.news.dfncis.de>

Raymond Wiker wrote:

> Jochen Schmidt <···@dataheaven.de> writes:
> 
>> Raymond Wiker wrote:
>> 
>> >         I wrote a small (37 lines) function yesterday for parsing
>> > URIs. When I compared it with NET.URI, I noticed that I didn't handle
>> > fragments (internal anchors in html files). On the other hand, it
>> > *does* handle usernames and passwords.
>> > 
>> >         The code is included below; I would appreciate
>> > comments/suggestions.
> 
>         [ snip ]
> 
>> Be carefully some URI-types are not handled correctly by this code:
>> 
>> e.g. a relative URI without scheme:
>> 
>> foo/index.html
>> gets "foo" as host and index.html as path
>> 
>> ··········@dataheaven.de
>> mailto is taken as username jsc as password
>> 
>> The function does handle URLs but not URIs in general.
>> See RFC2396 for further info on URIs
> 
>         Will do; thanks.
> 
>         There's a further bug in my code: for arguments, the code
> assumes that #\? is used to separate args. This should be #\&. Anyway,
> it's probably better *not* to split up the arguments in this function.
> 
>         The code you posted earlier looked interesting. Is the rest of
> the code available anywhere?

I'll post the complete NET.URI like module on my homepage 
(http://www.dataheaven.de) today. The code changed a bit but not much.

Regards,
Jochen

From: Pierre R. Mai
Subject: Re: URI parsing
Date: Fri, 06 Apr 2001 10:39:20 +0000
Message-ID: <87k84ypb1j.fsf@orion.bln.pmsf.de>

Raymond Wiker <·············@fast.no> writes:

> > The function does handle URLs but not URIs in general.
> > See RFC2396 for further info on URIs
> 
>         Will do; thanks.
> 
>         There's a further bug in my code: for arguments, the code
> assumes that #\? is used to separate args. This should be #\&. Anyway,
> it's probably better *not* to split up the arguments in this function.

Also beware that you have to be extra careful when it comes to
unescaping URL components (i.e. converting %HH to characters, etc.),
and its interactions with argument parsing and other parts of URL
processing, especially ordering constraints.  This is definitely
non-trivial to get right, especially since common practice sometimes
deviates from relevant RFCs.  Also don't bother with URL-releated RFCs
prior to RFC2396 like e.g. RFC 1738 and RFC 1808.  Earlier RFCs were
completely insane in some of their requirements, whereas RFC2396 is at
least half-way sanely parseable.

From memory, further pitfalls are relative-URLs and merging semantics
and handling of parameters (the stuff in path segments separated by
#\;, not query-arguments, though only really relevant for ftp URLs
IIRC, and there only for the last path component, so you can maybe
punt on this).

The URL parsing and handling code is probably the second most arcane
stuff in our in-house server, taking nearly 1KLOC just to get HTTP
urls to a decent level (though all structured URL schemes can be added
with 3-4 lines each).  The only code that is IMHO even more intricate
is the code that handles inbound multipart entities, especially in the
presence of keep-alive connections and stream-based processing by user
code.

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein