Revisiting split-sequence and patterns

From: Erik Naggum
Subject: Revisiting split-sequence and patterns
Date: Wed, 28 Nov 2001 04:03:04 +0000
Message-ID: <3215908980293864@naggum.net>

  This problem occurred to me while trying to explain why comma-delimited
  data formats should not use something as simple as split-sequence (or a
  similar inline loop), so bear with me while I rehash an old issue and
  walk through some background.

  If there be patterns of syntax design, one of them would be escaping or
  quoting a character so it reverts from (potentially) special to normal
  interpretation using one special character solely for escaping purposes,
  backslash being the canonical choice for this.  (Another pattern of
  syntax design would be to use the backslash to _make_ characters special
  and thus mean something else entirely.  It is thus important to recognize
  which pattern the backslash is used to implement in a particular syntax.)
  The escaping character is discarded from the extracted token.  Call this
  the single-escape mechanism.

  Since Common Lisp adheres to this pattern of syntax design, I would have
  expected that a function that split a string into a list of tokens that
  were delimited by a special character would allow a token to contain that
  delimiter if it were escaped.

  If there be more patterns of syntax design, another one would involve
  escaping a whole sequence of characters so they revert from (potentially)
  special to normal interpretation, using a single special character at
  both ends of the sequence, and escaping that character and the escaping
  character inside the sequence with the same character that does this for
  a single character.  All the escaping characters are discarded from the
  extracted token.  Call this the multiple-escape mechanism.

  Since Common Lisp adheres to this pattern of syntax design, too, with
  both its string syntax and the multiple-escape syntax for symbols, I
  would also have expected that a function that split a string into a list
  of tokens that had been delimited on either end by a special character
  would allow a token to be so delimited and thus to avoid being split in
  the middle of such a token if it contained the delimiter on which to
  split the string.

  There are a whole lot of other patterns of syntax design: whether it be
  context-free or context-sensitive; how many characters (or tokens) of
  read-ahead it require; whether the start and end markers of individual
  elements be explicit or deduced.  Common Lisp has a context-sensitive
  grammar (strictly speaking, since every change to the readtable and even
  the package changes the syntax, one cannot read a Common Lisp source file
  correctly without evaluating certain top-level forms and performing any
  evaluation required by the #. reader macro), described by the readtable,
  which implements support for recursive-descent parsers that may change
  the readtable while processing "their" portion of the input stream.  (It
  is easy to describe the (static) syntax of the standard readtable that
  implements the standard language, but that is not sufficient for a real
  Common Lisp system.)  Common Lisp's syntax is in this particular way very
  peculiar and breaks the patterns that have later been established for
  context-free grammars and LALR parsers.  (Most interesting programming
  languages fail to adhere to these formal little things, anyway, but they
  are sometimes helpful in classifying things.)  Common Lisp's standard
  function to read delimited lists, expects to terminate on a character and
  cannot terminate on the end of input.

  Yet another possible pattern is that of considering whitespace completely
  ignorable, but also necessary to keep tokens apart, because almost no
  other character break apart tokens.  This means that whitespace differs
  from other delimiters in that repeated instances should be collapsed, but
  this is not really a feature of the delimiter, but of whitespace.

  Since Common Lisp requires explicit start and end markers for almost all
  tokens (except symbols), and the end marker is already an argument to
  read-delimited-list, I think we have presedence for this argument to a
  function that parses a string, too.

  Briefly, a "split-sequence" that adhers to these patterns would accept:

1 a single-escape character, which defaults to #\\, but the argument may be
  nil to prevent this functionality

2 a designator for a bag, i.e., character or sequence of characters, that
  are multiple-escape characters, which defaults to (#\"), but which can
  usefully be (#\" #\|) or (#\" #\').  If the bag is empty, no characters
  will be treated as multiple-escape characters.  If two multiple-escape
  characters are adjacent and thus complete a "hard empty" field, it is
  returned as an empty string.

3 a designator for a bag of characters that are to be considered whitespace
  and are thus to be ignored apart from their effect on terminating a token
  unless escaped.  E.g., (#\space #\tab).

4 an internal delimiter character that separates tokens, where tokens are
  considered "soft empty" if delimiters are adjacent (ignoring whitespace,
  if any) and if so, are returned as nil.

5 a designator for a bag of terminating delimiters that cause the parsing
  to stop and return the tokens collected.  If nil as a whole or the symbol
  :eof in a sequence, only the end of stream or string will terminate the
  parsed list of tokens, provided that it is not escaped.

  I think the name parse-delimited-list would be descriptive and fitting.

  A note on whitespace.  If some character that is usually a whitespace
  character is to be a delimiter in its own right, such as #\tab, the bag
  of whitespace characters should be empty and it be the internal delimiter
  -- or it will prevent adjacent delimiters from creating an empty field.

  A note on bags.  The function string-trim and friends are the only
  functions I can find that accepts a bag of anything in Common Lisp.  A
  bag can be any sequence type, but if non-characters are required, such as
  :eof, it must be a list or a vector.

  A note on designators for bags of characters.  Like make-array, I think
  it is useful to supply either a character or :eof without having to put
  them in a sequence just for the sake of this argument, so they designate
  a singleton sequence containing itself.  Note that both nil and a string
  of one character is already a bag.

  A note on "hard" and "soft" empty fields.  There may already be good
  reason to supply a "default list" for fields that are empty, modeled on
  make-pathname's defaults argument, but if so, we need a way to specify a
  field that is supplied as empty from one that is defaulted.  In the
  presence of multiple-escape characters, it is possible to create the
  empty string explicitly ("hard"), as opposed to implicitly by omitting
  characters between delimiters ("soft").  Only a "soft empty" field would
  be eligible for defaulting from the defaults list.  Note that supplying a
  short defaults list (which includes not supplying one) will default
  elements to nil, which is what a "soft empty" field would be returned as
  in the absence of a default list, so it need not be a special case.

  I hope it is evident how I have picked elements from several functions
  already in Common Lisp.  I hope the result has a "Common Lisp feel",
  which is also what patterns are about to me.  I have been a little queasy
  about the split-sequence "feel", which to me has a more C- or Perl-like
  feel to it.  (I know, I posted an early version and I am partly to blame.
  But then we age.  Or something.)

///
-- 
  The past is not more important than the future, despite what your culture
  has taught you.  Your future observations, conclusions, and beliefs are
  more important to you than those in your past ever will be.  The world is
  changing so fast the balance between the past and the future has shifted.

Re: Revisiting split-sequence and patterns Jeff Greif
- Re: Revisiting split-sequence and patterns Erik Naggum
  - Re: Revisiting split-sequence and patterns Jeff Greif
    - Re: Revisiting split-sequence and patterns Erik Naggum
      - Re: Revisiting split-sequence and patterns Jeff Greif
Re: Revisiting split-sequence and patterns Thomas F. Burdick

From: Jeff Greif
Subject: Re: Revisiting split-sequence and patterns
Date: Wed, 28 Nov 2001 05:37:59 +0000
Message-ID: <XK_M7.142262$R9.40126147@typhoon.we.rr.com>

This basically seems like a good idea.  However, it might be worthwhile
to allow separate bags for the start and end delimiters in multiple
escapes.  Sometimes the tokens might be enclosed like this `some token'
or [other token] or {there exists this token}.  It may also be useful to
specify the start and end delimiters in matching pairs, so that
"SELECT * FROM LISPERS WHERE NAME = 'Erik Naggum'" could be treated as a
single token in some circumstances where both single and double-quote
were allowed as multiple-escape delimiters, but starting with
double-quote meant you have to end with double-quote, and other
delimiting characters appearing inside the double quotes would not be
escapes.

Jeff

From: Erik Naggum
Subject: Re: Revisiting split-sequence and patterns
Date: Wed, 28 Nov 2001 07:48:30 +0000
Message-ID: <3215922508819020@naggum.net>

* Jeff Greif
| Sometimes the tokens might be enclosed like this `some token' or [other
| token] or {there exists this token}.

  The reason this is out of place is that these things naturally nest.

| It may also be useful to specify the start and end delimiters in matching
| pairs, so that "SELECT * FROM LISPERS WHERE NAME = 'Erik Naggum'" could
| be treated as a single token in some circumstances where both single and
| double-quote were allowed as multiple-escape delimiters, but starting
| with double-quote meant you have to end with double-quote, and other
| delimiting characters appearing inside the double quotes would not be
| escapes.

  I already covered that.  Please have another look:

    If there be more patterns of syntax design, another one would involve
    escaping a whole sequence of characters so they revert from (potentially)
    special to normal interpretation, using a single special character at
--> BOTH ENDS of the sequence, and escaping that character and the escaping
    character inside the sequence with the same character that does this for
    a single character.  All the escaping characters are discarded from the
    extracted token.  Call this the multiple-escape mechanism.

///
-- 
  The past is not more important than the future, despite what your culture
  has taught you.  Your future observations, conclusions, and beliefs are
  more important to you than those in your past ever will be.  The world is
  changing so fast the balance between the past and the future has shifted.

From: Jeff Greif
Subject: Re: Revisiting split-sequence and patterns
Date: Sat, 01 Dec 2001 03:50:09 +0000
Message-ID: <RrYN7.147798$R9.41987620@typhoon.we.rr.com>

I originally sent this on the morning of 28 November, but it never made
it to the newsgroup.

Jeff

"Erik Naggum" <····@naggum.net> wrote in message
·····················@naggum.net...
> * Jeff Greif
> | Sometimes the tokens might be enclosed like this `some token' or
[other
> | token] or {there exists this token}.
>
>   The reason this is out of place is that these things naturally nest.
>
According to the Apache webserver docs, this is a line from a webserver
error log (broken to fit newsgroup linewidth):

  [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by
server configuration: /export/home/live/ap/htdocs/test


I'm not sure your reason is strong enough to allow such a line to fall
outside the usage of this new split-sequence functionality.  A
compromise might be to allow delimiter pairs, but not nesting.  Or if
they are used, only the outermost ones will be treated as escapes, and
they must balance inside.  This is somewhat ugly.

It might be useful to add one more syntax pattern -- characters that are
punctuation (delimit tokens) but also *are* tokens (unless escaped or
appearing inside the multiple escaped sequence).  Using that pattern,
the line above could be tokenized as ( "[" "Wed" "Oct" "11" ... "]" ...)
and a higher level of parsing could handle the rectangular brackets.

Jeff

From: Erik Naggum
Subject: Re: Revisiting split-sequence and patterns
Date: Sat, 01 Dec 2001 13:30:57 +0000
Message-ID: <3216202256157250@naggum.net>

* Jeff Greif
| I'm not sure your reason is strong enough to allow such a line to fall
| outside the usage of this new split-sequence functionality.

  I am.  You need a real parser, not just a "split" function to get this
  kind of functionality.  OK, I called my function "parse-delimited-list"
  and that was probably misleading, but if we do this, it will grow into a
  full-fledged programmable parser like the Common Lisp reader, and then it
  is a better project to make the reader more powerful.  (I am working on
  that, too, in particular to get a handle on the function that decides
  when something is a symbol.)

| It might be useful to add one more syntax pattern -- characters that are
| punctuation (delimit tokens) but also *are* tokens (unless escaped or
| appearing inside the multiple escaped sequence).  Using that pattern,
| the line above could be tokenized as ( "[" "Wed" "Oct" "11" ... "]" ...)
| and a higher level of parsing could handle the rectangular brackets.

  I think this is a much better design.  There is but one drawback, which
  may be a feature, and that is that you no longer have any special rules
  inside the delimiter pairs, which I would kind of expect for a language
  that uses matching delimiters.

///
-- 
  The past is not more important than the future, despite what your culture
  has taught you.  Your future observations, conclusions, and beliefs are
  more important to you than those in your past ever will be.  The world is
  changing so fast the balance between the past and the future has shifted.

From: Jeff Greif
Subject: Re: Revisiting split-sequence and patterns
Date: Sat, 01 Dec 2001 17:19:26 +0000
Message-ID: <yi8O7.151050$xe.41293631@typhoon.we.rr.com>

"Erik Naggum" <····@naggum.net> wrote in message
·····················@naggum.net...
> * Jeff Greif
> | I'm not sure your reason is strong enough to allow such a line to
fall
> | outside the usage of this new split-sequence functionality.
>
>   I am.  You need a real parser, not just a "split" function to get
this
>   kind of functionality.  OK, I called my function
"parse-delimited-list"
>   and that was probably misleading, but if we do this, it will grow
into a
>   full-fledged programmable parser like the Common Lisp reader, and
then it
>   is a better project to make the reader more powerful.  (I am working
on
>   that, too, in particular to get a handle on the function that
decides
>   when something is a symbol.)
>

Pardon me for ruminating further.  I'm not trying to insult your
intelligence by telling you what you already know.

I think the upper bound for the functionailty requirement of this tool
is lexical analysis -- quite a bit less than general lexical analysis,
and definitely not full parsing.  The needed output is a sequence of
tokens that represent the 'atoms' in the grammar of the little language
that is being read.  Parsing can be left to other code.  In concrete
terms, when the Apache log file line:

  [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by
server configuration: /export/home/live/ap/htdocs/test

is cracked, you want to get either the token

"Wed Oct 11 14:32:52 2000" or "[Wed Oct 11 14:32:52 2000]" or
a sequence of tokens "[" "Wed" "Oct" ... "]" for the first part.  You
don't want to get "[Wed" "Oct" ... and  have to look inside the token
"[Wed" to see if it contains the grouping construct '[' and then inside
all subsequent tokens until you find the one containing the matching
"]".

Unfortunately, to do lexical analysis of this sort for arbitrary little
languages, you do need something like a finite-state automaton
processor, or the Lisp reader, to support nesting, context-sensitivity,
etc.  This new function you're proposing should not attempt to handle
all of these constructs -- for only a restricted subset of little
languages will it be suitable.

One way to decide where to draw the line in the functionality of this
operation is to survey the types of repetitive file formats that people
would like to be able process using a tool like this.  Clearly, what
you've proposed is fine for most numeric tabular data (or exported
spreadsheets), and for textual tables like address books, which might
have entries containing a lot of empty tokens, such as (with comma as
separator):

,,,Ronnie Button,,······@phenome.zit.edu,,,,,,,,,,,,,,,,,,,,

Web server logs are another group.  The Apache ones can be handled using
the 'punctuation as delimiter and token' pattern.  Presumably other
formats, such as standard configuration files for applications, should
be looked at as well.  Probably other tools would be better for XML or
other hierarchically-structured configuration files with arbitrary
nesting.  But is there a point in handling whatever sendmail uses these
days, or the windows.ini style of configuration file?  Tabular data
produced by Lisp might also be of interest -- perhaps a control string
for cl:format could be used as the specifier of the data format for
reading back.  (Only certain kinds could be handled.)

In the end, a formal description of the languages that can be handled
would be helpful to users considering the tool.

Jeff

From: Thomas F. Burdick
Subject: Re: Revisiting split-sequence and patterns
Date: Wed, 28 Nov 2001 22:33:36 +0000
Message-ID: <xcvitbujzjz.fsf@monsoon.OCF.Berkeley.EDU>

Erik Naggum <····@naggum.net> writes:

>   This problem occurred to me while trying to explain why comma-delimited
>   data formats should not use something as simple as split-sequence (or a
>   similar inline loop), so bear with me while I rehash an old issue and
>   walk through some background.

Oh, I'm going to have to rewrite my vector splitters with this in
mind.  I feel much better about them now.  They've been useful for
some things, particularly lines coming in from some Unix utility, but
I've been doing my own parsing when I need escaping.  Now that you
pointed this out, I feel kind of foolish for doing this.

This would leave something like split-sequence as a special case of a
more general utility.  I still think that a function returning the
entire parsed output should be a special case of a more general
call-while-splitting type of function.

If I have some spare time this evening, I'll try to do this, rewrite
some code to use it, and post any insights I get from that.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'