Seeking a *trivial* XML parser

From: Tim Bradshaw
Subject: Seeking a *trivial* XML parser
Date: Tue, 05 Feb 2002 16:57:23 +0000
Message-ID: <fbc0f5d1.0202050857.7e432f51@posting.google.com>

This article smells slightly of `I can't be bothered to find this out
myself', sorry.  I can be bothered but I'd also like to be guided by
any recommendations...

I am writing a system which uses a fairly elaborate (read `tiny
language') SEXPR-based config file format.  This looks like it's
unacceptable to the current users, who want something more
fashionable.  So probably this means XML of some kind, because I can
map things (albeit verbosely) into XML.

So I need an XML parser.  I have spent some time looking around for
one, and there are obviously a lot of offerings out there, many of
which are probably very good.  All the ones I've looked at look rather
more complex than I need though.  What I want is the absolute minimal
possible thing: since I get to write the DTDs I can completely control
what it will come across, and I don't need namespaces or any of the
other enormous complexities that encrust these things.  All I need,
really, is a politically-acceptable syntax for SEXPRs.

The parser can be external - I can run  aprogram and snarf the output
if need be.

It needs to have a non-contagious license.

Can anyone recommend anything?

Thanks

--tim

Re: Seeking a *trivial* XML parser Paul Tarvydas
Re: Seeking a *trivial* XML parser Raymond Wiker
Re: Seeking a *trivial* XML parser Pierre R. Mai
- Re: Seeking a *trivial* XML parser Pierre R. Mai
Re: Seeking a *trivial* XML parser Jeff Sandys

From: Paul Tarvydas
Subject: Re: Seeking a *trivial* XML parser
Date: Tue, 05 Feb 2002 18:19:11 +0000
Message-ID: <Xns91AC885A4A866pt@66.185.95.104>

How deep are your sexprs?  I had a similar problem with a fact base - each 
fact was stored as a depth-one sexpr.  Simply moving the left paren over to 
the right by one atom solved the acceptance problem :-)  (kinda like 
McCarthy's top-level sexprs in the lisp 1.5 book).
pt

From: Raymond Wiker
Subject: Re: Seeking a *trivial* XML parser
Date: Tue, 05 Feb 2002 17:37:31 +0000
Message-ID: <86g04fhlt0.fsf@raw.grenland.fast.no>

··········@tfeb.org (Tim Bradshaw) writes:

> This article smells slightly of `I can't be bothered to find this out
> myself', sorry.  I can be bothered but I'd also like to be guided by
> any recommendations...
> 
> I am writing a system which uses a fairly elaborate (read `tiny
> language') SEXPR-based config file format.  This looks like it's
> unacceptable to the current users, who want something more
> fashionable.  So probably this means XML of some kind, because I can
> map things (albeit verbosely) into XML.
> 
> So I need an XML parser.  I have spent some time looking around for
> one, and there are obviously a lot of offerings out there, many of
> which are probably very good.  All the ones I've looked at look rather
> more complex than I need though.  What I want is the absolute minimal
> possible thing: since I get to write the DTDs I can completely control
> what it will come across, and I don't need namespaces or any of the
> other enormous complexities that encrust these things.  All I need,
> really, is a politically-acceptable syntax for SEXPRs.
> 
> The parser can be external - I can run  aprogram and snarf the output
> if need be.
> 
> It needs to have a non-contagious license.
> 
> Can anyone recommend anything?

        "xsltproc" may be able to do what you want (like process a XML
file and output "real" SEXPRs). xsltproc is packaged with libxslt,
which in turn requires libxml; libxml and libxslt are both 
available from www.xmlsoft.org. The license is LGPL, for both.

        And no, I do *not* claim that this is a particularly elegant
solution :-)

-- 
Raymond Wiker                        Mail:  ·············@fast.no
Senior Software Engineer             Web:   http://www.fast.no/
Fast Search & Transfer ASA           Phone: +47 23 01 11 60
P.O. Box 1677 Vika                   Fax:   +47 35 54 87 99
NO-0120 Oslo, NORWAY                 Mob:   +47 48 01 11 60

Try FAST Search: http://alltheweb.com/

From: Pierre R. Mai
Subject: Re: Seeking a *trivial* XML parser
Date: Tue, 05 Feb 2002 21:20:07 +0000
Message-ID: <87adunk4mw.fsf@orion.bln.pmsf.de>

··········@tfeb.org (Tim Bradshaw) writes:

> So I need an XML parser.  I have spent some time looking around for
> one, and there are obviously a lot of offerings out there, many of
> which are probably very good.  All the ones I've looked at look rather
> more complex than I need though.  What I want is the absolute minimal
> possible thing: since I get to write the DTDs I can completely control
> what it will come across, and I don't need namespaces or any of the
> other enormous complexities that encrust these things.  All I need,
> really, is a politically-acceptable syntax for SEXPRs.
> 
> The parser can be external - I can run  aprogram and snarf the output
> if need be.

FWIW, I've tended to use the expat parser library by James Clark,
which comes dual-licenced under the MPL/GPL.  For simple uses, I've
written the following wrapper for expat, that parses standard input,
and outputs a simple, lisp-readable representation of the file, with
the following features:

- Translates UTF-8 to ISO Latin-1, so that the resulting output can be
  used as-is with normal 8bit Unix lisps.  Stuff outside of ISO
  Latin-1 is silently elided (can easily be changed to dump core
  instead ;).
- PCDATA is mapped to CL strings, where as many PCDATA segments as
  possible are merged into one string.
- Elements are mapped to lists, with the first item being the start
  tag, which is mapped to a nested list, i.e.

  <element attr="value">...</element>
  is mapped to
  (("element" "attr" "value") ...)
- Processing Instructions (PIs) are mapped to a single cons, i.e.

  <?foo ...?> is mapped to ("foo" . "...")

The appended file is hereby placed into the public domain.

Regs, Pierre.

/* This is an interface program that uses expat to parse XML and
 * output a Lispified representation that can be easily parsed by the
 * normal Common Lisp reader.
 */

#include <stdio.h>
#include "xmlparse.h"

/* Since XML PCDATA elements can be returned in multiple chunks by
 * expat, and we want this merged into one string for the Lisp side of
 * things, we keep track of the current inText state, i.e. whether the
 * last thing we output was text, or not.  We only write opening
 * double quotes on !inText -> inText transitions, and closing douple
 * quotes on inText -> !inText transitions.
 */

void finishText(int *inText)
{
  if (*inText)
  {
      putchar('"');
      putchar('\n');
      *inText=0;
  }
} 

void startText(int *inText)
{
    if (!(*inText))
    {
	putchar('"');
	*inText=1;
    }
}

/* Handle conversion from UTF-8 to ISO Latin-1, to which we restrict
 * our Lisp side support for the moment.  Characters outside of the
 * ISO Latin-1 8bit range will be SILENTLY elided. */

void outputText(const unsigned char* text,int len)
{
  int pos=0;
  
  while (pos<len)
    {
      if (text[pos] < 0x80) 
	{
	  /* ASCII:  Output verbatim, except for escape-chars */
	  if (text[pos] == '\\' || text[pos] == '"')
	    putchar('\\');
	  putchar(text[pos++]);
	}
      else if (text[pos] < 0xC0) 
	{
	  /* We are in the middle of a multi-byte sequence!
	   * This should never happen, so we skip it. */
	  pos++;
	}
      else if (text[pos] < 0xE0) 
	{
	  /* Two-byte sequence.  Skip if follow on char is not a
	   * valid continuation byte: */
	  if ((pos+1>=len) || ((text[pos+1] & 0x80) != 0x80))
	    {
	      pos++;
	      continue;
	    }
	  /* Check whether we have a valid ISO Latin-1 character: */
	  if (text[pos] < 0xC4) 
	    {
	      /* Valid, output this and next byte */
	      putchar(((text[pos] & 0x03) << 6) | (text[pos+1] & 0x3f));
	    }
	  
	  pos+=2;
	}
      else if (*text < 0xF0)
	{
	  /* Three-byte sequence. Skip it. */
	  if ((pos+1>=len) || ((text[pos+1] & 0x80) != 0x80))
	    {
	      pos++;
	      continue;
	    }
	  if ((pos+2>=len) || ((text[pos+2] & 0x80) != 0x80))
	    {
	      pos+=2;
	      continue;
	    }
	  pos+=3;
	}
      else
	{
	  /* 4 to 6 byte sequences can't happen in XML, which only
	   * uses the BMP, aka Unicode.  We skip until the next non
	   * continuation character. */
	  do 
	    {
	      pos++;
	    }
	  while ((pos<len) && ((text[pos] & 0x80) == 0x80));
	}
    }
}
	  
void outputString(const unsigned char* text)
{
  outputText(text,strlen(text));
}

/* Handle Element start and stop tags */

void startElement(void *userData, const char *name, const char **atts)
{
  const char** att;
  finishText((int*)userData);
  fputs("((\"",stdout);
  outputString(name);
  putchar('"');
  for (att=atts;*att;att+=2)
    {
      fputs(" \"",stdout);
      outputString(*att);
      fputs("\" \"",stdout);
      outputString(*(att+1));
      fputs("\"",stdout);
    }
  fputs(")\n",stdout);
}

void endElement(void *userData, const char *name)
{
    finishText((int*)userData);
    fputs(")\n",stdout);
}

/* Handle PCDATA */

void charData(void* userData, const XML_Char *s,int len)
{
    int i;
    startText((int*)userData);
    outputText(s,len);
}

/* Handle PIs */
void processingInstruction(void* userData,const XML_Char *target,const XML_Char *data)
{
  finishText((int*)userData);
  fputs("(\"",stdout);
  outputString(target);
  fputs("\" . \"",stdout);
  outputString(data);
  fputs("\")",stdout);
}

/* Main program */

int main()
{
  char buf[BUFSIZ];
#ifndef CL_NS_SEP
  XML_Parser parser = XML_ParserCreate(NULL);
#else
  XML_Parser parser = XML_ParserCreateNS(NULL,CL_NS_SEP);
#endif
  int done;
  int inText = 0;
  XML_SetUserData(parser, &inText);
  XML_SetElementHandler(parser, startElement, endElement);
  XML_SetCharacterDataHandler(parser, charData);
  XML_SetProcessingInstructionHandler(parser,processingInstruction);
  do {
    size_t len = fread(buf, 1, sizeof(buf), stdin);
    done = len < sizeof(buf);
    if (!XML_Parse(parser, buf, len, done)) {
      fprintf(stderr,
	      "%s at line %d\n",
	      XML_ErrorString(XML_GetErrorCode(parser)),
	      XML_GetCurrentLineNumber(parser));
      return 1;
    }
  } while (!done);
  XML_ParserFree(parser);
  return 0;
}


-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Pierre R. Mai
Subject: Re: Seeking a *trivial* XML parser
Date: Wed, 06 Feb 2002 22:12:46 +0000
Message-ID: <87g04ew97l.fsf@orion.bln.pmsf.de>

"Pierre R. Mai" <····@acm.org> writes:

> FWIW, I've tended to use the expat parser library by James Clark,
> which comes dual-licenced under the MPL/GPL.  For simple uses, I've
> written the following wrapper for expat, that parses standard input,
> and outputs a simple, lisp-readable representation of the file, with
> the following features:

A couple of points that have cropped up in private email:

- If you don't need support for processing instructions, just comment
  out the following line in main():

>   XML_SetProcessingInstructionHandler(parser,processingInstruction);

  This will give you a simpler format on the lisp side, since you can
  now treat any cons as an element spec.

- Namespaces can be supported by defining the pre-processor symbol
  CL_XML_SEP to a character that is then used to separate the
  namespace from the identifier in relevant things like GIs...

- In XML (like in mixed content model SGML), whitespace _is_
  significant.  Only the application can decide whether to elide it in
  certain cases (e.g. elements that only contain other elements).

  Since the elements.c wrapper guarantees that PCDATA segments are
  merged as much as possible, it is easy to trim whitespace from
  PCDATA, and elide PCDATA segments which are all whitespace in a
  simple post-processing stage, e.g.

(defconstant +ws-char-bag+ '(#\Space #\Tab #\Newline)
  "Or whatever else you like to call whitespace...")

(defun post-process-xml (list)
  (assert (consp list))
  (if (and (stringp (car list)) (stringp (cdr list)))
      ;; PIs get passed-through
      list
      (list* (first list)
	     (mapcan #'(lambda (elem)
			 (if (stringp elem)
			     (let ((result (string-trim +ws-char-bag+ elem)))
			       (if (zerop (length result))
				   nil
				   (list result)))
			     (list (post-process-xml elem))))
		     (rest list)))))

With such a post-processing stage, the following XML fragment

<config>
  <attrib name="foo">value</attrib>
  <attrib name="bar">another value</attrib>
</config>

will result in this list:

(("config")
 (("attrib" "name" "foo") "value")
 (("attrib" "name" "bar") "another value"))

Regs, Pierre.

-- 
Pierre R. Mai <····@acm.org>                    http://www.pmsf.de/pmai/
 The most likely way for the world to be destroyed, most experts agree,
 is by accident. That's where we come in; we're computer professionals.
 We cause accidents.                           -- Nathaniel Borenstein

From: Jeff Sandys
Subject: Re: Seeking a *trivial* XML parser
Date: Wed, 06 Feb 2002 21:27:54 +0000
Message-ID: <3C619FDA.4816AA1B@juno.com>

Tim Bradshaw wrote:
> So I need an XML parser.
	http://www.garshol.priv.no/download/xmltools/
The CLOCC xlm parser works fine, easy to use, 
if you want an all lisp system.
The expat-Lisp bindings also work, if you want to compile 
expat for greater parsing speed.

Thanks,
Jeff Sandys