Looking for a CL "web scraper"

From: rif
Subject: Looking for a CL "web scraper"
Date: Fri, 03 Sep 2004 04:41:39 +0000
Message-ID: <wj08ybs2ccc.fsf@five-percent-nation.mit.edu>

I'm interested in writing a CL program that can organize data obtained
from websites.  The program needs to "pretend to be a browser", and
download dynamically generated pages (I believe HTTP "GET"s).
Ideally, I'd like a tool that will do as much as possible ---
downloading, parsing, etc., but I'll start with whatever I can get.  I
looked at the cliki web tools, but I'm seeing mostly servers and HTML
generators; the closure web browser maybe has what I need inside it,
but doesn't seem to have any docs.  Any recommendations?

Cheers,

rif

Re: Looking for a CL "web scraper" TLOlczyk
- Re: Looking for a CL "web scraper" Paolo Amoroso
  - Re: Looking for a CL "web scraper" TLOlczyk
    - Re: Looking for a CL "web scraper" Paolo Amoroso
      - Re: Looking for a CL "web scraper" Paolo Amoroso
    - Re: Looking for a CL "web scraper" Björn Lindberg
Re: Looking for a CL "web scraper" Edi Weitz
Re: Looking for a CL "web scraper" Kevin Layer
Re: Looking for a CL "web scraper" Marco Baringer
- Re: Looking for a CL "web scraper" Reini Urban
Re: Looking for a CL "web scraper" Thomas A. Russ
Re: Looking for a CL "web scraper" Paolo Amoroso
- Re: Looking for a CL "web scraper" Eric Marsden
  - Re: Looking for a CL "web scraper" Paolo Amoroso
    - Re: Looking for a CL "web scraper" Eric Marsden
      - Re: Looking for a CL "web scraper" Paolo Amoroso

From: TLOlczyk
Subject: Re: Looking for a CL "web scraper"
Date: Fri, 03 Sep 2004 05:36:09 +0000
Message-ID: <b80gj0596c92pcrkjlugjio49alk8gkt42@4ax.com>

On 03 Sep 2004 00:41:39 -0400, rif <···@mit.edu> wrote:

>
>I'm interested in writing a CL program that can organize data obtained
>from websites.  The program needs to "pretend to be a browser", and
>download dynamically generated pages (I believe HTTP "GET"s).
>Ideally, I'd like a tool that will do as much as possible ---
>downloading, parsing, etc., but I'll start with whatever I can get.  I
>looked at the cliki web tools, but I'm seeing mostly servers and HTML
>generators; the closure web browser maybe has what I need inside it,
>but doesn't seem to have any docs.  Any recommendations?
>
From the way your write your post, I suspect that you know that
all the HTML you will spider is simple, all the HTTP requests simple
GETS. If that is the case then you should be able to find wrappers 
for libwww, expat, libxml2 and libcurl.

However I suspect that that is not the case ( you just made it seem so
). In which case you are stuck. If you have to deal with arbitrary
HTML, then you have to deal with some really bad HTML that
IE accepts. This kind of HTML chokes almost all parsers. The
only parser I know of that can handle this kind of HTML is
Perl's parser. 

I prety much suspect that nothing like this is freely availble in
Lisp. If it were then I would think the people who wrote it
would modify it to compile to a so/dll and distribute it to the
general community. Make for great advertising.

The reply-to email address is ··········@yahoo.com.
This is an address I ignore.
To reply via email, remove 2002 and change yahoo to
interaccess,

**
Thaddeus L. Olczyk, PhD

There is a difference between
*thinking* you know something,
and *knowing* you know something.

From: Paolo Amoroso
Subject: Re: Looking for a CL "web scraper"
Date: Fri, 03 Sep 2004 13:11:24 +0000
Message-ID: <87u0uf5wg3.fsf@plato.moon.paoloamoroso.it>

TLOlczyk <··········@yahoo.com> writes:

> ). In which case you are stuck. If you have to deal with arbitrary
> HTML, then you have to deal with some really bad HTML that
> IE accepts. This kind of HTML chokes almost all parsers. The
> only parser I know of that can handle this kind of HTML is
> Perl's parser. 

The paper by Gilbert Baumann I have mentioned elsewhere in this thread
also covers parsing both sane and illegal HTML.  Here is the abstract:

  "Closure - A Web Browser in Common Lisp"
  Gilbert Baumann
  Proceedings of ILC 2002

  Closure is a complete and standards-conforming web browser, which is
  written in Common Lisp. It contains, besides HTML-4.01 support, a
  correct implementation of CSS-1 and aims to be a serious competitor
  to wide-spread browsers such as Mozilla and IE.  Closure uses CLIM
  for its user interface which allows the user to easily extend the
  browser with new commands and makes it feasible to offer rich
  debugging aids like an interactive parse tree inspector which is of
  tremendous help to make Closure sufficiently bug-compatible with
  existing browsers. Finally, the choice of using Common Lisp allowed
  for a fairly correct, maintainable, and modular yet efficient
  implementation.

Paolo
-- 
Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film
Recommended Common Lisp libraries/tools (Google for info on each):
- ASDF/ASDF-INSTALL: system building/installation
- CL-PPCRE: regular expressions
- UFFI: Foreign Function Interface

From: TLOlczyk
Subject: Re: Looking for a CL "web scraper"
Date: Sat, 04 Sep 2004 09:01:22 +0000
Message-ID: <m9vij092e3sib4tappqq279sqf5vieqgse@4ax.com>

On Fri, 03 Sep 2004 15:11:24 +0200, Paolo Amoroso <·······@mclink.it>
wrote:

1) The abstract does not say that it can parse non-conforming HTML.
Hell it doesn't even say that it can parse HTML/1.0.
2) The paper itself is not easy to find. The book is not in Illinois
university libraries that are publicly accessible. It is not available
from amazon. AFAICS it is not available on the net. It's not even in
citeseer. 
3) Even if the paper were available, it is the instructions on how to
build a parser, not a parser itself.
4) It is not clear that it is a relatively easy task to extract the
parser. In fact in most cases, it is near impossible.
5) It's not clear that the parser can handle javascript. A lot
of spidering requires that you be able to handle javascript.
( Two examples: amazon's "look inside" property, tv guides
tv listings).
6) If it a robust solution, then why hasn't anyone extracted the
code and and made an "HTML parsing" dll/so? As I pointed 
out before it would be a PR coup, everyone who wants to do 
spidering in everything from C++ to Java to Eiffel to Haskell
would know that there was something Lisp can do that
no other language can.


The reply-to email address is ··········@yahoo.com.
This is an address I ignore.
To reply via email, remove 2002 and change yahoo to
interaccess,

**
Thaddeus L. Olczyk, PhD

There is a difference between
*thinking* you know something,
and *knowing* you know something.

From: Paolo Amoroso
Subject: Re: Looking for a CL "web scraper"
Date: Sat, 04 Sep 2004 12:14:36 +0000
Message-ID: <87brgms02b.fsf@plato.moon.paoloamoroso.it>

TLOlczyk <··········@yahoo.com> writes:

> On Fri, 03 Sep 2004 15:11:24 +0200, Paolo Amoroso <·······@mclink.it>
> wrote:

This attribution is incorrect, you probably forgot to delete it.  The
quoted text below is from your article, not mine.

[about the Closure web browser by Gilbert Baumann, and an ILC 2002
paper about it]

> 1) The abstract does not say that it can parse non-conforming HTML.

Indeed.  An abstract is usually shorter than the full paper.

But the paper itself includes a section titled "Parsing Illegal HTML".
So, there is at least some weakly suggestive evidence that Closure may
be able to parse non-conforming HTML.

> Hell it doesn't even say that it can parse HTML/1.0.

From the abstract: "Closure is a complete and standards-conforming web
browser, which is written in Common Lisp.  It contains, besides
HTML-4.01 support, a correct implementation  of CSS-1".  More in the
paper text.

> 2) The paper itself is not easy to find. The book is not in Illinois
> university libraries that are publicly accessible. It is not available
> from amazon. AFAICS it is not available on the net. It's not even in
> citeseer. 

I was able to purchase the ILC 2002 proceedings online with a vacuum
tubes modem, without difficulties.  I just placed an order via Franz's
online store, as I implied in another post in this thread.  If I was
able to do that at the province of the empire, in Italy, I assume it's
possible to get a copy also from Illinois--and probably from most of
the planet.

If you are just interested in Gilbert's paper, it is very easy to drop
a line to him by email and ask for a copy.  The ILC 2002 proceedings
include a CD with machine-readable version of the papers and slides,
so he doesn't need any extra work.  I would put a copy online myself,
but I should check with Gilbert and/or the publisher for permission
first.  And I'm no longer motivated.

> 3) Even if the paper were available, it is the instructions on how to
> build a parser, not a parser itself.

The actual parser is part of the Closure source distribution, which is
freely available online.

> 4) It is not clear that it is a relatively easy task to extract the
> parser. In fact in most cases, it is near impossible.

It probably takes just a few minutes to download the Closure code, and
check whether extracting the parser is possible/convenient or not.

> 5) It's not clear that the parser can handle javascript. A lot

Again, this is a question that the actual code, freely available
online, can answer in a matter of minutes, even without the paper.

> 6) If it a robust solution, then why hasn't anyone extracted the
> code and and made an "HTML parsing" dll/so? As I pointed 
> out before it would be a PR coup, everyone who wants to do 
> spidering in everything from C++ to Java to Eiffel to Haskell
> would know that there was something Lisp can do that
> no other language can.

Before someone decides to do that, one has to know that Closure exists
and has a useful HTML parser in the first place.  That's why I have
mentioned it.

And, incidentally, Gilbert himself has extracted some useful
functionality from Closure and made it available separately.  Is
Google accessible from Illinois?

Paolo
-- 
Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film
Recommended Common Lisp libraries/tools (Google for info on each):
- ASDF/ASDF-INSTALL: system building/installation
- CL-PPCRE: regular expressions
- UFFI: Foreign Function Interface

From: Paolo Amoroso
Subject: Re: Looking for a CL "web scraper"
Date: Sat, 04 Sep 2004 14:07:25 +0000
Message-ID: <87vfeuyvoi.fsf@plato.moon.paoloamoroso.it>

Ingvar <······@hexapodia.net> writes:

> Paolo Amoroso <·······@mclink.it> writes:
[...]
>> But the paper itself includes a section titled "Parsing Illegal HTML".
>> So, there is at least some weakly suggestive evidence that Closure may
>> be able to parse non-conforming HTML.
>
> Unless the state of Closure has changed drastically for the worse (I
> ran it under Allegro, way way back when, before it was CLIMified), it
> can parse illegal HTML, though it will emit diagnostics saying as much
> to the REPL window (well, I ran it in an xterm and taht's where it

From the above mentioned section in Gilbert's paper, I seem to
understand that now Closure does something more useful with illegal
HTML, and tries to fix/render it.  But I can't check the actual
program because the sources I have don't build even with their own
version of McCLIM.  Maybe the build problem has more to do with recent
versions of CMUCL than McCLIM.

Paolo
-- 
Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film
Recommended Common Lisp libraries/tools (Google for info on each):
- ASDF/ASDF-INSTALL: system building/installation
- CL-PPCRE: regular expressions
- UFFI: Foreign Function Interface

From: Björn Lindberg
Subject: Re: Looking for a CL "web scraper"
Date: Sat, 04 Sep 2004 14:37:31 +0000
Message-ID: <hcsr7pi2j84.fsf@my.nada.kth.se>

TLOlczyk <··········@yahoo.com> writes:

> On Fri, 03 Sep 2004 15:11:24 +0200, Paolo Amoroso <·······@mclink.it>
> wrote:
> 
> 1) The abstract does not say that it can parse non-conforming HTML.
> Hell it doesn't even say that it can parse HTML/1.0.
> 2) The paper itself is not easy to find. The book is not in Illinois
> university libraries that are publicly accessible. It is not available
> from amazon. AFAICS it is not available on the net. It's not even in
> citeseer. 

The ILC-02 proceedings are available online. See:

  http://lambda-the-ultimate.org/node/view/67


Bj�rn

From: Edi Weitz
Subject: Re: Looking for a CL "web scraper"
Date: Fri, 03 Sep 2004 05:04:59 +0000
Message-ID: <87oekonds4.fsf@bird.agharta.de>

On 03 Sep 2004 00:41:39 -0400, rif <···@mit.edu> wrote:

> I'm interested in writing a CL program that can organize data
> obtained from websites.  The program needs to "pretend to be a
> browser", and download dynamically generated pages (I believe HTTP
> "GET"s).  Ideally, I'd like a tool that will do as much as possible
> --- downloading, parsing, etc., but I'll start with whatever I can
> get.  I looked at the cliki web tools, but I'm seeing mostly servers
> and HTML generators; the closure web browser maybe has what I need
> inside it, but doesn't seem to have any docs.  Any recommendations?

(Portable)AllegroServe includes a fully-featured web client.

Edi.

-- 

"Lisp doesn't look any deader than usual to me."
(David Thornley, reply to a question older than most languages)

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: Kevin Layer
Subject: Re: Looking for a CL "web scraper"
Date: Fri, 03 Sep 2004 16:37:17 +0000
Message-ID: <mkd613s402.fsf@*n*o*s*p*a*m*franz.com>

rif <···@mit.edu> writes:

> I'm interested in writing a CL program that can organize data obtained
> from websites.  The program needs to "pretend to be a browser", and
> download dynamically generated pages (I believe HTTP "GET"s).
> Ideally, I'd like a tool that will do as much as possible ---
> downloading, parsing, etc., but I'll start with whatever I can get.  I
> looked at the cliki web tools, but I'm seeing mostly servers and HTML
> generators; the closure web browser maybe has what I need inside it,
> but doesn't seem to have any docs.  Any recommendations?

In addition to what Edi said, if you are an Allegro customer (even of
the Trial Edition), you can check out

   examples/checklinks/checklinks.cl

in the Allegro directory.  It's a link checker that uses the
AllegroServe client interface.

Kevin Layer

From: Marco Baringer
Subject: Re: Looking for a CL "web scraper"
Date: Fri, 03 Sep 2004 07:51:36 +0000
Message-ID: <m2k6vbhjsn.fsf@bese.it>

rif <···@mit.edu> writes:

> I'm interested in writing a CL program that can organize data obtained
> from websites.  The program needs to "pretend to be a browser", and
> download dynamically generated pages (I believe HTTP "GET"s).
> Ideally, I'd like a tool that will do as much as possible ---
> downloading, parsing, etc., but I'll start with whatever I can get.  I
> looked at the cliki web tools, but I'm seeing mostly servers and HTML
> generators; the closure web browser maybe has what I need inside it,
> but doesn't seem to have any docs.  Any recommendations?

aserve (and by consequence portableaserve) contains an http client,
it handles get and post request types and cookie management. there's
some code in portableaserve to deal with https but it only works on
lispworks (afaict).

as far as parsing "real world" html i'd suggest franz's html parser
(in their xmlutils), [plug: which i've ported to a few
implementations http://common-lisp.net/project/bese/pxmlutils.html].

[thanks to franz for releasing aserve and xmlutils]

-- 
-Marco
Ring the bells that still can ring.
Forget your perfect offering.
There is a crack in everything.
That's how the light gets in.
     -Leonard Cohen

From: Reini Urban
Subject: Re: Looking for a CL "web scraper"
Date: Mon, 20 Sep 2004 09:55:16 +0000
Message-ID: <9fdb4c8c.0409200155.315d327@posting.google.com>

Marco Baringer wrote:
> as far as parsing "real world" html i'd suggest franz's html parser
> (in their xmlutils), [plug: which i've ported to a few
> implementations http://common-lisp.net/project/bese/pxmlutils.html].

I haven't look at pxmlutils, but so far only Closure turned up in 
illegal HTML parsing.

This is another one in various scheme dialects, which fixes illegal HTML:
  http://www.neilvandyke.org/htmlprag/

From: Thomas A. Russ
Subject: Re: Looking for a CL "web scraper"
Date: Fri, 03 Sep 2004 18:06:04 +0000
Message-ID: <ymi3c1zcjn7.fsf@sevak.isi.edu>

In addition to aserve, you may also want to look at CL-HTTP
from MIT.

  http://www.ai.mit.edu/projects/iiip/doc/cl-http/home-page.html  

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: Paolo Amoroso
Subject: Re: Looking for a CL "web scraper"
Date: Fri, 03 Sep 2004 12:52:07 +0000
Message-ID: <87wtzbh5vs.fsf@plato.moon.paoloamoroso.it>

rif <···@mit.edu> writes:

> generators; the closure web browser maybe has what I need inside it,
> but doesn't seem to have any docs.  Any recommendations?

You may check:

  "A Web Browser in Common Lisp"
  Gilbert Baumann
  Proceedings of ILC 2002

The proceedings can be ordered at the Franz online store.  Gilbert's
paper may already be available online.  If not, you might try asking
him for a copy.

Incidentally, Closure is currently broken with respect to the latest
McCLIM CVS sources.  Which is a pity, because it is a great
application.  I'm afraid my CLIM proficiency is not good enough for
fixing it.

Paolo
-- 
Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film
Recommended Common Lisp libraries/tools (Google for info on each):
- ASDF/ASDF-INSTALL: system building/installation
- CL-PPCRE: regular expressions
- UFFI: Foreign Function Interface

From: Eric Marsden
Subject: Re: Looking for a CL "web scraper"
Date: Fri, 03 Sep 2004 14:49:43 +0000
Message-ID: <wziacw7quew.fsf@melbourne.laas.fr>

>>>>> "pa" == Paolo Amoroso <·······@mclink.it> writes:

  pa> Incidentally, Closure is currently broken with respect to the latest
  pa> McCLIM CVS sources.  Which is a pity, because it is a great
  pa> application.  I'm afraid my CLIM proficiency is not good enough for
  pa> fixing it.

  the following patch should help (I can't currently check whether it
  works with the very latest McCLIM, but it worked a few months ago).
  I've sent it to Gilbert, but he's not really maintaining Closure any
  longer (a shame!).

     <URL:http://www.laas.fr/~emarsden/tmp/closure-200404.diff>


  With respect to the OP's question, I would like to point to
  WebScraperHelper by Niel Van Dyke (though it's for PLT Scheme). It
  generates an sexp-equivalent of XPath expressions to reach a goal in
  an XML document; for real use one would combine it with the same
  author's "pragmatic" HTML parser that produces XML.

     <URL:http://www.neilvandyke.org/webscraperhelper/>
     
-- 
Eric Marsden                          <URL:http://www.laas.fr/~emarsden/>

From: Paolo Amoroso
Subject: Re: Looking for a CL "web scraper"
Date: Fri, 03 Sep 2004 19:36:47 +0000
Message-ID: <87y8jrywj4.fsf@plato.moon.paoloamoroso.it>

Eric Marsden <········@laas.fr> writes:

>   the following patch should help (I can't currently check whether it
>   works with the very latest McCLIM, but it worked a few months ago).
>   I've sent it to Gilbert, but he's not really maintaining Closure any
>   longer (a shame!).
>
>      <URL:http://www.laas.fr/~emarsden/tmp/closure-200404.diff>

Which version of the Closure sources does it require with?  I
currently only have the 2003-03-14 tarball, and the bauhh.dyndns.org
CVS repository is unavailable most of the time.  Are there alternate
repositories or more recent snapshots?

Thanks,

Paolo
-- 
Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film
Recommended Common Lisp libraries/tools (Google for info on each):
- ASDF/ASDF-INSTALL: system building/installation
- CL-PPCRE: regular expressions
- UFFI: Foreign Function Interface

From: Eric Marsden
Subject: Re: Looking for a CL "web scraper"
Date: Tue, 07 Sep 2004 12:28:16 +0000
Message-ID: <wziekletg9r.fsf@melbourne.laas.fr>

>>>>> "pa" == Paolo Amoroso <·······@mclink.it> writes:

  <URL:http://www.laas.fr/~emarsden/tmp/closure-200404.diff>

  pa> Which version of the Closure sources does it require with? I
  pa> currently only have the 2003-03-14 tarball, and the
  pa> bauhh.dyndns.org CVS repository is unavailable most of the time.
  pa> Are there alternate repositories or more recent snapshots?

  the diff is from the CVS sources as of April 2004; I can't
  regenerate a current diff because the repository is down.
  I don't know of other repositories, and I'm not very motivated to
  package up my checked out sources because they contain various
  uninteresting modifications.
    
-- 
Eric Marsden                          <URL:http://www.laas.fr/~emarsden/>

From: Paolo Amoroso
Subject: Re: Looking for a CL "web scraper"
Date: Tue, 07 Sep 2004 14:10:36 +0000
Message-ID: <87oeki9nkz.fsf@plato.moon.paoloamoroso.it>

Eric Marsden <········@laas.fr> writes:

>   the diff is from the CVS sources as of April 2004; I can't
>   regenerate a current diff because the repository is down.
>   I don't know of other repositories, and I'm not very motivated to
>   package up my checked out sources because they contain various

I'm not in a hurry, I'll wait for Gilbert's repository to be
accessible again.  Thanks anyway,


Paolo
-- 
Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film
Recommended Common Lisp libraries/tools (Google for info on each):
- ASDF/ASDF-INSTALL: system building/installation
- CL-PPCRE: regular expressions
- UFFI: Foreign Function Interface