HTML scraping

From: verec
Subject: HTML scraping
Date: Wed, 09 May 2007 16:39:37 +0000
Message-ID: <4641f949$0$642$5a6aecb4@news.aaisp.net.uk>

Hi. I'm just fishing for ideas ;-)

I've read the "Writing HTML parser wasn't as hard as I thought it'd be"
thread, but didn't notice anything that would fit between the
regexp (or adhoc literal search) and the full DOM monster.

A Google on "html scraping" returns quite a few pages, but all the ones
I looked at only offered brittle approaches, at best.

Obviously, the "royal road" would be the full DOM parser, but that
seems really over my time budget :-(

So far I've only used the search some anchor/skip to ">"/
retreive data approach, with copious use of trimming, but
that's not only brittle but oh, soooo ugly!

In an "ideal world" the data I'm after would be "semantically
marked", but in the real world, well, it isn't. Not even
mentionning the fact that fully standard compliant (whichever
(x)html standard you want to plug) pages are the exception
rather than the norm.

I kind of _feel_ there's a way out, but that's way too fuzzy
in my mind at this point.

Any one has any experience with a middle ground approach?

Also, I have noticed something interesting, which, when you come
to think of it, is only natural: some site deliver a _different_
html content depending on the user agent, which means that scraping
by using regexp on what you see as "the page source" in your browser
of choice may NOT be the actual html you receive when handling
the request by yourself.

I'm using cl-http-client, of which I can customize the user agent
string at will, only that I didn't expect it to happen ... Call me
naive :-)

Any idea?

Many thanks.
--
JFB

Re: HTML scraping Pascal Bourguignon
Re: HTML scraping Alex Mizrahi
Re: HTML scraping David Lichteblau
Re: HTML scraping Rainer Joswig
- Re: HTML scraping verec
Re: HTML scraping GP lisper
- Re: HTML scraping Edi Weitz
  - Re: HTML scraping Edi Weitz
  - Re: HTML scraping GP lisper
    - Re: HTML scraping Zach Beane
    - Re: HTML scraping David Lichteblau
      - Re: HTML scraping John Thingstad
      - Re: HTML scraping Holger Schauer
        Re: HTML scraping David Lichteblau
    - Re: HTML scraping Edi Weitz
    - Re: HTML scraping Alain Picard
  - Re: HTML scraping mirod

From: Pascal Bourguignon
Subject: Re: HTML scraping
Date: Wed, 09 May 2007 17:13:33 +0000
Message-ID: <87wszhex8i.fsf@thalassa.lan.informatimago.com>

verec writes:

> Hi. I'm just fishing for ideas ;-)
>
> I've read the "Writing HTML parser wasn't as hard as I thought it'd be"
> thread, but didn't notice anything that would fit between the
> regexp (or adhoc literal search) and the full DOM monster.
>
> A Google on "html scraping" returns quite a few pages, but all the ones
> I looked at only offered brittle approaches, at best.
>
> Obviously, the "royal road" would be the full DOM parser, but that
> seems really over my time budget :-(
>
> So far I've only used the search some anchor/skip to ">"/
> retreive data approach, with copious use of trimming, but
> that's not only brittle but oh, soooo ugly!
>
> In an "ideal world" the data I'm after would be "semantically
> marked", but in the real world, well, it isn't. Not even
> mentionning the fact that fully standard compliant (whichever
> (x)html standard you want to plug) pages are the exception
> rather than the norm.
>
> I kind of _feel_ there's a way out, but that's way too fuzzy
> in my mind at this point.
>
> Any one has any experience with a middle ground approach?
>
> Also, I have noticed something interesting, which, when you come
> to think of it, is only natural: some site deliver a _different_
> html content depending on the user agent, which means that scraping
> by using regexp on what you see as "the page source" in your browser
> of choice may NOT be the actual html you receive when handling
> the request by yourself.
>
> I'm using cl-http-client, of which I can customize the user agent
> string at will, only that I didn't expect it to happen ... Call me
> naive :-)
>
> Any idea?

When there is no semantic markup, in the case of pages generated
automatically, you can still infer the semantics from the physical
presentation markup, thanks to its regularity.

What would go between "regexps" and a full parser (with ad-hoc error
corrections)?  I only see some AI that would "understand" the source
and build the semantic net from it.  An alternative, would be to use a
browser, and have the AI "read" the bitmap built by the browser.  This
solution would be the sturdiest, since it would still work whatever
changes in the underlying standards (I've not counted, but it looks
like there's at least one new "technology" every year, be it a new
markup language, or a new script language or whatever).

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

"By filing this bug report you have challenged the honor of my
family. Prepare to die!"

From: Alex Mizrahi
Subject: Re: HTML scraping
Date: Wed, 09 May 2007 17:22:08 +0000
Message-ID: <46420346$0$90273$14726298@news.sunsite.dk>

(message (Hello 'verec)
(you :wrote  :on '(Wed, 9 May 2007 17:39:37 +0100))
(

 v> Obviously, the "royal road" would be the full DOM parser, but that
 v> seems really over my time budget :-(

hmm, what prevents you from using some ready html parser, like 
cl-html-parse?
possibly i've missed something in your posting, too many letters for me..

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
"I am everything you want and I am everything you need")

From: David Lichteblau
Subject: Re: HTML scraping
Date: Thu, 10 May 2007 11:27:38 +0000
Message-ID: <slrnf460eq.gti.usenet-2006@babayaga.math.fu-berlin.de>

On 2007-05-09, verec <verec> wrote:
> I've read the "Writing HTML parser wasn't as hard as I thought it'd be"
> thread, but didn't notice anything that would fit between the
> regexp (or adhoc literal search) and the full DOM monster.

Completeness of the parser is unrelated to the in-memory representation.
If you dislike DOM (which is very understandable), choose a simpler
model instead.

For example, Closure is one of the Lisp programs with an HTML parser
that also understands broken HTML.

Although its XML parser supports DOM Core, Closure does not at this
point have an implementation of the HTML DOM, and uses its own tree
representation instead.

d.

From: Rainer Joswig
Subject: Re: HTML scraping
Date: Thu, 10 May 2007 18:10:52 +0000
Message-ID: <joswig-BD3138.20105110052007@news-europe.giganews.com>

In article <·······················@news.aaisp.net.uk>, verec wrote:

> Hi. I'm just fishing for ideas ;-)
> 
> I've read the "Writing HTML parser wasn't as hard as I thought it'd be"
> thread, but didn't notice anything that would fit between the
> regexp (or adhoc literal search) and the full DOM monster.
> 
> A Google on "html scraping" returns quite a few pages, but all the ones
> I looked at only offered brittle approaches, at best.
> 
> Obviously, the "royal road" would be the full DOM parser, but that
> seems really over my time budget :-(
> 
> So far I've only used the search some anchor/skip to ">"/
> retreive data approach, with copious use of trimming, but
> that's not only brittle but oh, soooo ugly!
> 
> In an "ideal world" the data I'm after would be "semantically
> marked", but in the real world, well, it isn't. Not even
> mentionning the fact that fully standard compliant (whichever
> (x)html standard you want to plug) pages are the exception
> rather than the norm.
> 
> I kind of _feel_ there's a way out, but that's way too fuzzy
> in my mind at this point.
> 
> Any one has any experience with a middle ground approach?
> 
> Also, I have noticed something interesting, which, when you come
> to think of it, is only natural: some site deliver a _different_
> html content depending on the user agent, which means that scraping
> by using regexp on what you see as "the page source" in your browser
> of choice may NOT be the actual html you receive when handling
> the request by yourself.
> 
> I'm using cl-http-client, of which I can customize the user agent
> string at will, only that I didn't expect it to happen ... Call me
> naive :-)

Isn't it wonderful if something obscure, but advertised,
works - even unexpected?

As a sidenote, IIRC, many years ago somebody wanted to write an
application with CL-HTTP-CLIENT for harvesting emails
from websites - collecting email addresses for spamming...

-- 
http://lispm.dyndns.org

From: verec
Subject: Re: HTML scraping
Date: Thu, 10 May 2007 21:01:43 +0000
Message-ID: <46438836$0$638$5a6aecb4@news.aaisp.net.uk>

On 2007-05-10 19:10:52 +0100, Rainer Joswig <······@lisp.de> said:

>> I'm using cl-http-client, of which I can customize the user agent
>> string at will, only that I didn't expect it to happen ... Call me
>> naive :-)
> 
> Isn't it wonderful if something obscure, but advertised,
> works - even unexpected?

:-)

> As a sidenote, IIRC, many years ago somebody wanted to write an
> application with CL-HTTP-CLIENT

My mistake. I meant to say:

s-http-client http://homepage.mac.com/svc/s-http-client/

> for harvesting emails from websites - collecting email
> addresses for spamming...

And no, I have no such devil purpose in mind: I'm just
scraping financial sites option chains looking for
arbitrage opportunities :)
--
JFB

From: GP lisper
Subject: Re: HTML scraping
Date: Fri, 11 May 2007 22:30:51 +0000
Message-ID: <slrnf49rkr.mqu.spambait@phoenix.clouddancer.com>

On Wed, 9 May 2007 17:39:37 +0100, <verec> wrote:
>
> I kind of _feel_ there's a way out, but that's way too fuzzy
> in my mind at this point.
>
> Any one has any experience with a middle ground approach?

Don't use lisp.  When lisp has tools better than XML::Twig and
LWP::Simple, then it could be employed.

> Also, I have noticed something interesting, which, when you come
> to think of it, is only natural: some site deliver a _different_
> html content depending on the user agent, which means that scraping
> by using regexp on what you see as "the page source" in your browser
> of choice may NOT be the actual html you receive when handling
> the request by yourself.

geeze, if you see this as a problem, abandon hope.  This is so trivial
to beat.

> I'm using cl-http-client, of which I can customize the user agent

For the last few years I've been running a number of these dataminers.
I've found the most efficient way is to use perl on the front end,
dump into SQL and then run lisp on the analysis end.  This combination
simply flys, playing to the strengths of each tool.

-- 
There are no average Common Lisp programmers
Reply-To: email is ignored.

-- 
Posted via a free Usenet account from http://www.teranews.com

From: Edi Weitz
Subject: Re: HTML scraping
Date: Fri, 11 May 2007 23:23:31 +0000
Message-ID: <uy7jvndvw.fsf@agharta.de>

On Fri, 11 May 2007 15:30:51 -0700, GP lisper <········@CloudDancer.com> wrote:

> When lisp has tools better than XML::Twig and LWP::Simple, then it
> could be employed.

So, what's so cool about them?  For example, from this annotation

  http://www.annocpan.org/~GAAS/libwww-perl-5.805/lib/LWP/Simple.pm

I gather that LWP::Simple doesn't even get UTF-8 right.  That doesn't
really convince me...

And from a quick glance at the XML::Twig website I can see that the
author itself boasts that it's a "very cool module", but it escapes me
what it has to offer that one couldn't hack with CXML in a couple of
minutes.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: Edi Weitz
Subject: Re: HTML scraping
Date: Fri, 11 May 2007 23:26:06 +0000
Message-ID: <utzujndrl.fsf@agharta.de>

On Sat, 12 May 2007 01:23:31 +0200, Edi Weitz <········@agharta.de> wrote:

> And from a quick glance at the XML::Twig website I can see that the
> author itself boasts that it's a "very cool module"
------------^

Himself, of course.  Time to go to bed... :)

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: GP lisper
Subject: Re: HTML scraping
Date: Sat, 12 May 2007 13:04:14 +0000
Message-ID: <slrnf4beqe.oht.spambait@phoenix.clouddancer.com>

On Sat, 12 May 2007 01:23:31 +0200, <········@agharta.de> wrote:
> On Fri, 11 May 2007 15:30:51 -0700, GP lisper <········@CloudDancer.com> wrote
>
> And from a quick glance at the XML::Twig website I can see that the
> author itself boasts that it's a "very cool module", but it escapes me
> what it has to offer that one couldn't hack with CXML in a couple of
> minutes.

Gee Edi, wonder why no one has done that.  Perhaps because more than
hot air is required from a lisper.  Bet that you cannot do it either
in your 'few minutes', and that you come back with some snappy comment
as to why your few minutes are more important to spend elsewhere.  Not
to mention that the lisp package you cite is so incomplete that it
requires further work.  Lisp sucks at this problem.  XML::Twig was
providing a useful dataflow in minutes, your sour grapes not
withstanding.

-- 
There are no average Common Lisp programmers
Reply-To: email is ignored.

-- 
Posted via a free Usenet account from http://www.teranews.com

From: Zach Beane
Subject: Re: HTML scraping
Date: Sat, 12 May 2007 14:36:50 +0000
Message-ID: <m34pmi85x9.fsf@unnamed.xach.com>

GP lisper <········@CloudDancer.com> writes:

> Not to mention that the lisp package you cite is so incomplete that
> it requires further work.  Lisp sucks at this problem.

I'm sorry CXML doesn't fit your needs. What do you need it to do that
needs more work?

I've been very happy using CXML for a wide variety of XML processing
tasks in Lisp. I looked into XML::Twig, but it has the huge downside
that you have to throw away the nice interactive style of Lisp to work
with it. CXML has the advantage of not only being very complete,
documented, and useful, but also, since it's Lisp, can be used in a
very supportive development environment.

Zach

From: David Lichteblau
Subject: Re: HTML scraping
Date: Sat, 12 May 2007 14:38:14 +0000
Message-ID: <slrnf4bkc7.91m.usenet-2006@babayaga.math.fu-berlin.de>

On 2007-05-12, GP lisper <········@CloudDancer.com> wrote:
>> And from a quick glance at the XML::Twig website I can see that the
>> author itself boasts that it's a "very cool module", but it escapes me
>> what it has to offer that one couldn't hack with CXML in a couple of
>> minutes.
>
> Gee Edi, wonder why no one has done that.  Perhaps because more than
> hot air is required from a lisper.  Bet that you cannot do it either
> in your 'few minutes', and that you come back with some snappy comment
> as to why your few minutes are more important to spend elsewhere.  Not
> to mention that the lisp package you cite is so incomplete that it
> requires further work.  Lisp sucks at this problem.  XML::Twig was
> providing a useful dataflow in minutes, your sour grapes not
> withstanding.

I realize that you are just trolling, but for the benefit of those
readers who might not be aware of that, I would like to point out that
CXML is in fact not incomplete at all.

It is a conforming XML processor, with a validating parser, an
implementation of DOM 2 Core (the only one written in Lisp that I am
aware of), catalog support, rich APIs including both a push-based and
recently also a pull-based parser.

Of course, some other specifications are not implemented yet.  For
example, I am currently working on an implementation of Relax NG.  XPath
support is also missing, and is on my "to do" list.  (CL-XML has an
XPath library, which might be easy to port.)

I am also committed to good documentation.  If anyone is interested in
using CXML but has trouble finding out how to work with it based on the
existing documentation, please point out the areas in which you would
like to see improvements.

Now, I do not know whether Twig is designed as a conforming XML
processor or as an error-tolerant parser for broken XML.  If it falls
into the latter category, it serves a very different purpose than CXML.
(There are, of course, other libraries available in Lisp for the purpose
of parsing broken HTML or XML, and they could be used together with
CXML's higher-level APIs without too much work.)

d.

From: John Thingstad
Subject: Re: HTML scraping
Date: Sat, 12 May 2007 18:45:29 +0000
Message-ID: <op.tr7813agpqzri1@pandora.upc.no>

On Sat, 12 May 2007 16:38:14 +0200, David Lichteblau  
<···········@lichteblau.com> wrote:

>
> Of course, some other specifications are not implemented yet.  For
> example, I am currently working on an implementation of Relax NG.  XPath
> support is also missing, and is on my "to do" list.  (CL-XML has an
> XPath library, which might be easy to port.)
>

I like what I am hearing. RelaxNG is way better than XSL.
XPath makes finding random data much easier.
If you look into XPath perhaps you could take a sideways glance at
the SimpleXML in PHP.
Keep up the good work! :)

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

From: Holger Schauer
Subject: Re: HTML scraping
Date: Sun, 13 May 2007 14:37:11 +0000
Message-ID: <yxzps54bxig.fsf@gmx.de>

On 5001 September 1993, David Lichteblau wrote:
> It is a conforming XML processor, with a validating parser, an
> implementation of DOM 2 Core (the only one written in Lisp that I am
> aware of), catalog support, rich APIs including both a push-based and
> recently also a pull-based parser.

FWIW, while cxml has catalog support, it supports XML catalogs
only. Things being as they are, I need to handle SGML catalogs. 
Tough luck for me. :-)

> I am also committed to good documentation.  If anyone is interested in
> using CXML but has trouble finding out how to work with it based on the
> existing documentation, please point out the areas in which you would
> like to see improvements.

Well, if the implementation of the current catalog handling would be
documented, I would probably give handling SGML catalogs a try ...
No, that's not a serious complaint, I know how to use the source and
the current documentation of the API is well done.

Holger

-- 
---          http://hillview.bugwriter.net/            ---
"Welche Linux-Distribution soll ich nehmen?"
               -- Aus "Wie starte ich einen Endlosthread ?", Teil 1171

From: David Lichteblau
Subject: Re: HTML scraping
Date: Sun, 13 May 2007 15:20:40 +0000
Message-ID: <slrnf4eb7q.li2.usenet-2006@babayaga.math.fu-berlin.de>

On 2007-05-13, Holger Schauer <··············@gmx.de> wrote:
> Well, if the implementation of the current catalog handling would be
> documented, I would probably give handling SGML catalogs a try ...
> No, that's not a serious complaint, I know how to use the source and
> the current documentation of the API is well done.

Hmm.  It is a legitimate complaint, I think.  The are two mechanisms
allowing users to redirect specific public or system IDs to a different
location.

One is the `entity-resolver' keyword argument, which is documented and
does whatever the user wants.  The other are catalogs, which do exactly
one thing and are hooked into the parser directly.

So what's the difference?  Entity resolvers returns streams, not URIs.
Because of that, catalogs are compatible with DTD cacheing, and entity
resolvers are not.

It would probably be best to allow entity resolvers to returns a file
URI instead of a stream, so that catalogs (of any variety) could be
implemented on top of them, outside of the parser.

d.

From: Edi Weitz
Subject: Re: HTML scraping
Date: Sat, 12 May 2007 14:33:19 +0000
Message-ID: <u7ireru1c.fsf@agharta.de>

On Sat, 12 May 2007 06:04:14 -0700, GP lisper <········@CloudDancer.com> wrote:

> Gee Edi, wonder why no one has done that.

Where do you know that from?  Not everything that has been done has
been released so that you can use it for free.

> Perhaps because more than hot air is required from a lisper.

I think as far as Lisp open source software is concerned I have
provided a bit more than hot air in the last years.  Have you?
(Except for your cool nickname, of course...)

> Bet that you cannot do it either in your 'few minutes', and that you
> come back with some snappy comment as to why your few minutes are
> more important to spend elsewhere.  Not to mention that the lisp
> package you cite is so incomplete that it requires further work.

That's bullshit.  I've used CXML in various commercial projects.  It
worked fine and was easy to use.

> Lisp sucks at this problem.  XML::Twig was providing a useful
> dataflow in minutes, your sour grapes not withstanding.

I note that you haven't even tried to answer my question, so I'm not
interested in continuing this thread anymore.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: Alain Picard
Subject: Re: HTML scraping
Date: Sun, 13 May 2007 01:06:49 +0000
Message-ID: <87k5vd8rbq.fsf@memetrics.com>

GP lisper <········@CloudDancer.com> writes:

> Gee Edi, wonder why no one has done that.  Perhaps because more than
> hot air is required from a lisper.  Bet that you cannot do it either
> in your 'few minutes', and that you come back with some snappy comment
> as to why your few minutes are more important to spend elsewhere.  

Have you even the _slightest_ idea to whom you are responding?
Edi guy has put his money where his mouth his.  I can download
and use his packages (and I do) and they are of uniformly _exceptional_
quality.

I can't even figure out what _your_ name is, much less evaluate
what your contribution has been.

Let's say that, given this baseline, I'll take his technical evaluation
of the difficulty of a problem over yours any day...

[ Thx for all the great work Edi.  And don't let the morons get you down. ]

From: mirod
Subject: Re: HTML scraping
Date: Mon, 14 May 2007 08:36:29 +0000
Message-ID: <46481f92$0$7744$5fc30a8@news.tiscali.it>

Edi Weitz wrote:
> On Fri, 11 May 2007 15:30:51 -0700, GP lisper <········@CloudDancer.com> wrote:
> 
>> When lisp has tools better than XML::Twig and LWP::Simple, then it
>> could be employed.
> 
> So, what's so cool about them?  For example, from this annotation
> 
>   http://www.annocpan.org/~GAAS/libwww-perl-5.805/lib/LWP/Simple.pm
> 
> I gather that LWP::Simple doesn't even get UTF-8 right.  That doesn't
> really convince me...
> 
> And from a quick glance at the XML::Twig website I can see that the
> author itself boasts that it's a "very cool module", but it escapes me
> what it has to offer that one couldn't hack with CXML in a couple of
> minutes.
> 

It escapes said author too ;--) Especially as its last foray into lisp
dates back to 20 years ago, working on an expert-system with a Minitel
(!) interface.

Thanks for pointing out the silly "very cool module" sentence, BTW. I
probably wrote it in the proud-but-exhausted phase that comes right
after a major release, a few years ago. I have fixed it now.

As for the question of whether XML::Twig processes XML or HTML, the
answer would be... both. Or maybe none. It delegates those tasks to
other modules: XML::Parser (which in turns wraps expat) for straight
XML, and HTML::TreeBuilder to converts HTML into XHTML, that's then
processed as XML.

I hope that answers some of your questions about that probably very
non-lispy library.

-- 
mirod