Extracting news from HTML pages

From: Roland
Subject: Extracting news from HTML pages
Date: Wed, 03 Dec 2003 21:02:36 +0000
Message-ID: <63566228.0312031302.913ad7c@posting.google.com>

Hello,

I want to extract information from HTML news pages. I want to convert
them to RSS format. The hard part is filtering out the news. How could
this best be accomplished? There is no semantic information to rely
on, you have to guess what is Title and content from the context and
from the tags used. I think what I need is some kind of pattern
matching languages based on tree structure(HTML in this case).

Can you give me any pointers where something like this has been done
already?

Thanks,

Roland

Re: Extracting news from HTML pages Raymond Wiker
Re: Extracting news from HTML pages ····@pobox.com
Re: Extracting news from HTML pages Drew McDermott
Re: Extracting news from HTML pages Paolo Amoroso

From: Raymond Wiker
Subject: Re: Extracting news from HTML pages
Date: Wed, 03 Dec 2003 21:35:54 +0000
Message-ID: <86d6b5futh.fsf@raw.grenland.fast.no>

·····@gmx.net (Roland) writes:

> Hello,
>
> I want to extract information from HTML news pages. I want to convert
> them to RSS format. The hard part is filtering out the news. How could
> this best be accomplished? There is no semantic information to rely
> on, you have to guess what is Title and content from the context and
> from the tags used. I think what I need is some kind of pattern
> matching languages based on tree structure(HTML in this case).
>
> Can you give me any pointers where something like this has been done
> already?

        XSLT is probably a good match for this, assuming that you can
use XSL stylesheets specialised for individual sites. Of course,
XSL is not really Lisp-related, but you may be able to build something
XSL-like in Common Lisp or Scheme (or possibly use DSSSL).

-- 
Raymond Wiker                        Mail:  ·············@fast.no
Senior Software Engineer             Web:   http://www.fast.no/
Fast Search & Transfer ASA           Phone: +47 23 01 11 60
P.O. Box 1677 Vika                   Fax:   +47 35 54 87 99
NO-0120 Oslo, NORWAY                 Mob:   +47 48 01 11 60

Try FAST Search: http://alltheweb.com/

From: ····@pobox.com
Subject: Re: Extracting news from HTML pages
Date: Thu, 04 Dec 2003 22:26:46 +0000
Message-ID: <7eb8ac3e.0312041426.29307a0f@posting.google.com>

·····@gmx.net (Roland) wrote in message news:<···························@posting.google.com>..
> I want to extract information from HTML news pages.

Perhaps you might find Htmlprag useful then:

Quoted from 
http://www.neilvandyke.org/htmlprag/

	"HtmlPrag provides permissive HTML parsing capability to Scheme
programs, which is useful for software agent extraction of information
from Web pages, for programmatically transforming HTML files, and for
implementing interactive Web browsers. HtmlPrag emits [SXML], so that
conventional HTML may be processed with XML tools such as [SXPath] and
[SXML-Tools]. Like [SSAX-HTML], HtmlPrag provides a permissive
tokenizer, but also attempts to recover structure."

From: Drew McDermott
Subject: Re: Extracting news from HTML pages
Date: Thu, 04 Dec 2003 22:29:26 +0000
Message-ID: <bqocg6$arj$1@news.wss.yale.edu>

Roland wrote:
> Hello,
> 
> I want to extract information from HTML news pages. I want to convert
> them to RSS format. The hard part is filtering out the news. How could
> this best be accomplished? There is no semantic information to rely
> on, you have to guess what is Title and content from the context and
> from the tags used. I think what I need is some kind of pattern
> matching languages based on tree structure(HTML in this case).
> 
> Can you give me any pointers where something like this has been done
> already?
> 
> Thanks,
> 
> Roland

This is a huge research area.  Do a Google search for

                information extraction web pages

and explore from there.

    -- Drew McDermott

From: Paolo Amoroso
Subject: Re: Extracting news from HTML pages
Date: Thu, 04 Dec 2003 19:54:15 +0000
Message-ID: <87llpsz7dk.fsf@plato.moon.paoloamoroso.it>

[following up to comp.lang.lisp only]

Roland writes:

> I want to extract information from HTML news pages. I want to convert
> them to RSS format. The hard part is filtering out the news. How could
> this best be accomplished? There is no semantic information to rely
> on, you have to guess what is Title and content from the context and

"Semantic" information in the title? Huh, this is the easy part: the
title of most of the HTML documents I have seen is "(no title)".


Paolo
-- 
Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film