Hello,
I want to extract information from HTML news pages. I want to convert
them to RSS format. The hard part is filtering out the news. How could
this best be accomplished? There is no semantic information to rely
on, you have to guess what is Title and content from the context and
from the tags used. I think what I need is some kind of pattern
matching languages based on tree structure(HTML in this case).
Can you give me any pointers where something like this has been done
already?
Thanks,
Roland
·····@gmx.net (Roland) writes:
> Hello,
>
> I want to extract information from HTML news pages. I want to convert
> them to RSS format. The hard part is filtering out the news. How could
> this best be accomplished? There is no semantic information to rely
> on, you have to guess what is Title and content from the context and
> from the tags used. I think what I need is some kind of pattern
> matching languages based on tree structure(HTML in this case).
>
> Can you give me any pointers where something like this has been done
> already?
XSLT is probably a good match for this, assuming that you can
use XSL stylesheets specialised for individual sites. Of course,
XSL is not really Lisp-related, but you may be able to build something
XSL-like in Common Lisp or Scheme (or possibly use DSSSL).
--
Raymond Wiker Mail: ·············@fast.no
Senior Software Engineer Web: http://www.fast.no/
Fast Search & Transfer ASA Phone: +47 23 01 11 60
P.O. Box 1677 Vika Fax: +47 35 54 87 99
NO-0120 Oslo, NORWAY Mob: +47 48 01 11 60
Try FAST Search: http://alltheweb.com/
·····@gmx.net (Roland) wrote in message news:<···························@posting.google.com>..
> I want to extract information from HTML news pages.
Perhaps you might find Htmlprag useful then:
Quoted from
http://www.neilvandyke.org/htmlprag/
"HtmlPrag provides permissive HTML parsing capability to Scheme
programs, which is useful for software agent extraction of information
from Web pages, for programmatically transforming HTML files, and for
implementing interactive Web browsers. HtmlPrag emits [SXML], so that
conventional HTML may be processed with XML tools such as [SXPath] and
[SXML-Tools]. Like [SSAX-HTML], HtmlPrag provides a permissive
tokenizer, but also attempts to recover structure."
Roland wrote:
> Hello,
>
> I want to extract information from HTML news pages. I want to convert
> them to RSS format. The hard part is filtering out the news. How could
> this best be accomplished? There is no semantic information to rely
> on, you have to guess what is Title and content from the context and
> from the tags used. I think what I need is some kind of pattern
> matching languages based on tree structure(HTML in this case).
>
> Can you give me any pointers where something like this has been done
> already?
>
> Thanks,
>
> Roland
This is a huge research area. Do a Google search for
information extraction web pages
and explore from there.
-- Drew McDermott
[following up to comp.lang.lisp only]
Roland writes:
> I want to extract information from HTML news pages. I want to convert
> them to RSS format. The hard part is filtering out the news. How could
> this best be accomplished? There is no semantic information to rely
> on, you have to guess what is Title and content from the context and
"Semantic" information in the title? Huh, this is the easy part: the
title of most of the HTML documents I have seen is "(no title)".
Paolo
--
Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film