From: ·······@Yahoo.Com
Subject: What RobertMaas has been programming recently (1)
Date: 
Message-ID: <REM-2004aug15-001@Yahoo.Com>
This past Wednesday (2004.Aug.11) I came up with a nifty new idea
(hack) for improving on a hack I created, which I mentionned here:
http://www.google.com/groups?selm=REM-2004aug11-003%40Yahoo.Com
Here's a little more about the idea, comparing it to the very crufty
hack I had previously (2002.Mar-May). The old hack ran Lynx as a
sub-process under CMUCL, like this:
    (ext:run-program "/usr/local/bin/lynx"
         (list ... "file://localhost/home/users/rem/Safe/Active.html")
      :WAIT NIL :PTY NIL :INPUT :STREAM :OUTPUT :STREAM))
It traversed that local WebPage to where I have a Yahoo! Mail login
form (with https changed to http so it works with Lynx), filled out
that form and submitted it, dealt with Yahoo wanting to send three
kinds of cookies, finally getting into the toplevel Yahoo! Mail WebPage
for whatever account it logged into. It then used Lynx keyboard
commands to move around in that WebPage, and to click on links to
retrieve other WebPages from Yahoo! Mail. It watched the rendered-HTML
VT100 text emitted by Lynx and compared with what it expected to see,
to judge whether the latest command had been finished so it would now
be safe to send the next keyboard command to Lynx. So how did it know
when to use uparrow or downarrow to move the cursor, and when to use
rightarrow (or CarriageReturn) to go into a link, i.e. how did it "see"
what was on-screen and "see" where the cursor was within the screen so
it could generate the appropriate keyboard commands at any moment? I
used a hack I call "Window Watcher", which collects the VT100 output
from Lynx, compares it to what it saw before, and used that to estimate
what's really going on. The key trick was how to handle scrolling (with
ctrl-N or ctrl-P commands that scroll by two lines each way
respectively) and searching (with / command, which may find a target on
the same screen and put the cursor exactly there, or may find a target
on a different screen and put the cursor at the start of the line where
the search string was found). It compared what had been displayed
before with what was being displayed now, finding the longest matching
segment of text, then from what was remaining finding the next-longest
matching segment (but required to be in same sequence as original, i.e.
both-before or both-after the original segment), etc., until it had a
mapping between all long-enough (3-or-so characters each) segments in
old and new text. It then knew which text had scrolled off one edge of
the screen and which text had scrolled onto the other edge. So for
example to collect a list of Subject fields of messages in a folder,
it'd first search for the title that appears at the top or bottom of
the listing of messages, then scroll repeatedly until it saw the title
that appears at the other end. In between it'd observe the
VT100-rendered text that comes out, with [ ] checkbox at the start of
each message entree, and parse the rendered-entree that appears between
those checkboxes. Once it had collected a list of Subject fields in
sequence, it could then go to any of them and click on it to retrieve
the actual message, or it could click on messages as it was originally
scrolling through the listing. Most of the time this worked fine, but
often there would be five or ten consecutive spams all with the same
Subject field and sender, which caused the segment-matching process in
Window Watcher to be confused as to what in the old screen matched with
what in the new scroll screen. So I'd have to constantly watch the
output from the program to see when it was getting confused. As you see
by the date I worked on it, I finally gave up this method after about
two months of developing it, more than two years ago, and haven't
resumed work on it until I came up with this new idea just this past
week:

The new idea is: Instead of scrolling the Lynx window and collecting
the VT100 rendered output and trying to make sense of all the
single-window pieces of rendered output, my new program would download
the HTML source for the whole WebPage, then parse it all at one time.
HTML source has a lot more semantic information than the VT100
rendering of it, and getting it all in one piece is better, so I can
flush that flakey Window Watcher. But wait, once you are in a
particular WebPage, Lynx doesn't allow any way to download it. You have
to be one level higher, looking at a link (anchor) pointing to that
WebPage, and issue the 'd' command, to download it. And if you're on
some higher-up WebPage, my software still would need to navigate its
way around that WebPage to find the link to the desired WebPage it
wants to download. My new idea this past Wednesday was a way around
this problem: Instead of downloading via a link from the higher-up
WebPage, I build a local file that has only that single link in it,
load that local file into Lynx, then issue the 'd' command immediately,
knowing I'm at the correct link, the only one in that file. But how
would my program know what URL to put in that single-link file? By
parsing the downloaded higher-level WebPage to extract the relevant
URL. But how do I download the very top level WebPage of Yahoo! Mail
after logging in, so I have a starting point for this process of
parsing one WebPage to get a URL to build into single-link file for
downloading another WebPage? After filling out the Yahoo! Mail login
form, clicking on the submit (login) link, and accepting three kinds of
cookies, Lynx is already *in* the toplevel Yahoo! Mail WebPage, so
there's no way to download it! That's the last part of my new idea:
Create another local single-link WebPage that has a single link to just
http://mail.yahoo.com. So after Lynx successfully logs in to Yahoo!
Mail and sees the toplevel Yahoo! Mail WebPage, it then just abandons
that and switches to the local single-link WebPage, then uses the 'd'
command to download that single http://mail.yahoo.com. But because of
cookies already present due to already being logged in, that link
redirects to the actual toplevel Yahoo! Mail WebPage, so that page
rather than the login form is what gets downloaded. Now we're cooking
with fire! For reference, here's that toplevel single-link WebPage:
<html>
<head><title>Single link to mail.yahoo.com</title></head>
<body>
START-TOP<br>
<a href = "http://mail.yahoo.com">link-top</a><br>
VERY_END
</body>
</html>
and here's an example of the lower-level single-link WebPage that gets
written automatically by my program:
<html>
<head><title>Single link from YM one page to another</title></head>
<body>
START-TOP<br>
<a href = "http://us.f113.mail.yahoo.com/ym/ShowFolder?Search=&Npos=4&next=1&YY=74625&inc=200&order=down&sort=date&pos=3&view=&head=&box=BulkAside">link to folder[BulkAside]+4</a><br>
VERY_END
</body>
</html>

I've jumped ahead a little. Those files didn't exist until Saturday. So
here's what I actually did, after thinking of this hack (described
above) on Wednesday:

Thursday (2004.Aug.12): Wrote two semi-parsers. (Terminology: A parser
is a program that takes as input a file that is written per some formal
grammar, and converts that entire file to a parse tree that expresses
everything important about the meaning of that file per that grammar. A
semi-parser merely scans a file looking for key-strings that locate
important parts of the file, and crudely parses and/or abstracts just
those particular parts, ignoring all the rest of the file, and even
those abstracted parts of the file don't necessarily get fully parsed.)

Semi-parser for WebPage that contains single message of Yahoo! Mail.
It returns the following values:
- The BSD mail0 line (the extra line before the RFC822 header).
- An assoc list of the header lines, with each value exactly as Yahoo!
Mail showed them (not yet condensed into single long lines) except that
when there's no leading whitespace (as Yahoo! Mail does for just the
From: and one other header line) then one space is inserted to produce
consistent results.
- A string containing the entire body of the message as best I can
recover it from the mess of HTML that Yahoo! Mail generates. For this
first draft I programmed it only for messages that have <pre>...</pre>
arouund the body. See Saturday for enhancement for another message
format.

Semi-parser for WebPage that contains listing of up to 200 entries,
each linking to one message. It returns the following values:
- An assoc-list showing the side-links (First, Previous, Next, Last)
which appear at the top of the list of messages.
- A list of the URLs of all the listed messages. Note these are
relative URLs, missing the servername.

Friday: I was busy with other activites.

Saturday (2004.Aug.14): Wrote two more semi-parsers, and a lot more:

Semi-parser for toplevel WebPage that appears shortly after
successfully logging into Yahoo! Mail (after dealing with yes I will
accept an invalid cookie domain, accept All cookies in a domain, yes I
will accept yet another invalid cookie domain). This is the same
WebPage gotten if at any later time (before logging out) one goes to
http://mail.yahoo.com. It returns the following value:
- URL of the Folders [Edit] link, to the WebPage that lists all the
folders with statistics for each. Note this is a relative URL.

Semi-parser for the Folders-Edit WebPage, the target of the above link.
It returns the following value:
- An assoc list, each element of form (foldername . URL), where each
URL is relative.

Searched for hours and finally found where I had stashed that abandoned
pty-lynx and Window-Watcher etc. code from 2002.May, so that I could
modify it to use the new hack instead of the old one. Adapted it to do
the following:

After performing login as described earlier, same as in 2002.May:
- Call pty-lynx-atlink-getinfo (from 2002.May) to issue the Lynx '='
command and parse the output to obtain the current URL and the URL of
the current link.
- New code: Semi-parse the current URL to obtain the URL base used for
all the relative URLs that actually occur in Yahoo! Mail WebPages. (Why
don't I just parse the HTML source of one of these WebPages to find the
place where the base is defined? Because I need the base before I can
create an absolute URL in my single-link file to download the WebPage
so that I can then parse it! Catch-22, right?)
- Call pty-lynx-go-url to make Lynx go to the toplevel single-link
local WebPage.
- Call pty-lynx-download to download the toplevel after-login Yahoo!
Mail WebPage.
- Call new (that morning) parser to find URL for Folders [Edit] link.
- Append URL base and that local URL to get full URL for Folders page.
- Call url-write-y1link-file (written just now at this point in
development) to write that full URL into the non-toplevel single-link
local file.
- Call pty-lynx-go-url to go to that newly-written single-link local
file.
- Call pty-lynx-download to download through that link, thereby
downloading the WebPage containing a listing of folders with
statistics.
- Call new (that morning) parser to collect list of folders and URLs.

Wrote new code to browse from that point (details omitted):
-- Given link to folder, fetch that WebPage and parse it and get list
of URLs for first (most recent) 200 individual messages.
-- Given folder WebPage, follow links to Next section to get a listing
of another 200 messages, and continue following Next link until there
is no more Next link because we're already at the last (oldest) section
of messages.
-- Pick a particular message, by sequence number 1 to 200, and retrieve
that particular message, and parse it.

In course of the above, I discovered a deficiency in my semi-parser for
individual messages: The Thursday semi-parser assumed each message had
<pre>...</pre> around the message body, but on Saturday I encountered
the first message that didn't, so I had to modify my semi-parser to
handle both cases. Someday if I get curious I'll have to check whether
there's any correlation between this difference in the way Yahoo! Mail
presents a message body, and the MIME type declared in the header.

Also in that course, I discovered a deficiency in my semi-parser for
200-messages-in-folder listings: The Thursday semi-parser assumed there
were exactly 7 <td>...</td> within each row, and picked the specific
elements (sender, subject&link, date, size) from among those
accordingly. But Saturday it blew up when it found entrees with only 5
<td>...</td> per row, so I had to parameterize the semi-parser to
handle both cases.

So basically I have all the tools I need to write various kinds of
spiders for Yahoo! Mail. Some ideas I might work on next:
- For any given account, follow Next links collecting URLs for 200
messages in each listing, and for each such message download and parse
the message to collect the mail0 lines, which can be used to recognize
when I've seen the same message before. Build a database showing these
mail0 lines associated with what's visible on the message-in-folder
listing (the Subject field and the date written and the sender's name).
Later, quickly follow Next links again but fetch and parse only the new
messages that weren't processed before, and fold those new entries into
the database. Bleep me on terminal whenever new e-mail arrives.
- For each spam listed in above database, download and parse message to
get Received lines, trace internal Yahoo forwarding backwards to
discover last-relay IP number, look up the CTW (Complain To Whom) info,
auto-complain, and flag that message in database as already complained.
- Collect just the list of last-relay IP numbers of spam on Yahoo! Mail
to add to my database of ISPs that have spammed me, so that I can
refuse CGI services to their users just the same as I would if they
spammed me at my Unix ISP address.
- Set up a spam-retrieval service whereby if somebody wants to know why
I have some particular IP number listed as a source of spam, my
software could look that number in my database to see which Yahoo! Mail
account got spam from there and which particular message there was the
spam, auto-download that spam message, parse, and reconstruct the
original form of the spam.

With just a little more code, namely semi-parsers for various WebPages
presented by Yahoo! Groups, I could write a spider that checked many
such groups to see which have new messages I haven't yet seen, and
notify me, presenting me with a nice menu of what messages have
appeared in which groups, and letting me click on any entree to
auto-retrieve that message and let me view it.

** Why should anybody care? If you read this far, here's your reward, a
quick summary of part of why I'm posting this: In just two workdays
(Thursday and Saturday), I wrote all the code described above and got
it working, buliding on what I had written 2002.Mar-May. This shows the
wonderful productivity available when programming in lisp (specifically
CL, in this case CMUCL), and also my individual productivity. I really
need some money before I become homeless. Please somebody hire me!!
You'll get a lot of good productivity for your money!!