From: Christopher Browne
Subject: Statistics for comp.lang.lisp
Date: 
Message-ID: <apgkl2$18ltb$1@ID-125932.news.dfncis.de>
Following is a summary of articles spanning a 7 day period,
beginning at 20 Oct 2002 11:56:53 GMT and ending at
27 Oct 2002 02:30:55 GMT.

Notes
=====

    - A line in the body of a post is considered to be original if it
      does *not* match the regular expression /^\s{0,3}(?:>|:|\S+>|\+\+|\|\s+|\*\s)/.
    - All text after the last cut line (/^-- $/) in the body is
      considered to be the author's signature.
    - The scanner prefers the Reply-To: header over the From: header
      in determining the "real" e-mail address and name.
    - Original Content Rating is the ratio of the original content volume
      to the total body volume.
    - Please send all comments to Christopher Browne <········@acm.org>

Excluded Posters
================

····················@mox\.perl\.com

Totals
======

Posters:  142
Articles: 543 (270 with cutlined signatures)
Threads:  44
Volume generated: 1321.1 kb
    - headers:    585.7 kb (9,248 lines)
    - bodies:     694.7 kb (17,268 lines)
    - original:   435.6 kb (11,560 lines)
    - signatures: 40.1 kb (999 lines)

Original Content Rating: 0.627

Averages
========

Posts per poster: 3.8
    median: 2.0 posts
    mode:   1 post - 67 posters
    s:      6.9 posts
Posts per thread: 12.3
    median: 3.0 posts
    mode:   1 post - 12 threads
    s:      26.8 posts
Message size: 2491.3 bytes
    - header:     1104.6 bytes (17.0 lines)
    - body:       1310.2 bytes (31.8 lines)
    - original:   821.5 bytes (21.3 lines)
    - signature:  75.6 bytes (1.8 lines)

Top 10 Posters by Number of Posts
=================================

         (kb)   (kb)  (kb)  (kb)
Posts  Volume (  hdr/ body/ orig)  Address
-----  --------------------------  -------

   57   153.9 ( 64.6/ 80.3/ 63.5)  Erik Naggum <····@naggum.no>
   27    66.6 ( 28.9/ 37.6/ 28.8)  Tim Bradshaw <···@cley.com>
   23    82.0 ( 29.9/ 50.5/ 14.4)  arien <·············@getlost.invalid>
   21    46.8 ( 22.1/ 23.8/ 23.8)  Vassil Nikolov <········@poboxes.com>
   19    47.2 ( 21.8/ 21.8/ 13.0)  Pascal Costanza <········@web.de>
   19    36.9 ( 23.6/ 13.2/  8.0)  Will Deakin <···········@hotmail.com>
   16    35.7 ( 17.9/ 14.5/  7.0)  Barry Margolin <······@genuity.net>
   15    35.5 ( 17.7/ 17.8/ 10.1)  Joe Marshall <···@ccs.neu.edu>
   11    26.8 ( 10.5/ 16.3/  3.8)  "Vlastimil Adamovsky" <·····@ambrasoft.com>
   11    27.2 (  8.0/ 18.3/ 10.4)  Nils Goesche <······@cartan.de>

These posters accounted for 40.3% of all articles.

Top 10 Posters by Volume
========================

  (kb)   (kb)  (kb)  (kb)
Volume (  hdr/ body/ orig)  Posts  Address
--------------------------  -----  -------

 153.9 ( 64.6/ 80.3/ 63.5)     57  Erik Naggum <····@naggum.no>
  82.0 ( 29.9/ 50.5/ 14.4)     23  arien <·············@getlost.invalid>
  66.6 ( 28.9/ 37.6/ 28.8)     27  Tim Bradshaw <···@cley.com>
  47.2 ( 21.8/ 21.8/ 13.0)     19  Pascal Costanza <········@web.de>
  46.8 ( 22.1/ 23.8/ 23.8)     21  Vassil Nikolov <········@poboxes.com>
  36.9 ( 23.6/ 13.2/  8.0)     19  Will Deakin <···········@hotmail.com>
  35.7 ( 17.9/ 14.5/  7.0)     16  Barry Margolin <······@genuity.net>
  35.5 ( 17.7/ 17.8/ 10.1)     15  Joe Marshall <···@ccs.neu.edu>
  34.5 ( 13.3/ 19.3/  8.5)      9  Duane Rettig <·····@franz.com>
  27.2 (  8.0/ 18.3/ 10.4)     11  Nils Goesche <······@cartan.de>

These posters accounted for 42.9% of the total volume.

Top 10 Posters by OCR (minimum of five posts)
==============================================

         (kb)    (kb)
OCR      orig /  body  Posts  Address
-----  --------------  -----  -------

1.000  ( 23.8 / 23.8)     21  Vassil Nikolov <········@poboxes.com>
0.837  (  6.8 /  8.2)      5  ······@qnci.net (William D Clinger)
0.819  ( 13.0 / 15.9)      7  Christopher Browne <········@acm.org>
0.815  (  5.9 /  7.3)      8  Advance Australia Dear <····················@yahoo.com>
0.791  ( 63.5 / 80.3)     57  Erik Naggum <····@naggum.no>
0.765  ( 28.8 / 37.6)     27  Tim Bradshaw <···@cley.com>
0.758  (  2.3 /  3.0)      5  ·········@netscape.net (Jules F. Grosse)
0.716  (  7.0 /  9.8)      5  ···@ashi.footprints.net (Kaz Kylheku)
0.705  (  6.5 /  9.3)      7  "Wade Humeniuk" <····@nospam.nowhere>
0.705  (  2.1 /  3.0)      5  ozan s yigit <··@blue.cs.yorku.ca>

Bottom 10 Posters by OCR (minimum of five posts)
=================================================

         (kb)    (kb)
OCR      orig /  body  Posts  Address
-----  --------------  -----  -------

0.558  (  1.1 /  2.0)      6  Kalle Olavi Niemitalo <···@iki.fi>
0.514  (  1.9 /  3.7)      6  Paolo Amoroso <·······@mclink.it>
0.487  (  2.0 /  4.0)      8  Chris Beggy <······@kippona.com>
0.485  (  7.0 / 14.5)     16  Barry Margolin <······@genuity.net>
0.454  (  5.9 / 12.9)      7  ·······@bcect.com (Michael Sullivan)
0.441  (  5.8 / 13.2)      8  Nils Goesche <···@cartan.de>
0.439  (  8.5 / 19.3)      9  Duane Rettig <·····@franz.com>
0.391  (  3.5 /  8.9)     10  Marc Spitzer <········@optonline.net>
0.285  ( 14.4 / 50.5)     23  arien <·············@getlost.invalid>
0.235  (  3.8 / 16.3)     11  "Vlastimil Adamovsky" <·····@ambrasoft.com>

Top 10 Threads by Number of Posts
=================================

Posts  Subject
-----  -------

  152  Re: Difference between LISP and C++
   67  Midfunction Recursion
   56  A strange question...
   47  Getting the PID in CLISP
   41  Best combination of {hardware / lisp implementation / operating system}
   23  Lisp options on Mac OS X (Was: Best combination of {hardware / lisp implementation / operating system})
   19  Re: "Well, I want to switch over to replace EMACS LISP with Guile."
   16  Lisp advocacy misadventures
   12  iteration vs recursion Performance viewpoint
   11  Re: Franz Liszt & Farewell my Dijkstra

Top 10 Threads by Volume
========================

  (kb)   (kb)  (kb)  (kb)
Volume (  hdr/ body/ orig)  Posts  Subject
--------------------------  -----  -------

 442.1 (209.4/220.3/121.2)    152  Re: Difference between LISP and C++
 142.0 ( 58.9/ 77.3/ 47.6)     67  Midfunction Recursion
 135.4 ( 54.4/ 77.2/ 44.6)     56  A strange question...
  97.1 ( 40.3/ 54.4/ 32.6)     41  Best combination of {hardware / lisp implementation / operating system}
  78.0 ( 44.0/ 30.0/ 19.4)     47  Getting the PID in CLISP
  55.3 ( 23.4/ 30.2/ 18.4)     23  Lisp options on Mac OS X (Was: Best combination of {hardware / lisp implementation / operating system})
  50.6 ( 26.2/ 22.3/ 16.6)     19  Re: "Well, I want to switch over to replace EMACS LISP with Guile."
  45.1 ( 13.8/ 29.8/ 23.1)     16  Lisp advocacy misadventures
  28.3 ( 10.6/ 17.0/ 12.5)     12  iteration vs recursion Performance viewpoint
  23.6 ( 11.7/ 11.7/  6.7)     10  Re: Best combination of {hardware / lisp implementation / operating
 system}

Top 10 Threads by OCR (minimum of three posts)
==============================================

         (kb)    (kb)
OCR      orig /  body  Posts  Subject
-----  --------------  -----  -------

0.945  (  6.4/   6.8)      3  Stalin's optimisations: Can they be used outside Scheme ?
0.898  ( 10.7/  11.9)      9  CMUCL's PCL Code Walker
0.859  ( 10.3/  12.0)      5  Naggum's got some good points!
0.794  (  3.0/   3.8)      5  Lisp compiler
0.773  ( 23.1/  29.8)     16  Lisp advocacy misadventures
0.744  ( 16.6/  22.3)     19  Re: "Well, I want to switch over to replace EMACS LISP with Guile."
0.735  ( 12.5/  17.0)     12  iteration vs recursion Performance viewpoint
0.718  (  2.0/   2.8)      3  setf-like forms on VALUE places
0.666  (  3.4/   5.1)      3  Re: M-Expressions and early Lisp (was Re: Lisp's unique feature:  compiler available at run-time)
0.646  ( 19.4/  30.0)     47  Getting the PID in CLISP

Bottom 10 Threads by OCR (minimum of three posts)
=================================================

         (kb)    (kb)
OCR      orig /  body  Posts  Subject
-----  --------------  -----  -------

0.616  ( 47.6 / 77.3)     67  Midfunction Recursion
0.608  ( 18.4 / 30.2)     23  Lisp options on Mac OS X (Was: Best combination of {hardware / lisp implementation / operating system})
0.599  ( 32.6 / 54.4)     41  Best combination of {hardware / lisp implementation / operating system}
0.578  ( 44.6 / 77.2)     56  A strange question...
0.572  (  6.7 / 11.7)     10  Re: Best combination of {hardware / lisp implementation / operating
 system}
0.566  (  3.7 /  6.6)      7  Bounding Indices in Sequence Functions
0.552  (  2.4 /  4.4)      5  FFI Concept, way over my head.
0.552  (  1.8 /  3.3)      6  Re: How much use of CLOS?
0.550  (121.2 /220.3)    152  Re: Difference between LISP and C++
0.495  (  2.2 /  4.5)      3  Re: Lisp options on Mac OS X (Was: Best combination of {hardware
 / lisp implementation / operating system})

Top 10 Targets for Crossposts
=============================

Articles  Newsgroup
--------  ---------

      49  comp.lang.scheme
      46  comp.lang.smalltalk
       5  comp.lang.smalltalk.advocacy
       2  comp.sys.xerox
       1  comp.text.tex

Top 10 Crossposters
===================

Articles  Address
--------  -------

      20  "Vlastimil Adamovsky" <·····@ambrasoft.com>
       8  Marc Spitzer <········@optonline.net>
       6  Vassil Nikolov <········@poboxes.com>
       6  "Boris Popov" <···········@shaw.ca>
       4  ··········@msn.com (Rich Demers)
       4  "Adam Warner" <······@consulting.net.nz>
       4  David Rush <····@bellsouth.net>
       4  Pascal Costanza <········@web.de>
       3  ····@emf.emf.net (Tom Lord)
       3  panu <····@fcc.net>

From: Paul Wallich
Subject: Re: Statistics for comp.lang.lisp
Date: 
Message-ID: <pw-C36D0C.09155327102002@reader1.panix.com>
In article <··············@ID-125932.news.dfncis.de>,
 Christopher Browne <········@acm.org> wrote:

>Following is a summary of articles spanning a 7 day period,
>beginning at 20 Oct 2002 11:56:53 GMT and ending at
>27 Oct 2002 02:30:55 GMT.
[snip]

The results appear to be an almost perfect object lesson in why 
statistical code metrics are useless without a deeper understanding of 
what's going on. Both the tops and bottoms of the volume, "original 
content" and thread rankings contain solid representations from the most 
and least informative posters and threads of the past week.

At the level these statistics capture, there appears to be little 
difference between  succinct explanations and one-liners, or between 
detailed technical responses and rants, and a complex discussion that 
requires keeping a lot of context around looks very much like a 
pedantic-mode exchange of flames. Beats KLOC and function points all 
hollow though.

I'm not immediately sure how deep a parse you would have to do to make 
completely reliable distinctions for this kind of thing.

paul
From: Michael Sullivan
Subject: Re: Statistics for comp.lang.lisp
Date: 
Message-ID: <1fkpr2q.fbr5fi3j95ipN%mes@panix.com>
Paul Wallich <··@panix.com> wrote:

> At the level these statistics capture, there appears to be little 
> difference between  succinct explanations and one-liners, or between 
> detailed technical responses and rants, and a complex discussion that
> requires keeping a lot of context around looks very much like a 
> pedantic-mode exchange of flames. Beats KLOC and function points all 
> hollow though.

> I'm not immediately sure how deep a parse you would have to do to make
> completely reliable distinctions for this kind of thing.

I'd be really interested to see what a naive bayesian approach would do
after I'd developed a database of a few thousand articles or so.

I'm thinking 3 categories:  good, off-topic but interesting, crap.

The really interesting question is whether it could distinguish good
from bad in similar content style.  e.g. distinguishing witty flamers
from whining teenagers.

On the backburner of "things I will do in my copious free time" is a
newsreader that uses this approach to write a 'probfile', as opposed to
scorefile or kill/tagfile.  The basic idea is to have a way to tell the
program "This article was sorted incorrectly -- it should have been
here:" to update the database, then let it just work on the fly.


Michael
From: Erik Naggum
Subject: Re: Statistics for comp.lang.lisp
Date: 
Message-ID: <3244750632510418@naggum.no>
* Paul Wallich
| The results appear to be an almost perfect object lesson in why
| statistical code metrics are useless without a deeper understanding of
| what's going on.  Both the tops and bottoms of the volume, "original
| content" and thread rankings contain solid representations from the most
| and least informative posters and threads of the past week.

  Well, what are the metrics you have used to determine your conclusions?

| I'm not immediately sure how deep a parse you would have to do to make
| completely reliable distinctions for this kind of thing.

  Readers would have to rate news articles.  For the past few months, I
  have been working on a system to do this with the Norwegian newsgroup
  hierarchy.  I may decide to repeat the experiment with other newsgroups.

-- 
Erik Naggum, Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.
From: Vassil Nikolov
Subject: Re: Statistics for comp.lang.lisp
Date: 
Message-ID: <uu1j7z4o9.fsf@poboxes.com>
    On 27 Oct 2002 23:37:12 +0000, Erik Naggum <····@naggum.no> said:

    [...]
    EN>   Readers would have to rate news articles.

Human readers, rather than news readers, I suppose?

---Vassil.

-- 
For an M-person job assigned to an N-person team, only rarely M=N.
From: Vassil Nikolov
Subject: Re: Statistics for comp.lang.lisp
Date: 
Message-ID: <uof9fz1wq.fsf@poboxes.com>
    On 27 Oct 2002 21:19:02 -0500, Vassil Nikolov <········@poboxes.com> said:

    On 27 Oct 2002 23:37:12 +0000, Erik Naggum <····@naggum.no> said:
    [...]
    EN>   Readers would have to rate news articles.

    VN> Human readers, rather than news readers, I suppose?

Actually, I didn't mean that to sound sarcastic.  I was just
thinking that with all those developments in AI I haven't followed,
I couldn't be sure what a news reader might be able to do...

---Vassil.

-- 
For an M-person job assigned to an N-person team, only rarely M=N.
From: Erik Naggum
Subject: Re: Statistics for comp.lang.lisp
Date: 
Message-ID: <3244804608347600@naggum.no>
* Vassil Nikolov
| Human readers, rather than news readers, I suppose?

  Well, I meant human, but the support for rating has to exist in both the
  client and the server software.

-- 
Erik Naggum, Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.
From: Paul Wallich
Subject: Re: Statistics for comp.lang.lisp
Date: 
Message-ID: <pw-89F8B1.22403227102002@reader1.panix.com>
In article <················@naggum.no>, Erik Naggum <····@naggum.no> 
wrote:

>* Paul Wallich
>| The results appear to be an almost perfect object lesson in why
>| statistical code metrics are useless without a deeper understanding of
>| what's going on.  Both the tops and bottoms of the volume, "original
>| content" and thread rankings contain solid representations from the most
>| and least informative posters and threads of the past week.
>
>  Well, what are the metrics you have used to determine your conclusions?

Purely subjective, based on five years or so of regular reading and name 
recognition, with a sense of what posts are interesting or informative 
to me and what posts appear ditto to others. I expect that someone else 
would have different personal metrics, but think that most of them would 
show a similar spread with respect to the stats given.

>| I'm not immediately sure how deep a parse you would have to do to make
>| completely reliable distinctions for this kind of thing.
>
>  Readers would have to rate news articles.  For the past few months, I
>  have been working on a system to do this with the Norwegian newsgroup
>  hierarchy.  I may decide to repeat the experiment with other newsgroups.

Does such a system integrate reasonably with common newsreaders? 

(On reflection, I think that one could probably distinguish between good 
and bad in threads consisting mostly of long posts with low "original 
content" by looking at posting interval and total thread length. Longer 
intervals and shorter ultimate length for "useful" threads because of 
the time and effort required for cogent replies, but unfortunately the 
measure would be mostly retrospective.)

paul
From: Christopher Browne
Subject: Re: Statistics for comp.lang.lisp
Date: 
Message-ID: <apidpq$1son6$2@ID-125932.news.dfncis.de>
Centuries ago, Nostradamus foresaw when Paul Wallich <··@panix.com> would write:
> In article <················@naggum.no>, Erik Naggum <····@naggum.no> 
> wrote:
>>* Paul Wallich
>>| The results appear to be an almost perfect object lesson in why
>>| statistical code metrics are useless without a deeper
>>| understanding of what's going on.  Both the tops and bottoms of
>>| the volume, "original content" and thread rankings contain solid
>>| representations from the most and least informative posters and
>>| threads of the past week.
>>
>>  Well, what are the metrics you have used to determine your
>>  conclusions?
>
> Purely subjective, based on five years or so of regular reading and
> name recognition, with a sense of what posts are interesting or
> informative to me and what posts appear ditto to others. I expect
> that someone else would have different personal metrics, but think
> that most of them would show a similar spread with respect to the
> stats given.
>
>>| I'm not immediately sure how deep a parse you would have to do to
>>| make completely reliable distinctions for this kind of thing.
>>
>>  Readers would have to rate news articles.  For the past few
>>  months, I have been working on a system to do this with the
>>  Norwegian newsgroup hierarchy.  I may decide to repeat the
>>  experiment with other newsgroups.
>
> Does such a system integrate reasonably with common newsreaders? 
>
> (On reflection, I think that one could probably distinguish between
> good and bad in threads consisting mostly of long posts with low
> "original content" by looking at posting interval and total thread
> length. Longer intervals and shorter ultimate length for "useful"
> threads because of the time and effort required for cogent replies,
> but unfortunately the measure would be mostly retrospective.)

<http://quimby.gnus.org/gnus/manual/gnus_237.html>

"GroupLens (http://www.cs.umn.edu/Research/GroupLens/) is a
collaborative filtering system that helps you work together with other
people to find the quality news articles out of the huge volume of
news articles generated every day.

To accomplish this the GroupLens system combines your opinions about
articles you have already read with the opinions of others who have
done likewise and gives you a personalized prediction for each unread
news article. Think of GroupLens as a matchmaker. GroupLens watches
how you rate articles, and finds other people that rate articles the
same way. Once it has found some people you agree with it tells you,
in the form of a prediction, what they thought of the article. You can
use this prediction to help you decide whether or not you want to read
the article.

NOTE: Unfortunately the GroupLens system seems to have shut down, so
this section is mostly of historical interest."

You could presumably build a protocol to share Gnus "score" files with
others with similar interests which could also help.

A third approach would be to use something like Paul Graham's
statistical filtering scheme or something more sophisticated such as
IFile to filter between "good" and "bad", perhaps sharing a corpus of
"good" and "bad" material with others.

I think the ideal way of handling this would probably involve using a
GroupLens-like approach to allow people to share "scoring" information
on articles which would be used to define allocations of messages to
corpuses.  

Those allocations would then be used to do IFile-like evaluations of
messages which would mean that /everyone/ would get improved scoring.
When articles were scored wrongly, the feedback would be used to
improve the corpus...

Note that the statistics are intended as much for amusement purposes
as for any serious analysis.  I certainly agree that the value is
pretty dubious; you take them seriously at your own risk...
-- 
(concatenate 'string "cbbrowne" ·@cbbrowne.com")
http://www3.sympatico.ca/cbbrowne/internet.html
"... While programs written for Sun machines won't run unmodified on
Intel-based computers, Sun said the two packages will be completely
compatible and that software companies can convert a program from one
system to the other through a fairly straightforward and automated
process known as ``recompiling.''" -- San Jose Mercury News