From: Emre Sevinc
Subject: Code review: CL for SNA (Social Network Analysis)
Date: 
Message-ID: <87d5jrou6v.fsf@ileriseviye.org>
A fellow Lisper of mine started to use CL for his budding SNA
work.

First thing he did was to write some CL code that grabs e-mail
archive and parses it.

We tried the code on a few lists, not very bad but not very
fast, too.

So I thought maybe the wise people here would like to review
and criticize the code below (example usage is at the very
end of the list), in terms of style, performance, optimization,
being idiomatic, etc.:


;; should be different on your system.
(pushnew "/home/vst/vstDocs/usr/lib/common-lisp/system/" asdf:*central-registry* :test #'equal)
(asdf:oos 'asdf:load-op :trivial-http)
(asdf:oos 'asdf:load-op :cl-ppcre)

(defun character-stream-to-string (input-stream)
  "Convert stream to string. For using with trivial-http responses."
  (let ((strlist nil))
	(do ((line (read-line input-stream nil nil)
			   (read-line input-stream nil nil)))
		((null line) (apply #'concatenate 'string (reverse strlist)))
	  (push line strlist))))

(defun get-gzipped-mails-list (index-content url-prefix)
  "Extract mothly mbox archive links from html page"
  (let ((result nil))
	(cl-ppcre:do-register-groups (first)
		("<A href=\"([^\"]*\.txt)" index-content)
	  (push first result))
	(let ((res2 nil))
	  (loop for i in result do
			(if (cl-ppcre:scan "^http:" i)
				(push i res2)
				(push (concatenate 'string url-prefix i) res2))) res2)))

(defun get-the-list-of-mboxs (url)
  "Returns a list of available mboxes on the pipermail archive index."
  (reverse (get-gzipped-mails-list
			(character-stream-to-string (caddr (thttp:http-get url)))
			url)))

(defun prepare-full-raw-archive (url-list file-path)
  "Gets all the mboxes and concatenates them into one mbox on local file path"
  (with-open-file (output-stream file-path :direction :OUTPUT)
	(loop for rfile in url-list do
		  (let ((content-stream (caddr (thttp:http-get rfile))))
			(do ((line (read-line content-stream nil nil)
					   (read-line content-stream nil nil)))
				((null line) t)
			  (format output-stream "~A~%" line))))))

(defun from-line? (line)
  "Returns true, if the line looks like a FROM line. Note that we use the regular expression *if* really needed."
  (if (and (> (length line) 11) (equal (subseq line 0 5) "From "))
	  (ppcre:scan "^From [^ ]+ at [^ ]+" line)
	  nil))

(defun date-line? (line)
  "Check for DATE line"
  (and (> (length line) 6) (equal (subseq line 0 6) "Date: ")))

(defun subject-line? (line)
  "Check for SUBJECT line"
  (and (>= (length line) 9) (equal (subseq line 0 9) "Subject: ")))

(defun empty-line? (line)
  "Check for an EMPTY line"
  (equal (length line) 0))

(defun msgid-line? (line)
  "Check for a MESSAGE-ID line"
  (and (> (length line) 13) (equal (subseq line 0 12) "Message-ID: ")))

(defun reply-to-line? (line)
  "Check for REPLY-TO line"
  (and (> (length line) 14) (equal (subseq line 0 13) "In-Reply-To: ")))

;; TODO: the function below looks horrible. do something...
(defun parse-mails-from-mbox (mbox-pathname)
  "Parse the mbox and return a list of associated lists, where each assoc. list corresponds to an email"
  (with-open-file (content-stream mbox-pathname)
	(let ((maillist nil)
		  (mail nil)
		  (seek-for-from t))
	  (do ((line (read-line content-stream nil nil)
				 (read-line content-stream nil nil)))
		  ((null line) (setf maillist (reverse maillist)))
		;; from - header - body - from - header - body - ...
		(cond
		  ((eq seek-for-from t)
		   (cond
			 ((from-line? line)
			  (progn
				(setf maillist (cons (reverse mail) maillist))
				(setf mail nil)
				(setf mail (cons (cons ':FROM line) mail))
				(setf seek-for-from nil)))
			 (t (cond 
				  ((null (assoc ':CONTENT mail)) (setf mail (cons (cons ':CONTENT line) mail)))
				  (t
				   (setf (cdr (assoc ':CONTENT mail))
						 (concatenate 'string
									  (cdr (assoc ':CONTENT mail))
									  (string #\Newline)
									  line)))))))
		  (t (cond
			   ((date-line? line) (setf mail (cons (cons ':DATE line) mail)))
			   ((subject-line? line) (setf mail (cons (cons ':SUBJECT line) mail)))
			   ((msgid-line? line) (setf mail (cons (cons ':MESSAGE-ID line) mail)))
			   ((reply-to-line? line) (setf mail (cons (cons ':REPLY-TO-ID line) mail)))
			   ((empty-line? line) (setf seek-for-from t))))))maillist)))


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Examples
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

(setf *url-list* (get-the-list-of-mboxs "http://cs.bilgi.edu.tr/pipermail/cs-lisp/"))
(prepare-full-raw-archive *url-list* "/tmp/archive.mbox")
(setf *mails* (parse-mails-from-mbox "/tmp/archive.mbox"))




-- 
Emre Sevinc

eMBA Software Developer         Actively engaged in:
http:www.bilgi.edu.tr           http://ileriseviye.org
http://www.bilgi.edu.tr         http://fazlamesai.net
Cognitive Science Student       http://cazci.com
http://www.cogsci.boun.edu.tr

From: Rob Warnock
Subject: Re: Code review: CL for SNA (Social Network Analysis)
Date: 
Message-ID: <a5ednZyDKJpemTTeRVn-tQ@speakeasy.net>
Emre Sevinc  <·····@bilgi.edu.tr> wrote:
+---------------
| (defun date-line? (line)
|   "Check for DATE line"
|   (and (> (length line) 6) (equal (subseq line 0 6) "Date: ")))
+---------------

IMHO, all of the ones of this pattern would be better written as follows
[notice the CHAR-EQUAL, since RFC 2822 specifies case-insensitive header
field names]:

   (defun date-line-p (line)
     (not (mismatch "Date: " line :end2 6 :test #'char-equal)))

Even better, fold all those into one general case:

   (defun prefix-match-p (prefix line)
     (not (mismatch prefix line :end2 (length prefix) :test #'char-equal)))

which you can use as (PREFIX-MATCH-P "date: " LINE), etc.

Note that RFC 2822 does *NOT* actually require a space after the colon
[though it's not likely you'll see that case], so you'd really be safer
with just (PREFIX-MATCH-P "date:" LINE), etc.

+---------------
| ;; TODO: the function below looks horrible. do something...
+---------------

Yup, the (CONS (CONS ...)) stuff is pretty ugly. But if you replace
the DO loop with a LOOP, you can using COLLECT INTO multiple variables
to vastly simplify the building of the multiple result lists.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Christophe Rhodes
Subject: Re: Code review: CL for SNA (Social Network Analysis)
Date: 
Message-ID: <sqoe3an99x.fsf@cam.ac.uk>
····@rpw3.org (Rob Warnock) writes:

> IMHO, all of the ones of this pattern would be better written as follows
> [notice the CHAR-EQUAL, since RFC 2822 specifies case-insensitive header
> field names]:
>
>    (defun date-line-p (line)
>      (not (mismatch "Date: " line :end2 6 :test #'char-equal)))
>
> Even better, fold all those into one general case:
>
>    (defun prefix-match-p (prefix line)
>      (not (mismatch prefix line :end2 (length prefix) :test #'char-equal)))
>
> which you can use as (PREFIX-MATCH-P "date: " LINE), etc.

I think you need
  (defun prefix-match-p (prefix line)
    (and (>= (length line) (length prefix))
         (not (mismatch prefix line :end2 (length prefix) :test #'char-equal))))
otherwise you run the risk of signalling an error, which isn't what
you want, when the argument to END2 is larger than the length of the line.

That can also be written
  (defun prefix-match-p (prefix line)
    (let ((mismatch (mismatch prefix line :test #'char-equal)))
      (or (null mismatch) (= mismatch (length prefix)))))

Cheers,

Christophe
From: Rob Warnock
Subject: Re: Code review: CL for SNA (Social Network Analysis)
Date: 
Message-ID: <tbudncQrY7VUrzTeRVn-uw@speakeasy.net>
Christophe Rhodes  <·····@cam.ac.uk> wrote:
+---------------
| ····@rpw3.org (Rob Warnock) writes:
| > (defun prefix-match-p (prefix line)
| >   (not (mismatch prefix line :end2 (length prefix) :test #'char-equal)))
| 
| I think you need
| (defun prefix-match-p (prefix line)
|   (and (>= (length line) (length prefix))
|        (not (mismatch prefix line :end2 (length prefix) :test #'char-equal))))
| otherwise you run the risk of signalling an error, which isn't what
| you want, when the argument to END2 is larger than the length of the line.
+---------------

Oops! Correct, good point.

+---------------
| That can also be written
|   (defun prefix-match-p (prefix line)
|     (let ((mismatch (mismatch prefix line :test #'char-equal)))
|       (or (null mismatch) (= mismatch (length prefix)))))
+---------------

O.k., but to my taste this is slightly simpler:

    (defun prefix-match-p (prefix line)
      (not (mismatch prefix line :end2 (min (length prefix) (length line))
				 :test #'char-equal)))


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Thomas A. Russ
Subject: Re: Code review: CL for SNA (Social Network Analysis)
Date: 
Message-ID: <ymivexiidnr.fsf@sevak.isi.edu>
Emre Sevinc <·····@bilgi.edu.tr> writes:

> 
> 
> A fellow Lisper of mine started to use CL for his budding SNA
> work.
> 
> First thing he did was to write some CL code that grabs e-mail
> archive and parses it.
> 
> We tried the code on a few lists, not very bad but not very
> fast, too.

Well, that calls for doing some profiling.  You can either use the tools
(if any) that came with your lisp implementation or else you can use a
more general tool like metering.cl by Mark Kantrowitz to figure out
where you are really spending your time.  Then you can focus
optimization effort on the part that's really important.

http://www-cgi.cs.cmu.edu/afs/cs/project/ai-repository/ai/lang/lisp/code/tools/metering/metering.cl 


-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: Tayssir John Gabbour
Subject: Re: Code review: CL for SNA (Social Network Analysis)
Date: 
Message-ID: <1135187478.446414.289690@z14g2000cwz.googlegroups.com>
Emre Sevinc wrote:
> A fellow Lisper of mine started to use CL for his budding SNA
> work.
>
> First thing he did was to write some CL code that grabs e-mail
> archive and parses it.
>
> We tried the code on a few lists, not very bad but not very
> fast, too.
>
> So I thought maybe the wise people here would like to review
> and criticize the code below (example usage is at the very
> end of the list), in terms of style, performance, optimization,
> being idiomatic, etc.:
>
> (defun get-gzipped-mails-list (index-content url-prefix)
>   "Extract mothly mbox archive links from html page"
>   (let ((result nil))
> 	(cl-ppcre:do-register-groups (first)
> 		("<A href=\"([^\"]*\.txt)" index-content)
> 	  (push first result))
> 	(let ((res2 nil))
> 	  (loop for i in result do
> 			(if (cl-ppcre:scan "^http:" i)
> 				(push i res2)
> 				(push (concatenate 'string url-prefix i) res2))) res2)))


I'm curious about something; this isn't big deal but I'm curious. Why
is this preferred

;; orig
(let ((res2 nil))
  (loop for i in result do
       (if (cl-ppcre:scan "^http:" i)
           (push i res2)
           (push (concatenate 'string url-prefix i) res2))) res2)

over something like this (untested)?

;; new
(loop for i in result
      if (cl-ppcre:scan "^http:" i) collect i
      else collect (concatenate 'string url-prefix i))


Does the original style lead to better debuggability in case of errors,
as the LOOP form is less complex?


Tayssir

 --
"Let's talk about the question of why people are wealthy. There is a
myth that it's a function of enormous personal attributes... the
individual wealth which is generated in this economy is, in my
judgment, and I doubt that there is much that anyone could disagree
with about this, is a function of the innovative businesses which are
created as a result of federal research. But you understand that the
people who benefit from that research get it free... It starts from
this incredible research activity which is going on with federal
money."
-- Bill Gates Sr., 2003
From: Lars Brinkhoff
Subject: Re: Code review: CL for SNA (Social Network Analysis)
Date: 
Message-ID: <85ek46spaz.fsf@junk.nocrew.org>
Emre Sevinc <·····@bilgi.edu.tr> writes:
> ((date-line? line) (setf mail (cons (cons ':DATE line) mail)))
> ((subject-line? line) (setf mail (cons (cons ':SUBJECT line) mail)))
> ((msgid-line? line) (setf mail (cons (cons ':MESSAGE-ID line) mail)))
> ((reply-to-line? line) (setf mail (cons (cons ':REPLY-TO-ID line) mail)))
> ((empty-line? line) (setf seek-for-from t))))))maillist)))

Maybe (push (cons ...) mail) or (setf mail (acons ... mail))?
From: John Wiseman
Subject: Re: Code review: CL for SNA (Social Network Analysis)
Date: 
Message-ID: <m2vexihr6r.fsf@Johns-Computer.local>
First, of course, profile the code.

Emre Sevinc <·····@bilgi.edu.tr> writes:

> (defun character-stream-to-string (input-stream)
>   "Convert stream to string. For using with trivial-http responses."
>   (let ((strlist nil))
> 	(do ((line (read-line input-stream nil nil)
> 			   (read-line input-stream nil nil)))
> 		((null line) (apply #'concatenate 'string (reverse strlist)))
> 	  (push line strlist))))

This is probably faster, though it may not make much difference:

(defun stream-contents (stream)
  "Returns a string with the entire contents of the specified stream."
  (with-output-to-string (contents)
    (let* ((buffer-size 4096)
           (buffer (make-string buffer-size)))
      (labels ((read-chunks ()
                 (let ((size (read-sequence buffer stream)))
                   (if (< size buffer-size)
                       (princ (subseq buffer 0 size) contents)
                       (progn
                         (princ buffer contents)
                         (read-chunks))))))
        (read-chunks)))))


> (defun get-gzipped-mails-list (index-content url-prefix)
>   "Extract mothly mbox archive links from html page"
>   (let ((result nil))
> 	(cl-ppcre:do-register-groups (first)
> 		("<A href=\"([^\"]*\.txt)" index-content)
> 	  (push first result))
> 	(let ((res2 nil))
> 	  (loop for i in result do
> 			(if (cl-ppcre:scan "^http:" i)
> 				(push i res2)
> 				(push (concatenate 'string url-prefix i) res2))) res2)))

You should make sure that the regular expressions you pass to CL-PPCRE
are being precompiled, because compiling them at runtime can be very,
very slow.

CL-PPCRE has compiler macros that are supposed to precompile constant
regexp strings, but they didn't work for me for some reason.  In any
case, doing something like this sped things up greatly:

  (defparameter *txt-re* (cl-ppcre:create-scanner "<A href=\"([^\"]*\.txt)"))

  (cl-ppcre:scan *txt-re* ...)


> (defun prepare-full-raw-archive (url-list file-path)
>   "Gets all the mboxes and concatenates them into one mbox on local file path"
>   (with-open-file (output-stream file-path :direction :OUTPUT)
> 	(loop for rfile in url-list do
> 		  (let ((content-stream (caddr (thttp:http-get rfile))))
> 			(do ((line (read-line content-stream nil nil)
> 					   (read-line content-stream nil nil)))
> 				((null line) t)
> 			  (format output-stream "~A~%" line))))))

Using read-sequence will help here, too.


> (defun date-line? (line)
>   "Check for DATE line"
>   (and (> (length line) 6) (equal (subseq line 0 6) "Date: ")))
>
> (defun subject-line? (line)
>   "Check for SUBJECT line"
>   (and (>= (length line) 9) (equal (subseq line 0 9) "Subject: ")))

All this subseq'ing can generate a lot of garbage.  And equal is not
the fastest way to compare strings.  This should avoid garbage and be
a faster comparison:

  (defun date-line? (line)
    "Check for DATE line"
    (and (> (length line) 6) (string= line "Date: " :start1 0 :end1 6)))


> ((date-line? line) (setf mail (cons (cons ':DATE line) mail)))
> ((subject-line? line) (setf mail (cons (cons ':SUBJECT line) mail)))
> ((msgid-line? line) (setf mail (cons (cons ':MESSAGE-ID line) mail)))
> ((reply-to-line? line) (setf mail (cons (cons ':REPLY-TO-ID line) mail)))
> ((empty-line? line) (setf seek-for-from t))))))maillist)))

Style-wise, anything that looks like (setf foo (cons something foo))
should probably be (push something foo).


John
From: Timofei Shatrov
Subject: Re: Code review: CL for SNA (Social Network Analysis)
Date: 
Message-ID: <43aa7254.2245978@news.readfreenews.net>
On Thu, 22 Dec 2005 01:12:21 GMT, John Wiseman
<·······@Johns-Computer.local> tried to confuse everyone with this
message:

>
>> (defun get-gzipped-mails-list (index-content url-prefix)
>>   "Extract mothly mbox archive links from html page"
>>   (let ((result nil))
>> 	(cl-ppcre:do-register-groups (first)
>> 		("<A href=\"([^\"]*\.txt)" index-content)
>> 	  (push first result))
>> 	(let ((res2 nil))
>> 	  (loop for i in result do
>> 			(if (cl-ppcre:scan "^http:" i)
>> 				(push i res2)
>> 				(push (concatenate 'string url-prefix i) res2))) res2)))

I think using more advanced ppcre function, like all-matches-as-strings
would be better here. Not only it makes the code cleaner, it also
ensures that the closure that searches for regexs is generated only
once.

-- 
|a\o/r|,-------------.,---------- Timofei Shatrov aka Grue ------------.
| m"a ||FC AMKAR PERM|| mail: grue at mail.ru  http://grue3.tripod.com |
|  k  ||  PWNZ J00   || Kingdom of Loathing: Grue3 lvl 18 Seal Clubber |
`-----'`-------------'`-------------------------------------------[4*72]
From: Edi Weitz
Subject: Re: Code review: CL for SNA (Social Network Analysis)
Date: 
Message-ID: <u4q51tnyn.fsf@agharta.de>
On Thu, 22 Dec 2005 01:12:21 GMT, John Wiseman <·······@Johns-Computer.local> wrote:

> You should make sure that the regular expressions you pass to
> CL-PPCRE are being precompiled, because compiling them at runtime
> can be very, very slow.
>
> CL-PPCRE has compiler macros that are supposed to precompile
> constant regexp strings, but they didn't work for me for some
> reason.

Do you have an example?  Which Lisp, which CL-PPCRE version?  I
thought I had all cases covered so I'd like to investigate this.

Of course, if you have something that's supposed to work and it
doesn't then it's usually a good idea to report this as a bug to the
maintainer(s) instead of posting workarounds in a public forum...

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")