"Read stuff from a file and chop it up to do stuff" code advice wanted.

From: landspeedrecord
Subject: "Read stuff from a file and chop it up to do stuff" code advice wanted.
Date: Thu, 25 Oct 2007 00:57:44 +0000
Message-ID: <1193273864.563512.123540@v23g2000prn.googlegroups.com>

I am writing code that reinvents the "read stuff from a file and chop
it up to do stuff" code wheel (of pain conan).  If you know what I
mean...  Why am I doing this?

a) to learn lisp and also how to program...
b) because I am too stupid (or it is too hard) to find code that does
the insanely simple crap I want to do as a newbie.  I tried with
google... really.  Plus I don't know how to use packages yet.

Anyway... I am seeking some advice... any advice about how to write
code better or with better style.  Especially advice on how to write
more beautiful and/or simple code.  Am I over documenting?  Under
using useful functions that I am too stupid/ignorant to use?  I assume
that there is a library that already does what my code is doing... but
I couldn't find it so I wrote my own. If so let me know what the
library is called.  P.S.  I am on a PC so path names are of the form C:
\blah.text.

If anyone finds the code useful, please let me know!

HERE IS MY CODE (that was so painful to create!  aaaarrrgghhh!)

;; NOT-TEXT?
;; INPUT: a character
;; OUTPUT: T or NIL
;;
;; This function returns TRUE if any character
;; passed to it is below #\!... which is where
;; all the ascii (unicode as well?) control characters
;; are.  Also below #\! is #\space etc...

(defun not-text? (char)
      (if (char< char #\!)
	  T
	  NIL))



;; ALPHA-CHAR?
;; INPUT: a character
;; OUTPUT: T or NIL
;;
;; Only works with ASCII??? (I am NOT SURE.)
;; Returns TRUE if any character passed to it
;; is alphabetic.  Problematically, Unicode
;; characters (I think it is Unicode at least! - not sure)
;; 91. #\[ 92. #\\ 93. #\] 94. #\^ 95. #\_ 96. #\`
;; are between 65. #\A and 122. #\z. and so they
;; would ALSO make the function return TRUE.
;; I don't know how to make Lisp change between different
;; character code sets so I can't fix/figure out how
;; to solve this issue.

(defun alpha-char? (char)
      (if (AND (char-not-lessp char #\A)(char-not-greaterp char #\z))
	  T
	  NIL))



;; GET-TEXTCHUNK-FROM-STREAM
;; INPUT:  a stream and an array to hold characters in temp memory.
;; OUTPUT: a string or NIL if at the absolute end of file and
;;         the temp memory array is empty.
;;
;; This function grabs characters from a stream until it has
;; a chunk of text surrounded by white space
;; or linefeeds or carriage returns and returns the
;; resulting string.  It works via recursion... IS THIS
;; INEFFECIENT or slower than a loop??? Don't know.
;;
;; Some notes about the steps of the "COND" part of the fuction:
;; 1 If new-char is nil - then end of file/stream! but if the
;;   array contains characters then return the array so the final
;;   characters in the stream aren't lost!
;; 2 new-char = NIL & the array has no characters.
;;   end of file/stream!  Return NIL.
;; 3 Any character below "!" is a control character like
;;   #\newline or #\space etc... Text-chunk complete!
;;   Throw out the control character and return the chunk.
;; 4 Still reading in legitimate characters.  Push new-char
;;   onto the array and then keep going
;;   via recursion (passing in the updated char-array...)

(defun get-textchunk-from-stream (stream char-array)
      ;;the below NILs are important! for endoffile error avoidance.
      (let ((new-char (read-char stream nil)))
              ;1
        (cond ((AND (eq nil new-char) (> (length char-array) 0)) char-
array)
              ;2
              ((eq nil new-char) nil)
              ;3
	      ((not-text? new-char) char-array)
              ;4
	      (T (progn (vector-push-extend new-char char-array)
		        (get-textchunk-from-stream stream char-array))))))



;; GET-TEXTCHUNK
;; INPUT: a stream
;; OUTPUT: a string
;;
;; This function is just a helper function that sets up
;; GET-TEXTCHUNK-FROM-STREAM to begin its recursive process
;; properly. I could probably have done without it but
;; I couldn't figure out how to make GET-TEXTCHUNK-FROM-STREAM
;; self contained.

(defun get-textchunk (stream)
      (let ((char-catcher (make-array 0 :element-type 'character
                                        :fill-pointer 0
                                        :adjustable t)))
	(get-textchunk-from-stream stream char-catcher)))



;; GET-ALL-TEXTCHUNKS
;; INPUT:  a stream.
;; OUTPUT: a list of strings.
;;
;; This function loops GET-TEXTCHUNK over and over again
;; until the stream ends and GET-TEXTCHUNK returns NIL finally,
;; whereupon GET-ALL-TEXTCHUNKS
;; returns a list of strings.

(defun get-all-textchunks (stream)
      (loop for word = (get-textchunk stream)
	  while word collect word))



;; SLURP-STREAM5
;; INPUT: a stream
;; OUTPUT: a very long string?
;;
;; Holy Crap this grabs all the text from a stream so fast!
;; I got this off the web at:
;; http://www.emmett.ca/~sabetts/slurp.html
;; My older code was grabbing one word from the file stream at a time.
;; It is MUCH faster to use this code to read in all the text at once
;; and then run my code on the text string that this code creates.

(defun slurp-stream5 (stream)
  (let ((seq (make-array (file-length stream)
               :element-type 'character
               :fill-pointer t)))
    (setf (fill-pointer seq) (read-sequence seq stream))
    seq))



;; SUPER-TEXT-SLURP
;; INPUT: a file path
;; OUTPUT: the output of slurp-string5, i.e. a single long string.
;;
;; The functin uses slurp-stream5 to open a stream and
;; return all the text as a single long string.  SUPER-TEXT-SLURP is
;; just some time saving code that saves me from having
;; to open and close strings just to read in text.
;; It also has the advantage of using WITH-OPEN-FILE
;; so the closing of the file is done automatically.

(defun super-text-slurp (file-location)
  (with-open-file (temp-var file-location
			:direction :input
			:if-does-not-exist :error)
	(slurp-stream5 temp-var)))



;; *TEMP-TEXT-HOLDER*
;;
;; This is just a global variable to temporarily hold
;; all the text being read in via SUPER-TEXT-SLURP
;; This is also serves as an example call to SUPER-TEXT-SLURP.
;; "Test.txt" is just a giant ASCII text file
;; that I got off of projectgutenburg.org.
;;
;; NOTE:
;; !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
;; Put in the path to any text file on your hard drive
;; to get the other code that follows this function to work.
;; I am using a windows file path here... watch out linux people!

(defparameter *temp-text-holder* (super-text-slurp "C:\\test.txt"))



;; *TEXT-ARRAY*
;;
;; This is an empty adjustable character array.
;; SHOULD I USE DEFVAR INSTEAD of DEFPARAMETER???
;; What is the difference between them anyway???

(defparameter *text-array* (make-array 0 :element-type 'character
                                        :fill-pointer 0
                                        :adjustable t))


;; This next bit is just some necessary code to make it all run...
;; I suppose I should embed it in a master function called
;; main... whateva.
;;
;; *TEXT-ARRAY* is an empty adjustable character array.
;; *temp-text-holder* is a gigantic book length string.
;; "streamz" is a stream created by with-input-from-string.
;; This is necessary because GET-ALL-TEXTCHUNKS only
;; accepts a stream.
;; GET-ALL-TEXTCHUNKS chops it all up into stings.
;; Each string is one word.
;; The contents of *text-array* is a list of strings.

(setq *text-array*
	  (with-input-from-string (streamz *temp-text-holder*)
	    (get-all-textchunks streamz)))


;; FIRST-ALPHACHAR-POS
;; INPUT: a character
;; OUTPUT: a non-negative integer
;;
;; This function returns the position
;; of the first alphabetic character in a string;
;; i.e. an "a" through a "z" irrespective
;; of case.  Zero is position 1, 1 is position
;; 2 etc... I hate how Lisp does that but
;; whateva.  Don't fight the system mang.
;; That is "\\\zing" would return 3 NOT 4.

(defun first-alphachar-pos (string)
      (loop for i from 0 to (1- (length string))
	    do (if (alpha-char? (elt string i))
		   (return i))
	    finally (return i)))


;; LAST-ALPHACHAR-POS
;; INPUT: a character
;; OUTPUT: a non-negative integer or -1.
;;
;; This function returns the position
;; of the last alphabetic character in a string;
;; Zero is position 1, 1 is position
;; 2 etc... That is "shlorp&^%" would return 5 NOT 6.

(defun last-alphachar-pos (string)
      (loop for i from (1- (length string)) downto 0
	    do (if (alpha-char? (elt string i))
		   (return i))
	    finally (return i)))



;; STRING-LIST-NONCHAR-FIXER (this is way too big of a function!)
;; INPUT:  Nothing + a global variable included
;;         as part of the function's internal code.  This is
;;         because I could never figure out how to
;;         properly pass a global variable in Lisp.
;; OUTPUT: A list of properly parsed word and non-word
;;         strings that retains the proper order from the
;;         initial text (the strings stored in the global variable
;;         must have been stored as a single list of strings
;;         or this function will not work!)
;;
;; This function grabs a string from a global variable's list of
strings
;; (The GV is called "*text-array*" in the code right now but this can
;; change later if need be) and then cuts up the string into
;; non-alphabetic and alphebetic string chunks.  It then pushes the
;; chunk(s) in the proper order onto a new list and grabs the
;; next string.  When all the strings are processed the function
;; returns the new list in toto.
;;
;; The purpose of this function is to clean
;; up junk AND PUNCTUATION that gets included when "words"
;; are grabbed from a stream based on being surrounded by
;; spaces.  For instance "---bob!" would get chuncked up
;; into "---", "bob", and "!".  This allows for easy removal
;; of non-words and makes handling punctuation easier in the
;; future.  Of course all this assumes that the best way to deal
;; with a large corpus or text is to seperate it into a list of
;; strings... maybe a hash table or an array is more efficient.
;; This function will not fix several problems:
;; 1) Words with non-alpha typos embedded in them or two words
;;    stuck together with a non-alphabetic character with no space.
;; 2) Hacker speak type stuff like "7am3r" or "$hit".
;; 3) It doesn't seperate numbers from junk!  I bet I have to modify
it
;;    later to fix this!

(defun string-list-nonchar-fixer ()
  (let ((newlist nil)
        (listlen (1- (length *text-array*))))
    (loop for i from listlen downto 0
	  for j = (nth i *text-array*)
                   ;position of first alpha character.
          do (let* ((firstcut (first-alphachar-pos j))
                   ;position of first ending-group non-alpha
character.
                   (lastcut (1+ (last-alphachar-pos j)))
                   ;total length of the string minus 1.
                   (stringlen (length j))
                   ;only alphabetic characters in string?
                   (all-alpha (if (and (= firstcut 0) (= lastcut
stringlen))
                                        T NIL)))
                   ;This cond form pushes all the bits onto "newlist".
                         ;1 - only alphabetic characters in string.
                   (cond (all-alpha (push j newlist))
                         ;2 - non-alphabetic string - push it along...
                         ;    perhaps (= lastcut 0) would be more
effecient?
                         ((> firstcut lastcut) (push j newlist))
                         ;3 - junk before but not after.
                         ((= lastcut stringlen)
                            (progn (push (subseq j firstcut stringlen)
newlist)
                                   (push (subseq j 0 firstcut)
newlist)))
                         ;4 - junk after but not before.
                         ((= firstcut 0)
                            (progn (push (subseq j lastcut stringlen)
newlist)
                                   (push (subseq j 0 lastcut)
newlist)))
                         ;5 - junk at start and end.
                         (T (progn (push (subseq j lastcut stringlen)
newlist)
                                   (push (subseq j firstcut lastcut)
newlist)
                                   (push (subseq j 0 firstcut)
newlist))))))
         newlist))

;; SAVE-TO-FILE
;; INPUT:  a variable to write to file and a filepath with a filename
;;         included.
;; OUTPUT: a textfile on your computers harddrive!  YAY!
;;
;; This is a handy function for saving text into a file.
;; I don't know if it works with multi-byte characters or not.

(defun SAVE-TO-FILE (variable filename)
     (with-open-file (out filename
                      :direction :output
                      :if-exists :supersede)
     (with-standard-io-syntax
        (print variable out))))


(string-list-nonchar-fixer)

Re: "Read stuff from a file and chop it up to do stuff" code advice wanted. Thomas A. Russ
Re: "Read stuff from a file and chop it up to do stuff" code advice wanted. Alan Crowe
- Re: "Read stuff from a file and chop it up to do stuff" code advice wanted. Mark Tarver
  - Re: "Read stuff from a file and chop it up to do stuff" code advice wanted. Mark Tarver

From: Thomas A. Russ
Subject: Re: "Read stuff from a file and chop it up to do stuff" code advice wanted.
Date: Thu, 25 Oct 2007 19:46:57 +0000
Message-ID: <ymisl3z9du6.fsf@blackcat.isi.edu>

landspeedrecord <···············@gmail.com> writes:

.> I am writing code that reinvents the "read stuff from a file and chop
.> it up to do stuff" code wheel (of pain conan).  If you know what I
.> mean...  Why am I doing this?
.> 
.> a) to learn lisp and also how to program...

Well, in general this looks pretty good.  There are a number of built-in
functions that would make things a bit simpler, though.

.> b) because I am too stupid (or it is too hard) to find code that does
.> the insanely simple crap I want to do as a newbie.  I tried with
.> google... really.  Plus I don't know how to use packages yet.

Well, perhaps either a lisp text book (Practical Common Lisp, available
on-line) would be a good start.

Then as a reference (it won't teach you how to program), you should
consider the Lisp HyperSpec:

   <http://www.lisp.org/HyperSpec/FrontMatter/index.html>

.> Am I over documenting?

No.  This is really hard to do.  The more documentation, the easier it
is for us to understand what you want to do, and also the easier it is
for you to remember this when you look at the code 6 months from now to
re-use it in another project.

.> Under using useful functions that I am too stupid/ignorant to use?

There are a large number of useful functions that could make your life
easier.  On the other hand, doing some of the simple functions can be a
useful learning experience.

.> If anyone finds the code useful, please let me know!
.> 
.> HERE IS MY CODE (that was so painful to create!  aaaarrrgghhh!)
.> 
.> ;; NOT-TEXT?
.> ;; INPUT: a character
.> ;; OUTPUT: T or NIL
.> ;;
.> ;; This function returns TRUE if any character
.> ;; passed to it is below #\!... which is where
.> ;; all the ascii (unicode as well?) control characters
.> ;; are.  Also below #\! is #\space etc...
.> 
.> (defun not-text? (char)
.>       (if (char< char #\!)
.> 	  T
.> 	  NIL))

OK, this one is tough, since it does rely on encoding issues.  If you
had a character set without control characters, it would likely fail.

I would also tend to want to use CHAR-CODE, since you don't want to get
into collation issues if you use a lisp that handles other character
sets like Unicode.  I'm not sure I could predict what CHAR< does in that
case.

And if you are limiting yourself to ASCII or supersets of it, then
CHAR-CODE won't hurt.


.> ;; ALPHA-CHAR?
.> ;; INPUT: a character
.> ;; OUTPUT: T or NIL
.> ;;
.> ;; Only works with ASCII??? (I am NOT SURE.)
.> ;; Returns TRUE if any character passed to it
.> ;; is alphabetic.  Problematically, Unicode
.> ;; characters (I think it is Unicode at least! - not sure)
.> ;; 91. #\[ 92. #\\ 93. #\] 94. #\^ 95. #\_ 96. #\`
.> ;; are between 65. #\A and 122. #\z. and so they
.> ;; would ALSO make the function return TRUE.
.> ;; I don't know how to make Lisp change between different
.> ;; character code sets so I can't fix/figure out how
.> ;; to solve this issue.
.> 
.> (defun alpha-char? (char)
.>       (if (AND (char-not-lessp char #\A)(char-not-greaterp char #\z))
.> 	  T
.> 	  NIL))

As you note, these will only work for ASCII, and the Common Lisp
standard doesn't mandate any particular character encoding.  This is
turning out to be nice now, since it allows extension to Unicode while
remaining within the standard.

You should use the existing functions:
  alpha-char-p
  digit-char-p

To figure out alphabetic and numeric characters.  Note in particular
that digit-char-p depends on the radix:

   (digit-char-p #\8 2)  => NIL
   (digit-char-p #\z 36) => 35


.> ;; GET-TEXTCHUNK-FROM-STREAM
.> ;; INPUT:  a stream and an array to hold characters in temp memory.
.> ;; OUTPUT: a string or NIL if at the absolute end of file and
.> ;;         the temp memory array is empty.
.> ;;
.> ;; This function grabs characters from a stream until it has
.> ;; a chunk of text surrounded by white space
.> ;; or linefeeds or carriage returns and returns the
.> ;; resulting string.  It works via recursion... IS THIS
.> ;; INEFFECIENT or slower than a loop??? Don't know.

Possibly.  Because of various overhead items, doing char-by-char input
is relatively slow in Lisp.  For faster input you need to read larger
chunks in at a time, for example, by using READ-SEQUENCE.  But this
won't do the character-by-character parsing that you do below.

.> ;; Some notes about the steps of the "COND" part of the fuction:
.> ;; 1 If new-char is nil - then end of file/stream! but if the
.> ;;   array contains characters then return the array so the final
.> ;;   characters in the stream aren't lost!
.> ;; 2 new-char = NIL & the array has no characters.
.> ;;   end of file/stream!  Return NIL.
.> ;; 3 Any character below "!" is a control character like
.> ;;   #\newline or #\space etc... Text-chunk complete!
.> ;;   Throw out the control character and return the chunk.
.> ;; 4 Still reading in legitimate characters.  Push new-char
.> ;;   onto the array and then keep going
.> ;;   via recursion (passing in the updated char-array...)
.> 
.> (defun get-textchunk-from-stream (stream char-array)
.>       ;;the below NILs are important! for endoffile error avoidance.
.>       (let ((new-char (read-char stream nil)))
.>               ;1
.>         (cond ((AND (eq nil new-char) (> (length char-array) 0)) char-
.> array)
.>               ;2
.>               ((eq nil new-char) nil)
.>               ;3
.> 	      ((not-text? new-char) char-array)
.>               ;4
.> 	      (T (progn (vector-push-extend new-char char-array)
.> 		        (get-textchunk-from-stream stream char-array))))))

NULL is the standard test for NIL, instead of the EQ ... NIL form.  I
also generally don't like to repeat tests, so I would nest the
conditionals for the end-of-file condition with an internal conditional

Also, there is an implicit PROGN in the COND clauses (one of the ways it
differs from IF), so you don't need to have an explicit one.

You might also want to consider whether or not you want to consume
multiple white-space characters in your clause 3.

.> ;; GET-TEXTCHUNK
.> ;; INPUT: a stream
.> ;; OUTPUT: a string
.> ;;
.> ;; This function is just a helper function that sets up
.> ;; GET-TEXTCHUNK-FROM-STREAM to begin its recursive process
.> ;; properly. I could probably have done without it but
.> ;; I couldn't figure out how to make GET-TEXTCHUNK-FROM-STREAM
.> ;; self contained.
.> 
.> (defun get-textchunk (stream)
.>       (let ((char-catcher (make-array 0 :element-type 'character
.>                                         :fill-pointer 0
.>                                         :adjustable t)))
.> 	(get-textchunk-from-stream stream char-catcher)))

Well, generally Lisp style uses more nested function calls and the
like.  So, unless CHAR-CATCHER were used in more than one place, I would
be tempted to just make the array as part of the call to
get-textchunk-from-stream.

Thinking a bit harder on this, since the array construction is a bit
involved with all the keyword arguments, I would be tempted to isolate
that into my own function:

  (defun make-char-catcher () 
    (make-array 0 :element-type 'character
                  :fill-pointer 0
                  :adjustable t))

which would then make calling it as an argument to the main function
even simpler.  One could then dispense with the get-textchunk function
entirely.

.> ;; GET-ALL-TEXTCHUNKS
.> ;; INPUT:  a stream.
.> ;; OUTPUT: a list of strings.
.> ;;
.> ;; This function loops GET-TEXTCHUNK over and over again
.> ;; until the stream ends and GET-TEXTCHUNK returns NIL finally,
.> ;; whereupon GET-ALL-TEXTCHUNKS
.> ;; returns a list of strings.
.> 
.> (defun get-all-textchunks (stream)
.>       (loop for word = (get-textchunk stream)
.> 	  while word collect word))
.> 
.> 
.> 
.> ;; SLURP-STREAM5
.> ;; INPUT: a stream
.> ;; OUTPUT: a very long string?
.> ;;
.> ;; Holy Crap this grabs all the text from a stream so fast!
.> ;; I got this off the web at:
.> ;; http://www.emmett.ca/~sabetts/slurp.html
.> ;; My older code was grabbing one word from the file stream at a time.
.> ;; It is MUCH faster to use this code to read in all the text at once
.> ;; and then run my code on the text string that this code creates.
.> 
.> (defun slurp-stream5 (stream)
.>   (let ((seq (make-array (file-length stream)
.>                :element-type 'character
.>                :fill-pointer t)))
.>     (setf (fill-pointer seq) (read-sequence seq stream))
.>     seq))

Yes.  The difference between READ-CHAR and READ-SEQUENCE.

.> ;; SUPER-TEXT-SLURP
.> ;; INPUT: a file path
.> ;; OUTPUT: the output of slurp-string5, i.e. a single long string.
.> ;;
.> ;; The functin uses slurp-stream5 to open a stream and
.> ;; return all the text as a single long string.  SUPER-TEXT-SLURP is
.> ;; just some time saving code that saves me from having
.> ;; to open and close strings just to read in text.
.> ;; It also has the advantage of using WITH-OPEN-FILE
.> ;; so the closing of the file is done automatically.
.> 
.> (defun super-text-slurp (file-location)
.>   (with-open-file (temp-var file-location
.> 			:direction :input
.> 			:if-does-not-exist :error)
.> 	(slurp-stream5 temp-var)))
.> 
.> 
.> 
.> ;; *TEMP-TEXT-HOLDER*
.> ;;
.> ;; This is just a global variable to temporarily hold
.> ;; all the text being read in via SUPER-TEXT-SLURP
.> ;; This is also serves as an example call to SUPER-TEXT-SLURP.
.> ;; "Test.txt" is just a giant ASCII text file
.> ;; that I got off of projectgutenburg.org.
.> ;;
.> ;; NOTE:
.> ;; !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
.> ;; Put in the path to any text file on your hard drive
.> ;; to get the other code that follows this function to work.
.> ;; I am using a windows file path here... watch out linux people!
.> 
.> (defparameter *temp-text-holder* (super-text-slurp "C:\\test.txt"))
.> 
.> 
.> 
.> ;; *TEXT-ARRAY*
.> ;;
.> ;; This is an empty adjustable character array.
.> ;; SHOULD I USE DEFVAR INSTEAD of DEFPARAMETER???

Depends.  In this case probably.

.> ;; What is the difference between them anyway???

The difference only manifests itself when you reload a file that
contains the form.  DEFVAR will only initialize an UNBOUND variable,
whereas DEFPARAMETER will use the initial value all the time.  This
makes a difference if you are keeping state that you need to preserve
across new code loads.  In that case, DEFVAR is necessary.  The general
practice is to use DEFVAR unless you actually want to make sure the
value is reset each time you reload the file.

.> (defparameter *text-array* (make-array 0 :element-type 'character
.>                                         :fill-pointer 0
.>                                         :adjustable t))
.> 
.> 
.> ;; This next bit is just some necessary code to make it all run...
.> ;; I suppose I should embed it in a master function called
.> ;; main... whateva.
.> ;;
.> ;; *TEXT-ARRAY* is an empty adjustable character array.
.> ;; *temp-text-holder* is a gigantic book length string.
.> ;; "streamz" is a stream created by with-input-from-string.
.> ;; This is necessary because GET-ALL-TEXTCHUNKS only
.> ;; accepts a stream.
.> ;; GET-ALL-TEXTCHUNKS chops it all up into stings.
.> ;; Each string is one word.
.> ;; The contents of *text-array* is a list of strings.
.> 
.> (setq *text-array*
.> 	  (with-input-from-string (streamz *temp-text-holder*)
.> 	    (get-all-textchunks streamz)))

At this point, you might want to consider re-coding get-all-text-chunks
to work directly on the buffer, instead of introducing a new stream
here.  Although this does give you more flexibility, so it's really a
design call rather than something that can be absolutely determined.

.> ;; FIRST-ALPHACHAR-POS
.> ;; INPUT: a character
.> ;; OUTPUT: a non-negative integer
.> ;;
.> ;; This function returns the position
.> ;; of the first alphabetic character in a string;
.> ;; i.e. an "a" through a "z" irrespective
.> ;; of case.  Zero is position 1, 1 is position
.> ;; 2 etc... I hate how Lisp does that but
.> ;; whateva.  Don't fight the system mang.
.> ;; That is "\\\zing" would return 3 NOT 4.

Well, almost all programming languages work the same way, with 0-based
indexing of vectors and arrays.  It is at least consistent, so that you
can use the returned value in ELT or CHAR to access the characters.

.> (defun first-alphachar-pos (string)
.>       (loop for i from 0 to (1- (length string))
.> 	    do (if (alpha-char? (elt string i))
.> 		   (return i))
.> 	    finally (return i)))

The easiest solution is to use the built-in function:

 (position-if #'alpha-char-p string)

Note that this returns a different value if no alphabetic characters are
found.

There are also some minor things that can make the loop a little
simpler:

    (loop for i from 0 below (length string)
          when (alpha-char? (elt string i))
          return i
          finally (return i))

.> ;; LAST-ALPHACHAR-POS
.> ;; INPUT: a character
.> ;; OUTPUT: a non-negative integer or -1.
.> ;;
.> ;; This function returns the position
.> ;; of the last alphabetic character in a string;
.> ;; Zero is position 1, 1 is position
.> ;; 2 etc... That is "shlorp&^%" would return 5 NOT 6.
.> 
.> (defun last-alphachar-pos (string)
.>       (loop for i from (1- (length string)) downto 0
.> 	    do (if (alpha-char? (elt string i))
.> 		   (return i))
.> 	    finally (return i)))

  (position-if #'alpha-char-p string :from-end t)

.> ;; STRING-LIST-NONCHAR-FIXER (this is way too big of a function!)
.> ;; INPUT:  Nothing + a global variable included
.> ;;         as part of the function's internal code.  This is
.> ;;         because I could never figure out how to
.> ;;         properly pass a global variable in Lisp.
.> ;; OUTPUT: A list of properly parsed word and non-word
.> ;;         strings that retains the proper order from the
.> ;;         initial text (the strings stored in the global variable
.> ;;         must have been stored as a single list of strings
.> ;;         or this function will not work!)
.> ;;
.> ;; This function grabs a string from a global variable's list of
.> strings
.> ;; (The GV is called "*text-array*" in the code right now but this can
.> ;; change later if need be) and then cuts up the string into
.> ;; non-alphabetic and alphebetic string chunks.  It then pushes the
.> ;; chunk(s) in the proper order onto a new list and grabs the
.> ;; next string.  When all the strings are processed the function
.> ;; returns the new list in toto.
.> ;;
.> ;; The purpose of this function is to clean
.> ;; up junk AND PUNCTUATION that gets included when "words"
.> ;; are grabbed from a stream based on being surrounded by
.> ;; spaces.  For instance "---bob!" would get chuncked up
.> ;; into "---", "bob", and "!".  This allows for easy removal
.> ;; of non-words and makes handling punctuation easier in the
.> ;; future.  Of course all this assumes that the best way to deal
.> ;; with a large corpus or text is to seperate it into a list of
.> ;; strings... maybe a hash table or an array is more efficient.
.> ;; This function will not fix several problems:
.> ;; 1) Words with non-alpha typos embedded in them or two words
.> ;;    stuck together with a non-alphabetic character with no space.
.> ;; 2) Hacker speak type stuff like "7am3r" or "$hit".
.> ;; 3) It doesn't seperate numbers from junk!  I bet I have to modify
.> it
.> ;;    later to fix this!
.> 
.> (defun string-list-nonchar-fixer ()
.>   (let ((newlist nil)
.>         (listlen (1- (length *text-array*))))
.>     (loop for i from listlen downto 0
.> 	  for j = (nth i *text-array*)
.>                    ;position of first alpha character.
.>           do (let* ((firstcut (first-alphachar-pos j))
.>                    ;position of first ending-group non-alpha
.> character.
.>                    (lastcut (1+ (last-alphachar-pos j)))
.>                    ;total length of the string minus 1.
.>                    (stringlen (length j))
.>                    ;only alphabetic characters in string?
.>                    (all-alpha (if (and (= firstcut 0) (= lastcut
.> stringlen))
.>                                         T NIL)))
.>                    ;This cond form pushes all the bits onto "newlist".
.>                          ;1 - only alphabetic characters in string.
.>                    (cond (all-alpha (push j newlist))
.>                          ;2 - non-alphabetic string - push it along...
.>                          ;    perhaps (= lastcut 0) would be more
.> effecient?
.>                          ((> firstcut lastcut) (push j newlist))
.>                          ;3 - junk before but not after.
.>                          ((= lastcut stringlen)
.>                             (progn (push (subseq j firstcut stringlen)
.> newlist)
.>                                    (push (subseq j 0 firstcut)
.> newlist)))
.>                          ;4 - junk after but not before.
.>                          ((= firstcut 0)
.>                             (progn (push (subseq j lastcut stringlen)
.> newlist)
.>                                    (push (subseq j 0 lastcut)
.> newlist)))
.>                          ;5 - junk at start and end.
.>                          (T (progn (push (subseq j lastcut stringlen)
.> newlist)
.>                                    (push (subseq j firstcut lastcut)
.> newlist)
.>                                    (push (subseq j 0 firstcut)
.> newlist))))))
.>          newlist))

Well, first of all, NTH is very inefficient for lists.  Lists are
sequentially accessible datastructures, not random-access ones, so this
requires a lot of traversals of the list.  An improvement would be to
use the standard list traversal constructs in Loop.

   (loop for item in list ...)

Since you want to do this backwards, you should reverse the list first.
This will be much more efficient than repeatedly traversing the list.

Also the constuct (if (...) T NIL) is redundant.  Just use the (...)
test itself.

It would also perhaps be useful to have your secondary splitting code
isolated into its own function.  You may even be able to reuse that same
function to do the splitting based on whitespace.  You might then need
to implement an white-space-p or white-space? predicate.

The top-level loop would then simply become the following.  Note that I
replaced the global variable with a parameter.  You can just pass
*text-array* as a parameter when you call it.  Cleaner this way.
 
   (defun string-list-nonchar-fixer (string-list)
     (let ((newlist nil))
       (loop for chunk in (reverse string-list)
             do (setq newlist (split-chunk chunk newlist)))))

with a separate function to separate the chunks that looks like

  ;; Splits the chunk string into alphabetic and non-alphabetic parts.
  ;; PUSH them onto accumulator and then return the augmented value.
  ;; We pass in the accumulator to allow for more efficient processing,
  ;; although we could instead have used multiple-value return and
  ;; handled the marshalling at the caller site.
  (defun split-chunk (chunk accumulator)
     ... insert your code here ...)

I would be tempted to generalize this to take a criterion function as
its input and then use it for splitting, similar to the calling sequence
for find-if, position-if, etc.

  ;; Split string based on FUNCTION into prefix main and suffix
  ;;
  ;; TO DO:  Extend this to handle more than just a single transition,
  ;; so that for, example "??Fred#%$Didn't??" will be handled
  ;; correctly.  Note that going that route will require abandoning the
  ;; multiple-values return approach and going to lists.

  (defun split-if (function string)
     (let ((firstpos (position-if function string))
           (lastpos  (position-if function string :from-end t)))
        ;; If we get a last position, we need to increment it to use it
        ;; as an index to just beyond that position.
       (when lastpos (incf lastpos))
       (values (if firstpos
                   (subseq string 0 firstpos)
                   string)
               (if firstpos
                   (subseq string firstpos lastpos)
                   nil)
               (if lastpos
                   (subseq string lastpos)
                    nil))))

Then your splitting code would look like:

   (defun string-list-nonchar-fixer (string-list)
     (let ((newlist nil))
       (loop for chunk in (reverse string-list)
             do (multiple-value-bind (prefix main suffix)
                       (split-if #'alpha-char-p chunk)
                    (when suffix (push suffix newlist))
                    (when main (push main newlist))
                    (when prefix (push prefix newlist))))
        newlist))


.> ;; SAVE-TO-FILE
.> ;; INPUT:  a variable to write to file and a filepath with a filename
.> ;;         included.
.> ;; OUTPUT: a textfile on your computers harddrive!  YAY!
.> ;;
.> ;; This is a handy function for saving text into a file.
.> ;; I don't know if it works with multi-byte characters or not.
.> 
.> (defun SAVE-TO-FILE (variable filename)
.>      (with-open-file (out filename
.>                       :direction :output
.>                       :if-exists :supersede)
.>      (with-standard-io-syntax
.>         (print variable out))))


-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: Alan Crowe
Subject: Re: "Read stuff from a file and chop it up to do stuff" code advice wanted.
Date: Thu, 25 Oct 2007 14:28:15 +0000
Message-ID: <86ve8v46bk.fsf@cawtech.freeserve.co.uk>

landspeedrecord <···············@gmail.com> writes:

> I am writing code that reinvents the "read stuff from a file and chop
> it up to do stuff" code wheel (of pain conan).  If you know what I
> mean...  Why am I doing this?
> 
> a) to learn lisp and also how to program...
> b) because I am too stupid (or it is too hard) to find code that does
> the insanely simple crap I want to do as a newbie.  I tried with
> google... really.  Plus I don't know how to use packages yet.
> 
> Anyway... I am seeking some advice... any advice about how to write
> code better or with better style.  Especially advice on how to write
> more beautiful and/or simple code.  

> Am I over documenting? 

You are asking people to comment on your code. Suppose,
hypothetically, that it is hard to follow. Then readers will
not be able to work out what it is supposed to, and will be
unable to suggest better ways of writing it. If you want
replies you need to offer a seperate, plain English, written
account of what the code is supposed to do. At first glance
you seem to be providing the necessary level of
documentation for a post to c.l.l. asking for advice. This
is very different from the documentation appropriate to a
source file. 

The key idea in writing comments is that modern languages
permit lengthy descriptive variable names, pseudo-english
constructs such as loop, and many ways of writing code, with
the intention that you should write self-documenting
code. So the comments should mainly be about stuff that is
not present in the code. For example, if you have not
written the code in the obvious way, the code itself
explains the way that you actually wrote it, but omits both
the obvious way and why that doesn't actually work. So the
comment should provide what is missing, sketching the
obvious way and explaining the problem with it.


> Under using useful functions that I am too stupid/ignorant
> to use?

Err, I think you mean "Under using useful functions because
I have only just started on my wonderful adventure"

> I assume
> that there is a library that already does what my code is doing... but
> I couldn't find it so I wrote my own. 

The Perl way building a system of interacting programs is
that each program writes it outputs in formats that the
programmer makes up as he goes along. Later when an output
is needed as input to a program that is written later, the
programmer attempts to discover the grammar of the language
he has invented and tries to write an ad hoc parser for it
using regular expressions. Perl is a powerful language with
convenient regular expressions so this approach is almost
successful.

The Lisp way is to just use WRITE and READ. So things are a
bit ugly. (configuring the pretty printer can help). But
it is reliable and requires no coding. 

> ;; NOT-TEXT?
> ;; INPUT: a character
> ;; OUTPUT: T or NIL
> ;;
> ;; This function returns TRUE if any character
> ;; passed to it is below #\!... which is where
> ;; all the ascii (unicode as well?) control characters
> ;; are.  Also below #\! is #\space etc...
> 
> (defun not-text? (char)
>       (if (char< char #\!)
> 	  T
> 	  NIL))
> 

It is traditional to hold #\? in reserve for use as a
macro-character. One might arrange that ?x gets read as
#s(match-variable name x). So tradition would have you name
the function not-text-p.

You can use predicates directly, not-text-p is a one-liner

(defun not-text-p (char)
  (char< char #\!))

> ;; ALPHA-CHAR?
> ;; INPUT: a character
> ;; OUTPUT: T or NIL
> ;;
> ;; Only works with ASCII??? (I am NOT SURE.)
> ;; Returns TRUE if any character passed to it
> ;; is alphabetic.  Problematically, Unicode
> ;; characters (I think it is Unicode at least! - not sure)
> ;; 91. #\[ 92. #\\ 93. #\] 94. #\^ 95. #\_ 96. #\`
> ;; are between 65. #\A and 122. #\z. and so they
> ;; would ALSO make the function return TRUE.
> ;; I don't know how to make Lisp change between different
> ;; character code sets so I can't fix/figure out how
> ;; to solve this issue.

The problem here is with original ASCII, I don't think that
changing character sets will help.

> 
> (defun alpha-char? (char)
>       (if (AND (char-not-lessp char #\A)(char-not-greaterp char #\z))
> 	  T
> 	  NIL))

No need for if. Notice that many CL ordering predicates
permit more than two arguments, which makes writing range
checks straight forward.

You don't have to write 

(and (<= lower-limit x)
     (<= x upper-limit))

you can simply say

(<= lower-limit x upper-limit)

This applies here,

CL-USER> (defun alpha-filter (char)
           (char-not-greaterp #\A char #\Z))

CL-USER> (loop for code from 0 below 256
               when (alpha-filter (code-char code))
               collect (code-char code))

(#\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M #\N #\O #\P #\Q #\R #\S
 #\T #\U #\V #\W #\X #\Y #\Z #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l
 #\m #\n #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z)
 
though now that I have given the game away by saying that
predicates end in #\p not #\? you are soon going to discover
the built-in functions alpha-char-p and alphanumericp

> 
> 
> 
> ;; GET-TEXTCHUNK-FROM-STREAM
> ;; INPUT:  a stream and an array to hold characters in temp memory.
> ;; OUTPUT: a string or NIL if at the absolute end of file and
> ;;         the temp memory array is empty.
> ;;
> ;; This function grabs characters from a stream until it has
> ;; a chunk of text surrounded by white space
> ;; or linefeeds or carriage returns and returns the
> ;; resulting string.  It works via recursion... IS THIS
> ;; INEFFECIENT or slower than a loop??? Don't know.
> ;;
> ;; Some notes about the steps of the "COND" part of the fuction:
> ;; 1 If new-char is nil - then end of file/stream! but if the
> ;;   array contains characters then return the array so the final
> ;;   characters in the stream aren't lost!
> ;; 2 new-char = NIL & the array has no characters.
> ;;   end of file/stream!  Return NIL.
> ;; 3 Any character below "!" is a control character like
> ;;   #\newline or #\space etc... Text-chunk complete!
> ;;   Throw out the control character and return the chunk.
> ;; 4 Still reading in legitimate characters.  Push new-char
> ;;   onto the array and then keep going
> ;;   via recursion (passing in the updated char-array...)
> 
> (defun get-textchunk-from-stream (stream char-array)
>       ;;the below NILs are important! for endoffile error avoidance.
>       (let ((new-char (read-char stream nil)))
>               ;1
>         (cond ((AND (eq nil new-char) (> (length char-array) 0)) char-
> array)
>               ;2
>               ((eq nil new-char) nil)
>               ;3
> 	      ((not-text? new-char) char-array)
>               ;4
> 	      (T (progn (vector-push-extend new-char char-array)
> 		        (get-textchunk-from-stream stream char-array))))))
> 

(and (eq nil new-char) (> (length char-array) 0))

could be

(and (not new-char)(plusp (length char-array)))

The progn is redundant. Cutting and pasting from the
hyperspec

Macro COND 

Syntax:

cond {clause}* => result*

clause::= (test-form form*) 
                         ^
                         |
   This crucial asterisk indicates 0,1,2, or more forms

They are evaluated in an implicit progn. Think of cond as

(cond ((test data)(do-this)(do-that)(compute-value))
      ((further-test data) (do-something-else data)(different-value-computation)))

COND looks a bit strange if you are used to C with its if-then-else but
it is used alot because it does both
if-then-elseif-then-else and packages up multiple statements.


> 
> 
> ;; GET-TEXTCHUNK
> ;; INPUT: a stream
> ;; OUTPUT: a string
> ;;
> ;; This function is just a helper function that sets up
> ;; GET-TEXTCHUNK-FROM-STREAM to begin its recursive process
> ;; properly. I could probably have done without it but
> ;; I couldn't figure out how to make GET-TEXTCHUNK-FROM-STREAM
> ;; self contained.
> 
> (defun get-textchunk (stream)
>       (let ((char-catcher (make-array 0 :element-type 'character
>                                         :fill-pointer 0
>                                         :adjustable t)))
> 	(get-textchunk-from-stream stream char-catcher)))
> 
> 
> 
> ;; GET-ALL-TEXTCHUNKS
> ;; INPUT:  a stream.
> ;; OUTPUT: a list of strings.
> ;;
> ;; This function loops GET-TEXTCHUNK over and over again
> ;; until the stream ends and GET-TEXTCHUNK returns NIL finally,
> ;; whereupon GET-ALL-TEXTCHUNKS
> ;; returns a list of strings.
> 
> (defun get-all-textchunks (stream)
>       (loop for word = (get-textchunk stream)
> 	  while word collect word))
> 
> 
> 
> ;; SLURP-STREAM5
> ;; INPUT: a stream
> ;; OUTPUT: a very long string?
> ;;
> ;; Holy Crap this grabs all the text from a stream so fast!
> ;; I got this off the web at:
> ;; http://www.emmett.ca/~sabetts/slurp.html
> ;; My older code was grabbing one word from the file stream at a time.
> ;; It is MUCH faster to use this code to read in all the text at once
> ;; and then run my code on the text string that this code creates.
> 
> (defun slurp-stream5 (stream)
>   (let ((seq (make-array (file-length stream)
>                :element-type 'character
>                :fill-pointer t)))
>     (setf (fill-pointer seq) (read-sequence seq stream))
>     seq))
> 
slurp-character-stream might be a better name, reflecting
your commitment to a specific element type.

> 
> 
> ;; SUPER-TEXT-SLURP
> ;; INPUT: a file path
> ;; OUTPUT: the output of slurp-string5, i.e. a single long string.
> ;;
> ;; The functin uses slurp-stream5 to open a stream and
> ;; return all the text as a single long string.  SUPER-TEXT-SLURP is
> ;; just some time saving code that saves me from having
> ;; to open and close strings just to read in text.
> ;; It also has the advantage of using WITH-OPEN-FILE
> ;; so the closing of the file is done automatically.
> 
> (defun super-text-slurp (file-location)
>   (with-open-file (temp-var file-location
> 			:direction :input
> 			:if-does-not-exist :error)
> 	(slurp-stream5 temp-var)))

using the word "super" to denote a variant of a function
always ends in tears. You need something less general,
perhaps

slurp-text-file

temp-var is super-vague, which will annoy you when you
re-read your code in 6months time. The hyperspec says

macro WITH-OPEN-FILE 

Syntax:

with-open-file (stream filespec options*) declaration* form*

so you can steal the names from there and write

(defun slurp-text-file (filespec)
  (with-open-file (stream filespec
                          :direction :input
                          :if-does-not-exist :error)
    (slurp-character-stream stream)))

Whoops, I am so out of time and I haven't got to the
interesting stuff yet.

Alan Crowe
Edinburgh
Scotland

From: Mark Tarver
Subject: Re: "Read stuff from a file and chop it up to do stuff" code advice wanted.
Date: Thu, 25 Oct 2007 15:10:54 +0000
Message-ID: <1193325054.772610.148020@k79g2000hse.googlegroups.com>

On 25 Oct, 15:28, Alan Crowe <····@cawtech.freeserve.co.uk> wrote:
> landspeedrecord <···············@gmail.com> writes:
> > I am writing code that reinvents the "read stuff from a file and chop
> > it up to do stuff" code wheel (of pain conan).  If you know what I
> > mean...  Why am I doing this?
>
> > a) to learn lisp and also how to program...
> > b) because I am too stupid (or it is too hard) to find code that does
> > the insanely simple crap I want to do as a newbie.  I tried with
> > google... really.  Plus I don't know how to use packages yet.
>
> > Anyway... I am seeking some advice... any advice about how to write
> > code better or with better style.  Especially advice on how to write
> > more beautiful and/or simple code.  
> > Am I over documenting?
>
> You are asking people to comment on your code. Suppose,
> hypothetically, that it is hard to follow. Then readers will
> not be able to work out what it is supposed to, and will be
> unable to suggest better ways of writing it. If you want
> replies you need to offer a seperate, plain English, written
> account of what the code is supposed to do. At first glance
> you seem to be providing the necessary level of
> documentation for a post to c.l.l. asking for advice. This
> is very different from the documentation appropriate to a
> source file.
>
> The key idea in writing comments is that modern languages
> permit lengthy descriptive variable names, pseudo-english
> constructs such as loop, and many ways of writing code, with
> the intention that you should write self-documenting
> code. So the comments should mainly be about stuff that is
> not present in the code. For example, if you have not
> written the code in the obvious way, the code itself
> explains the way that you actually wrote it, but omits both
> the obvious way and why that doesn't actually work. So the
> comment should provide what is missing, sketching the
> obvious way and explaining the problem with it.
>
> > Under using useful functions that I am too stupid/ignorant
> > to use?
>
> Err, I think you mean "Under using useful functions because
> I have only just started on my wonderful adventure"
>
> > I assume
> > that there is a library that already does what my code is doing... but
> > I couldn't find it so I wrote my own.
>
> The Perl way building a system of interacting programs is
> that each program writes it outputs in formats that the
> programmer makes up as he goes along. Later when an output
> is needed as input to a program that is written later, the
> programmer attempts to discover the grammar of the language
> he has invented and tries to write an ad hoc parser for it
> using regular expressions. Perl is a powerful language with
> convenient regular expressions so this approach is almost
> successful.
>
> The Lisp way is to just use WRITE and READ. So things are a
> bit ugly. (configuring the pretty printer can help). But
> it is reliable and requires no coding.
>
> > ;; NOT-TEXT?
> > ;; INPUT: a character
> > ;; OUTPUT: T or NIL
> > ;;
> > ;; This function returns TRUE if any character
> > ;; passed to it is below #\!... which is where
> > ;; all the ascii (unicode as well?) control characters
> > ;; are.  Also below #\! is #\space etc...
>
> > (defun not-text? (char)
> >       (if (char< char #\!)
> >      T
> >      NIL))
>
> It is traditional to hold #\? in reserve for use as a
> macro-character. One might arrange that ?x gets read as
> #s(match-variable name x). So tradition would have you name
> the function not-text-p.
>
> You can use predicates directly, not-text-p is a one-liner
>
> (defun not-text-p (char)
>   (char< char #\!))
>
> > ;; ALPHA-CHAR?
> > ;; INPUT: a character
> > ;; OUTPUT: T or NIL
> > ;;
> > ;; Only works with ASCII??? (I am NOT SURE.)
> > ;; Returns TRUE if any character passed to it
> > ;; is alphabetic.  Problematically, Unicode
> > ;; characters (I think it is Unicode at least! - not sure)
> > ;; 91. #\[ 92. #\\ 93. #\] 94. #\^ 95. #\_ 96. #\`
> > ;; are between 65. #\A and 122. #\z. and so they
> > ;; would ALSO make the function return TRUE.
> > ;; I don't know how to make Lisp change between different
> > ;; character code sets so I can't fix/figure out how
> > ;; to solve this issue.
>
> The problem here is with original ASCII, I don't think that
> changing character sets will help.
>
>
>
> > (defun alpha-char? (char)
> >       (if (AND (char-not-lessp char #\A)(char-not-greaterp char #\z))
> >      T
> >      NIL))
>
> No need for if. Notice that many CL ordering predicates
> permit more than two arguments, which makes writing range
> checks straight forward.
>
> You don't have to write
>
> (and (<= lower-limit x)
>      (<= x upper-limit))
>
> you can simply say
>
> (<= lower-limit x upper-limit)
>
> This applies here,
>
> CL-USER> (defun alpha-filter (char)
>            (char-not-greaterp #\A char #\Z))
>
> CL-USER> (loop for code from 0 below 256
>                when (alpha-filter (code-char code))
>                collect (code-char code))
>
> (#\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M #\N #\O #\P #\Q #\R #\S
>  #\T #\U #\V #\W #\X #\Y #\Z #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l
>  #\m #\n #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z)
>
> though now that I have given the game away by saying that
> predicates end in #\p not #\? you are soon going to discover
> the built-in functions alpha-char-p and alphanumericp
>
>
>
>
>
>
>
> > ;; GET-TEXTCHUNK-FROM-STREAM
> > ;; INPUT:  a stream and an array to hold characters in temp memory.
> > ;; OUTPUT: a string or NIL if at the absolute end of file and
> > ;;         the temp memory array is empty.
> > ;;
> > ;; This function grabs characters from a stream until it has
> > ;; a chunk of text surrounded by white space
> > ;; or linefeeds or carriage returns and returns the
> > ;; resulting string.  It works via recursion... IS THIS
> > ;; INEFFECIENT or slower than a loop??? Don't know.
> > ;;
> > ;; Some notes about the steps of the "COND" part of the fuction:
> > ;; 1 If new-char is nil - then end of file/stream! but if the
> > ;;   array contains characters then return the array so the final
> > ;;   characters in the stream aren't lost!
> > ;; 2 new-char = NIL & the array has no characters.
> > ;;   end of file/stream!  Return NIL.
> > ;; 3 Any character below "!" is a control character like
> > ;;   #\newline or #\space etc... Text-chunk complete!
> > ;;   Throw out the control character and return the chunk.
> > ;; 4 Still reading in legitimate characters.  Push new-char
> > ;;   onto the array and then keep going
> > ;;   via recursion (passing in the updated char-array...)
>
> > (defun get-textchunk-from-stream (stream char-array)
> >       ;;the below NILs are important! for endoffile error avoidance.
> >       (let ((new-char (read-char stream nil)))
> >               ;1
> >         (cond ((AND (eq nil new-char) (> (length char-array) 0)) char-
> > array)
> >               ;2
> >               ((eq nil new-char) nil)
> >               ;3
> >          ((not-text? new-char) char-array)
> >               ;4
> >          (T (progn (vector-push-extend new-char char-array)
> >                    (get-textchunk-from-stream stream char-array))))))
>
> (and (eq nil new-char) (> (length char-array) 0))
>
> could be
>
> (and (not new-char)(plusp (length char-array)))
>
> The progn is redundant. Cutting and pasting from the
> hyperspec
>
> Macro COND
>
> Syntax:
>
> cond {clause}* => result*
>
> clause::= (test-form form*)
>                          ^
>                          |
>    This crucial asterisk indicates 0,1,2, or more forms
>
> They are evaluated in an implicit progn. Think of cond as
>
> (cond ((test data)(do-this)(do-that)(compute-value))
>       ((further-test data) (do-something-else data)(different-value-computation)))
>
> COND looks a bit strange if you are used to C with its if-then-else but
> it is used alot because it does both
> if-then-elseif-then-else and packages up multiple statements.
>
>
>
>
>
>
>
> > ;; GET-TEXTCHUNK
> > ;; INPUT: a stream
> > ;; OUTPUT: a string
> > ;;
> > ;; This function is just a helper function that sets up
> > ;; GET-TEXTCHUNK-FROM-STREAM to begin its recursive process
> > ;; properly. I could probably have done without it but
> > ;; I couldn't figure out how to make GET-TEXTCHUNK-FROM-STREAM
> > ;; self contained.
>
> > (defun get-textchunk (stream)
> >       (let ((char-catcher (make-array 0 :element-type 'character
> >                                         :fill-pointer 0
> >                                         :adjustable t)))
> >    (get-textchunk-from-stream stream char-catcher)))
>
> > ;; GET-ALL-TEXTCHUNKS
> > ;; INPUT:  a stream.
> > ;; OUTPUT: a list of strings.
> > ;;
> > ;; This function loops GET-TEXTCHUNK over and over again
> > ;; until the stream ends and GET-TEXTCHUNK returns NIL finally,
> > ;; whereupon GET-ALL-TEXTCHUNKS
> > ;; returns a list of strings.
>
> > (defun get-all-textchunks (stream)
> >       (loop for word = (get-textchunk stream)
> >      while word collect word))
>
> > ;; SLURP-STREAM5
> > ;; INPUT: a stream
> > ;; OUTPUT: a very long string?
> > ;;
> > ;; Holy Crap this grabs all the text from a stream so fast!
> > ;; I got this off the web at:
> > ;;http://www.emmett.ca/~sabetts/slurp.html
> > ;; My older code was grabbing one word from the file stream at a time.
> > ;; It is MUCH faster to use this code to read in all the text at once
> > ;; and then run my code on the text string that this code creates.
>
> > (defun slurp-stream5 (stream)
> >   (let ((seq (make-array (file-length stream)
> >                :element-type 'character
> >                :fill-pointer t)))
> >     (setf (fill-pointer seq) (read-sequence seq stream))
> >     seq))
>
> slurp-character-stream might be a better name, reflecting
> your commitment to a specific element type.
>
>
>
>
>
> > ;; SUPER-TEXT-SLURP
> > ;; INPUT: a file path
> > ;; OUTPUT: the output of slurp-string5, i.e. a single long string.
> > ;;
> > ;; The functin uses slurp-stream5 to open a stream and
> > ;; return all the text as a single long string.  SUPER-TEXT-SLURP is
> > ;; just some time saving code that saves me from having
> > ;; to open and close strings just to read in text.
> > ;; It also has the advantage of using WITH-OPEN-FILE
> > ;; so the closing of the file is done automatically.
>
> > (defun super-text-slurp (file-location)
> >   (with-open-file (temp-var file-location
> >                    :direction :input
> >                    :if-does-not-exist :error)
>
> ...
>
> read more �- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

I haven't read all your code but in general for this class of problem
there are 2 approaches

1.  Read in the stuff from the file into a list and then process the
list.
2.  Read and process the file incrementally as it is being read.

In general, method 1 results in code that is cleaner and easier to
read and declarative in spirit.  2. is suitable if the input is very
large and you're looking to retain only a fragment of the original.

If you're a newbie I'd advise going for 1.  In Qi there is an inbuilt
function read-file-as-charlist which reads the contents of a file as a
list of characters.  If you download the source from www.lambdassociates.org
you can find the Lisp source for this.

For now stay away from loop and other procedural constructions if
possible - they delay your evolution into thinking like a functional
programmer.

Mark

From: Mark Tarver
Subject: Re: "Read stuff from a file and chop it up to do stuff" code advice wanted.
Date: Thu, 25 Oct 2007 15:15:35 +0000
Message-ID: <1193325335.966244.216090@22g2000hsm.googlegroups.com>

Rather difficult to read my response because my Google reader
rather hides it.  Here is it again.

QUOTE
I haven't read all your code but in general for this class of problem
there are 2 approaches

1.  Read in the stuff from the file into a list and then process the
list.
2.  Read and process the file incrementally as it is being read.

In general, method 1 results in code that is cleaner and easier to
read and declarative in spirit.  2. is suitable if the input is very
large and you're looking to retain only a fragment of the original.

If you're a newbie I'd advise going for 1.  In Qi there is an inbuilt
function read-file-as-charlist which reads the contents of a file as
a
list of characters.  If you download the source from www.lambdassociates.org
you can find the Lisp source for this.

For now stay away from loop and other procedural constructions if
possible - they delay your evolution into thinking like a functional
programmer.
UNQUOTE

Mark