Can not find older posting: Reading files (fast)

From: Bernd Schmitt
Subject: Can not find older posting: Reading files (fast)
Date: Sat, 27 Aug 2005 14:14:51 +0000
Message-ID: <431073ba$0$17575$9b622d9e@news.freenet.de>

Hello,

I am learning lisp and so I am lurking here for some time now.

There was an interesting post (for a novice like me) about fast loading 
a file, so that each line would be appended to a list. Actually this was 
the wrong way, somebody pointed out that this would lead to quadratic 
time consumption if file size would doubled, so the solution had been to 
put new lines at the beginning and reverse the list after finishing.

(I am wondering, if a use of one of those list-copying functions would 
be an alternative, but therefore:)
I would like to reread those articles, but I can not find it via 
google-groups. Does somebody remember the articles thread name?



Thanks,
Bernd


P.S.: vacation is over, the sun starts shining again and i am still not 
finished  with my lisp-book - grrr


-- 
https://gna.org/projects/mipisti - (microscope) picture stitching
          T_a_k_e__c_a_r_e__o_f__y_o_u_r__R_I_G_H_T_S.
            P_r_e_v_e_n_t__L_O_G_I_C--P_A_T_E_N_T_S
     http://www.ffii.org, http://www.nosoftwarepatents.org

Re: Can not find older posting: Reading files (fast) Pascal Bourguignon
Re: Can not find older posting: Reading files (fast) Rob Warnock
- Re: Can not find older posting: Reading files (fast) Bernd Schmitt
  - Re: Can not find older posting: Reading files (fast) Rob Warnock
    - Re: Can not find older posting: Reading files (fast) Bernd Schmitt
    - Re: Can not find older posting: Reading files (fast) Edi Weitz
      - Re: Can not find older posting: Reading files (fast) Rob Warnock
        Re: Can not find older posting: Reading files (fast) Christopher C. Stacy
        Re: Can not find older posting: Reading files (fast) Ivan Boldyrev
        Re: Can not find older posting: Reading files (fast) Robert Uhl
- Re: Can not find older posting: Reading files (fast) drewc
  - Re: Can not find older posting: Reading files (fast) Rob Warnock
    - Re: Can not find older posting: Reading files (fast) drewc
    - Re: Can not find older posting: Reading files (fast) Edi Weitz

From: Pascal Bourguignon
Subject: Re: Can not find older posting: Reading files (fast)
Date: Sat, 27 Aug 2005 16:20:47 +0000
Message-ID: <87psrze4gw.fsf@thalassa.informatimago.com>

Bernd Schmitt <··················@gmx.net> writes:

> Hello,
>
> I am learning lisp and so I am lurking here for some time now.
>
> There was an interesting post (for a novice like me) about fast
> loading a file, so that each line would be appended to a
> list. Actually this was the wrong way, somebody pointed out that this
> would lead to quadratic time consumption if file size would doubled,
> so the solution had been to put new lines at the beginning and reverse
> the list after finishing.

Note that instead of using append or nconc, you could write this:

(defstruct (managed-list (:conc-name ml-)) head tail)

(defun ml-enqueue (ml item)
  (if (null (ml-head ml))
      (setf (ml-head ml) (list item)
            (ml-tail ml) (ml-head ml))
      (setf (cdr (ml-tail ml)) (list item)
            (ml-tail ml) (cdr (ml-tail ml))))
  ml)
(let ((l (make-managed-list)))
  (print (ml-enqueue l :a))
  (print (ml-enqueue l :b))
  (print (ml-enqueue l :c))
  (ml-head l))

#S(MANAGED-LIST :HEAD (:A) :TAIL (:A)) 
#S(MANAGED-LIST :HEAD (:A :B) :TAIL (:B)) 
#S(MANAGED-LIST :HEAD (:A :B :C) :TAIL (:C)) 

--> (:A :B :C)


This uses temporarily a little more (O(1) cste=2) memory than the
push/nreverse schem, and uses exactly the same time.  If the list is
bigger than the cache, it might be slightly more efficient however,
since nreverse would have to reload several times the cache.


> (I am wondering, if a use of one of those list-copying functions would
> be an alternative, but therefore:)
> I would like to reread those articles, but I can not find it via
> google-groups. Does somebody remember the articles thread name?

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
The rule for today:
Touch my tail, I shred your hand.
New rule tomorrow.

From: Rob Warnock
Subject: Re: Can not find older posting: Reading files (fast)
Date: Sun, 28 Aug 2005 02:23:17 +0000
Message-ID: <7ZKdnVpy7JiIvYzeRVn-pw@speakeasy.net>

Bernd Schmitt  <··················@gmx.net> wrote:
+---------------
| There was an interesting post (for a novice like me) about fast loading 
| a file, so that each line would be appended to a list. Actually this was 
| the wrong way, somebody pointed out that this would lead to quadratic 
| time consumption if file size would doubled, so the solution had been to 
| put new lines at the beginning and reverse the list after finishing.
+---------------

If you're willing to use a simple LOOP without necessarily
understanding it completely at this point in your learning,  ;-}
the following [from my "random utilities" junkbox] has linear
behavior in most CL implementations:

    (defun file-lines (path)
      "Sucks up an entire file from PATH into a list of freshly-allocated
      strings, returning two values: the list of strings and the number of
      lines read."
      (with-open-file (s path)
	(loop for line = (read-line s nil nil)
	      while line
	  collect line into lines
	  counting t into line-count
	  finally (return (values lines line-count)))))

With CMUCL-19a on FreeBSD 4.10 on a 1.855 GHz Mobile Athlon, the
above takes a hair over one second to read in a 11637220 byte file
of 266478 lines.

And if, instead of a list of lines, you suck up the whole thing into
a single string [for later manipulation with CHAR or SUBSEQ]:

    (defun file-string (path)
      "Sucks up an entire file from PATH into a freshly-allocated string,
      returning two values: the string and the number of bytes read."
      (with-open-file (s path)
	(let* ((len (file-length s))
	       (data (make-string len)))
	  (values data (read-sequence data s)))))

then on the same machine/file/etc. as above, it goes ~20 times as fast
(0.05 seconds of real time).


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Bernd Schmitt
Subject: Re: Can not find older posting: Reading files (fast)
Date: Sun, 28 Aug 2005 08:57:57 +0000
Message-ID: <43117af2$0$9014$9b622d9e@news.freenet.de>

Rob Warnock wrote:
> Bernd Schmitt  <··················@gmx.net> wrote:


> +---------------
> | There was an interesting post (for a novice like me) about fast loading 
> | a file, so that each line would be appended to a list. Actually this was 
> | the wrong way, somebody pointed out that this would lead to quadratic 
> | time consumption if file size would doubled, so the solution had been to 
> | put new lines at the beginning and reverse the list after finishing.
> +---------------

I finally found it: "lists of lists (newbie)" from (Stephen Ramsay),
http://groups.google.com/group/comp.lang.lisp/browse_frm/thread/1f4af3e1b03b3cc7/ecc5cb97415e7a09?tvc=1&q=with-open-file#ecc5cb97415e7a09


> If you're willing to use a simple LOOP without necessarily
> understanding it completely at this point in your learning,  ;-}
you are right. therefore i would like to ask some novice questions to 
your code.


> the following [from my "random utilities" junkbox] has linear
> behavior in most CL implementations:
> 
>     (defun file-lines (path)
>       "Sucks up an entire file from PATH into a list of freshly-allocated
>       strings, returning two values: the list of strings and the number of
>       lines read."
>       (with-open-file (s path)
> 	(loop for line = (read-line s nil nil)
why is this indented like this?


> 	      while line
> 	  collect line into lines
> 	  counting t into line-count
> 	  finally (return (values lines line-count)))))
is this a kind of pseudo-code (otherwise i would miss some parens)?

If this is too novice-like, please excuse me and just give me the 
keyword to google/cliki/pcl/...



thank you
Bernd





-- 
https://gna.org/projects/mipisti - (microscope) picture stitching
          T_a_k_e__c_a_r_e__o_f__y_o_u_r__R_I_G_H_T_S.
            P_r_e_v_e_n_t__L_O_G_I_C--P_A_T_E_N_T_S
     http://www.ffii.org, http://www.nosoftwarepatents.org

From: Rob Warnock
Subject: Re: Can not find older posting: Reading files (fast)
Date: Sun, 28 Aug 2005 10:00:34 +0000
Message-ID: <ksydnTsoL9nfFozeRVn-sw@speakeasy.net>

Bernd Schmitt  <··················@gmx.net> wrote:
+---------------
| Rob Warnock wrote:
| > the following [from my "random utilities" junkbox] has linear
| > behavior in most CL implementations:
| > 
| >     (defun file-lines (path)
| >       "Sucks up an entire file from PATH into a list of freshly-allocated
| >       strings, returning two values: the list of strings and the number of
| >       lines read."
| >       (with-open-file (s path)
| > 	(loop for line = (read-line s nil nil)
| why is this indented like this?
+---------------

Well, it *wasn't*, when I posted it!  ;-}  ;-}

It was indented like this, with the LOOP indented two spaces inside
the WITH-OPEN-FILE (which was in turn two spaces inside the DEFUN):

    (defun file-lines (path)
      "Sucks up an entire file from PATH into a list of freshly-allocated
      strings, returning two values: the list of strings and the number of
      lines read."
      (with-open-file (s path)
        (loop for line = (read-line s nil nil)
              while line
          collect line into lines
          counting t into line-count
          finally (return (values lines line-count)))))

Customarily, the bodies of forms which define things or establish
bindings are indented two spaces in from the beginning of the enclosing
form. The above follows this for the bodies of DEFUN and WITH-OPEN-FILE.

LOOP indenting is slightly different, since it has a more COBOL-like
syntax which uses unevaluated symbols [but not necessarily from the
KEYWORD package!] as syntax markers or "LOOP keywords". In the above
LOOP, the syntax markers used are FOR, =, WHILE, COLLECT, INTO, COUNTING,
and FINALLY. [It is this non-Lispy embedded syntax that made me include
in my previous posting the caveat: "If you're willing to use a simple
LOOP without necessarily understanding it completely..."]

Opinions differ on proper style for LOOP indenting. A LOOP form may
contain binding/initializing/stepping sub-forms as well as "action"/
collecting/counting/summing/terminating sub-forms. Some people [and
the editors they use] like to indent the entire body of the LOOP by
five spaces, to line up with (typically) the first binding/initializing/
stepping sub-form, like this:

    (loop for line = (read-line s nil nil)
          while line
          collect line into lines
          counting t into line-count
          finally (return (values lines line-count)))

Others [yours truly among them] prefer to [and have configured their
editors to automate] indenting the binding/initializing/stepping
sub-forms five spaces, but then backing out to a two-space indent
for the "action"/collecting/counting/summing/terminating sub-forms,
like so:

    (loop for line = (read-line s nil nil)
          while line
      collect line into lines
      counting t into line-count
      finally (return (values lines line-count)))

This causes the indenting to resemble that of other binding and
iteration forms in the language.

Whatever style you choose, try to stick to it [though, where it
makes sense to do so, not slavishly] so the readers of your code
don't wonder why some of your code looks one way and some another.
[Otherwise, they may try to read meaning that isn't there into the
entrails of your LOOPs!]

+---------------
| > 	      while line
| > 	  collect line into lines
| > 	  counting t into line-count
| > 	  finally (return (values lines line-count)))))
| is this a kind of pseudo-code (otherwise i would miss some parens)?
+---------------

Yes, one might well say that!!  ;-}  ;-}

It is widely agreed by both those who love the Commonn Lisp LOOP
macro and those who hate it that LOOP is one of the *least* "Lispy"
parts of the CL standard -- it is truly an example of using a macro
to embed a different, application-specific language inside normal CL.
Nevertheless, it provides concise, powerful expressiveness for a
large percentage of the iteration tasks one runs into in practice.

Some resources for when you're ready to dig into it further:

  Tutorial:
    http://gigamonkeys.com/book/macros-standard-control-constructs.html
      [the section near the bottom called "The Mighty Loop"]
    http://gigamonkeys.com/book/loop-for-black-belts.html
      [the whole chapter]

  The Standard [well, the CLHS, based on the ANSI CL Standard]:
    http://www.lispworks.com/documentation/HyperSpec/Body/06_a.htm
      "6.1 The LOOP Facility"
    http://www.lispworks.com/documentation/HyperSpec/Body/06_aab.htm
      "6.1.1.2 Loop Keywords"
    http://www.lispworks.com/documentation/HyperSpec/Body/m_loop.htm
      "Macro LOOP"


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Bernd Schmitt
Subject: Re: Can not find older posting: Reading files (fast)
Date: Sun, 28 Aug 2005 14:31:46 +0000
Message-ID: <4311c92f$0$9038$9b622d9e@news.freenet.de>

Rob Warnock wrote:
> Bernd Schmitt  <··················@gmx.net> wrote:



[indentation quoting error]
> Well, it *wasn't*, when I posted it!  ;-}  ;-}

So it was mozilla's error - i really should learn how to use emacs for 
usenet ;)



> Some resources for when you're ready to dig into it further:
> 
>   Tutorial:
>     http://gigamonkeys.com/book/macros-standard-control-constructs.html
>       [the section near the bottom called "The Mighty Loop"]
>     http://gigamonkeys.com/book/loop-for-black-belts.html
>       [the whole chapter]
> 
>   The Standard [well, the CLHS, based on the ANSI CL Standard]:
>     http://www.lispworks.com/documentation/HyperSpec/Body/06_a.htm
>       "6.1 The LOOP Facility"
>     http://www.lispworks.com/documentation/HyperSpec/Body/06_aab.htm
>       "6.1.1.2 Loop Keywords"
>     http://www.lispworks.com/documentation/HyperSpec/Body/m_loop.htm
>       "Macro LOOP"

Many thanks, i will do so. I just passed destructive functions in the 
lisp-book i am currently reading (nconc, ... ->  loop is another 100 
pages away :). I should focus on reading before I do posting next time...




Thank you,
Bernd



-- 
https://gna.org/projects/mipisti - (microscope) picture stitching
          T_a_k_e__c_a_r_e__o_f__y_o_u_r__R_I_G_H_T_S.
            P_r_e_v_e_n_t__L_O_G_I_C--P_A_T_E_N_T_S
     http://www.ffii.org, http://www.nosoftwarepatents.org

From: Edi Weitz
Subject: Re: Can not find older posting: Reading files (fast)
Date: Sun, 28 Aug 2005 15:25:03 +0000
Message-ID: <uy86mozhs.fsf@agharta.de>

On Sun, 28 Aug 2005 05:00:34 -0500, ····@rpw3.org (Rob Warnock) wrote:

> Well, it *wasn't*, when I posted it!  ;-}  ;-}

You used tabs which very often leads to this kind of confusion.  IMHO
literal tabs should be avoided in source code.

  (setq-default indent-tabs-mode nil)

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: Rob Warnock
Subject: Re: Can not find older posting: Reading files (fast)
Date: Mon, 29 Aug 2005 02:00:05 +0000
Message-ID: <zJednRCUTJq48Y_eRVn-iQ@speakeasy.net>

Edi Weitz  <········@agharta.de> wrote:
+---------------
| ····@rpw3.org (Rob Warnock) wrote:
| > Well, it *wasn't*, when I posted it!  ;-}  ;-}
| 
| You used tabs which very often leads to this kind of confusion.
+---------------

Ouch!! (*blush!*)  I didn't know that, but looking at the file copy
of it, I see you're quite correct. Thanks for the catch!

+---------------
| IMHO literal tabs should be avoided in source code.
|   (setq-default indent-tabs-mode nil)
+---------------

I use "vi" for composing news articles, and while I am generally
careful to never type tabs in source code, occasionally "vi" will
"helpfully" insert them anyway if I do a ">%" ["shift right to
matching bracket"] and some of the shifted text ends up being on or
past a tab stop. ·@·····@^!# (*grumph!*)

I'll see if I can find a setting that stops that...

Thanks,


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Christopher C. Stacy
Subject: Re: Can not find older posting: Reading files (fast)
Date: Mon, 29 Aug 2005 06:36:16 +0000
Message-ID: <uzmr19rmo.fsf@news.dtpq.com>

····@rpw3.org (Rob Warnock) writes:

> Edi Weitz  <········@agharta.de> wrote:
> +---------------
> | ····@rpw3.org (Rob Warnock) wrote:
> | > Well, it *wasn't*, when I posted it!  ;-}  ;-}
> | 
> | You used tabs which very often leads to this kind of confusion.
> +---------------
> 
> Ouch!! (*blush!*)  I didn't know that, but looking at the file copy
> of it, I see you're quite correct. Thanks for the catch!
> 
> +---------------
> | IMHO literal tabs should be avoided in source code.
> |   (setq-default indent-tabs-mode nil)
> +---------------
> 
> I use "vi" for composing news articles, and while I am generally
> careful to never type tabs in source code, occasionally "vi" will
> "helpfully" insert them anyway if I do a ">%" ["shift right to
> matching bracket"] and some of the shifted text ends up being on or
> past a tab stop. ·@·····@^!# (*grumph!*)
> 
> I'll see if I can find a setting that stops that...

I use Emacs, and I try to remember to run the 
command [m-x untabify] over the code region.

From: Ivan Boldyrev
Subject: Re: Can not find older posting: Reading files (fast)
Date: Tue, 30 Aug 2005 17:42:16 +0000
Message-ID: <ocieu2-as6.ln1@ibhome.cgitftp.uiggm.nsc.ru>

On 9216 day of my life Christopher C. Stacy wrote:
> I use Emacs, and I try to remember to run the 
> command [m-x untabify] over the code region.

What about adding untabify to message-send-hook?

(add-hook
 'message-send-hook
 '(lambda ()
     (message-goto-body)
     (untabify (point) (point-max))))

-- 
Ivan Boldyrev

        Outlook has performed an illegal operation and will be shut down.
        If the problem persists, contact the program vendor.

From: Robert Uhl
Subject: Re: Can not find older posting: Reading files (fast)
Date: Mon, 29 Aug 2005 20:53:46 +0000
Message-ID: <m3acj031np.fsf@4dv.net>

····@rpw3.org (Rob Warnock) writes:
>
> I use "vi" for composing news articles, and while I am generally
> careful to never type tabs in source code, occasionally "vi" will
> "helpfully" insert them anyway if I do a ">%" ["shift right to
> matching bracket"] and some of the shifted text ends up being on or
> past a tab stop. ·@·····@^!# (*grumph!*)
>
> I'll see if I can find a setting that stops that...

tmp=`which vi`; rm $tmp; ln -s `which emacs` $tmp

*grin*

-- 
Robert Uhl <http://public.xdi.org/=ruhl>
All words have not a single meaning but a swarm of them, like bees
around a hive.  And like that swarm, changing its position
ever-so-slightly with each wingbeat, the word's meanings change a little
with each use upon the tongue or the page.             --Maureen O'Brien

From: drewc
Subject: Re: Can not find older posting: Reading files (fast)
Date: Mon, 29 Aug 2005 00:30:14 +0000
Message-ID: <qGsQe.327443$s54.301252@pd7tw2no>

Rob Warnock wrote:
> 
> And if, instead of a list of lines, you suck up the whole thing into
> a single string [for later manipulation with CHAR or SUBSEQ]:
> 
>     (defun file-string (path)
>       "Sucks up an entire file from PATH into a freshly-allocated string,
>       returning two values: the string and the number of bytes read."
>       (with-open-file (s path)
> 	(let* ((len (file-length s))
> 	       (data (make-string len)))
> 	  (values data (read-sequence data s)))))
> 
> then on the same machine/file/etc. as above, it goes ~20 times as fast
> (0.05 seconds of real time).

According to "~& ~:[?~;~:*~S~]: ~:[?~;~:*~S~] -> ~:[?~;~:*~S~]~%" [1], 
this function is not portable :

"But this almost certainly will not work reliably. file-length will 
almost certainly tell you the length of the file in octets, not 
characters, and if the encoding is not trivial, this will mean that the 
string allocated will be the wrong length (typically it will be too 
long). To see why this is likely to be true, consider how you would make 
things work `right' on a Unix-like system: since the file is actually 
just a sequence of octets - there is no useful metadata - then in order 
for file-length to calculate the character length of the file, it would 
have to read the whole file, decoding it into characters. So in order to 
work, this code has to read the whole file twice."

The following solution is offered :


(defun snarf-file (file)
   ;; encoding-resistant file reader.  You can't use FILE-LENGTH
   ;; because in the presence of variable-length encodings (and DOS
   ;; linefeed conventions) the length of a file can bear little resemblance
   ;; to the length of the string it corresponds to.  Reading each line
   ;; like this wastes a bunch of space but does solve the encoding
   ;; issues.
   (with-open-file (in file :direction :input)
     (loop for read = (read-line in nil nil)
           while read
           for i upfrom 1
           collect read into lines
           sum (length read) into len
           finally (return
                    (let ((huge (make-string (+ len i))))
                      (loop with pos = 0
                            for line in lines
                            for len = (length line)
                            do (setf (subseq huge pos) line
                                     (aref huge (+ pos len)) #\Newline
                                     pos (+ pos len 1))
                            finally (return huge)))))))

Hope that helps.

[1] http://www.tfeb.org/lisp/obscurities.html

drewc

> 
> 
> -Rob
> 
> -----
> Rob Warnock			<····@rpw3.org>
> 627 26th Avenue			<URL:http://rpw3.org/>
> San Mateo, CA 94403		(650)572-2607
> 


-- 
Drew Crampsie
drewc at tech dot coop
"Never mind the bollocks -- here's the sexp's tools."
	-- Karl A. Krueger on comp.lang.lisp

From: Rob Warnock
Subject: Re: Can not find older posting: Reading files (fast)
Date: Mon, 29 Aug 2005 03:46:18 +0000
Message-ID: <XsudnVwdGZGXGI_eRVn-gw@speakeasy.net>

drewc  <·····@rift.com> wrote:
+---------------
| Rob Warnock wrote:
| >     (defun file-string (path)
| >       "Sucks up an entire file from PATH into a freshly-allocated string,
| >       returning two values: the string and the number of bytes read."
| >       (with-open-file (s path)
| > 	(let* ((len (file-length s))
| > 	       (data (make-string len)))
| > 	  (values data (read-sequence data s)))))
| 
| According to [ <http://www.tfeb.org/lisp/obscurities.html> ] ...
+---------------

Thanks for the ref!

+---------------
| ...this function is not portable :
| 
| "But this almost certainly will not work reliably. file-length will 
| almost certainly tell you the length of the file in octets, not 
| characters...
+---------------

Hmmm... O.k., I'll agree with the non-portability in general, but
it *might* be slightly more portable than Tim's page suggests.  ;-}
According to the CLHS:

    FILE-LENGTH returns the length of stream, or NIL if the length
    cannot be determined.

    For a binary file, the length is measured in units of the
    element type of the stream.

and refers one to OPEN, which says:

    element-type---a type specifier for recognizable subtype of
    CHARACTER; or a type specifier for a finite recognizable subtype
    of INTEGER; or one of the symbols SIGNED-BYTE, UNSIGNED-BYTE, or
    :DEFAULT. The default is CHARACTER.

And 13.1.4.1 "Graphic Characters" says that:

    #\Backspace, #\Tab, #\Rubout, #\Linefeed, #\Return, and #\Page,
    if they are supported by the implementation, are non-graphic.

But 2.1.3 "Standard Characters" only requires that the non-graphic
characters #\Space and #\Newline is supported.

So I guess it really boils down to whether in a given implementation
#\Return exists as a CHARACTER, and what happens when you READ-CHAR
a stream containing one, since READ-SEQUENCE is defined that way:

    READ-SEQUENCE is identical in effect to iterating over the
    indicated subsequence and reading one element at a time from
    stream and storing it into sequence, but may be more efficient
    than the equivalent loop. An efficient implementation is more
    likely to exist for the case where the sequence is a vector with
    the same element type as the stream.

Note that this is *not* the same as asking whether:

    (= (length (file-string "foo"))
       (with-open-file (s "foo")
	 (loop for line = (read-line s nil nil)
	       while line
	   sum (1+ (length line)))))
     ==> T

This clearly might be false on platforms where #\Newline is externally
represented as <CR><LF>, but if #\Return is a (non-graphic) CHARACTER
on those machines, then the following might still be true even if the
above is false:

    (= (length (file-string "foo"))
       (with-open-file (s "foo")
	 (loop for char = (read-char s nil nil)
	       while char
	   count t)))

Note that the former returns NIL on CMUCL under Unix when given a file
containing ASCII NULs (a .tar.gz! ;-} ) but the latter still returns T.
It would be interesting to know whether the latter also returns T on
MS/DOS or Windows platforms, and for which CL implemetations.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: drewc
Subject: Re: Can not find older posting: Reading files (fast)
Date: Mon, 29 Aug 2005 07:53:21 +0000
Message-ID: <R9zQe.330701$s54.110830@pd7tw2no>

Rob Warnock wrote:

> drewc  <·····@rift.com> wrote:
> +---------------
> | Rob Warnock wrote:
> | >     (defun file-string (path)
> | >       "Sucks up an entire file from PATH into a freshly-allocated string,
> | >       returning two values: the string and the number of bytes read."
> | >       (with-open-file (s path)
> | > 	(let* ((len (file-length s))
> | > 	       (data (make-string len)))
> | > 	  (values data (read-sequence data s)))))
> | 
> | According to [ <http://www.tfeb.org/lisp/obscurities.html> ] ...
> +---------------
> 
> Thanks for the ref!
> 
> +---------------
> | ...this function is not portable :
> | 
> | "But this almost certainly will not work reliably. file-length will 
> | almost certainly tell you the length of the file in octets, not 
> | characters...
> +---------------
> 
> Hmmm... O.k., I'll agree with the non-portability in general, but
> it *might* be slightly more portable than Tim's page suggests.  ;-}

The real complaint he has is that, in the presense of multibyte 
encodings, your string is going to be longer than the number of 
characters it contains. This could cause some subtle bugs.

The following code serves to illustrate the problem, in a unicode 
enabled SBCL, reading a UTF-8 encoded file :

CL-USER>
(defun slurp-stream (file-stream)
   (with-output-to-string (datum)
     (let ((buffer (make-array 4096 :element-type 'character)))
       (loop for bytes-read = (read-sequence buffer file-stream)
	    do (write-sequence buffer datum :start 0 :end bytes-read)
	    while (= bytes-read 4096)))))
SLURP-STREAM

CL-USER> (with-open-file (s "/home/drewc/utf8_sample.html"
			    :external-format :utf8)
	   (values (file-length s) (length (slurp-stream s))))
11085
10279


FWIW, i use the following function from Marco Baringer's Arnesi library 
  which is somewhere between your function and the one Tim suggests in 
speed and memory usage, and to my eyes seems quite portable:

(defun read-string-from-file (pathname &key (buffer-size 4096)
                                             (element-type 'character))
   "Return the contents of @var{pathname} as a string."
   (with-input-from-file (file-stream pathname)
     (with-output-to-string (datum)
       (let ((buffer (make-array buffer-size :element-type element-type)))
	(loop for bytes-read = (read-sequence buffer file-stream)
	      do (write-sequence buffer datum :start 0 :end bytes-read)
	      while (= bytes-read buffer-size))))))


hth,

drewc






> According to the CLHS:
> 
>     FILE-LENGTH returns the length of stream, or NIL if the length
>     cannot be determined.
> 
>     For a binary file, the length is measured in units of the
>     element type of the stream.
> 
> and refers one to OPEN, which says:
> 
>     element-type---a type specifier for recognizable subtype of
>     CHARACTER; or a type specifier for a finite recognizable subtype
>     of INTEGER; or one of the symbols SIGNED-BYTE, UNSIGNED-BYTE, or
>     :DEFAULT. The default is CHARACTER.
> 
> And 13.1.4.1 "Graphic Characters" says that:
> 
>     #\Backspace, #\Tab, #\Rubout, #\Linefeed, #\Return, and #\Page,
>     if they are supported by the implementation, are non-graphic.
> 
> But 2.1.3 "Standard Characters" only requires that the non-graphic
> characters #\Space and #\Newline is supported.
> 
> So I guess it really boils down to whether in a given implementation
> #\Return exists as a CHARACTER, and what happens when you READ-CHAR
> a stream containing one, since READ-SEQUENCE is defined that way:
> 
>     READ-SEQUENCE is identical in effect to iterating over the
>     indicated subsequence and reading one element at a time from
>     stream and storing it into sequence, but may be more efficient
>     than the equivalent loop. An efficient implementation is more
>     likely to exist for the case where the sequence is a vector with
>     the same element type as the stream.
> 
> Note that this is *not* the same as asking whether:
> 
>     (= (length (file-string "foo"))
>        (with-open-file (s "foo")
> 	 (loop for line = (read-line s nil nil)
> 	       while line
> 	   sum (1+ (length line)))))
>      ==> T
> 
> This clearly might be false on platforms where #\Newline is externally
> represented as <CR><LF>, but if #\Return is a (non-graphic) CHARACTER
> on those machines, then the following might still be true even if the
> above is false:
> 
>     (= (length (file-string "foo"))
>        (with-open-file (s "foo")
> 	 (loop for char = (read-char s nil nil)
> 	       while char
> 	   count t)))
> 
> Note that the former returns NIL on CMUCL under Unix when given a file
> containing ASCII NULs (a .tar.gz! ;-} ) but the latter still returns T.
> It would be interesting to know whether the latter also returns T on
> MS/DOS or Windows platforms, and for which CL implemetations.
> 
> 
> -Rob
> 
> -----
> Rob Warnock			<····@rpw3.org>
> 627 26th Avenue			<URL:http://rpw3.org/>
> San Mateo, CA 94403		(650)572-2607
> 


-- 
Drew Crampsie
drewc at tech dot coop
"Never mind the bollocks -- here's the sexp's tools."
	-- Karl A. Krueger on comp.lang.lisp

From: Edi Weitz
Subject: Re: Can not find older posting: Reading files (fast)
Date: Mon, 29 Aug 2005 08:03:10 +0000
Message-ID: <uwtm5b269.fsf@agharta.de>

On Sun, 28 Aug 2005 22:46:18 -0500, ····@rpw3.org (Rob Warnock) wrote:

> Hmmm... O.k., I'll agree with the non-portability in general, but it
> *might* be slightly more portable than Tim's page suggests.  ;-}
>
> [analysis snipped]

You're most likely aware of this but just for the record: The real
problem in this case is the fact that the file in question could be
encoded in UTF-8 or whatever.  No matter how #\Return is treated by
the implementation it's still impossible to determine the string
length of the file from the octet length without reading through the
whole file.  (Doesn't matter for CMUCL where characters are 8-bit but
it does for LispWorks, AllegroCL, SBCL, CLISP and probably others.)

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")