FILE-LENGTH and external formats

From: Edi Weitz
Subject: FILE-LENGTH and external formats
Date: Mon, 08 Dec 2003 01:27:17 +0000
Message-ID: <8765gs6qve.fsf@bird.agharta.de>

The dictionary entry for FILE-LENGTH says that "for a binary file, the
length is measured in units of the element type of the stream." For
character streams, I couldn't find a definition other than that the
function "returns the length of [the] stream."

If I create a little UTF8-encoded file which consists of the string
"�pfel" (note the umlaut) and use OPEN on this file with an element
type of CHARACTER and the correct external format then the
Unicode-aware Lisps I have on my machine (LispWorks, AllegroCL, and
CLISP) all report a FILE-LENGTH of 6 (which is the length of the file
in octets) although they read its contents as a string of length 5.

I understand that a "correct" behaviour of FILE-LENGTH - if you define
"correct" behaviour to mean that the function should return the number
of characters and not the number of octets - might have severe
performance impacts. (With a variable-length encoding like UTF8 you'll
have to read the whole file until you know its length in characters.)

Was it for this reason that the meaning of "length of the stream" was
left vague in the standard? Would it be conforming behaviour if
FILE-LENGTH returned 5 instead of 6 in this case? Have there been any
Lisps which didn't measure file lengths in units of octets? Would it
be correct to say that the return value of FILE-LENGTH for character
streams is implementation-dependent?

Just curious...

Thanks,
Edi.

Re: FILE-LENGTH and external formats Kent M Pitman
- Re: FILE-LENGTH and external formats Rahul Jain
  - Re: FILE-LENGTH and external formats Barry Margolin
- Re: FILE-LENGTH and external formats Madhu
  - Re: FILE-LENGTH and external formats Thomas F. Burdick
Re: FILE-LENGTH and external formats james anderson
- Re: FILE-LENGTH and external formats Kent M Pitman

From: Kent M Pitman
Subject: Re: FILE-LENGTH and external formats
Date: Mon, 08 Dec 2003 03:04:27 +0000
Message-ID: <sfwy8tohux0.fsf@shell01.TheWorld.com>

Edi Weitz <···@agharta.de> writes:

> Was it for this reason that the meaning of "length of the stream" was
> left vague in the standard? Would it be conforming behaviour if
> FILE-LENGTH returned 5 instead of 6 in this case? Have there been any
> Lisps which didn't measure file lengths in units of octets? Would it
> be correct to say that the return value of FILE-LENGTH for character
> streams is implementation-dependent?

I think it's possible, but not very useful, for it to return 5.  The
problem is that you often want FILE-POSITION to work to move you to a
randomly chosen position, and if FILE-LENGTH returns 5, there's
pressure for FILE-POSITION to not count the fact that some characters
are wider than another, which makes it impossible to do random access
into the file.

From: Rahul Jain
Subject: Re: FILE-LENGTH and external formats
Date: Mon, 08 Dec 2003 03:54:17 +0000
Message-ID: <87n0a4t15i.fsf@nyct.net>

Kent M Pitman <······@nhplace.com> writes:

> I think it's possible, but not very useful, for it to return 5.  The
> problem is that you often want FILE-POSITION to work to move you to a
> randomly chosen position, and if FILE-LENGTH returns 5, there's
> pressure for FILE-POSITION to not count the fact that some characters
> are wider than another, which makes it impossible to do random access
> into the file.

What if that causes you to move into the middle of a character?

What if you're seeking according to a fixed-number-of-characters format
and not according to a byte offset?

IMO, seeking by the number of actual characters is the only way it makes
sense to seek in a file you're treating as a stream of characters. The
fact that C probably does the wrong thing in this case is no excuse, as
it doesn't have characters, just "chars". :)

--
Rahul Jain

From: Barry Margolin
Subject: Re: FILE-LENGTH and external formats
Date: Mon, 08 Dec 2003 05:52:41 +0000
Message-ID: <barmar-DB3FE2.00524208122003@netnews.attbi.com>

In article <··············@nyct.net>, Rahul Jain <·····@nyct.net> 
wrote:

> Kent M Pitman <······@nhplace.com> writes:
> 
> > I think it's possible, but not very useful, for it to return 5.  The
> > problem is that you often want FILE-POSITION to work to move you to a
> > randomly chosen position, and if FILE-LENGTH returns 5, there's
> > pressure for FILE-POSITION to not count the fact that some characters
> > are wider than another, which makes it impossible to do random access
> > into the file.
> 
> What if that causes you to move into the middle of a character?
> 
> What if you're seeking according to a fixed-number-of-characters format
> and not according to a byte offset?
> 
> IMO, seeking by the number of actual characters is the only way it makes
> sense to seek in a file you're treating as a stream of characters. The
> fact that C probably does the wrong thing in this case is no excuse, as
> it doesn't have characters, just "chars". :)

Actually, I think C deals with this similarly to Lisp.  In a character 
file, you're only allowed to seek to the beginning, end, or a position 
that was previously returned by ftell().  Even ignoring things like 
multi-byte character sets, you need to do it this way to handle all the 
different newline conventions (e.g. writing "\n" to a text file on 
Windows actually writes CR and LF).

-- 
Barry Margolin, ······@alum.mit.edu
Woburn, MA

From: Madhu
Subject: Re: FILE-LENGTH and external formats
Date: Sun, 14 Dec 2003 17:57:03 +0000
Message-ID: <xfesmjn9tao.fsf@rhea.cs.unm.edu>

Helu

[Sorry for so late a follow-up]

* Kent M Pitman in <···············@shell01.TheWorld.com> :
| Edi Weitz <···@agharta.de> writes:
|
|> Was it for this reason that the meaning of "length of the stream" was
|> left vague in the standard? Would it be conforming behaviour if
|> FILE-LENGTH returned 5 instead of 6 in this case? Have there been any
|> Lisps which didn't measure file lengths in units of octets? Would it
|> be correct to say that the return value of FILE-LENGTH for character
|> streams is implementation-dependent?
|
| I think it's possible, but not very useful, for it to return 5.  The
| problem is that you often want FILE-POSITION to work to move you to a
| randomly chosen position, and if FILE-LENGTH returns 5, there's
| pressure for FILE-POSITION to not count the fact that some characters
| are wider than another, which makes it impossible to do random access
| into the file.


OTOH I've seen a lot of people write snippets like the following to
slurp files, (perhaps following a common pattern from other
languages):

(with-open-file (input pathname) 
  (let* ((n (file-length input))
         (buf (make-array n :element-type 'character)))
    (dotimes (i n)
      (setf (aref buf i) (read-char input)) ; or read-sequence here
      ... ))) 

Of course this fails often. Even more remarkably when the file has
CRLFs.  FILE-LENGTH returns a value that is always greater than the
number of characters that can be read.

I was wondering what the best solution to this problem was.

A suggestion not to do it? 

Or maybe a suggestion to use `buf' with :fill-pointer t?  and then
setting the fill-pointer to the actual number of characters read (via
read-char or by read-sequence)?

--
Regards
Madhu

From: Thomas F. Burdick
Subject: Re: FILE-LENGTH and external formats
Date: Sun, 14 Dec 2003 18:22:51 +0000
Message-ID: <xcvu14345tw.fsf@famine.OCF.Berkeley.EDU>

Madhu <·····@cs.unm.edu> writes:

> Helu
> 
> [Sorry for so late a follow-up]
> 
> * Kent M Pitman in <···············@shell01.TheWorld.com> :
> | Edi Weitz <···@agharta.de> writes:
> |
> |> Was it for this reason that the meaning of "length of the stream" was
> |> left vague in the standard? Would it be conforming behaviour if
> |> FILE-LENGTH returned 5 instead of 6 in this case? Have there been any
> |> Lisps which didn't measure file lengths in units of octets? Would it
> |> be correct to say that the return value of FILE-LENGTH for character
> |> streams is implementation-dependent?
> |
> | I think it's possible, but not very useful, for it to return 5.  The
> | problem is that you often want FILE-POSITION to work to move you to a
> | randomly chosen position, and if FILE-LENGTH returns 5, there's
> | pressure for FILE-POSITION to not count the fact that some characters
> | are wider than another, which makes it impossible to do random access
> | into the file.
> 
> 
> OTOH I've seen a lot of people write snippets like the following to
> slurp files, (perhaps following a common pattern from other
> languages):
> 
> (with-open-file (input pathname) 
>   (let* ((n (file-length input))
>          (buf (make-array n :element-type 'character)))
>     (dotimes (i n)
>       (setf (aref buf i) (read-char input)) ; or read-sequence here
>       ... ))) 
> 
> Of course this fails often.

Of course it does, and it fails in other languages, too.  I don't
think I've ever seen this in Lisp.  What I *have* seen, and what I use
myself is:

  (defun slurp-file (filename)
    (with-open-file (in filename)
      (let* ((length (file-length in))
             (vector (make-array length :element-type 'character :fill-pointer t))
             (read (read-sequence vector in)))
        (setf (fill-pointer vector) read)
        vector))

There can be no more characters in a file than the file's FILE-LENGTH,
but of course there can be fewer (I remember using DOS!).

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | No to Imperialist war |                        
     ,--'    _,'   | Wage class war!       |                        
    /       /      `-----------------------'                        
   (   -.  |                               
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: james anderson
Subject: Re: FILE-LENGTH and external formats
Date: Mon, 08 Dec 2003 12:03:25 +0000
Message-ID: <3FD4676B.E271C568@setf.de>

Edi Weitz wrote:
> 
> The dictionary entry for FILE-LENGTH says that "for a binary file, the
> length is measured in units of the element type of the stream." For
> character streams, I couldn't find a definition other than that the
> function "returns the length of [the] stream."
> 
> If I create a little UTF8-encoded file which consists of the string
> "�pfel" (note the umlaut) and use OPEN on this file with an element
> type of CHARACTER and the correct external format then the
> Unicode-aware Lisps I have on my machine (LispWorks, AllegroCL, and
> CLISP) all report a FILE-LENGTH of 6 (which is the length of the file
> in octets) although they read its contents as a string of length 5.

what do they supply as the name component of the pathname for a file which has
characters in its name which require multi-byte sequences in utf-8 encoding? i
would presume the length reflects a string in which each character corresponds
to a 16-bit unicode value. otherwise the underlying os is not conformant.

any data which passes through an interface to an unicode stream must reflect a
similar constraint: it can neither accept nor return isolated 8-bit units
which are constituents of a multi-byte sequence. if it does, then the codec is
non-conformant. which means that an interface which permitted one to address
the middle of a multi-byte encoded sequence was not operating as a
unicode-conformant interface. (the conformance requirements are not as
explicit wrt surrogate pairs, but the implication is implicit.) i suppose an
interface could examine the stream content and back up as required prior to
actually performing the next read, but that still does not render the original
position meaningful. it would be, quite literally, random access.

the analogy to ftell/fseek is pertinent.

> 
> I understand that a "correct" behaviour of FILE-LENGTH - if you define
> "correct" behaviour to mean that the function should return the number
> of characters and not the number of octets - might have severe
> performance impacts. (With a variable-length encoding like UTF8 you'll
> have to read the whole file until you know its length in characters.)

perhaps either one must distinguish file length from stream length and file
position from stream position, or one must add teh constraint, that the only
values which may be supplied as the second argument file-position for a given
stream are those which have been returned by it when applied to the same stream.

...

From: Kent M Pitman
Subject: Re: FILE-LENGTH and external formats
Date: Mon, 08 Dec 2003 17:59:15 +0000
Message-ID: <sfw65grw5qk.fsf@shell01.TheWorld.com>

james anderson <··············@setf.de> writes:

> perhaps either one must distinguish file length from stream length
> and file position from stream position,

FILE-LENGTH is certainly badly named.  I would change it if I could to
STREAM-LENGTH.  Nonetheless, I don't think a pratical problem results,
and maybe even some good results from people having to think about this
issue.

> or one must add [the] constraint, that the only values which may be
> supplied as the second argument file-position for a given stream are
> those which have been returned by it when applied to the same
> stream.

implementations often do add such restrictions, which i agree are probably
the right thing, given the complexity of the circumstance.  in effect, 
one might say that FILE-POSITION gives one access to the stream at a 
'sequence break' where the stream model is internally consistent and hence
replayable, even though in general, as an implementation issue, it might
not be.