Lisp version of a C'ism

From: M Jared Finder
Subject: Lisp version of a C'ism
Date: Tue, 17 Aug 2004 04:20:22 +0000
Message-ID: <2odg18F9ao4iU1@uni-berlin.de>

Hello CLL!  I have recently decided to start learning Lisp so I would 
have more programming language experience then just C and C++.  As 
practice, I am working problem sets from my C school text, "Pointers on 
C", but because of the fundamental differences between C and Lisp, I 
find myself stuck on problem 1.3, which is as follows:

Write a program that reads characters from stdin and writes them to 
stdout.  Also calculate a checksum by summing the input and ignoring 
overflow.

This is extremely simple in C, but I have no idea how to accomplish this 
in Lisp.  The best solution I can think of involves writing the input to 
a file so I can open it in binary and read byte by byte.  Is there a 
better, more direct solution in Lisp?

   -- MJF

Re: Lisp version of a C'ism Pascal Bourguignon
Re: Lisp version of a C'ism Pascal Bourguignon
- Re: Lisp version of a C'ism Kaz Kylheku
Re: Lisp version of a C'ism Kaz Kylheku
- Re: Lisp version of a C'ism Vassil Nikolov
  - Re: Lisp version of a C'ism Rahul Jain
    - Re: Lisp version of a C'ism Pascal Bourguignon
    - Re: Lisp version of a C'ism Vassil Nikolov
      - Re: Lisp version of a C'ism Christopher C. Stacy
      - Re: Lisp version of a C'ism Rahul Jain
        Re: Lisp version of a C'ism Vassil Nikolov
        Re: Lisp version of a C'ism Rahul Jain
        Re: Lisp version of a C'ism Vassil Nikolov
        Re: Lisp version of a C'ism Joel Ray Holveck
        Re: Lisp version of a C'ism Pascal Bourguignon
        Re: Lisp version of a C'ism Vassil Nikolov
        Re: Lisp version of a C'ism Pascal Bourguignon
        Re: Lisp version of a C'ism Vassil Nikolov
        Re: Lisp version of a C'ism Pascal Bourguignon
        Re: Lisp version of a C'ism Vassil Nikolov

From: Pascal Bourguignon
Subject: Re: Lisp version of a C'ism
Date: Tue, 17 Aug 2004 05:06:13 +0000
Message-ID: <87llge8ibe.fsf@thalassa.informatimago.com>

M Jared Finder <·······@hpalace.com> writes:

> Hello CLL!  I have recently decided to start learning Lisp so I would
> have more programming language experience then just C and C++.  As
> practice, I am working problem sets from my C school text, "Pointers
> on C", but because of the fundamental differences between C and Lisp,
> I find myself stuck on problem 1.3, which is as follows:
> 
> Write a program that reads characters from stdin and writes them to
> stdout.  Also calculate a checksum by summing the input and ignoring
> overflow.

(loop for ch = (read-char t nil nil)
      while ch
      do (write ch) ;; or you may want (princ ch)
      sum (char-code ch) into sum
      finally (return (mod sum 256))) ;; integers are true integers in lisp.
           
> This is extremely simple in C, but I have no idea how to accomplish
> this in Lisp.  The best solution I can think of involves writing the
> input to a file so I can open it in binary and read byte by byte.  Is
> there a better, more direct solution in Lisp?

Your problem says: read characters, not read bytes!



-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.

From: Pascal Bourguignon
Subject: Re: Lisp version of a C'ism
Date: Tue, 17 Aug 2004 05:08:59 +0000
Message-ID: <87hdr28i6s.fsf@thalassa.informatimago.com>

M Jared Finder <·······@hpalace.com> writes:

> Hello CLL!  I have recently decided to start learning Lisp so I would
> have more programming language experience then just C and C++.  As
> practice, I am working problem sets from my C school text, "Pointers
> on C", but because of the fundamental differences between C and Lisp,
> I find myself stuck on problem 1.3, which is as follows:
> 
> Write a program that reads characters from stdin and writes them to
> stdout.  Also calculate a checksum by summing the input and ignoring
> overflow.

By the way, I don't see what this has to do with pointers???

#include <stdio.h>
int main(void){ 
    int sum=0;
    int ch;
    while(EOF!=(ch=getchar())){
        printf("%c",ch);
        sum+=ch;
    }
    printf("sum=%d\n",sum%256);
    return(0);
}

> This is extremely simple in C, but I have no idea how to accomplish
> this in Lisp.  The best solution I can think of involves writing the
> input to a file so I can open it in binary and read byte by byte.  Is
> there a better, more direct solution in Lisp?
> 
>    -- MJF

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.

From: Kaz Kylheku
Subject: Re: Lisp version of a C'ism
Date: Tue, 17 Aug 2004 18:19:57 +0000
Message-ID: <cf333042.0408170853.62f2d833@posting.google.com>

Pascal Bourguignon <····@mouse-potato.com> wrote in message news:<··············@thalassa.informatimago.com>...
> M Jared Finder <·······@hpalace.com> writes:
> 
> > Hello CLL!  I have recently decided to start learning Lisp so I would
> > have more programming language experience then just C and C++.  As
> > practice, I am working problem sets from my C school text, "Pointers
> > on C", but because of the fundamental differences between C and Lisp,
> > I find myself stuck on problem 1.3, which is as follows:
> > 
> > Write a program that reads characters from stdin and writes them to
> > stdout.  Also calculate a checksum by summing the input and ignoring
> > overflow.
> 
> By the way, I don't see what this has to do with pointers???
> 
> #include <stdio.h>
> int main(void){ 
>     int sum=0;
>     int ch;
>     while(EOF!=(ch=getchar())){
>         printf("%c",ch);
>         sum+=ch;
>     }
>     printf("sum=%d\n",sum%256);
>     return(0);
> }

This may be nice code for a newbie textbook on C programming, but it
doesn't actually solve any real-world problem, like meeting the
requirements for a *portable* checksumming utility: one that you could
for instance compile for Linux or Windows and use it to show that
files correctly transferred from one system to the other have the same
checksum.

I already mentioned the various issues in my other article in this
thread, but here is one more specific to the above code. It's a
nitpick, in that it won't happen except on some rare C programming
platforms.

On some embedded platforms, the type int is one byte wide. That is to
say, sizeof (int) is 1. Bytes are 32 bits wide. If you use getchar()
to read from a binary stream, you get values in the range 0 to
UCHAR_MAX, converted to type int. The value EOF is negative, so it
normally does not interfere in this range. However, if sizeof (int) is
1, then the range 0 to UCHAR_MAX doesn't fit into that type. This
means that some legal byte values are mapped to negative quantities,
and one of those negative quantities will have the same value as EOF.

So, to be maximally portable, you can't just check for EOF and assume
the data has ended. Rather, whenever your loop encounters EOF, it must
check the state of the stream to see whether there is an end-of-file
or error indication.

C is great for writing useless little textbook examples that are good
for selling the language to junior programmers, but when you are doing
actual engineering with real-world requirements, particularly ones
that include any kind of cross-platform portability, nothing is easy
any longer.

When we compare it to Lisp, we find that all the high level stuff is
easier in Lisp, and *correctly done* low-level stuff is *no harder*.

Where C shines is when you abuse the language, screw portability and
basically treat the compiler as an assembler. You write some code that
is undefined by ANSI, pass it through the compiler and verify that you
got the nice machine code you were looking for. Then when you upgrade
the compiler---or even invoke the same one in a different
configuration---you go through the same verification to check that
your undefined code still produces, by dumb luck, the right machine
code.  When such an undefined hack, by dumb luck, produces working
code on two platforms, C is hailed as a ``portable assembler'',
usually accompanied by a round of denouncements of everything else
that produces inferior machine code.

From: Kaz Kylheku
Subject: Re: Lisp version of a C'ism
Date: Tue, 17 Aug 2004 16:36:23 +0000
Message-ID: <cf333042.0408170836.5988516@posting.google.com>

M Jared Finder <·······@hpalace.com> wrote in message news:<··············@uni-berlin.de>...
> Hello CLL!  I have recently decided to start learning Lisp so I would 
> have more programming language experience then just C and C++.  As 
> practice, I am working problem sets from my C school text, "Pointers on 
> C", but because of the fundamental differences between C and Lisp, I 
> find myself stuck on problem 1.3, which is as follows:
> 
> Write a program that reads characters from stdin and writes them to 
> stdout.  Also calculate a checksum by summing the input and ignoring 
> overflow.
> 
> This is extremely simple in C, but I have no idea how to accomplish this 
> in Lisp. 

*IS* it really simple in C? Consider that stdin is a text streams.
Standard C text streams may munge the input in various ways. If the
intent is to get a proper binary checksum on the original input octets
coming from the environment into stdin, then the problem is impossible
to solve. Sure you can write the program, but its answer will differ
from platform to platform, even if all the platforms use the ASCII
character set. On DOS and Windows, the CR LF sequence will read as a
single \n character, whereas on Unix-like systems, a CR and LF in the
input will read as \r\n in the C world. There may be other
transformations. Text streams can munge control characters and even
swallow trailing whitespace in a line!

You can use the freopen() function to bind a different file to the
existing stdin stream, but there is no portable way to take the
underlying standard input device and bind it to a new stream which has
the right properties.

On a POSIX platform you can use fdopen(STDIN_FILENO, "rb"), but then
on most POSIX platforms you don't have to since text and binary
streams are the same. The exception might be something like Cygwin
where the user can configure text files to be DOS-like.

From: Vassil Nikolov
Subject: Re: Lisp version of a C'ism
Date: Mon, 23 Aug 2004 04:27:14 +0000
Message-ID: <lzk6vqtr6l.fsf@janus.vassil.nikolov.names>

···@ashi.footprints.net (Kaz Kylheku) writes:

> [...]
>> Write a program that reads characters from stdin and writes them to 
>> stdout.  Also calculate a checksum by summing the input and ignoring 
>> overflow.
>> 
>> This is extremely simple in C, but I have no idea how to accomplish this 
>> in Lisp. 
>
> *IS* it really simple in C? Consider that stdin is a text streams.
> Standard C text streams may munge the input in various ways. If the
> intent is to get a proper binary checksum on the original input octets
> coming from the environment into stdin, then the problem is impossible
> to solve. Sure you can write the program, but its answer will differ
> from platform to platform, even if all the platforms use the ASCII
> character set. On DOS and Windows, the CR LF sequence will read as a
> single \n character, whereas on Unix-like systems, a CR and LF in the
> input will read as \r\n in the C world. There may be other
> transformations. Text streams can munge control characters and even
> swallow trailing whitespace in a line!


  Still, a character-based checksum (as opposed to a checksum of the
  binary bytes) isn't meaningless, as it will not depend on the
  encoding, and will produce the same results whether the text is in
  ASCII or EBCDIC, ISO-8859-1 or UTF-8, etc., etc.

  Of course, implementing that isn't really extremely simple in C,
  either...

  ---Vassil.


-- 
Vassil Nikolov <········@poboxes.com>

Hollerith's Law of Docstrings: Everything can be summarized in 72 bytes.

From: Rahul Jain
Subject: Re: Lisp version of a C'ism
Date: Thu, 26 Aug 2004 04:30:35 +0000
Message-ID: <874qmqr05w.fsf@nyct.net>

Vassil Nikolov <········@poboxes.com> writes:

>   Still, a character-based checksum (as opposed to a checksum of the
>   binary bytes) isn't meaningless, as it will not depend on the
>   encoding, and will produce the same results whether the text is in
>   ASCII or EBCDIC, ISO-8859-1 or UTF-8, etc., etc.

This is ludicrous nonsense. What is capital a plus lowercase u? You
don't add characters, you add numbers. So, no, it IS meaningless. You
need a mapping from character to number, and lo and behold, that's what
an encoding is. Actually, UTF-8 is an encoding of an encoding. Please
learn what it is you're trying to "prove" before you say that something
is "difficult" and implying that it is "impossible" due to some imagined
"deficiency". 'A' on an ASCII C is not equal to 'A' on an EBCDIC C. See
also: CHAR-CODE, if you want to get the wrong answer, but some answer
nevertheless.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist

From: Pascal Bourguignon
Subject: Re: Lisp version of a C'ism
Date: Thu, 26 Aug 2004 06:42:57 +0000
Message-ID: <87y8k2v1qm.fsf@thalassa.informatimago.com>

Rahul Jain <·····@nyct.net> writes:

> Vassil Nikolov <········@poboxes.com> writes:
> 
> >   Still, a character-based checksum (as opposed to a checksum of the
> >   binary bytes) isn't meaningless, as it will not depend on the
> >   encoding, and will produce the same results whether the text is in
> >   ASCII or EBCDIC, ISO-8859-1 or UTF-8, etc., etc.
> 
> This is ludicrous nonsense. What is capital a plus lowercase u? You
> don't add characters, you add numbers. So, no, it IS meaningless. You
> need a mapping from character to number, and lo and behold, that's what
> an encoding is. Actually, UTF-8 is an encoding of an encoding. Please
> learn what it is you're trying to "prove" before you say that something
> is "difficult" and implying that it is "impossible" due to some imagined
> "deficiency". 'A' on an ASCII C is not equal to 'A' on an EBCDIC C. See
> also: CHAR-CODE, if you want to get the wrong answer, but some answer
> nevertheless.

No, you misunderstood what he meant by "character-based checksum". 
Of course, you don't sum 'A' and 'c', you have to choose an encoding,
depending on the properties you want for your checksum. (for example:
Are spaces and new-lines significant? if not, set their encoding value
to 0; or must it be case insensitive? in which case encoding('A') ==
encoding('a')). Actually, once you've "encoded" the text, you'd have
to forward it to something like MD5 rather than just checksumming,
because "ON" is quite different from "NO", and probably you don't want
them to checksum.  

(Besides, in C, 'A'+'u' == 65+117 == 182 == '�' (ISO-8859-1))
-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.

From: Vassil Nikolov
Subject: Re: Lisp version of a C'ism
Date: Fri, 27 Aug 2004 00:52:57 +0000
Message-ID: <lzwtzlpfkm.fsf@janus.vassil.nikolov.names>

Rahul Jain <·····@nyct.net> writes:

> [...]
> 'A' on an ASCII C is not equal to 'A' on an EBCDIC C.

  That obviously depends on the notion of equality that applies.

  Of course, I may be mistaken in thinking that 'A' stands for the
  first letter of the Latin alphabet.

  ---Vassil.

-- 
Vassil Nikolov <········@poboxes.com>

Hollerith's Law of Docstrings: Everything can be summarized in 72 bytes.

From: Christopher C. Stacy
Subject: Re: Lisp version of a C'ism
Date: Fri, 27 Aug 2004 02:12:22 +0000
Message-ID: <ud61d9vnc.fsf@news.dtpq.com>

>>>>> On Thu, 26 Aug 2004 20:52:57 -0400, Vassil Nikolov ("Vassil") writes:

 Vassil> Rahul Jain <·····@nyct.net> writes:
 >> [...]
 >> 'A' on an ASCII C is not equal to 'A' on an EBCDIC C.

 Vassil>   That obviously depends on the notion of equality that applies.
 Vassil>   Of course, I may be mistaken in thinking that 'A' stands for the
 Vassil>   first letter of the Latin alphabet.

I thought the whole point is that they were discussing character-set indices

From: Rahul Jain
Subject: Re: Lisp version of a C'ism
Date: Sun, 29 Aug 2004 19:44:14 +0000
Message-ID: <87eklpu3u9.fsf@nyct.net>

Vassil Nikolov <········@poboxes.com> writes:

> Rahul Jain <·····@nyct.net> writes:
>
>> [...]
>> 'A' on an ASCII C is not equal to 'A' on an EBCDIC C.
>
>
>   That obviously depends on the notion of equality that applies.

The one given by C would be the easiest one to discuss in this context.

>   Of course, I may be mistaken in thinking that 'A' stands for the
>   first letter of the Latin alphabet.

Of course, that's what I mean. But with different character encodings,
you get different numbers.

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist

From: Vassil Nikolov
Subject: Re: Lisp version of a C'ism
Date: Mon, 30 Aug 2004 03:16:12 +0000
Message-ID: <lzvff1nwn7.fsf@janus.vassil.nikolov.names>

Rahul Jain <·····@nyct.net> writes:

> Vassil Nikolov <········@poboxes.com> writes:
>
>> Rahul Jain <·····@nyct.net> writes:
>>
>>> [...]
>>> 'A' on an ASCII C is not equal to 'A' on an EBCDIC C.
>>
>>
>>   That obviously depends on the notion of equality that applies.
>
> The one given by C would be the easiest one to discuss in this context.
>
>>   Of course, I may be mistaken in thinking that 'A' stands for the
>>   first letter of the Latin alphabet.
>
> Of course, that's what I mean. But with different character encodings,
> you get different numbers.

  ...for the binary encoded values, yes.  But I (may) want the numbers
  to depend only on the characters, but not on the encoding.

  I should perhaps have written this in the first place:

  Let's consider a program that does the following: the input consists
  of a plain text file and a specification of its encoding, and the
  output is a checksum (say, a simple CRC checksum) of the contents of
  the file such that it only depends on the characters in it, but not
  on the encoding used.  (So, for example, if the file contains the
  English text "the quick brown fox jumps over the lazy dog", this
  checksum will be the same whether the encoding is, say, ISO-8859-1,
  EBCDIC, or UTF-16LE.)

  I claim that it makes sense to have such a program, and that
  implementing it in ANSI C is not as simple as it could be (it
  becomes quite simple if we allow platform-specific libraries,
  though).  It's true that ANSI Common Lisp does not guarantee a
  simple implementation, but at least it makes one possible, with the
  :EXTERNAL-FORMAT argument to OPEN, given a Common Lisp
  implementation that supports a large enough set of encodings (not
  too unreasonable to expect).

  ---Vassil.

-- 
Vassil Nikolov <········@poboxes.com>

Hollerith's Law of Docstrings: Everything can be summarized in 72 bytes.

From: Rahul Jain
Subject: Re: Lisp version of a C'ism
Date: Mon, 30 Aug 2004 03:38:30 +0000
Message-ID: <87sma5l2h5.fsf@nyct.net>

Vassil Nikolov <········@poboxes.com> writes:

>   Let's consider a program that does the following: the input consists
>   of a plain text file and a specification of its encoding,

How will you describe the numeric values of all poosible characters in
each of these encoding files?

What do you do where the encodings partially overlap? Do you require
that checksums of strings containing certain characters must be
consistent across all encodings that happen to contain that set of
characters?

How will you make sure of this for all possible encodings? What if
there's an encoding that contains all the lowercase greek letters and
all the uppercase roman letters and uppercase calligraphic letters? How
do you make sure that you've allowed your algorithm to cope with such an
encoding?

How on earth is that file that describes the encoding supposed to look? 
Will it contain names? What if one description of an encoding uses
"dieresis" and another uses "umlaut"? Those two groups of characters
should affect checksum the same way.

Please spend some time thinking about the question that you're asking
instead of asking for a theoretical answer to a practical impossibility. 
Then again, if someone does come up with such an algorithm, I will
consider him/her to be a mathematical genius.

All you've said so far is that Lisp won't let you implement an algorithm
that you can't describe, but you want this because C can do it?

-- 
Rahul Jain
·····@nyct.net
Professional Software Developer, Amateur Quantum Mechanicist

From: Vassil Nikolov
Subject: Re: Lisp version of a C'ism
Date: Mon, 30 Aug 2004 04:15:18 +0000
Message-ID: <lzn00dl0rt.fsf@janus.vassil.nikolov.names>

Rahul Jain <·····@nyct.net> writes:

> Vassil Nikolov <········@poboxes.com> writes:
>
>>   Let's consider a program that does the following: the input consists
>>   of a plain text file and a specification of its encoding,
>
> How will you describe the numeric values of all poosible characters in
> each of these encoding files?

  A "specification of its encoding" in the sense of "something that
  identifies its encoding", not "something that defines its encoding".
  I made a mistake using "specification" here, sorry.

> [...]
> Please spend some time thinking about the question that you're asking
> instead of asking for a theoretical answer to a practical impossibility. 

  I have not asked any questions in this thread, I have only made some
  statements.

> [...]
> All you've said so far is that Lisp won't let you implement an algorithm
> that you can't describe, but you want this because C can do it?

  I don't see how I have said that.

  ---Vassil.

-- 
Vassil Nikolov <········@poboxes.com>

Hollerith's Law of Docstrings: Everything can be summarized in 72 bytes.

From: Joel Ray Holveck
Subject: Re: Lisp version of a C'ism
Date: Wed, 01 Sep 2004 20:28:15 +0000
Message-ID: <y7c7jrdpwdc.fsf@sindri.juniper.net>

>>   Let's consider a program that does the following: the input consists
>>   of a plain text file and a specification of its encoding,
> How will you describe the numeric values of all poosible characters in
> each of these encoding files?
[etc...]

Well, you could start by using Emacs's MULE (now part of Emacs 21),
which seems to handle the issue pretty well.

joelh

From: Pascal Bourguignon
Subject: Re: Lisp version of a C'ism
Date: Mon, 30 Aug 2004 04:04:32 +0000
Message-ID: <87oektnuen.fsf@thalassa.informatimago.com>

Vassil Nikolov <········@poboxes.com> writes:

> Rahul Jain <·····@nyct.net> writes:
> 
> > Vassil Nikolov <········@poboxes.com> writes:
> >
> >> Rahul Jain <·····@nyct.net> writes:
> >>
> >>> [...]
> >>> 'A' on an ASCII C is not equal to 'A' on an EBCDIC C.
> >>
> >>
> >>   That obviously depends on the notion of equality that applies.
> >
> > The one given by C would be the easiest one to discuss in this context.
> >
> >>   Of course, I may be mistaken in thinking that 'A' stands for the
> >>   first letter of the Latin alphabet.
> >
> > Of course, that's what I mean. But with different character encodings,
> > you get different numbers.
> 
> 
>   ...for the binary encoded values, yes.  But I (may) want the numbers
>   to depend only on the characters, but not on the encoding.
> 
>   I should perhaps have written this in the first place:
> 
>   Let's consider a program that does the following: the input consists
>   of a plain text file and a specification of its encoding, and the
>   output is a checksum (say, a simple CRC checksum) of the contents of
>   the file such that it only depends on the characters in it, but not
>   on the encoding used.  (So, for example, if the file contains the
>   English text "the quick brown fox jumps over the lazy dog", this
>   checksum will be the same whether the encoding is, say, ISO-8859-1,
>   EBCDIC, or UTF-16LE.)
> 
>   I claim that it makes sense to have such a program, and that
>   implementing it in ANSI C is not as simple as it could be (it
>   becomes quite simple if we allow platform-specific libraries,
>   though).  It's true that ANSI Common Lisp does not guarantee a
>   simple implementation, but at least it makes one possible, with the
>   :EXTERNAL-FORMAT argument to OPEN, given a Common Lisp
>   implementation that supports a large enough set of encodings (not
>   too unreasonable to expect).

It's exactly the same in common-lisp and in C. In all languages, you
have to do something like:

    (with-open-file (in path :direction :input :if-does-not-exist :error)
      (loop for ch = (read-char in nil nil)
            while ch
            collect (position ch " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~") into bytes
            finally (return (md5sum (delete nil bytes)))))

                                                                
(with more processing to clean up the text and ignore non significant spaces).

At least, in Common-Lisp, you have a minimum set of characters you can
count on, but in all languages, if you don't want to rely on the
implementation encoding, you have to know in advance the _characters_.
                                                                
                                                                
                                                                
                                                                
                                                                

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.

From: Vassil Nikolov
Subject: Re: Lisp version of a C'ism
Date: Mon, 30 Aug 2004 04:55:36 +0000
Message-ID: <lzvff1z0l3.fsf@janus.vassil.nikolov.names>

Pascal Bourguignon <····@mouse-potato.com> writes:

> Vassil Nikolov <········@poboxes.com> writes:
> [...]
>>   Let's consider a program that does the following: the input consists
>>   of a plain text file and a specification of its encoding, and the
>>   output is a checksum (say, a simple CRC checksum) of the contents of
>>   the file such that it only depends on the characters in it, but not
>>   on the encoding used.  (So, for example, if the file contains the
>>   English text "the quick brown fox jumps over the lazy dog", this
>>   checksum will be the same whether the encoding is, say, ISO-8859-1,
>>   EBCDIC, or UTF-16LE.)
>> 
>>   I claim that it makes sense to have such a program, and that
>>   implementing it in ANSI C is not as simple as it could be (it
>>   becomes quite simple if we allow platform-specific libraries,
>>   though).  It's true that ANSI Common Lisp does not guarantee a
>>   simple implementation, but at least it makes one possible, with the
>>   :EXTERNAL-FORMAT argument to OPEN, given a Common Lisp
>>   implementation that supports a large enough set of encodings (not
>>   too unreasonable to expect).
>
> It's exactly the same in common-lisp and in C. In all languages, you
> have to do something like:
>
>     (with-open-file (in path :direction :input :if-does-not-exist :error)
>       (loop for ch = (read-char in nil nil)
>             while ch
>             collect (position ch " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~") into bytes
>             finally (return (md5sum (delete nil bytes)))))


  This would work only if the encoding of the file can be
  automatically recognized, and I don't think it is reasonable to
  expect that, so it is necessary to supply :EXTERNAL-FORMAT.  If the
  implementation handles it (and I believe it is reasonable to expect
  _that_), then CHAR-CODE would suffice, it won't be necessary to
  enumerate the characters.


> [...]
> At least, in Common-Lisp, you have a minimum set of characters you can
> count on, but in all languages, if you don't want to rely on the
> implementation encoding, you have to know in advance the _characters_.


  But I am willing to rely on the implementation having the necessary
  translation tables (since many implementations do), rather than
  enumerate the characters.

  ---Vassil.


-- 
Vassil Nikolov <········@poboxes.com>

Hollerith's Law of Docstrings: Everything can be summarized in 72 bytes.

From: Pascal Bourguignon
Subject: Re: Lisp version of a C'ism
Date: Mon, 30 Aug 2004 12:29:59 +0000
Message-ID: <87isb0olko.fsf@thalassa.informatimago.com>

Vassil Nikolov <········@poboxes.com> writes:
> Pascal Bourguignon <····@mouse-potato.com> writes:
> > It's exactly the same in common-lisp and in C. In all languages, you
> > have to do something like:
> >
> >     (with-open-file (in path :direction :input :if-does-not-exist :error)
> >       (loop for ch = (read-char in nil nil)
> >             while ch
> >             collect (position ch " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~") into bytes
> >             finally (return (md5sum (delete nil bytes)))))
> 
> 
>   This would work only if the encoding of the file can be
>   automatically recognized, and I don't think it is reasonable to
>   expect that, so it is necessary to supply :EXTERNAL-FORMAT.  

And how do you know what external format to use?

This is an orthogonal question.


>   If the implementation handles it (and I believe it is reasonable to expect
>   _that_), then CHAR-CODE would suffice, it won't be necessary to
>   enumerate the characters.

Of course not.  The implementation would deliver a number accordingly
to the encoding IT uses.  You want a fixed universal numbering to
compute the signature.  Therefore you have to explicitely convert from
characters to numbers (POSITION does that).
 

> > [...]
> > At least, in Common-Lisp, you have a minimum set of characters you can
> > count on, but in all languages, if you don't want to rely on the
> > implementation encoding, you have to know in advance the _characters_.
> 
> 
>   But I am willing to rely on the implementation having the necessary
>   translation tables (since many implementations do), rather than
>   enumerate the characters.

That's another matter.  We're not decoding bytes from files, we're
encoding characters in memory!

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.

From: Vassil Nikolov
Subject: Re: Lisp version of a C'ism
Date: Tue, 31 Aug 2004 03:11:45 +0000
Message-ID: <lz3c24gfwu.fsf@janus.vassil.nikolov.names>

Pascal Bourguignon <····@mouse-potato.com> writes:

> Vassil Nikolov <········@poboxes.com> writes:
>> [...]
>>   This would work only if the encoding of the file can be
>>   automatically recognized, and I don't think it is reasonable to
>>   expect that, so it is necessary to supply :EXTERNAL-FORMAT.  
>
> And how do you know what external format to use?
>
> This is an orthogonal question.

  I don't understand the question.  If a plain text file is to be
  useful, one has to know that (well, sometimes one can just guess).
  With MIME headers, for example, this information is available in the
  Content-Type header.  But I am not sure this is what the question is
  about...

>>   If the implementation handles it (and I believe it is reasonable to expect
>>   _that_), then CHAR-CODE would suffice, it won't be necessary to
>>   enumerate the characters.
>
> Of course not.  The implementation would deliver a number accordingly
> to the encoding IT uses.  You want a fixed universal numbering to
> compute the signature.  Therefore you have to explicitely convert from
> characters to numbers (POSITION does that).

  In principle, yes, but in practice, I may well be content with the
  code points that the implementation uses internally, especially if
  it uses Unicode code points, as many do.

>> > [...]
>> > At least, in Common-Lisp, you have a minimum set of characters you can
>> > count on, but in all languages, if you don't want to rely on the
>> > implementation encoding, you have to know in advance the _characters_.
>> 
>> 
>>   But I am willing to rely on the implementation having the necessary
>>   translation tables (since many implementations do), rather than
>>   enumerate the characters.
>
> That's another matter.  We're not decoding bytes from files, we're
> encoding characters in memory!

  Well, I am letting the implementation decode the bytes from a file
  (telling it, with :EXTERNAL-FORMAT, what decoding to apply), and
  then I am using whatever code points it uses internally for those
  characters.

  ---Vassil.

-- 
Vassil Nikolov <········@poboxes.com>

Hollerith's Law of Docstrings: Everything can be summarized in 72 bytes.

From: Pascal Bourguignon
Subject: Re: Lisp version of a C'ism
Date: Tue, 31 Aug 2004 03:46:13 +0000
Message-ID: <87hdqkm0l6.fsf@thalassa.informatimago.com>

Vassil Nikolov <········@poboxes.com> writes:

> Pascal Bourguignon <····@mouse-potato.com> writes:
> 
> > Vassil Nikolov <········@poboxes.com> writes:
> >> [...]
> >>   This would work only if the encoding of the file can be
> >>   automatically recognized, and I don't think it is reasonable to
> >>   expect that, so it is necessary to supply :EXTERNAL-FORMAT.  
> >
> > And how do you know what external format to use?
> >
> > This is an orthogonal question.
> 
> 
>   I don't understand the question.  If a plain text file is to be
>   useful, one has to know that (well, sometimes one can just guess).
>   With MIME headers, for example, this information is available in the
>   Content-Type header.  But I am not sure this is what the question is
>   about...

About the "you".  That may be the user, the programmer or the program it self.
How can the program "just guess" what encoding is used in a file?

But first note that my expression works equally well in all
Common-Lisp implementations:

> >     (with-open-file (in path :direction :input :if-does-not-exist :error)
> >       (loop for ch = (read-char in nil nil)
> >             while ch
> >             collect (position ch " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~") into bytes
> >             finally (return (md5sum (delete nil bytes)))))

That's because the implementation will do the right thing, and because
I consider only the standard character set (not encoding, I leave that
up to the implementation and the OS to decide what encoding they use
to store TEXT file on this system).

Finally, assuming that you're on an ASCII system and you happen to
have an EBCDIC file, and an implementation that is able to read EBCDIC
files. All right, then you could write the _implementation_ _specific_
expression:

    (with-open-file (in path :direction :input :if-does-not-exist :error
                     :external-format :ebcdic)
      (loop for ch = (read-char in nil nil)
             while ch
             collect (position ch " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~") into bytes
             finally (return (md5sum (delete nil bytes)))))

So I guess I'm conceding you this point, but I would rather implement
my own encoding/decoding because the use of :external-format is not
portable: you could not use the same expression on another
implementation and expect it to work.

> > That's another matter.  We're not decoding bytes from files, we're
> > encoding characters in memory!
> 
> 
>   Well, I am letting the implementation decode the bytes from a file
>   (telling it, with :EXTERNAL-FORMAT, what decoding to apply), and
>   then I am using whatever code points it uses internally for those
>   characters.

If your purpose is to compute a signature of the text that will be the
same on all copies of this text on any system with any encoding then
it won't work if you use the implementations code points.  You have to
do your own encoding (which I crudely do with POSITION).

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we.

From: Vassil Nikolov
Subject: Re: Lisp version of a C'ism
Date: Tue, 31 Aug 2004 12:55:24 +0000
Message-ID: <lzn00bfow3.fsf@janus.vassil.nikolov.names>

Pascal Bourguignon <····@mouse-potato.com> writes:

> Vassil Nikolov <········@poboxes.com> writes:
>
>> Pascal Bourguignon <····@mouse-potato.com> writes:
>> 
>> > Vassil Nikolov <········@poboxes.com> writes:
>> >> [...]
>> >>   This would work only if the encoding of the file can be
>> >>   automatically recognized, and I don't think it is reasonable to
>> >>   expect that, so it is necessary to supply :EXTERNAL-FORMAT.  
>> >
>> > And how do you know what external format to use?
>> >
>> > This is an orthogonal question.
>> 
>> 
>>   I don't understand the question.  If a plain text file is to be
>>   useful, one has to know that (well, sometimes one can just guess).
>>   With MIME headers, for example, this information is available in the
>>   Content-Type header.  But I am not sure this is what the question is
>>   about...
>
> About the "you".  That may be the user, the programmer or the program it self.
> How can the program "just guess" what encoding is used in a file?

  I meant the user; but since you mention it, some heuristics can be
  done programmatically, too (to aid the user in selecting an
  encoding).  (I am thinking about character frequencies, or things
  like "if every second octet is zero, then it is encoded in UTF-16",
  but there may be others as well.)

> But first note that my expression works equally well in all
> Common-Lisp implementations:
>
>> >     (with-open-file (in path :direction :input :if-does-not-exist :error)
>> >       (loop for ch = (read-char in nil nil)
>> >             while ch
>> >             collect (position ch " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~") into bytes
>> >             finally (return (md5sum (delete nil bytes)))))
>
> That's because the implementation will do the right thing, and because
> I consider only the standard character set (not encoding, I leave that
> up to the implementation and the OS to decide what encoding they use
> to store TEXT file on this system).

  I don't agree.

  Even if only the standard character set is considered (and that's a
  big if), and even if we assume that we don't have to worry about
  EBCDIC files on ASCII systems, there are a few different encodings
  that need to be distinguished: besides ASCII and the
  ASCII-compatible 8-bit encodings, there are UTF-16 and UTF-7, for
  example, not to mention multi-byte encodings such as ISO-2022-JP
  that would also yield spurious ASCII characters if not decoded
  properly.  So even in that case I can't really leave that to the
  implementation and OS (I don't expect them to be able to recognize
  the encoding of a plain text file, and unlike an interactive
  utility, I wouldn't like to let them guess).

> Finally, assuming that you're on an ASCII system and you happen to
> have an EBCDIC file, and an implementation that is able to read EBCDIC
> files. All right, then you could write the _implementation_ _specific_
> expression:
>
>     (with-open-file (in path :direction :input :if-does-not-exist :error
>                      :external-format :ebcdic)
>       (loop for ch = (read-char in nil nil)
>              while ch
>              collect (position ch " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~") into bytes
>              finally (return (md5sum (delete nil bytes)))))
>  
> So I guess I'm conceding you this point, but I would rather implement
> my own encoding/decoding because the use of :external-format is not
> portable: you could not use the same expression on another
> implementation and expect it to work.

  Of course; but in practice, porting a program that uses
  :EXTERNAL-FORMAT will involve less work (I believe much less work)
  then implementing the whole encoding/decoding business.

>> > [...]
>>   Well, I am letting the implementation decode the bytes from a file
>>   (telling it, with :EXTERNAL-FORMAT, what decoding to apply), and
>>   then I am using whatever code points it uses internally for those
>>   characters.
>
> If your purpose is to compute a signature of the text that will be the
> same on all copies of this text on any system with any encoding then
> it won't work if you use the implementations code points.  You have to
> do your own encoding (which I crudely do with POSITION).

  But my purpose is not necessarily that ambitious.  I might be content
  with using only systems where the same Common Lisp implementation is
  available, or where a Common Lisp implementation that uses Unicode
  code points internally is available.

  ---Vassil.

-- 
Vassil Nikolov <········@poboxes.com>

Hollerith's Law of Docstrings: Everything can be summarized in 72 bytes.