dangers of read-sequence

From: =?ISO-8859-15?Q?Andr=E9_Thieme?=
Subject: dangers of read-sequence
Date: Sat, 02 Jul 2005 20:45:25 +0000
Message-ID: <da6u87$hdj$1@ulric.tng.de>

I tried to count the number of lines in a file.

(dotimes (i 10)
  (setf buffer (make-string (* 1024 (1+ i))))
  (with-open-file (in "C:/somefile.txt" :direction :input)
    (let ((lines 0))
      (loop for bytes = (read-sequence buffer in)
            while (plusp bytes) do (incf lines (count #\Newline buffer)))
      (format t "run ~a: ~a lines in file~%" (1+ i) lines))))

run 1: 78606 lines in file
run 2: 39314 lines in file
run 3: 26202 lines in file
run 4: 19668 lines in file
run 5: 15730 lines in file
run 6: 13112 lines in file
run 7: 11242 lines in file
run 8: 9834 lines in file
run 9: 8734 lines in file
run 10: 7876 lines in file
NIL


read-sequence seems not to be the way to go.
What happened?

(I tried it with many different files, and most did not gave results 
with such a high divergency, but all had at least one run with a 
different number of lines (compared to the other runs)).


Andr�
--

Re: dangers of read-sequence Peter Seibel
- Re: dangers of read-sequence Edi Weitz
  - Re: dangers of read-sequence Peter Seibel
    - Re: dangers of read-sequence André Thieme
    - Re: dangers of read-sequence Geoffrey Summerhayes
      - Re: dangers of read-sequence Peter Seibel
      - Re: dangers of read-sequence Pascal Bourguignon
Dangers of not reading the manual (Was: dangers of read-sequence) Edi Weitz

From: Peter Seibel
Subject: Re: dangers of read-sequence
Date: Sat, 02 Jul 2005 21:18:32 +0000
Message-ID: <m28y0oewk8.fsf@beagle.local>

Andr� Thieme <······························@justmail.de> writes:

> I tried to count the number of lines in a file.
>
> (dotimes (i 10)
>   (setf buffer (make-string (* 1024 (1+ i))))
>   (with-open-file (in "C:/somefile.txt" :direction :input)
>     (let ((lines 0))
>       (loop for bytes = (read-sequence buffer in)
>             while (plusp bytes) do (incf lines (count #\Newline buffer)))
>       (format t "run ~a: ~a lines in file~%" (1+ i) lines))))
>
> run 1: 78606 lines in file
> run 2: 39314 lines in file
> run 3: 26202 lines in file
> run 4: 19668 lines in file
> run 5: 15730 lines in file
> run 6: 13112 lines in file
> run 7: 11242 lines in file
> run 8: 9834 lines in file
> run 9: 8734 lines in file
> run 10: 7876 lines in file
> NIL
>
>
> read-sequence seems not to be the way to go.
> What happened?

What happens if you change:

  (incf lines (count #\Newline buffer)

to:

  (incf lines (count #\Newline buffer :end bytes)

I can't quite figure out how to explain the actual results you saw but
this change does fix one bug--as it stands now, if the last read reads
less than a full buffer you end up counting any new lines left over in
the part of the buffer not updated by the last read twice.

-Peter

-- 
Peter Seibel           * ·····@gigamonkeys.com
Gigamonkeys Consulting * http://www.gigamonkeys.com/
Practical Common Lisp  * http://www.gigamonkeys.com/book/

From: Edi Weitz
Subject: Re: dangers of read-sequence
Date: Sat, 02 Jul 2005 21:22:46 +0000
Message-ID: <umzp4ucm1.fsf@agharta.de>

On Sat, 02 Jul 2005 21:18:32 GMT, Peter Seibel <·····@gigamonkeys.com> wrote:

> I can't quite figure out how to explain the actual results you saw

The results look so strange because he changes the size of the buffer
each time he runs through the inner loop.

> but this change does fix one bug--as it stands now, if the last read
> reads less than a full buffer you end up counting any new lines left
> over in the part of the buffer not updated by the last read twice.

Right.  I think that was the only bug.

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: Peter Seibel
Subject: Re: dangers of read-sequence
Date: Sat, 02 Jul 2005 21:46:38 +0000
Message-ID: <m24qbcev9e.fsf@beagle.local>

Edi Weitz <········@agharta.de> writes:

> On Sat, 02 Jul 2005 21:18:32 GMT, Peter Seibel <·····@gigamonkeys.com> wrote:
>
>> I can't quite figure out how to explain the actual results you saw
>
> The results look so strange because he changes the size of the buffer
> each time he runs through the inner loop.

Yeah, I noticed that. But that doesn't explain the numbers he got--the
bug I fixed only accounts for some number of extra newlines caused by
counting some newlines from the penultimate READ-SEQUENCE twice. So
the larger the buffer the larger the number of extra newlines you
could get (up to the point where the buffer is large enough to hold
the whole file when a related bug when a related bug creeps in,
because he didn't specify an :initial-element in the call to
MAKE-STRING, who knows what junk the string was initially filled
with.) But he was going from smaller to larger buffers so the number
of newlines counted should have been going up, not down.

>> but this change does fix one bug--as it stands now, if the last read
>> reads less than a full buffer you end up counting any new lines left
>> over in the part of the buffer not updated by the last read twice.
>
> Right.  I think that was the only bug.

I'm not so sure. But who knows.

-Peter

-- 
Peter Seibel           * ·····@gigamonkeys.com
Gigamonkeys Consulting * http://www.gigamonkeys.com/
Practical Common Lisp  * http://www.gigamonkeys.com/book/

From: André Thieme
Subject: Re: dangers of read-sequence
Date: Sat, 02 Jul 2005 22:13:46 +0000
Message-ID: <da73dr$lkq$1@ulric.tng.de>

Peter Seibel schrieb:

> Yeah, I noticed that. But that doesn't explain the numbers he got--the
> bug I fixed only accounts for some number of extra newlines caused by
> counting some newlines from the penultimate READ-SEQUENCE twice.

After restarting my lisp the same code produced these results:

run 1: 8604 lines in file
run 2: 8610 lines in file
run 3: 8604 lines in file
run 4: 8611 lines in file
run 5: 8607 lines in file
run 6: 8608 lines in file
run 7: 8609 lines in file
run 8: 8612 lines in file
run 9: 8605 lines in file
run 10: 8619 lines in file

So this extremly large divergence probably was some strange bug.
Anyway, thank you two.


Andr�
--

From: Geoffrey Summerhayes
Subject: Re: dangers of read-sequence
Date: Tue, 05 Jul 2005 18:12:14 +0000
Message-ID: <k4Aye.4906$Ud.579758@news20.bellglobal.com>

"Peter Seibel" <·····@gigamonkeys.com> wrote in message 
···················@beagle.local...
> Edi Weitz <········@agharta.de> writes:
>
>> On Sat, 02 Jul 2005 21:18:32 GMT, Peter Seibel <·····@gigamonkeys.com> 
>> wrote:
>>
>>> I can't quite figure out how to explain the actual results you saw
>>
>> The results look so strange because he changes the size of the buffer
>> each time he runs through the inner loop.
>
> Yeah, I noticed that. But that doesn't explain the numbers he got--the
> bug I fixed only accounts for some number of extra newlines caused by
> counting some newlines from the penultimate READ-SEQUENCE twice. So
> the larger the buffer the larger the number of extra newlines you
> could get (up to the point where the buffer is large enough to hold
> the whole file when a related bug when a related bug creeps in,
> because he didn't specify an :initial-element in the call to
> MAKE-STRING, who knows what junk the string was initially filled
> with.) But he was going from smaller to larger buffers so the number
> of newlines counted should have been going up, not down.
>
>>> but this change does fix one bug--as it stands now, if the last read
>>> reads less than a full buffer you end up counting any new lines left
>>> over in the part of the buffer not updated by the last read twice.
>>
>> Right.  I think that was the only bug.
>
> I'm not so sure. But who knows.
>
> -Peter

Wouldn't the buffer size determine if a CR-LF pair is broken between
two READ-SEQUENCE calls, it may be that the newline isn't recognized.

--
Geoff

From: Peter Seibel
Subject: Re: dangers of read-sequence
Date: Tue, 05 Jul 2005 18:34:51 +0000
Message-ID: <m24qb9cda0.fsf@beagle.local>

"Geoffrey Summerhayes" <·············@hotmail.com> writes:

> "Peter Seibel" <·····@gigamonkeys.com> wrote in message 
> ···················@beagle.local...
>> Edi Weitz <········@agharta.de> writes:
>>
>>> On Sat, 02 Jul 2005 21:18:32 GMT, Peter Seibel <·····@gigamonkeys.com> 
>>> wrote:
>>>
>>>> I can't quite figure out how to explain the actual results you saw
>>>
>>> The results look so strange because he changes the size of the buffer
>>> each time he runs through the inner loop.
>>
>> Yeah, I noticed that. But that doesn't explain the numbers he got--the
>> bug I fixed only accounts for some number of extra newlines caused by
>> counting some newlines from the penultimate READ-SEQUENCE twice. So
>> the larger the buffer the larger the number of extra newlines you
>> could get (up to the point where the buffer is large enough to hold
>> the whole file when a related bug when a related bug creeps in,
>> because he didn't specify an :initial-element in the call to
>> MAKE-STRING, who knows what junk the string was initially filled
>> with.) But he was going from smaller to larger buffers so the number
>> of newlines counted should have been going up, not down.
>>
>>>> but this change does fix one bug--as it stands now, if the last read
>>>> reads less than a full buffer you end up counting any new lines left
>>>> over in the part of the buffer not updated by the last read twice.
>>>
>>> Right.  I think that was the only bug.
>>
>> I'm not so sure. But who knows.
>>
>> -Peter
>
> Wouldn't the buffer size determine if a CR-LF pair is broken between
> two READ-SEQUENCE calls, it may be that the newline isn't recognized.

But then you'd expect smaller buffer sizes to miss more CRLFs and thus
yield a lower count. That's the opposite of the numbers the OP
reported.

-Peter

-- 
Peter Seibel           * ·····@gigamonkeys.com
Gigamonkeys Consulting * http://www.gigamonkeys.com/
Practical Common Lisp  * http://www.gigamonkeys.com/book/

From: Pascal Bourguignon
Subject: Re: dangers of read-sequence
Date: Wed, 06 Jul 2005 02:02:05 +0000
Message-ID: <87d5pw3d5u.fsf@thalassa.informatimago.com>

"Geoffrey Summerhayes" <·············@hotmail.com> writes:
> Wouldn't the buffer size determine if a CR-LF pair is broken between
> two READ-SEQUENCE calls, it may be that the newline isn't recognized.

That would be a bug in the implementation.  Either it should look up
one more byte to build a last new-line character, or it should leave
the last CR byte to read the new-line character next.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
Small brave carnivores
Kill pine cones and mosquitoes
Fear vacuum cleaner

From: Edi Weitz
Subject: Dangers of not reading the manual (Was: dangers of read-sequence)
Date: Sat, 02 Jul 2005 21:19:34 +0000
Message-ID: <ur7egucrd.fsf@agharta.de>

On Sat, 02 Jul 2005 22:45:25 +0200, Andr� Thieme <······························@justmail.de> wrote:

> I tried to count the number of lines in a file.
>
> (dotimes (i 10)
>   (setf buffer (make-string (* 1024 (1+ i))))
>   (with-open-file (in "C:/somefile.txt" :direction :input)
>     (let ((lines 0))
>       (loop for bytes = (read-sequence buffer in)
>             while (plusp bytes) do (incf lines (count #\Newline buffer)))

Try this instead:

              while (plusp bytes) do (incf lines (count #\Newline buffer :end bytes)))

>       (format t "run ~a: ~a lines in file~%" (1+ i) lines))))
>
> run 1: 78606 lines in file
> run 2: 39314 lines in file
> run 3: 26202 lines in file
> run 4: 19668 lines in file
> run 5: 15730 lines in file
> run 6: 13112 lines in file
> run 7: 11242 lines in file
> run 8: 9834 lines in file
> run 9: 8734 lines in file
> run 10: 7876 lines in file
> NIL
>
> read-sequence seems not to be the way to go.

Most things are not "the way to go" if one doesn't use them
correctly...

> What happened?

According to the CLHS READ-SEQUENCE returns the position of the first
index into BUFFER that was /not/ updated.  You've ignored this return
value (except for testing whether it's positive) and counted until the
end of the string.

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")