CMUCL + db-sockets: random Bad Address (EFAULT) errors

From: Kenny Tilton
Subject: CMUCL + db-sockets: random Bad Address (EFAULT) errors
Date: Fri, 29 Aug 2003 16:57:46 +0000
Message-ID: <3F4F86E3.3040303@nyc.rr.com>

OK, db-sockets and cmucl and robocells are getting along swimmingly now, 
exchanging thousands of messages before going kersplat at random:

    Socket error in "recvfrom": 14 (Bad address)

I gotta think GC interaction, but I see in the db-sockets source:

    (sb-sys:without-gcing...

That sb-sys gave me a scare since I am using cmu, but a little package 
sleuthing reveals that that translates to system. ie, I think I am 
looking at the right source.

I am going to try trapping the error and simply re-reading with the same 
or even a brandy new buffer (I have also tried letting socket-receive 
allocate the buffer--no dice, maybe worse), but any insights will be 
greatly appreciated.

-- 

  kenny tilton
  clinisys, inc
  http://www.tilton-technology.com/
  ---------------------------------------------------------------
"Career highlights? I had two. I got an intentional walk from
Sandy Koufax and I got out of a rundown against the Mets."
                                                  -- Bob Uecker

Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors Kenny Tilton
- Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors Daniel Barlow
  - Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors Kenny Tilton
    - Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors Johan Ur Riise
      - Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors Johan Ur Riise
        Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors Daniel Barlow
        Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors Johan Ur Riise

From: Kenny Tilton
Subject: Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors
Date: Fri, 29 Aug 2003 18:16:19 +0000
Message-ID: <3F4F994D.2090604@nyc.rr.com>

Kenny Tilton wrote:
> OK, db-sockets and cmucl and robocells are getting along swimmingly now, 
> exchanging thousands of messages before going kersplat at random:
> 
>    Socket error in "recvfrom": 14 (Bad address)
> 
> I gotta think GC interaction, but I see in the db-sockets source:
> 
>    (sb-sys:without-gcing...
> 
> That sb-sys gave me a scare since I am using cmu, but a little package 
> sleuthing reveals that that translates to system. ie, I think I am 
> looking at the right source.
> 
> I am going to try trapping the error and simply re-reading with the same 
> or even a brandy new buffer (I have also tried letting socket-receive 
> allocate the buffer--no dice, maybe worse), but any insights will be 
> greatly appreciated.

retrying worked. Often with the same buffer (all are allocated by my 
app, not socket-receive). But I think that is risky business; when I 
told it to try the same buffer more than once I managed to take out the 
server. coincidence, maybe. but i am tempted to go straight to a new 
buffer as soon as I get an EFAULT (Bad Address) back from socket-receive 
(from recvfrom). i certainly will if i seem to be dying on retrying the 
same buffer.

so... assuming this is a GC issue, is there a better way to safeguard 
FFI calls than WITHOUT-GCING under cmu? Or could there be a flaw in the 
db-sockets FFI for recvfrom that would provoke this behavior? of course 
it could always be something else entirely...

-- 

  kenny tilton
  clinisys, inc
  http://www.tilton-technology.com/
  ---------------------------------------------------------------
"Career highlights? I had two. I got an intentional walk from
Sandy Koufax and I got out of a rundown against the Mets."
                                                  -- Bob Uecker

From: Daniel Barlow
Subject: Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors
Date: Sun, 31 Aug 2003 13:11:45 +0000
Message-ID: <87n0dqug32.fsf@noetbook.telent.net>

Kenny Tilton <·······@nyc.rr.com> writes:

>> OK, db-sockets and cmucl and robocells are getting along swimmingly
>> now, exchanging thousands of messages before going kersplat at
>> random:
>>    Socket error in "recvfrom": 14 (Bad address)

I've hit the same problem in SBCL and sb-bsd-sockets (which is
basically the same code).  I tried to email you about it, but 
your isp bounced my mail for being sent from a cablemodem

(Which I think is wrong; if it had just refused the connection
altogether my machine would have retried via its backup smarthost,
which is a real internet site(tm) and you'd have got the mail.  Yay
Roadrunner)

>> I gotta think GC interaction, but I see in the db-sockets source:
>>    (sb-sys:without-gcing...

Today's new information (I found this out five minutes ago):
sb-sys:without-gcing doesn't always work in sbcl.  Now, we did a fair
amount of GC rearrangement in order to support native threads, but the
same problem may exist in cmucl as well.  Try setting

(alien:def-alien-variable "gencgc_verbose" alien:unsigned)
(setf gencgc-verbose 2)

;; this makes sure you get a gc notification from the actual C code
;; whenever a collection takes place

and see if the errors are correlated in any way with the time of gc

> so... assuming this is a GC issue, is there a better way to safeguard
> FFI calls than WITHOUT-GCING under cmu? Or could there be a flaw in
> the db-sockets FFI for recvfrom that would provoke this behavior? of
> course it could always be something else entirely...

So, anyway, I'm seeing what looks like the same problem here, and
working on it.

-dan

-- 

   http://www.cliki.net/ - Link farm for free CL-on-Unix resources

From: Kenny Tilton
Subject: Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors
Date: Sun, 31 Aug 2003 17:43:08 +0000
Message-ID: <3F52335E.6000002@nyc.rr.com>

[sent to c.l.l. as well]

Daniel Barlow wrote:
> Kenny Tilton <·······@nyc.rr.com> writes:
> 
> 
>>>OK, db-sockets and cmucl and robocells are getting along swimmingly
>>>now, exchanging thousands of messages before going kersplat at
>>>random:
>>>   Socket error in "recvfrom": 14 (Bad address)
>>
> 
> I've hit the same problem in SBCL and sb-bsd-sockets (which is
> basically the same code).  I tried to email you about it, but 
> your isp bounced my mail for being sent from a cablemodem

Try embedding your actual message in an advertisement for Viaggravating 
or anatomy enlargement, all those get thru. :)

> Today's new information (I found this out five minutes ago):
> sb-sys:without-gcing doesn't always work in sbcl. 

Thanks for confirming this. Did you see my workaround? Trap the error, 
allocate a new buffer and retry. hasn't failed yet. I have not looked to 
see if I am missing one message each time this comes up--I could check 
that the sense-see-sense-see-sense pattern is not broken. Anyway, at 
this point I am thrilled to be thinking about soccer and Machine 
learning and the Robocells framework instead of sockets.

But I would definitely be interested in a true fix. Please keep me posted.

-- 

  kenny tilton
  clinisys, inc
  http://www.tilton-technology.com/
  ---------------------------------------------------------------
"Career highlights? I had two. I got an intentional walk from
Sandy Koufax and I got out of a rundown against the Mets."
                                                  -- Bob Uecker

From: Johan Ur Riise
Subject: Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors
Date: Thu, 18 Sep 2003 05:03:30 +0000
Message-ID: <87isnqy9gd.fsf@egg.riise-data.net>

Kenny Tilton <·······@nyc.rr.com> writes:

> [sent to c.l.l. as well]
> 
> Daniel Barlow wrote:
> > Kenny Tilton <·······@nyc.rr.com> writes:
> >
> >>>OK, db-sockets and cmucl and robocells are getting along swimmingly
> >>>now, exchanging thousands of messages before going kersplat at
> >>>random:
> >>>   Socket error in "recvfrom": 14 (Bad address)
> >>
> > I've hit the same problem in SBCL and sb-bsd-sockets (which is
> > basically the same code).  I tried to email you about it, but your
> > isp bounced my mail for being sent from a cablemodem
> 
> Try embedding your actual message in an advertisement for
> Viaggravating or anatomy enlargement, all those get thru. :)
> 
> 
> > Today's new information (I found this out five minutes ago):
> > sb-sys:without-gcing doesn't always work in sbcl.
> 
> Thanks for confirming this. Did you see my workaround? Trap the error,
> allocate a new buffer and retry. hasn't failed yet. I have not looked
> to see if I am missing one message each time this comes up--I could
> check that the sense-see-sense-see-sense pattern is not
> broken. Anyway, at this point I am thrilled to be thinking about
> soccer and Machine learning and the Robocells framework instead of
> sockets.
> 
> But I would definitely be interested in a true fix. Please keep me posted.

Has this problem been resolved? 

I get the same error sometimes in cmucl and sbcl. The trick with a new
buffer, then receive again, lets the application continue, but each
time one packet is lost.  I have tried this on two machines with linux
2.4.18 and 2.4.20, both dual processor machines. Sbcl is in a cvs
version 0.8.3.72 with sb-thread compiled in. cmucl is 3.1.2 18d+.

-- 
Hilsen
Johan Ur Riise

From: Johan Ur Riise
Subject: Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors
Date: Fri, 19 Sep 2003 21:16:25 +0000
Message-ID: <87k784xyvq.fsf@egg.riise-data.net>

Johan Ur Riise <·····@riise-data.no> writes:

> Kenny Tilton <·······@nyc.rr.com> writes:
> 
> > [sent to c.l.l. as well]
> > 
> > Daniel Barlow wrote:
> > > Kenny Tilton <·······@nyc.rr.com> writes:
> > >
> > >>>OK, db-sockets and cmucl and robocells are getting along swimmingly
> > >>>now, exchanging thousands of messages before going kersplat at
> > >>>random:
> > >>>   Socket error in "recvfrom": 14 (Bad address)
> > >>
> > > I've hit the same problem in SBCL and sb-bsd-sockets (which is
> > > basically the same code).  I tried to email you about it, but your
> > > isp bounced my mail for being sent from a cablemodem
> > 
> > Try embedding your actual message in an advertisement for
> > Viaggravating or anatomy enlargement, all those get thru. :)
> > 
> > 
> > > Today's new information (I found this out five minutes ago):
> > > sb-sys:without-gcing doesn't always work in sbcl.
> > 
> > Thanks for confirming this. Did you see my workaround? Trap the error,
> > allocate a new buffer and retry. hasn't failed yet. I have not looked
> > to see if I am missing one message each time this comes up--I could
> > check that the sense-see-sense-see-sense pattern is not
> > broken. Anyway, at this point I am thrilled to be thinking about
> > soccer and Machine learning and the Robocells framework instead of
> > sockets.
> > 
> > But I would definitely be interested in a true fix. Please keep me posted.
> 
> Has this problem been resolved? 
> 
> I get the same error sometimes in cmucl and sbcl. The trick with a new
> buffer, then receive again, lets the application continue, but each
> time one packet is lost.  I have tried this on two machines with linux
> 2.4.18 and 2.4.20, both dual processor machines. Sbcl is in a cvs
> version 0.8.3.72 with sb-thread compiled in. cmucl is 3.1.2 18d+.
> 

Someone emailed me about this, saying

> It might be related to CMUCL using VM protection for write (or read?)
> barriers. Segfaults are normally handled by the runtime, except those
> that happen inside syscalls. Touching a byte on each page before the
> actual call seems to help

and supplied example code to touch (by writing an octet on) each 4K page
of the buffer just before the read. I adjusted this code for sb-bsd-sockets
sockets.lisp, and it worked. The problem is that, when I wanted to verify the
solution by going back to the unpatched code, that seemed to work too. So now
i can read a hundred million datagrams from a socket and get no read errors.
The only difference with the system now, I think, is that I killed a few dozen
xterms just to clean up my environment.

I shall come back to this if I find out more.

-- 
Hilsen
Johan Ur Riise

From: Daniel Barlow
Subject: Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors
Date: Fri, 19 Sep 2003 22:53:51 +0000
Message-ID: <87fzis75kw.fsf@noetbook.telent.net>

Johan Ur Riise <·····@riise-data.no> writes:

> Someone emailed me about this, saying
>
>> It might be related to CMUCL using VM protection for write (or read?)
>> barriers. Segfaults are normally handled by the runtime, except those
>> that happen inside syscalls. Touching a byte on each page before the
>> actual call seems to help

It's the write barrier, yes.  GC write-protects all pages for the
oldest generation when it's done collecting: if the page is still 
protected by the time of the following GC, we know it contains no
pointers into younger generations.

> and supplied example code to touch (by writing an octet on) each 4K page
> of the buffer just before the read. I adjusted this code for sb-bsd-sockets
> sockets.lisp, and it worked. The problem is that, when I wanted to verify the

That would help, but it wouldn't completely do the trick.  If another
thread triggers GC while your thread is blocked in a foreign call, it
still won't stop the buffer from potentially being moved (if it's in
the generation being collected) or write-protected again (if it's
moved into an older generation which should be write-barricaded)

The good news is that in SBCL CVS this is all fixed by some small
changes to the way the garbage collector works: a new
WITH-PINNED-OBJECTS macro takes zero or more objects and puts pointers
to them on the C stack so that GC won't move them.  So, if you're
using SBCL, either get the CVS version and rebuild (remember, this is
SBCL!  rebuilding is easy!  Plug finishes...) or wait for 0.8.4, which
should happen sometime between now and the end of the month.

How to make it work for CMUCL, I don't know.  Touching each page would
do nicely unless you're using threads, in which case you're still
stuck.  The simplest answer (though involving tedious mucking around)
might just be to have db-sockets allocate foreign memory for sockaddrs
instead of lisp memory.  I don't use CMUCL much these days, though.
"We're taking patches"

-dan

-- 

   http://www.cliki.net/ - Link farm for free CL-on-Unix resources

From: Johan Ur Riise
Subject: Re: CMUCL + db-sockets: random Bad Address (EFAULT) errors
Date: Sat, 20 Sep 2003 00:05:01 +0000
Message-ID: <87fzisxr2q.fsf@egg.riise-data.net>

Daniel Barlow <···@telent.net> writes:

> 
> The good news is that in SBCL CVS this is all fixed by some small
> changes to the way the garbage collector works: a new
> WITH-PINNED-OBJECTS macro takes zero or more objects and puts pointers
> to them on the C stack so that GC won't move them.  So, if you're
> using SBCL, either get the CVS version and rebuild (remember, this is
> SBCL!  rebuilding is easy!  Plug finishes...) or wait for 0.8.4, which
> should happen sometime between now and the end of the month.

Great, I will update my cvs version. If this was done in version 1.17
of x86/macros.lisp, an explanation of the disappearance of the error
could be that I restarted my lisp, and the old sbcl I was running was 
pre 1.17. 

-- 
Hilsen
Johan Ur Riise