Lispworks: CORBA+FLI=SEGV?

From: Sunil Mishra
Subject: Lispworks: CORBA+FLI=SEGV?
Date: Tue, 14 Nov 2000 00:00:00 +0000
Message-ID: <3A11E2A4.9030002@everest.com>

In rare circumstances, we have found that our XML parser when operating 
in a multithreaded lispworks corba server produces a segmentation fault. 
I'm going to try and elaborate, in the hopes that someone here will be 
able to help us locate/work around our problem.

We set up a little test client/server pair that does the following:

1. The server creates a pair of CORBA receiver objects. They are 
attached to a multi-threaded POA instance (not the root POA). Their 
responsibility is to parse and store the XML document they receive.

2. The client constructs a proxy for each of the servers. It starts up a 
thread for each proxy, where the thread waits for a short random 
interval (< 1 sec) and transfers an XML string to the corresponding 
receiver servant. Then the thread repeats the wait-and-send.

Each return trip takes longer than a second, so more often than not the 
server is working with two incoming messages simultaneously. The XML 
parser has a process lock for now, so there should be no issues of 
thread safety here.

The XML string in question is 280K in size, and the XML parser is a set 
of FLI bindings that call James Clark's expat parser. (The parser itself 
has been open sourced, and also has ACL bindings.) The FLI layer is 
moderately complex, in that it requires both calls into the C code, and 
callbacks from C to Lisp. There is virtually no interesting data 
translation that takes place. We declare the input to the parser as a 
:REFERENCE-PASS :EF-MB-STRING argument, and lispworks handles the 
details of the translation. Callbacks invariably have short string 
values coming back into lisp, and are translated to lisp strings using 
FLI:CONVERT-FROM-FOREIGN-STRING. Nothing fancy, so far as I can tell.

The client/server pair get through about a half dozen send-and-parse 
cycles. Then the server breaks with a segmentation fault. All we have 
been able to tell so far is that the segmentation fault is in one of the 
callbacks, with the foreign string pointer pointing into a bad address 
space. I'm posting here hoping to get a solution, or at least hoping to 
get some suggestions that might let me dig deeper into the FLI layer and 
find out what to blame, lispworks or expat.

Thanks,

Sunil

Re: Lispworks: CORBA+FLI=SEGV? Joe Marshall
- Re: Lispworks: CORBA+FLI=SEGV? Sunil Mishra
  - Re: Lispworks: CORBA+FLI=SEGV? Joe Marshall
    - Re: Lispworks: CORBA+FLI=SEGV? Sunil Mishra
      - Re: Lispworks: CORBA+FLI=SEGV? Joe Marshall
        Re: Lispworks: CORBA+FLI=SEGV? Sunil Mishra
      - Re: Lispworks: CORBA+FLI=SEGV? Kaelin Colclasure
        Re: Lispworks: CORBA+FLI=SEGV? Sunil Mishra

From: Joe Marshall
Subject: Re: Lispworks: CORBA+FLI=SEGV?
Date: Wed, 15 Nov 2000 00:00:00 +0000
Message-ID: <d7fx2p41.fsf@content-integrity.com>

Sunil Mishra <············@everest.com> writes:

> In rare circumstances, we have found that our XML parser when operating 
> in a multithreaded lispworks corba server produces a segmentation fault. 

Is it the case that running your XML parser in a single threaded
environment works ok?  (or is that not possible to try?)

> Each return trip takes longer than a second, so more often than not the 
> server is working with two incoming messages simultaneously. The XML 
> parser has a process lock for now, so there should be no issues of 
> thread safety here.

What sort of process lock?  Is the lock provided by the OS, or is this
a lisp process-lock?  When is this lock acquired and released?  When
the segfault happens, has a GC occurred recently?

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----

From: Sunil Mishra
Subject: Re: Lispworks: CORBA+FLI=SEGV?
Date: Wed, 15 Nov 2000 00:00:00 +0000
Message-ID: <3A12EFB9.7070204@everest.com>

Joe Marshall wrote:

> Sunil Mishra <············@everest.com> writes:
> 
> 
>> In rare circumstances, we have found that our XML parser when operating 
>> in a multithreaded lispworks corba server produces a segmentation fault. 
> 
> 
> Is it the case that running your XML parser in a single threaded
> environment works ok?  (or is that not possible to try?)

It works fine in a single threaded environment. It also works fine in a 
multi-threaded environment with only one thread active. It only tends to 
break when there are multiple threads active over both corba and the xml 
parser. I can't think of anything that might cause this type of an 
interaction, not knowing the FLI or ORB implementation, so I'm fishing 
for tips.

>> Each return trip takes longer than a second, so more often than not the 
>> server is working with two incoming messages simultaneously. The XML 
>> parser has a process lock for now, so there should be no issues of 
>> thread safety here.
> 
> 
> What sort of process lock?  Is the lock provided by the OS, or is this
> a lisp process-lock?  When is this lock acquired and released?  When
> the segfault happens, has a GC occurred recently?

It's a user level lisp process lock. (Lispworks on linux does not have 
support for kernel threads.) I haven't been able to correlate the segv 
with a gc, that might be something worth looking into. But if you are 
thinking the cause might be that a lisp gc has moved the data that the C 
code is working with, I believe that is an unlikely scenario. Lispworks 
(fortunately or unfortunately) appears to take the paranoid approach and 
make a copy of all strings (presumably into static heap) before passing 
them to foreign code.

Sunil

From: Joe Marshall
Subject: Re: Lispworks: CORBA+FLI=SEGV?
Date: Wed, 15 Nov 2000 00:00:00 +0000
Message-ID: <k8a5t0ue.fsf@content-integrity.com>

Sunil Mishra <············@everest.com> writes:

> Joe Marshall wrote:
> 
> > Sunil Mishra <············@everest.com> writes:
> > 
> > 
> >> In rare circumstances, we have found that our XML parser when operating 
> >> in a multithreaded lispworks corba server produces a segmentation fault. 
> > 
> > 
> > Is it the case that running your XML parser in a single threaded
> > environment works ok?  (or is that not possible to try?)
> 
> It works fine in a single threaded environment. It also works fine in a 
> multi-threaded environment with only one thread active. It only tends to 
> break when there are multiple threads active over both corba and the xml 
> parser. I can't think of anything that might cause this type of an 
> interaction, not knowing the FLI or ORB implementation, so I'm fishing 
> for tips.

That sure seems like it would indicate a threading issue.

> > What sort of process lock?  Is the lock provided by the OS, or is this
> > a lisp process-lock?  When is this lock acquired and released?  When
> > the segfault happens, has a GC occurred recently?
> 
> It's a user level lisp process lock. (Lispworks on linux does not have 
> support for kernel threads.) 

So there is but a single OS thread that is being multiplexed among the
several Lisp threads?

I have never used Lispworks, but my experience in dealing with
the multithreading implementation of 3 other major lisp vendors is
that it is not unheard of to have subtle errors within the
synchronization functions.  Can you trace the entry and exit of a
thread into the XML parser to determine if the lock might be failing?

> I haven't been able to correlate the segv with a gc, that might be
> something worth looking into. But if you are thinking the cause
> might be that a lisp gc has moved the data that the C code is
> working with, I believe that is an unlikely scenario.

I agree that it is highly unlikely that the GC is deliberately mucking
around with the data that C is working with.  However, there are other
things that the GC might be doing that could interact with the C
code.  I have seen an implementation of Lisp that did not check for
memory contention when growing the heap.  It was possible for the heap
to then grow into the static region that C was using, or into the data
segment of a dynamically loaded library.

I have also seen an implementation that did not synchronize the
garbage collector with the other Lisp threads.  A GC could enter a
critical section that was still in use by another Lisp thread.

But I'd only start poking around at that if there were a strong
coincidence between the GC running and the parser crashing.

> Lispworks (fortunately or unfortunately) appears to take the
> paranoid approach and make a copy of all strings (presumably into
> static heap) before passing them to foreign code.

Since you are getting a segfault on the return value, I would imagine
that the address that was being used is nowhere near *anything*
reasonable.

Once the foreign code returns, I assume you have to copy the strings
back out of wherever into the Lisp heap.  How are the return values
managed?  Do you have to pre-allocate some storage for them, or do you
have to manually free them?

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----

From: Sunil Mishra
Subject: Re: Lispworks: CORBA+FLI=SEGV?
Date: Wed, 15 Nov 2000 00:00:00 +0000
Message-ID: <3A135F98.3040802@everest.com>

Joe Marshall wrote:

> Sunil Mishra <············@everest.com> writes:
> 
> 
>> Joe Marshall wrote:
>> 
>> 
>>> Sunil Mishra <············@everest.com> writes:
>>> 
>>> 
>>> 
>>>> In rare circumstances, we have found that our XML parser when operating 
>>>> in a multithreaded lispworks corba server produces a segmentation fault. 
>>> 
>>> 
>>> Is it the case that running your XML parser in a single threaded
>>> environment works ok?  (or is that not possible to try?)
>> 
>> It works fine in a single threaded environment. It also works fine in a 
>> multi-threaded environment with only one thread active. It only tends to 
>> break when there are multiple threads active over both corba and the xml 
>> parser. I can't think of anything that might cause this type of an 
>> interaction, not knowing the FLI or ORB implementation, so I'm fishing 
>> for tips.
> 
> 
> That sure seems like it would indicate a threading issue.

It appears to be more than just a threading issue. I have tested out the 
behavior of the system with xml+threads, corba+threads and corba+xml. It 
all works under these combinations. However, with xml+threads+corba, a 
break is only a matter of time. I have replicated this behavior with 
fairly small xml documents as well.

Our guess is that when there are two threads simultaneously sending data 
through the corba interface, there is some kind of buffering problem. 
This is just speculation right now.


>>> What sort of process lock?  Is the lock provided by the OS, or is this
>>> a lisp process-lock?  When is this lock acquired and released?  When
>>> the segfault happens, has a GC occurred recently?
>> 
>> It's a user level lisp process lock. (Lispworks on linux does not have 
>> support for kernel threads.) 
> 
> 
> So there is but a single OS thread that is being multiplexed among the
> several Lisp threads?
> 
> I have never used Lispworks, but my experience in dealing with
> the multithreading implementation of 3 other major lisp vendors is
> that it is not unheard of to have subtle errors within the
> synchronization functions.  Can you trace the entry and exit of a
> thread into the XML parser to determine if the lock might be failing?

I have not done this rigorously, but under the tests that I have run I 
have never seen any indication of the lock failing. There is only one 
thread at a time that ever seems to be in the parser.


>> I haven't been able to correlate the segv with a gc, that might be
>> something worth looking into. But if you are thinking the cause
>> might be that a lisp gc has moved the data that the C code is
>> working with, I believe that is an unlikely scenario.
> 
> 
> I agree that it is highly unlikely that the GC is deliberately mucking
> around with the data that C is working with.  However, there are other
> things that the GC might be doing that could interact with the C
> code.  I have seen an implementation of Lisp that did not check for
> memory contention when growing the heap.  It was possible for the heap
> to then grow into the static region that C was using, or into the data
> segment of a dynamically loaded library.
> 
> I have also seen an implementation that did not synchronize the
> garbage collector with the other Lisp threads.  A GC could enter a
> critical section that was still in use by another Lisp thread.
> 
> But I'd only start poking around at that if there were a strong
> coincidence between the GC running and the parser crashing.

This is definitely a possibility. I'm trying to obtain more information 
about gc monitoring in lispworks.

>> Lispworks (fortunately or unfortunately) appears to take the
>> paranoid approach and make a copy of all strings (presumably into
>> static heap) before passing them to foreign code.
> 
> 
> Since you are getting a segfault on the return value, I would imagine
> that the address that was being used is nowhere near *anything*
> reasonable.

It isn't :-)

> Once the foreign code returns, I assume you have to copy the strings
> back out of wherever into the Lisp heap.  How are the return values
> managed?  Do you have to pre-allocate some storage for them, or do you
> have to manually free them?

I believe the allocation and de-allocation of storage is handled by the 
expat parser for the values in the callbacks. Once in lisp space, I copy 
them into a lisp string and then do interesting things with them.

Thanks for your help, I have gotten some good ideas for debugging our 
system :-)

Sunil

From: Joe Marshall
Subject: Re: Lispworks: CORBA+FLI=SEGV?
Date: Thu, 16 Nov 2000 00:00:00 +0000
Message-ID: <wve3svhd.fsf@content-integrity.com>

Sunil Mishra <············@everest.com> writes:

> Thanks for your help, I have gotten some good ideas for debugging our 
> system :-)

You're welcome.  Let me know if you need any more suggestions.  I'd
also be interested in knowing what the problem was when you find it.

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----==  Over 80,000 Newsgroups - 16 Different Servers! =-----

From: Sunil Mishra
Subject: Re: Lispworks: CORBA+FLI=SEGV?
Date: Thu, 16 Nov 2000 00:00:00 +0000
Message-ID: <3A1470A8.80108@everest.com>

Joe Marshall wrote:

> Sunil Mishra <············@everest.com> writes:
> 
> 
>> Thanks for your help, I have gotten some good ideas for debugging our 
>> system :-)
> 
> 
> You're welcome.  Let me know if you need any more suggestions.  I'd
> also be interested in knowing what the problem was when you find it.
> 
> 
> -----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
> http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
> -----==  Over 80,000 Newsgroups - 16 Different Servers! =-----

Received a patch from xanalys today :-) It appears the problem was with 
a pointer being mangled in the FLI upon receiving an interrupt under 
certain circumstances. I don't have details beyond that.

Thanks again for the pointers!

Sunil

From: Kaelin Colclasure
Subject: Re: Lispworks: CORBA+FLI=SEGV?
Date: Thu, 16 Nov 2000 00:00:00 +0000
Message-ID: <wu1ywblmqu.fsf@vanguard.arslogica.lan>

Sunil Mishra <············@everest.com> writes:

> > Once the foreign code returns, I assume you have to copy the strings
> > back out of wherever into the Lisp heap.  How are the return values
> > managed?  Do you have to pre-allocate some storage for them, or do you
> > have to manually free them?
> 
> I believe the allocation and de-allocation of storage is handled by
> the expat parser for the values in the callbacks. Once in lisp space,
> I copy them into a lisp string and then do interesting things with
> them.

The storage for these strings is indeed handled by expat -- and the
locations addressed are valid for the duration of the callback only.
So you are correct in copying them into Lisp space in the callback
routine.

I'm wondering if all this foreign string manipulation within the
dynamic context of a CORBA method invocation is key to where the
problem lies? Perhaps Lispworks CORBA does something with some static
buffer space which is also used by the string conversion internals?

-- Kaelin

From: Sunil Mishra
Subject: Re: Lispworks: CORBA+FLI=SEGV?
Date: Thu, 16 Nov 2000 00:00:00 +0000
Message-ID: <3A1442E3.4010207@everest.com>

Kaelin Colclasure wrote:

> Sunil Mishra <············@everest.com> writes:
> 
> 
>>> Once the foreign code returns, I assume you have to copy the strings
>>> back out of wherever into the Lisp heap.  How are the return values
>>> managed?  Do you have to pre-allocate some storage for them, or do you
>>> have to manually free them?
>> 
>> I believe the allocation and de-allocation of storage is handled by
>> the expat parser for the values in the callbacks. Once in lisp space,
>> I copy them into a lisp string and then do interesting things with
>> them.
> 
> 
> The storage for these strings is indeed handled by expat -- and the
> locations addressed are valid for the duration of the callback only.
> So you are correct in copying them into Lisp space in the callback
> routine.
> 
> I'm wondering if all this foreign string manipulation within the
> dynamic context of a CORBA method invocation is key to where the
> problem lies? Perhaps Lispworks CORBA does something with some static
> buffer space which is also used by the string conversion internals?
> 
> -- Kaelin

One would hope that the lispworks orb correctly allocates the space it 
needs, in a manner that there are no buffer overflow problems. My 
current best guess is that since the lispworks orb as shipped does not 
support a multithreaded policy (we had to get a patch for this) there 
might be assumptions made in buffer allocation that are perfectly 
suitable for single threaded applications but not for multi-threaded. My 
conclusion rests on the source of the segv that expat reported-- a 
character data pointer pointing into never-never land. This may be from 
the expat memory getting corrupted by an overflow from elsewhere. Most 
likely a thread-related poa buffer.

Sunil