out of memory: a sad case

From: Alex Mizrahi
Subject: out of memory: a sad case
Date: Thu, 23 Oct 2008 12:35:03 +0000
Message-ID: <49006f7c$0$90275$14726298@news.sunsite.dk>

helo

i've made a little research on how good CL implementations
deal with out of memory conditions:
http://blo.udoidio.info/2008/10/out-of-memory-sad-case.html

and, unfortunately, among tested the only one which handled
it correctly was ABCL, and only because it is JVM-based.

i really wonder why people (implementation authors) do not
 consider this important -- it seems it is pretty nasty
when process crashes or hangs due to out of memory condition,
and users cannot solve this programmatically, except with an
external watchdog process.

my research missed commercial implementation like ACL and
Lispworks, and some versions tested are quite outdated. so
if you think situation is changed please test it and send
me results.

for your convenience here is test part of article (in
a blog you can also find an introduction why do i consider
out of memory handling important):

for my testing i've used a function like this:
(defun testmem (size)
  (ignore-errors
    (loop for i upfrom 1 collect (make-array size))))

the idea is to try different allocation sizes -- as described above, 
allocating bigger chunks is less problematic than smaller ones (but in real 
world situation more smaller one (like conses) are allocated). ignore-errors 
is there to avoid breaking into debugger, as that can be problematic in 
low-memory situation.
SBCL (1.0.13, on Debian GNU/Linux 4.0, x86)

i've limited heap to 100MB (sbcl --dynamic-space-size 100) and runned test 
with size of 10000. a real surprise is that ignore-errors have not caught 
OOM condition, SBCL reported it like this:
Heap exhausted during allocation: 114688 bytes available, 400008 requested.

debugger invoked on a SB-KERNEL::HEAP-EXHAUSTED-ERROR in thread...

SBCL does not consider SB-KERNEL::HEAP-EXHAUSTED-ERROR being a CL:ERROR. but 
it is possible to catch it with a (handler-case ... 
(SB-KERNEL::HEAP-EXHAUSTED-ERROR ())). making this change to a testing code, 
it indeed was able to catch an error, but it crashed later during GC:
* (testmem 10000)
Heap exhausted during allocation: 8192 bytes available, 40008 requested.

NIL
* (room)

Dynamic space usage is:   102,380,208 bytes.
Read-only space usage is:      1,912 bytes.
...
Breakdown for dynamic space:
  79,359,744 bytes for    61,350 simple-vector objects.
...
* (sb-ext:gc :full t)
Heap exhausted during garbage collection: 0 bytes available, 984 requested.
...
fatal error encountered in SBCL pid 5634(tid 3085342400):
Heap exhausted, game over.
Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb>

for a size of 100000 it was working fine (successfully lived through a few 
cycles), but that is not really inspiring. on the other hand, size of 1 
produced instant crash:
* (testmem 1)
Heap exhausted during garbage collection: 0 bytes available, 16 requested.

that sucks..
CMUCL (19d, on Debian GNU/Linux 4.0 x86)

in CMUCL situation was pretty similar to SBCL:
* (testmem 10000)
; [GC threshold exceeded with 12,054,816 bytes in use.  Commencing GC.]
; [GC completed with 11,952,672 bytes retained and 102,144 bytes freed.]
...
; [GC will next occur when at least 108,280,736 bytes are in use.]
*A2 gc_alloc_large failed, nbytes=40008.
 CMUCL has run out of dynamic heap space (100 MB).
  You can control heap size with the -dynamic-space-size commandline option.

Imminent dynamic space overflow has occurred:
Only a small amount of dynamic space is available now. Please note that you 
will be returned to the Top-Level without warning if you run out of space 
while debugging.

Heap (dynamic space) overflow
   [Condition of type KERNEL:HEAP-OVERFLOW]

Restarts:
  0: [ABORT] Return to Top-Level.

with handler-case it returned from function, but than crashed in (room). 
testing with size of 100000 revealed pretty interesting thing:
; [GC completed with 24,070,808 bytes retained and -2,008 bytes freed.]
; [GC will next occur when at least 36,070,808 bytes are in use.]

that is, GC freed negative amount of memory, probably that's why it crashes 
during GC. despite it returned from function, further operations were 
botched in CMUCL (even with size of 1000000), and i gave up. with size of 1 
CMUCL hanged.
Scieneer CL (1.3.8.1, on same Debian)

maybe commercial offspring of CMUCL does better? let's see. 
(-dynamic-space-size option is not documented, but it seems to work like in 
CMUCL)
$ /opt/scl/bin/lisp -dynamic-space-size 100
...
* (testmem 100000)
Scieneer Common Lisp: Internal error, quitting.
Scieneer Common Lisp: Debug dump written to: /tmp/scl-debug-dump-5733.txt

$ cat /tmp/scl-debug-dump-5733.txt
Error event log:
Thread 0xB7C5BBB0: GC alloc_large failed, nbytes=400008.
Thread 0xB7C5BBB0: Lisp memory allocation failure.
...

i like how it created really nice and detailed dump, but still this sucks. 
size of 1 produced same result. at least it did not hang..
Clozure Common Lisp (Version 1.2-r10552 on Ubuntu 8.04 x86-64)

Clozure CL has --heap-reserve parameter, but it works in a really weird way. 
for example, with reserve of 100 000 000, (room) says -976.359 MB reserved 
for heap expansion., that is, it is negative. and it does not joke --  
testmem quickly signaled some meaningless internal error when that value was 
used. so i guess i have to adjust it somehow for some internal reservation. 
experimenting with parameter, i've found that with reserve of 1 300 000 000 
it says 132.375 MB reserved for heap expansion. ok, and now lets test it..
nothing new -- with 100000 (and more) it is enters into debugger and is able 
to recover (it did not hint how can i catch error programmatically, though), 
with 10000 (and less) it hangs without making any progress. (probably 
endless GC loop)
CLISP (2.41 on Debian x86)

CLISP does not have a parameter to limit memory allocation, so i could only 
test OS mechanisms. in default configuration, this went as expected --  
oom-killer first killed nearby SBCL process that was not allocating any 
memory at time, and only then killed CLISP that was allocating memory.

with ulimit, i could limit maximum virtual memory to 100MB, and it was able 
to detect an error:
[2]> (testmem 100000)
map memory to address 0x25a55000 .
[spvw_mmap.d:359] errno = ENOMEM: Not enough memory.
Trying to make room through a GC...
Cannot map memory to address 0x25a54000 .

*** - No more room for LISP objects
Break 1 [3]>

trying to use debugger hanged CLISP, making it constantly "trying to make 
room through GC", but if i just instantly abort, it recovered itself fine. 
with small memory allocations (size of 1) it went into constant GC without 
first giving a chance in debugger..
ECL (0.9i on Debian x86)

situation with ECL is similar to CLISP -- it handled size of 10000 just fine 
(and ignore-errors worked in ECL!), but size of 1 knocked it out:
> (testmem 1)
GC Warning: Out of Memory!  Returning NIL!
GC Warning: Out of Memory!  Returning NIL!
Segmentation fault
ABCL (0.0.10.5 on Sun Java 1.5.0 on Windows XP x86)

JVM-based ABCL successfully caught out-of-memory condition, it ignored 
ignore-errors/handler-case constructions though:
> java.exe -Xloggc:gclog.txt -server -Xmx100M -jar abcl.jar

CL-USER(2): (testmem 1)
Debugger invoked on condition of type ERROR:
  Out of memory.
Restarts:
  0: TOP-LEVEL Return to top level.

what's interesting, it handled allocation of size 1 without problems, and 
"stagnation" period only lasted for 12 seconds, about 60% of time. 
stagnation becomes a problem with larger heap, for example, it took really 
long time with 300 MB, but in this case alternative GC algorithm comes to 
help -- parralel GC "throws an out-of-memory exception if an excessive 
amount of time is being spent collecting a small amount of the heap". it has 
configurable parameters, but default parameters were good enough for me.

Re: out of memory: a sad case Kaz Kylheku
- Re: out of memory: a sad case Dimiter "malkia" Stanev
  - Re: out of memory: a sad case Kaz Kylheku
  - Re: out of memory: a sad case Alex Mizrahi
- Re: out of memory: a sad case Alex Mizrahi
Re: out of memory: a sad case sross
Re: out of memory: a sad case Juanjo

From: Kaz Kylheku
Subject: Re: out of memory: a sad case
Date: Thu, 23 Oct 2008 18:53:42 +0000
Message-ID: <20081023114428.723@gmail.com>

On 2008-10-23, Alex Mizrahi <········@users.sourceforge.net> wrote:
> helo
>
> i've made a little research on how good CL implementations
> deal with out of memory conditions:
> http://blo.udoidio.info/2008/10/out-of-memory-sad-case.html
>
> and, unfortunately, among tested the only one which handled
> it correctly was ABCL, and only because it is JVM-based.

On what platform? If you're running on, say, Linux, the OS itself won't be
graceful about this. Firstly, in default configurations it will allow the
memory mapper to over-commit allocations. This means that memory allocations
will fail at page-fault time. I.e. when a process touches an "allocated" page
for the first time, and triggers a page fault, if the frame cannot be
allocated, the process will be given a fatal signal.

You can turn on strict overcommit accounting, but there are still bugs with
that; the kernel can allocate memory for itself which will create a de-facto
overcommit situation anyway. When the system is fully committed, nothing
prevents the kernel from allocating more pages. At best you can play with
the ratio. Say, allow total virtual memory to be no larger than 80% of
RAM + swap, and then hope (pray) the kernel doesn't ever need more than the
remaining 20%. You may have to tune various network buffers via sysctl, etc.

Basically dealing gracefully with OOM at the application level on an OS like
Iinux is a waste of time.

First, you need an OS with deterministic out of memory behavior. All else
can only follow.

> i've limited heap to 100MB (sbcl --dynamic-space-size 100) and runned test 

I.e. you set up a simulated OOM condition, not a real one at the system level.

From: Dimiter "malkia" Stanev
Subject: Re: out of memory: a sad case
Date: Thu, 23 Oct 2008 20:05:22 +0000
Message-ID: <gdqle2$lrq$1@registered.motzarella.org>

> First, you need an OS with deterministic out of memory behavior. All else
> can only follow.

Any examples of what such an OS could be (or compile options/config 
options?)

From: Kaz Kylheku
Subject: Re: out of memory: a sad case
Date: Thu, 23 Oct 2008 20:21:58 +0000
Message-ID: <20081023131339.331@gmail.com>

On 2008-10-23, Dimiter "malkia" Stanev <······@mac.com> wrote:
>
>> First, you need an OS with deterministic out of memory behavior. All else
>> can only follow.
>
> Any examples of what such an OS could be (or compile options/config 
> options?)

On Linux, you could approximate it by turning on strict overcommit,
with a conservative ratio. E.g.

  # /etc/sysctl.conf
  vm.overcommit_memory = 2
  vm.overcommit_ratio = 80

This means if you have a gigabyte of (RAM plus swap), the total size of all
(accountable) mappings allocated by mmap cannot exceed about 800 megs in
virtual size.

Under these conditions, mmap actually returns null before you run out of
memory. Unless the kernel goes crazy and eats up more than alloted reserve,
causing existing mappings to become overcomitted.

From: Alex Mizrahi
Subject: Re: out of memory: a sad case
Date: Thu, 23 Oct 2008 20:19:10 +0000
Message-ID: <4900dc43$0$90271$14726298@news.sunsite.dk>

 ??>> First, you need an OS with deterministic out of memory behavior. All
 ??>> else can only follow.

 DmS> Any examples of what such an OS could be (or compile options/config
 DmS> options?)

Solaris does not have overcommit enabled by default (iirc).
or you can disable overcommit in Linux this way:

echo 2 > /proc/sys/vm/overcommit_memory

or you can just make a large swap -- it will be determistically go uber-slow 
due to thrashing,
but won't randomly kill processes..

From: Alex Mizrahi
Subject: Re: out of memory: a sad case
Date: Thu, 23 Oct 2008 19:14:53 +0000
Message-ID: <4900cd32$0$90268$14726298@news.sunsite.dk>

 ??>> i've limited heap to 100MB (sbcl --dynamic-space-size 100) and runned
 ??>> test

 KK> I.e. you set up a simulated OOM condition, not a real one at the system
 KK> level.

can you, um, RTFA? there is sort of discussion of this stuff

From: sross
Subject: Re: out of memory: a sad case
Date: Fri, 24 Oct 2008 10:09:45 +0000
Message-ID: <533a7daa-dc0a-4aff-ad99-453ef1a94304@f63g2000hsf.googlegroups.com>

On Oct 23, 1:35 pm, "Alex Mizrahi" <········@users.sourceforge.net>
wrote:
> helo
>
> i've made a little research on how good CL implementations
> deal with out of memory conditions:http://blo.udoidio.info/2008/10/out-of-memory-sad-case.html
>
> and, unfortunately, among tested the only one which handled
> it correctly was ABCL, and only because it is JVM-based.
>
> i really wonder why people (implementation authors) do not
>  consider this important -- it seems it is pretty nasty
> when process crashes or hangs due to out of memory condition,
> and users cannot solve this programmatically, except with an
> external watchdog process.
>
> my research missed commercial implementation like ACL and
> Lispworks, and some versions tested are quite outdated. so
> if you think situation is changed please test it and send
> me results.
>
> for your convenience here is test part of article (in
> a blog you can also find an introduction why do i consider
> out of memory handling important):
>
> for my testing i've used a function like this:
> (defun testmem (size)
>   (ignore-errors
>     (loop for i upfrom 1 collect (make-array size))))
>
> the idea is to try different allocation sizes -- as described above,
> allocating bigger chunks is less problematic than smaller ones (but in real
> world situation more smaller one (like conses) are allocated). ignore-errors
> is there to avoid breaking into debugger, as that can be problematic in
> low-memory situation.
> SBCL (1.0.13, on Debian GNU/Linux 4.0, x86)
>
> i've limited heap to 100MB (sbcl --dynamic-space-size 100) and runned test
> with size of 10000. a real surprise is that ignore-errors have not caught
> OOM condition, SBCL reported it like this:
> Heap exhausted during allocation: 114688 bytes available, 400008 requested.
>
> debugger invoked on a SB-KERNEL::HEAP-EXHAUSTED-ERROR in thread...
>
> SBCL does not consider SB-KERNEL::HEAP-EXHAUSTED-ERROR being a CL:ERROR. but
> it is possible to catch it with a (handler-case ...
> (SB-KERNEL::HEAP-EXHAUSTED-ERROR ())). making this change to a testing code,
> it indeed was able to catch an error, but it crashed later during GC:
> * (testmem 10000)
> Heap exhausted during allocation: 8192 bytes available, 40008 requested.
>
> NIL
> * (room)
>
> Dynamic space usage is:   102,380,208 bytes.
> Read-only space usage is:      1,912 bytes.
> ...
> Breakdown for dynamic space:
>   79,359,744 bytes for    61,350 simple-vector objects.
> ...
> * (sb-ext:gc :full t)
> Heap exhausted during garbage collection: 0 bytes available, 984 requested.
> ...
> fatal error encountered in SBCL pid 5634(tid 3085342400):
> Heap exhausted, game over.
> Welcome to LDB, a low-level debugger for the Lisp runtime environment.
> ldb>
>
> for a size of 100000 it was working fine (successfully lived through a few
> cycles), but that is not really inspiring. on the other hand, size of 1
> produced instant crash:
> * (testmem 1)
> Heap exhausted during garbage collection: 0 bytes available, 16 requested.
>
> that sucks..
> CMUCL (19d, on Debian GNU/Linux 4.0 x86)
>
> in CMUCL situation was pretty similar to SBCL:
> * (testmem 10000)
> ; [GC threshold exceeded with 12,054,816 bytes in use.  Commencing GC.]
> ; [GC completed with 11,952,672 bytes retained and 102,144 bytes freed.]
> ...
> ; [GC will next occur when at least 108,280,736 bytes are in use.]
> *A2 gc_alloc_large failed, nbytes=40008.
>  CMUCL has run out of dynamic heap space (100 MB).
>   You can control heap size with the -dynamic-space-size commandline option.
>
> Imminent dynamic space overflow has occurred:
> Only a small amount of dynamic space is available now. Please note that you
> will be returned to the Top-Level without warning if you run out of space
> while debugging.
>
> Heap (dynamic space) overflow
>    [Condition of type KERNEL:HEAP-OVERFLOW]
>
> Restarts:
>   0: [ABORT] Return to Top-Level.
>
> with handler-case it returned from function, but than crashed in (room).
> testing with size of 100000 revealed pretty interesting thing:
> ; [GC completed with 24,070,808 bytes retained and -2,008 bytes freed.]
> ; [GC will next occur when at least 36,070,808 bytes are in use.]
>
> that is, GC freed negative amount of memory, probably that's why it crashes
> during GC. despite it returned from function, further operations were
> botched in CMUCL (even with size of 1000000), and i gave up. with size of 1
> CMUCL hanged.
> Scieneer CL (1.3.8.1, on same Debian)
>
> maybe commercial offspring of CMUCL does better? let's see.
> (-dynamic-space-size option is not documented, but it seems to work like in
> CMUCL)
> $ /opt/scl/bin/lisp -dynamic-space-size 100
> ...
> * (testmem 100000)
> Scieneer Common Lisp: Internal error, quitting.
> Scieneer Common Lisp: Debug dump written to: /tmp/scl-debug-dump-5733.txt
>
> $ cat /tmp/scl-debug-dump-5733.txt
> Error event log:
> Thread 0xB7C5BBB0: GC alloc_large failed, nbytes=400008.
> Thread 0xB7C5BBB0: Lisp memory allocation failure.
> ...
>
> i like how it created really nice and detailed dump, but still this sucks.
> size of 1 produced same result. at least it did not hang..
> Clozure Common Lisp (Version 1.2-r10552 on Ubuntu 8.04 x86-64)
>
> Clozure CL has --heap-reserve parameter, but it works in a really weird way.
> for example, with reserve of 100 000 000, (room) says -976.359 MB reserved
> for heap expansion., that is, it is negative. and it does not joke --
> testmem quickly signaled some meaningless internal error when that value was
> used. so i guess i have to adjust it somehow for some internal reservation.
> experimenting with parameter, i've found that with reserve of 1 300 000 000
> it says 132.375 MB reserved for heap expansion. ok, and now lets test it..
> nothing new -- with 100000 (and more) it is enters into debugger and is able
> to recover (it did not hint how can i catch error programmatically, though),
> with 10000 (and less) it hangs without making any progress. (probably
> endless GC loop)
> CLISP (2.41 on Debian x86)
>
> CLISP does not have a parameter to limit memory allocation, so i could only
> test OS mechanisms. in default configuration, this went as expected --
> oom-killer first killed nearby SBCL process that was not allocating any
> memory at time, and only then killed CLISP that was allocating memory.
>
> with ulimit, i could limit maximum virtual memory to 100MB, and it was able
> to detect an error:
> [2]> (testmem 100000)
> map memory to address 0x25a55000 .
> [spvw_mmap.d:359] errno = ENOMEM: Not enough memory.
> Trying to make room through a GC...
> Cannot map memory to address 0x25a54000 .
>
> *** - No more room for LISP objects
> Break 1 [3]>
>
> trying to use debugger hanged CLISP, making it constantly "trying to make
> room through GC", but if i just instantly abort, it recovered itself fine.
> with small memory allocations (size of 1) it went into constant GC without
> first giving a chance in debugger..
> ECL (0.9i on Debian x86)
>
> situation with ECL is similar to CLISP -- it handled size of 10000 just fine
> (and ignore-errors worked in ECL!), but size of 1 knocked it out:> (testmem 1)
>
> GC Warning: Out of Memory!  Returning NIL!
> GC Warning: Out of Memory!  Returning NIL!
> Segmentation fault
> ABCL (0.0.10.5 on Sun Java 1.5.0 on Windows XP x86)
>
> JVM-based ABCL successfully caught out-of-memory condition, it ignored
> ignore-errors/handler-case constructions though:
>
> > java.exe -Xloggc:gclog.txt -server -Xmx100M -jar abcl.jar
>
> CL-USER(2): (testmem 1)
> Debugger invoked on condition of type ERROR:
>   Out of memory.
> Restarts:
>   0: TOP-LEVEL Return to top level.
>
> what's interesting, it handled allocation of size 1 without problems, and
> "stagnation" period only lasted for 12 seconds, about 60% of time.
> stagnation becomes a problem with larger heap, for example, it took really
> long time with 300 MB, but in this case alternative GC algorithm comes to
> help -- parralel GC "throws an out-of-memory exception if an excessive
> amount of time is being spent collecting a small amount of the heap". it has
> configurable parameters, but default parameters were good enough for me.


It's worth noting that Lispworks does to do the correct thing.
On Ubuntu 8.04 with ulimit -v 51200

CL-USER> (princ (nth-value 1 (testmem 10000))) ;; or (testmem 1)
<!>  Failed to enlarge memory
<!>  Failed to enlarge memory
<!>  Failed to enlarge memory
<!>  Failed to enlarge memory
<**> Failed to allocate object of size 9c48
Failed to allocate object of size 40008 bytes ;; <- thats our error
=> #<SIMPLE-ERROR 200906AB>

OSX seemed to ignore ulimit (or I'm just doing something wrong) but
the
results were much the same, LW just gobbled up the remaining 2.5GB of
RAM
before printing the error.

So it's not all doom & gloom.

From: Juanjo
Subject: Re: out of memory: a sad case
Date: Sat, 25 Oct 2008 15:56:41 +0000
Message-ID: <3f8509f6-e8df-4abc-8a6b-e0c819e78bf8@t41g2000hsc.googlegroups.com>

The next release of ECL will also feature detection and handling of
out of memory conditions, using the facilities provided by the Boehm-
Weiser garbage collector. The code is currently in an unstable branch
of our GIT repository, but seems to work just nice.

$ ecl -norc
ECL (Embeddable Common-Lisp) 8.10.0
Copyright (C) 1984 Taiichi Yuasa and Masami Hagiya
Copyright (C) 1993 Giuseppe Attardi
Copyright (C) 2000 Juan J. Garcia-Ripoll
ECL is free software, and you are welcome to redistribute it
under certain conditions; see file 'Copyright' for details.
Type :h for Help. Top level.
> (defun testmem (size) (ignore-errors (loop for i upfrom 1 collect (make-array size))))

TESTMEM
> (testmem 1)
GC Warning: Out of Memory! Returning NIL!
Memory limit reached. Please jump to an outer point or quit program.
Broken at SI:BYTECODES.No restarts available.
Broken at TESTMEM.
>> :q
Top level.
>