From: Greg Menke
Subject: Is Greenspun enough?
Date: 
Message-ID: <m3k6elq1or.fsf@athena.pienet>
Just got hold of some C++ at work, while I've not discovered a homebrew
GC yet I have discovered a buggy homebrew implementation of sprintf.
Naturally its been rendered Policially Correct and thus is an even
greater PITA because it was implemented as snprintf.

I asked the developer why it was there and couldn't we please just use
the local C library version- the answer was the reimplementation was
required because the developer wanted to be able to print objects
usefully.  Sigh.. the usual proof that this is what happens when we
pretend that low level languages are high level.

So this app is clearly headed towards demonstrating Greenspun yet again
but its not quite there.  Could some corollary be derived to express the
wholesale, buggy reimplementation of C library elements in a C++ app
just to facilitate a minor (and entirely nonstandard) enhancement?

Regards,

Gregm

From: Kaz Kylheku
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <1133727729.726908.218580@g49g2000cwa.googlegroups.com>
Greg Menke wrote:
> Just got hold of some C++ at work, while I've not discovered a homebrew
> GC yet I have discovered a buggy homebrew implementation of sprintf.
> Naturally its been rendered Policially Correct and thus is an even
> greater PITA because it was implemented as snprintf.

This is basically stupidity, because they are are working in C++, yet
extending an outdated, fragile, interface to handle objects.

In C++, you can write an object to do the job.

I once made one. It was a class which you'd construct, and pass in a
format string into the constructor.

Then, member function calls were used to shove in the arguments.

The format string was type generic; no such thing as a string
conversion specifier accidentally being matched by an integer.

Moreover, numbers could be specified to easily render the arguments
other than in order of their appearance.

You never actually had to name the object; it was good enough to write
a constructor expression to create a temporary, something like:

   str = (format("%2 %1 %3"), arg, arg, arg);

The format class has an overloaded comma operator, and an operator to
convert to a string. Of course there was syntax for field width,
precision and all that.
From: Bruce Hoult
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <bruce-86A090.14024904122005@news.clear.net.nz>
In article <··············@athena.pienet>,
 Greg Menke <············@toadmail.com> wrote:

> Just got hold of some C++ at work, while I've not discovered a homebrew
> GC yet I have discovered a buggy homebrew implementation of sprintf.
> Naturally its been rendered Policially Correct and thus is an even
> greater PITA because it was implemented as snprintf.
> 
> I asked the developer why it was there and couldn't we please just use
> the local C library version- the answer was the reimplementation was
> required because the developer wanted to be able to print objects
> usefully.  Sigh.. the usual proof that this is what happens when we
> pretend that low level languages are high level.
> 
> So this app is clearly headed towards demonstrating Greenspun yet again
> but its not quite there.  Could some corollary be derived to express the
> wholesale, buggy reimplementation of C library elements in a C++ app
> just to facilitate a minor (and entirely nonstandard) enhancement?

But C++ streams do that already!  And as nicely as you *can* do it (with 
type checking) for arbitrary objects in C++.

Hmm .. couldn't you use regular printf if your objects had to_string() 
methods that returned a string class (to free storage when it goes out 
of scope) with an implicit conversion to char* available?

-- 
Bruce |  41.1670S | \  spoken |          -+-
Hoult | 174.8263E | /\ here.  | ----------O----------
From: Greg Menke
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <m34q5psj6t.fsf@athena.pienet>
Bruce Hoult <·····@hoult.org> writes:

> > So this app is clearly headed towards demonstrating Greenspun yet again
> > but its not quite there.  Could some corollary be derived to express the
> > wholesale, buggy reimplementation of C library elements in a C++ app
> > just to facilitate a minor (and entirely nonstandard) enhancement?
> 
> But C++ streams do that already!  And as nicely as you *can* do it (with 
> type checking) for arbitrary objects in C++.
> 
> Hmm .. couldn't you use regular printf if your objects had to_string() 
> methods that returned a string class (to free storage when it goes out 
> of scope) with an implicit conversion to char* available?

The homebrew snprintf is part of a suite of classes and functions
forming an event logging subsystem, the C oriented stuff forms the
bottom few layers- object methods call out to the reimplemented C
routines.  I imagine it could be redesigned but my assignment is to use
(and presumably debug) the app, not fix it.

Since I don't care about the additional sprintf functionality at this
point I think my approach is to rip out the homebrew stuff and call the
C library stuff instead...

Gregm
From: Björn Lindberg
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <9mpek4rj24h.fsf@muvclx01.cadence.com>
Greg Menke <············@toadmail.com> writes:

> Bruce Hoult <·····@hoult.org> writes:
> 
> > > So this app is clearly headed towards demonstrating Greenspun yet again
> > > but its not quite there.  Could some corollary be derived to express the
> > > wholesale, buggy reimplementation of C library elements in a C++ app
> > > just to facilitate a minor (and entirely nonstandard) enhancement?
> > 
> > But C++ streams do that already!  And as nicely as you *can* do it (with 
> > type checking) for arbitrary objects in C++.
> > 
> > Hmm .. couldn't you use regular printf if your objects had to_string() 
> > methods that returned a string class (to free storage when it goes out 
> > of scope) with an implicit conversion to char* available?
> 
> The homebrew snprintf is part of a suite of classes and functions
> forming an event logging subsystem, the C oriented stuff forms the
> bottom few layers- object methods call out to the reimplemented C
> routines.  I imagine it could be redesigned but my assignment is to use
> (and presumably debug) the app, not fix it.
> 
> Since I don't care about the additional sprintf functionality at this
> point I think my approach is to rip out the homebrew stuff and call the
> C library stuff instead...

There is a library called, believe it or not, Format in Boost that
seems to do what you need.


Bj�rn
From: ·········@gmail.com
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <1133722697.111716.186550@g44g2000cwa.googlegroups.com>
Also, there is "serialize" in BOOST. That stuff is AWESOME! (If you
have no choice but to use C++ that is)
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <9me8p1l40htcd78ic2o9iafiflvhl3dvfh@4ax.com>
On 03 Dec 2005 17:02:28 -0500, Greg Menke <············@toadmail.com>
wrote:

>
>Just got hold of some C++ at work, while I've not discovered a homebrew
>GC yet I have discovered a buggy homebrew implementation of sprintf.
> :
>I asked the developer why it was there and couldn't we please just use
>the local C library version- the answer was the reimplementation was
>required because the developer wanted to be able to print objects
>usefully.

That is not unusual for a small embedded application.

Suppose you need formatted printing but you know, for example, that
you will never print floating point values and memory is tight so you
don't want to include any unnecessary printf() helper functions in the
executable.   You copy the source to printf() - which is included in
your cross-compiler library - and remove all the floating point stuff.

In fact it's pretty normal to _have_ to modify the library startup
functions to set up heap and so forth, but I've never seen a custom
implementation of a standard library routine done for any other
purpose than embedded.  

Certainly the reason given sounds ridiculous.

George
--
for email reply remove "/" from address
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <2jg8p1p5u9vbgl90jln4nvcjc4lmf64fo8@4ax.com>
On Mon, 05 Dec 2005 08:31:03 -0500, George Neuner
<·········@comcast.net> wrote:

>In fact it's pretty normal to _have_ to modify the library startup
>functions to set up heap and so forth, but I've never seen a custom
>implementation of a standard library routine done for any other
>purpose than embedded.  

Should have added "other than alloc/free and their ilk" which are
reimplemented over and over by C programmers all the time.

George
--
for email reply remove "/" from address
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3vjb92F165mh8U1@individual.net>
George Neuner wrote:
> On Mon, 05 Dec 2005 08:31:03 -0500, George Neuner
> <·········@comcast.net> wrote:
> 
>> In fact it's pretty normal to _have_ to modify the library startup
>> functions to set up heap and so forth, but I've never seen a custom
>> implementation of a standard library routine done for any other
>> purpose than embedded.  
> 
> Should have added "other than alloc/free and their ilk" which are
> reimplemented over and over by C programmers all the time.

I usually only layer on top of them for convenience.  But should I ever 
like a single malloc, I'd probably want to write a faster version (well, 
also layering on malloc, which I'd use to get maybe 100k blocks).

I assume a substantial fraction of memory waste (pardon, usage) and 
runtime originates from mallocing stuff, filling values in, moving them 
around, and then maybe deleting stuff again (but most programs don't 
seem to bother, be it in Java, C++, or even Aquamacs; their mem usage 
increases until I restart them once they reach 100-200 MB).  The part of 
an application that actually *does* something must be a fraction.

-- 
Majority, n.: That quality that distinguishes a crime from a law.
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <dih9p1p801clulh8qpfh9lbs9k2obtgc38@4ax.com>
On Mon, 05 Dec 2005 18:21:37 +0100, Ulrich Hobelmann
<···········@web.de> wrote:

>George Neuner wrote:
>> On Mon, 05 Dec 2005 08:31:03 -0500, George Neuner
>> <·········@comcast.net> wrote:
>> 
>>> In fact it's pretty normal to _have_ to modify the library startup
>>> functions to set up heap and so forth, but I've never seen a custom
>>> implementation of a standard library routine done for any other
>>> purpose than embedded.  
>> 
>> Should have added "other than alloc/free and their ilk" which are
>> reimplemented over and over by C programmers all the time.
>
>I usually only layer on top of them for convenience.  But should I ever 
>like a single malloc, I'd probably want to write a faster version (well, 
>also layering on malloc, which I'd use to get maybe 100k blocks).

I've written several custom allocators from scratch for realtime use
on various platforms, including a cute 2 dimensional tiled allocator
for a proprietary image processor, but for desktop/server apps the
most I have ever needed to do is preallocate arrays of objects and do
ring, stack or list management on the arrays [a driver might need the
array memory to be non-pageable].  But I would only bother if the
situation could not be handled by the default language or OS
allocator(s).


>I assume a substantial fraction of memory waste (pardon, usage) and 
>runtime originates from mallocing stuff, filling values in, moving them 
>around, and then maybe deleting stuff again (but most programs don't 
>seem to bother, be it in Java, C++, or even Aquamacs; their mem usage 
>increases until I restart them once they reach 100-200 MB).  The part of 
>an application that actually *does* something must be a fraction.

Applications tend to monotonically increase their range of reserved
virtual addresses ... precious few ever bother to release unused pages
back to the system.  Unless the application tracks and reports on its
true memory usage [e.g., logging GC], there is usually no way to tell
how much of the system reported reserved range is really in use.

George 
--
for email reply remove "/" from address
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3vkpa7F15npkiU1@individual.net>
George Neuner wrote:
> I've written several custom allocators from scratch for realtime use
> on various platforms, including a cute 2 dimensional tiled allocator
> for a proprietary image processor, but for desktop/server apps the
> most I have ever needed to do is preallocate arrays of objects and do
> ring, stack or list management on the arrays [a driver might need the
> array memory to be non-pageable].  But I would only bother if the
> situation could not be handled by the default language or OS
> allocator(s).

Yeah, usually I don't bother, but for instance I sometimes allocate 
objects with indentical lifetimes in one big chunk, so I can free it all 
at once ("regions").  For that, malloc wasn't built.  In fact I wonder 
why everybody uses it as if it were general-purpose; it's just a single 
way to de-/allocate memory, and certainly not suited for small objects, 
fast allocation, and low space overhead (in the implementations I know of).

>> I assume a substantial fraction of memory waste (pardon, usage) and 
>> runtime originates from mallocing stuff, filling values in, moving them 
>> around, and then maybe deleting stuff again (but most programs don't 
>> seem to bother, be it in Java, C++, or even Aquamacs; their mem usage 
>> increases until I restart them once they reach 100-200 MB).  The part of 
>> an application that actually *does* something must be a fraction.
> 
> Applications tend to monotonically increase their range of reserved
> virtual addresses ... precious few ever bother to release unused pages
> back to the system.  Unless the application tracks and reports on its
> true memory usage [e.g., logging GC], there is usually no way to tell
> how much of the system reported reserved range is really in use.

And that means after a while you'll page out stuff that'd be in fact 
unused.  It's like user applications blocking memory, by caching things. 
  You'd think that caching should be unified in the OS itself, because 
then the OS can decide when to un-cache (instead of taking up all 
memory) to free up memory.

Often even the OS cache is stupid, and after having watched a movie at 
your computer, you notice that everything else was paged out in the 
process, just so the whole stupid movie could fit into the file cache 
(as if you ever wanted to cache sequential stream-like files, for using 
them more than once).

Oh well, people have been throwing more CPU and more RAM at the problem 
for decades, and today's computers are almost as fast as they were in 
the late '80s.  No need to worry, or do develop something decent. :)

-- 
Majority, n.: That quality that distinguishes a crime from a law.
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <icscp11h6j1atgbog4fk3bbpdg7beim75h@4ax.com>
On Tue, 06 Dec 2005 07:27:18 +0100, Ulrich Hobelmann
<···········@web.de> wrote:

>George Neuner wrote:
>> 
>> Applications tend to monotonically increase their range of reserved
>> virtual addresses ... precious few ever bother to release unused pages
>> back to the system.  Unless the application tracks and reports on its
>> true memory usage [e.g., logging GC], there is usually no way to tell
>> how much of the system reported reserved range is really in use.
>
>And that means after a while you'll page out stuff that'd be in fact 
>unused.  It's like user applications blocking memory, by caching things. 
>  You'd think that caching should be unified in the OS itself, because 
>then the OS can decide when to un-cache (instead of taking up all 
>memory) to free up memory.

Unused code/data doesn't waste RAM - only virtual addresses and disk
swap space.  Most OSes memory map executables directly from the file
system so code doesn't pollute the file cache or swap space.


>Often even the OS cache is stupid, and after having watched a movie at 
>your computer, you notice that everything else was paged out in the 
>process, just so the whole stupid movie could fit into the file cache 
>(as if you ever wanted to cache sequential stream-like files, for using 
>them more than once).

I don't think it's possible in most current OSes to bypass the file
cache when reading using standard I/O calls.  Even worse, sequential
read optimization uses more cache space than random read of the same
volume of data.  I would say the failing was the app developer who
should have known to map the file and do manual buffering instead of
reading it.


IMO, there are currently many "professional" software developers who
would be doing the world a big favor by leaving the profession.  I
don't even want to think about what kind of crap will be produced by
the next generation of developers who will cut their teeth on 64-bit
machines with gigabytes of RAM and terabytes of disk storage ... the
term "limited resource" won't even have meaning for them.

George
--
for email reply remove "/" from address
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3vnlb2F171j6kU1@individual.net>
George Neuner wrote:
> Unused code/data doesn't waste RAM - only virtual addresses and disk
> swap space.  Most OSes memory map executables directly from the file
> system so code doesn't pollute the file cache or swap space.

But they cache them, and the swapping-out becomes noticeable as memory 
needs to be freed to store, say, heap objects.

> I don't think it's possible in most current OSes to bypass the file
> cache when reading using standard I/O calls.  Even worse, sequential
> read optimization uses more cache space than random read of the same
> volume of data.  I would say the failing was the app developer who
> should have known to map the file and do manual buffering instead of
> reading it.

mmap()ed files are also cached, no?

I'd like to be able to tell the OS not to cache whatever file I'm 
reading sequentially (or just the "current" couple of 100k).  Or to 
cache executable contents over big data file chunks, so I don't have to 
wait so much for applications to swap back in.

It's like people telling their Squid web cache not to cache everything, 
because that'd clearly be contraproductive.  Yet in the OS world we're 
not quite there yet.  (now if I only had the time to do a decent OS...)

> IMO, there are currently many "professional" software developers who
> would be doing the world a big favor by leaving the profession.  I
> don't even want to think about what kind of crap will be produced by
> the next generation of developers who will cut their teeth on 64-bit
> machines with gigabytes of RAM and terabytes of disk storage ... the
> term "limited resource" won't even have meaning for them.

True.  Fortunately we don't have to use all their crap.  Maybe someday 
we'll build alternatives.

-- 
Majority, n.: That quality that distinguishes a crime from a law.
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <dg1ep19dgdn42vb24uoscjv04oplc18o2j@4ax.com>
On Wed, 07 Dec 2005 09:37:53 +0100, Ulrich Hobelmann
<···········@web.de> wrote:

>George Neuner wrote:
>> Unused code/data doesn't waste RAM - only virtual addresses and disk
>> swap space.  Most OSes memory map executables directly from the file
>> system so code doesn't pollute the file cache or swap space.
>
>But they cache them, and the swapping-out becomes noticeable as memory 
>needs to be freed to store, say, heap objects.

Interesting ... I don't know many people who consider virtual memory
to be a "cache" ... at least not unless the discussion involves SVM
coherence.

Anyway, "swapping" implies data movement which is not necessarily the
case.  Paging in always involves a read from disk, but when paging out
only writable pages whose contents have changed need to be stored back
to disk.   Read-only pages, and writable data pages that have been
through at least one out-in cycle and still remain unmodified, will
simply be overwritten because their contents can be reread from the
disk.


>> I don't think it's possible in most current OSes to bypass the file
>> cache when reading using standard I/O calls.  Even worse, sequential
>> read optimization uses more cache space than random read of the same
>> volume of data.  I would say the failing was the app developer who
>> should have known to map the file and do manual buffering instead of
>> reading it.
>
>mmap()ed files are also cached, no?

No.  Mapped files are handled by the virtual memory system and all
modern systems DMA pages directly to/from disk with no buffering.
Obviously there is a mechanism to locate the disk block, so there is
indexing information for open mapped files and swap devices kept in
RAM, but the contents of the files are not cached.

If you read the documentation regarding mmap(), or the Windows
equivalent CreateFileMapping(), you will see scary warnings telling
you not to use normal I/O calls on a file while it is mapped.  The
result of doing so is unspecified and can be, um ... bad!

Mapping data files is straightforward, but executables have a twist.
Modern compilers/linkers produce relocatable code with metadata which
allows the OS loader to rebase the code for a different load address.
COFF, ELF and PE files contain relocation metadata as well as code.

The compiler's default base address works fine for programs ... you
only load 1 program into the process address space ... but trying to
use multiple DLLs at the same base address won't work - only the first
library loaded can use the address and all the others must be
relocated.

I haven't investigated exactly *how* current OS loaders handle code
relocation.  It used to be done by making a modified copy on the swap
device and running it from there.  I suspect that now it is done on
demand by the page fault handler according to whatever process/mapping
combination currently controls the page ... that would seem to make
sense even if it slows page handling a bit because it conserves swap
space for data pages and allows different code mappings to share the
same virtual pages.


>I'd like to be able to tell the OS not to cache whatever file I'm 
>reading sequentially (or just the "current" couple of 100k).  

I think for that you'll have to get friendly with mmap() and/or
CreateFileMapping().


>Or to cache executable contents over big data file chunks, so I 
>don't have to wait so much for applications to swap back in.

Not exactly sure what you mean here, but modern systems don't swap
processes ... just pages.  The code you want to execute or the data it
needs might be on the _only_ page in the whole application that is not
currently in RAM [unlikely but possible].

Some older paging Un*x systems tracked the process working set and
tried to keep it together.  When the process was scheduled, the system
checked the current working set and brought in any missing pages
before restarting the process.  Perhaps this is what you meant.

WS swapping was popular for a while, but it was abandoned by the time
Unix System III arrived.  It was found to cause thrashing in busy
systems and proved to be difficult to tune effectively for different
hardware configurations.  If RAM is overcommitted, trying to bring in
all the WS pages for one process just pushes out needed WS pages for
other processes.  Great disk workout ensues.

Ultimately OS developers settled on the current system - single page
demand replacement - as the best compromise solution.  It made sense
to abandon WS swapping when memory was tight.  Now with huge memories
becoming the norm it might make sense to revisit it as a performance
boost. 

George
--
for email reply remove "/" from address
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3vqdcdF168q69U1@individual.net>
George Neuner wrote:
> On Wed, 07 Dec 2005 09:37:53 +0100, Ulrich Hobelmann
> <···········@web.de> wrote:
> 
>> George Neuner wrote:
>>> Unused code/data doesn't waste RAM - only virtual addresses and disk
>>> swap space.  Most OSes memory map executables directly from the file
>>> system so code doesn't pollute the file cache or swap space.
>> But they cache them, and the swapping-out becomes noticeable as memory 
>> needs to be freed to store, say, heap objects.
> 
> Interesting ... I don't know many people who consider virtual memory
> to be a "cache" ... at least not unless the discussion involves SVM
> coherence.

No, virtual memory is just VM.  But stuff that is actually populated 
needs physical memory (or disk swap) to live in, and as soon as a file 
is accessed most OSes swap everything out (or just clear the caches for 
executable files and other resources those programs might be using) to 
cache the stupid huge file.

Later you want to work with the program again, and every file it uses, 
maybe even the dynamic libraries, has to be paged in again.  That's 
often much slower than starting a program from scratch after boot-up.  I 
suppose file-systems have better read locality, but swap space isn't 
organized very well (sequentially).

> Anyway, "swapping" implies data movement which is not necessarily the
> case.  Paging in always involves a read from disk, but when paging out
> only writable pages whose contents have changed need to be stored back
> to disk.   Read-only pages, and writable data pages that have been
> through at least one out-in cycle and still remain unmodified, will
> simply be overwritten because their contents can be reread from the
> disk.

Yes, but that takes time too, and why at all cache a file that's just 
used once (media stream) and then probably never accessed again?

>> mmap()ed files are also cached, no?
> 
> No.  Mapped files are handled by the virtual memory system and all
> modern systems DMA pages directly to/from disk with no buffering.

Seriously?  I thought every common OS would buffer/cache most pages in 
those files, so it doesn't need to access disk every time something is 
read or written (just with normal file reads/writes).

> Obviously there is a mechanism to locate the disk block, so there is
> indexing information for open mapped files and swap devices kept in
> RAM, but the contents of the files are not cached.

Yeah, swap isn't. :)

> If you read the documentation regarding mmap(), or the Windows
> equivalent CreateFileMapping(), you will see scary warnings telling
> you not to use normal I/O calls on a file while it is mapped.  The
> result of doing so is unspecified and can be, um ... bad!
> 
> Mapping data files is straightforward, but executables have a twist.

I wager most libraries and maybe even executables on Unix systems are 
mmap()ed, so I really can't believe those files aren't ever cached, but 
reread from disk on every access...

> Modern compilers/linkers produce relocatable code with metadata which
> allows the OS loader to rebase the code for a different load address.
> COFF, ELF and PE files contain relocation metadata as well as code.
> 
> The compiler's default base address works fine for programs ... you
> only load 1 program into the process address space ... but trying to
> use multiple DLLs at the same base address won't work - only the first
> library loaded can use the address and all the others must be
> relocated.

Yep.

> I haven't investigated exactly *how* current OS loaders handle code
> relocation.  It used to be done by making a modified copy on the swap
> device and running it from there.  I suspect that now it is done on
> demand by the page fault handler according to whatever process/mapping
> combination currently controls the page ... that would seem to make
> sense even if it slows page handling a bit because it conserves swap
> space for data pages and allows different code mappings to share the
> same virtual pages.

I think most functions are accessed only indirectly through a table; 
yes, sounds awfully slow, but I too haven't exactly understood the 
badly-documented machine-level stuff that happens inside a typical 
dynamic loader. :(

The dll itself (i.e. the code) is probably just mmap()ed and so 
demand-paged.

>> I'd like to be able to tell the OS not to cache whatever file I'm 
>> reading sequentially (or just the "current" couple of 100k).  
> 
> I think for that you'll have to get friendly with mmap() and/or
> CreateFileMapping().

No, I'm just talking about the way memory is handled on most current 
systems.  If I had time to do my own system, it'd include a file-open 
flag called don't-cache-this-file, so the file would be cached with 
lowest priority (dropped from cache as soon as anything else needed 
memory).  I'm not sure most current system even have memory priorities 
(like they do have scheduling priorities), only some basic LRU.

>> Or to cache executable contents over big data file chunks, so I 
>> don't have to wait so much for applications to swap back in.
> 
> Not exactly sure what you mean here, but modern systems don't swap
> processes ... just pages.  The code you want to execute or the data it
> needs might be on the _only_ page in the whole application that is not
> currently in RAM [unlikely but possible].

When applications are swapped in, it usually takes LONG, so I actually 
close them whenever I don't have any state open in it.  Ok, now with 1GB 
I'm not that picky anymore, but still I notice that when music or video 
files fill up my cache, there's a small slowdown when re-starting apps, 
because their files aren't in the cache anymore.

> Some older paging Un*x systems tracked the process working set and
> tried to keep it together.  When the process was scheduled, the system
> checked the current working set and brought in any missing pages
> before restarting the process.  Perhaps this is what you meant.

No, systems do that (with a global working-set AFAIK), but they allow 
some random big file to take up all cache memory and have application 
files (and executables) either swapped out or just dropped from physical 
memory (for read-only files).  That's because they only have basic LRU 
from the '70s, with no priorities attached to memory blocks at all.

> WS swapping was popular for a while, but it was abandoned by the time
> Unix System III arrived.  It was found to cause thrashing in busy
> systems and proved to be difficult to tune effectively for different
> hardware configurations.  If RAM is overcommitted, trying to bring in
> all the WS pages for one process just pushes out needed WS pages for
> other processes.  Great disk workout ensues.

Yes, that's the problems of local instead of global policies.  Since 
it's the OS's memory, the OS can best manage it itself.

> Ultimately OS developers settled on the current system - single page
> demand replacement - as the best compromise solution.  It made sense
> to abandon WS swapping when memory was tight.  Now with huge memories
> becoming the norm it might make sense to revisit it as a performance
> boost. 

No, I only want priorities for memory blocks.  I'd leave the paging 
inside the kernel, but users could tell what priorities they have (like: 
I won't ever need this mp3 file again, but please cache my images and 
executable file with priority "normal").  Something like BeOS did with 
threads, so user-perceived performance (or latency) improves.

-- 
Majority, n.: That quality that distinguishes a crime from a law.
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <9stgp1hf1spjpsiio6rjook9lvg0kl4q6h@4ax.com>
On Thu, 08 Dec 2005 10:40:28 +0100, Ulrich Hobelmann
<···········@web.de> wrote:

>George Neuner wrote:
>> On Wed, 07 Dec 2005 09:37:53 +0100, Ulrich Hobelmann
>> <···········@web.de> wrote:
>> 
>>> George Neuner wrote:
>
>>> mmap()ed files are also cached, no?
>> 
>> No.  Mapped files are handled by the virtual memory system and all
>> modern systems DMA pages directly to/from disk with no buffering.
>
>Seriously?  I thought every common OS would buffer/cache most pages in 
>those files, so it doesn't need to access disk every time something is 
>read or written (just with normal file reads/writes).

Seriously, you need to read up on MMUs, virtual memory and demand
paging systems.

The VMM accesses the disk only when necessary ... if a page isn't
currently in RAM when it is needed, it will be read in from the disk.
The page then remains accessible in RAM until it is forced out due to
overcommitment according to the replacement policy, whereupon the page
contents are written back to the disk to preserve it.  When the last
process unmaps the page and returns it to the OS, the contents are
disgarded.

mmap() tells the operating system to use a particular file/device as
backing store for a particular range of virtual addresses in the
current process.  When the map is first established the pages
underlying the range are not valid, so touching a page causes a page
fault and suspends the process.  The VMM immediately reads the page
data from the corresponding range in the backing file and then allows
the process to continue.  The page is then valid memory and can be
accessed normally until either the process unmaps it or until the VMM
needs the physical page slot for some other purpose - usually because
RAM is overcommitted.  If the VMM steals the page slot away, the page
data is preserved by writing it to it's corresponding range in the
backing file.  When the process goes to access the page again, the
page will be invalid and the fault handler will read it in again.
Finally munmap() forces all the pages in the range to be written to
the backing store file, then breaks the correspondence between the
address range and the file and unmaps the pages from the process.

[IMO, the control offered by mmap() et al. is too coarse grained.
Windows file mapping API is easier to use and more flexible.
Application level control of the process's virtual memory is one of
the few areas where Windows really outshines Unix/Linux.]


>I wager most libraries and maybe even executables on Unix systems are 
>mmap()ed, so I really can't believe those files aren't ever cached, but 
>reread from disk on every access...

See above.


>> I haven't investigated exactly *how* current OS loaders handle code
>> relocation. 
>
>I think most functions are accessed only indirectly through a table; 
>yes, sounds awfully slow, but I too haven't exactly understood the 
>badly-documented machine-level stuff that happens inside a typical 
>dynamic loader. :(

My point was that code relocation requires adding an offset to every
non-relative internal address reference so that it works at it's new
location.  The executable file contains metadata for each code segment
indicatig where these addresses are to be found.

Older systems used to handle relocation by coping the code into swap,
patching up the address references and then executing the modified
code directly from the swap.  

I think current systems cache the fix up address map and patch the
code on demand, one page at a time on-the-fly as needed.


>> Some older paging Un*x systems tracked the process working set and
>> tried to keep it together.  When the process was scheduled, the system
>> checked the current working set and brought in any missing pages
>> before restarting the process.  Perhaps this is what you meant.
>
>No, systems do that (with a global working-set AFAIK), but they allow 
>some random big file to take up all cache memory and have application 
>files (and executables) either swapped out or just dropped from physical 
>memory (for read-only files).  That's because they only have basic LRU 
>from the '70s, with no priorities attached to memory blocks at all.

I think I understand what you're getting at ... you're talking about
letting applications control some aspects of their own file caching
without having to handle it explicitly.  I agree that this could be a
useful thing.

George
--
for email reply remove "/" from address
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3vrm48F17iuj8U1@individual.net>
George Neuner wrote:
>>>> mmap()ed files are also cached, no?
>>> No.  Mapped files are handled by the virtual memory system and all
>>> modern systems DMA pages directly to/from disk with no buffering.
>> Seriously?  I thought every common OS would buffer/cache most pages in 
>> those files, so it doesn't need to access disk every time something is 
>> read or written (just with normal file reads/writes).
> 
> Seriously, you need to read up on MMUs, virtual memory and demand
> paging systems.

Been there, done that.  What I (think I) know about modern OSes is that 
they cache everything from disk, I suppose by associating file blocks 
with in-memory-blocks.  Surely I could be wrong, but I'd really like to 
know where you got your information that mmap()ed data isn't cached at 
all, because I don't believe it.

> The VMM accesses the disk only when necessary ... if a page isn't
> currently in RAM when it is needed, it will be read in from the disk.

Yep.

> The page then remains accessible in RAM until it is forced out due to
> overcommitment according to the replacement policy, whereupon the page
> contents are written back to the disk to preserve it.  When the last
> process unmaps the page and returns it to the OS, the contents are
> disgarded.

Yep.  Most dirty pages are probably flushed out in advance, so paging-in 
is faster when the need arises.

[... how mmap works; no problems here ...]
> [IMO, the control offered by mmap() et al. is too coarse grained.
> Windows file mapping API is easier to use and more flexible.
> Application level control of the process's virtual memory is one of
> the few areas where Windows really outshines Unix/Linux.]

Hm, haven't ever looked at the Windows API.  I suppose it's both more 
modern in some ways, and more bloated in other ways, than Unix.

>> I wager most libraries and maybe even executables on Unix systems are 
>> mmap()ed, so I really can't believe those files aren't ever cached, but 
>> reread from disk on every access...
> 
> See above.

Which didn't in any way differ from my opinion or understanding of 
mmap().  AFAIK mmapped stuff is written to disk when you modify it, but 
until then it is cached in memory, until the VM system decides to drop 
the pages.

> I think current systems cache the fix up address map and patch the
> code on demand, one page at a time on-the-fly as needed.

I think FreeBSD did or does something like that.

-- 
Majority, n.: That quality that distinguishes a crime from a law.
From: Duane Rettig
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <o03bl2enj1.fsf@franz.com>
Ulrich Hobelmann <···········@web.de> writes:

> George Neuner wrote:
>>>>> mmap()ed files are also cached, no?
>>>> No.  Mapped files are handled by the virtual memory system and all
>>>> modern systems DMA pages directly to/from disk with no buffering.
>>> Seriously?  I thought every common OS would buffer/cache most pages
>>> in those files, so it doesn't need to access disk every time
>>> something is read or written (just with normal file reads/writes).
>> Seriously, you need to read up on MMUs, virtual memory and demand
>> paging systems.
>
> Been there, done that.  What I (think I) know about modern OSes is
> that they cache everything from disk, I suppose by associating file
> blocks with in-memory-blocks.  Surely I could be wrong, but I'd really
> like to know where you got your information that mmap()ed data isn't
> cached at all, because I don't believe it.

The caching sometimes happens at first access, and not at mmap time.  It
depends on whether there is backing for the mmap, or if it is
"MAP_NORESERVE", which means that it is using "virtual swap" (which is
different than virtual memory).  There is also copy-on-write options
which sometimes create new pages but which don't write them back out - this
allows many processes to map the same file without disturbing the original,
and yet without incurring the extra virtual page cost of a private mapping;
any pages which have not been written to can use the same actual page
across process boundaries.

>
> [... how mmap works; no problems here ...]
>> [IMO, the control offered by mmap() et al. is too coarse grained.
>> Windows file mapping API is easier to use and more flexible.
>> Application level control of the process's virtual memory is one of
>> the few areas where Windows really outshines Unix/Linux.]
>
> Hm, haven't ever looked at the Windows API.  I suppose it's both more
> modern in some ways, and more bloated in other ways, than Unix.

I'll speak directly to George about this one; I do have a beef with
one of MS's arbitrary restrictions....

>>> I wager most libraries and maybe even executables on Unix systems
>>> are mmap()ed, so I really can't believe those files aren't ever
>>> cached, but reread from disk on every access...
>> See above.
>
> Which didn't in any way differ from my opinion or understanding of
> mmap().  AFAIK mmapped stuff is written to disk when you modify it,
> but until then it is cached in memory, until the VM system decides to
> drop the pages.

This is also different on different operating systems.  We implemented
a memory-mapped file device in our simple-streams implementation which
works for all architectures; you just specify :mapped t as a keyword
pair to cl:open.  It keeps mapped data in memory, but the decision as to
when to flush it is made by the mapping strategy; some systems flush
aggressively, and others require a flushing operation (on unix its an
msync() call, and on windows it's a FlushViewOfFile() call). You can see
the difference in behavior by looking at the file in a shell while
having it open in the lisp - if the file contents change when writing is
done, then the flushing strategy is aggressive.


-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Kostik Belousov
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <87ek4mz9tg.fsf@deviant.kiev.zoral.com.ua>
Duane Rettig <·····@franz.com> writes:

> This is also different on different operating systems.  We implemented
> a memory-mapped file device in our simple-streams implementation which
> works for all architectures; you just specify :mapped t as a keyword
> pair to cl:open.  It keeps mapped data in memory, but the decision as to
> when to flush it is made by the mapping strategy; some systems flush
> aggressively, and others require a flushing operation (on unix its an
> msync() call, and on windows it's a FlushViewOfFile() call). You can see
> the difference in behavior by looking at the file in a shell while
> having it open in the lisp - if the file contents change when writing is
> done, then the flushing strategy is aggressive.
> 
Just a 2 cents:

what your called "aggressive flushing" seems to have well-established
name of "coherent virtual page/buffer caches" and seems to be the feature
of all unix'es at least 10 years (except, maybe, HP-UX, that definitely
had separate caches at 10.20, not sure about 11x).
From: Duane Rettig
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <o0pso6av90.fsf@franz.com>
Kostik Belousov <···················@zoral.com.ua> writes:

> Duane Rettig <·····@franz.com> writes:
>
>> This is also different on different operating systems.  We implemented
>> a memory-mapped file device in our simple-streams implementation which
>> works for all architectures; you just specify :mapped t as a keyword
>> pair to cl:open.  It keeps mapped data in memory, but the decision as to
>> when to flush it is made by the mapping strategy; some systems flush
>> aggressively, and others require a flushing operation (on unix its an
>> msync() call, and on windows it's a FlushViewOfFile() call). You can see
>> the difference in behavior by looking at the file in a shell while
>> having it open in the lisp - if the file contents change when writing is
>> done, then the flushing strategy is aggressive.
>> 
> Just a 2 cents:
>
> what your called "aggressive flushing" seems to have well-established
> name of "coherent virtual page/buffer caches" and seems to be the feature
> of all unix'es at least 10 years (except, maybe, HP-UX, that definitely
> had separate caches at 10.20, not sure about 11x).

That's part of it.  But I've seen some interesing anomalies where
this coherency is not obeyed, as well.  Also, the coherency didn't
always follow intuition - I can't recall the specifics, but when we
first got our AIX machine mid-90's, it was interesting to build a
small file with some characters in it, and then edit that file using
vi (I think) - while the vi session was active, and the buffer was
being changed, you could go out to another shell and cat the file again,
and you'd get the changed contents of the file so far.  Then, after
doing a :q! another cat would show the original file.  Perhaps it was
just a bug in vi (or whatever it was that was making the changes);
it was definitely not expected.

I don't recall which os requires explicit flushing.  I think you're
right about HP/UX, but AIX may also be one (they may have taken a
pendulum swing in the opposite direction) - it was quite a while
since I performed those experiments and I have no time to repeat
them just now...

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Brian Downing
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3Qgmf.391274$084.239467@attbi_s22>
In article <··············@deviant.kiev.zoral.com.ua>,
Kostik Belousov  <···················@zoral.com.ua> wrote:
> Duane Rettig <·····@franz.com> writes:
> > This is also different on different operating systems.  We implemented
> > a memory-mapped file device in our simple-streams implementation which
> > works for all architectures; you just specify :mapped t as a keyword
> > pair to cl:open.  It keeps mapped data in memory, but the decision as to
> > when to flush it is made by the mapping strategy; some systems flush
> > aggressively, and others require a flushing operation (on unix its an
> > msync() call, and on windows it's a FlushViewOfFile() call). You can see
> > the difference in behavior by looking at the file in a shell while
> > having it open in the lisp - if the file contents change when writing is
> > done, then the flushing strategy is aggressive.
> 
> what your called "aggressive flushing" seems to have well-established
> name of "coherent virtual page/buffer caches" and seems to be the feature
> of all unix'es at least 10 years (except, maybe, HP-UX, that definitely
> had separate caches at 10.20, not sure about 11x).

I have to agree with this.  I work on an embedded system running Linux
(a 2.6 kernel), and we have a memory-mapped file that is not explicitly
msync'd by us.  We also access this file through seperate perl processes
using normal file I/O.  The perl processes always see a consistant view
of the file as it is at the present time, but if we lose power to the
system without the batteries installed, usually on the next boot the
file will look exactly as it did before any modification (as we don't
have a lot of memory pressure on this system).

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net> 
From: Duane Rettig
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <o0irtxc2y6.fsf@franz.com>
Brian Downing <·············@lavos.net> writes:

> In article <··············@deviant.kiev.zoral.com.ua>,
> Kostik Belousov  <···················@zoral.com.ua> wrote:
>> Duane Rettig <·····@franz.com> writes:
>> > This is also different on different operating systems.  We implemented
>> > a memory-mapped file device in our simple-streams implementation which
>> > works for all architectures; you just specify :mapped t as a keyword
>> > pair to cl:open.  It keeps mapped data in memory, but the decision as to
>> > when to flush it is made by the mapping strategy; some systems flush
>> > aggressively, and others require a flushing operation (on unix its an
>> > msync() call, and on windows it's a FlushViewOfFile() call). You can see
>> > the difference in behavior by looking at the file in a shell while
>> > having it open in the lisp - if the file contents change when writing is
>> > done, then the flushing strategy is aggressive.
>> 
>> what your called "aggressive flushing" seems to have well-established
>> name of "coherent virtual page/buffer caches" and seems to be the feature
>> of all unix'es at least 10 years (except, maybe, HP-UX, that definitely
>> had separate caches at 10.20, not sure about 11x).
>
> I have to agree with this.  I work on an embedded system running Linux
> (a 2.6 kernel), and we have a memory-mapped file that is not explicitly
> msync'd by us.  We also access this file through seperate perl processes
> using normal file I/O.  The perl processes always see a consistant view
> of the file as it is at the present time, but if we lose power to the
> system without the batteries installed, usually on the next boot the
> file will look exactly as it did before any modification (as we don't
> have a lot of memory pressure on this system).

I agree this is a Good Thing for data integrity.  All I'm saying is that
not every operating system does it by default.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Brian Downing
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <ySumf.635835$xm3.228506@attbi_s21>
In article <··············@franz.com>, Duane Rettig  <·····@franz.com> wrote:
> Brian Downing <·············@lavos.net> writes:
> > I have to agree with this.  I work on an embedded system running Linux
> > (a 2.6 kernel), and we have a memory-mapped file that is not explicitly
> > msync'd by us.  We also access this file through seperate perl processes
> > using normal file I/O.  The perl processes always see a consistant view
> > of the file as it is at the present time, but if we lose power to the
> > system without the batteries installed, usually on the next boot the
> > file will look exactly as it did before any modification (as we don't
> > have a lot of memory pressure on this system).
> 
> I agree this is a Good Thing for data integrity.  All I'm saying is that
> not every operating system does it by default.

Wasn't denying that.  :-)  I was just putting in two cents on a platform
where it does work.

-bcd
-- 
*** Brian Downing <bdowning at lavos dot net> 
From: Rob Warnock
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <eLadnfJhwdxJFgbenZ2dnUVZ_sKdnZ2d@speakeasy.net>
Duane Rettig  <·····@franz.com> wrote:
+---------------
| Ulrich Hobelmann <···········@web.de> writes:
| > George Neuner wrote:
| >>>>> mmap()ed files are also cached, no?
| >>>> No.  Mapped files are handled by the virtual memory system and all
| >>>> modern systems DMA pages directly to/from disk with no buffering.
| >>> Seriously?  I thought every common OS would buffer/cache most pages...
| >> Seriously, you need to read up on MMUs, virtual memory and demand
| >> paging systems.
| >
| > What I (think I) know about modern OSes is that they cache everything
| > from disk, I suppose by associating file blocks with in-memory-blocks.
| > Surely I could be wrong...
+---------------

Ulrich isn't wrong, George. The "global page cache" has pretty much
completely replaced the "file block cache" in recent versions of the
VM systems of most decent operating systems [e.g., Irix, *BSD, Linux].
Pretty much *everything* is cached using the same mechanism [which
does have its downsides sometimes, to be sure.].

+---------------
| > like to know where you got your information that mmap()ed data isn't
| > cached at all, because I don't believe it.
| 
| The caching sometimes happens at first access, and not at mmap time.
+---------------

True. In particular, these days executable files are simply mmap'd
into memory, and pages only fault in from the ELF (say) file as they
are first accessed.

+---------------
| It depends on whether there is backing for the mmap, or if it is
| "MAP_NORESERVE", which means that it is using "virtual swap" (which is
| different than virtual memory).  There is also copy-on-write options
| which sometimes create new pages but which don't write them back out -
| this allows many processes to map the same file without disturbing the
| original, and yet without incurring the extra virtual page cost of a
| private mapping; any pages which have not been written to can use the
| same actual page across process boundaries.
+---------------

Quite true! And in fact, the page caching in both FreeBSD & Linux is
so good that even though CMUCL's typical image file is *huge* compared
to CLISP's, on the second (and subsequent) executions, CMUCL starts up
slightly *faster* than CLISP!!  [Aside: Sam, I haven't tested this on
the latest version of CLISP, so my apologies if it's no longer true.]
The following was done with a laptop running FreeBSD-4.10 on a 1.8 GHz
Athlon with 1 GiB of RAM [but a *slow* disk]:

    $ cat test_clisp.lisp
    #!/usr/local/bin/clisp -q
    (format t "hello world!~%")
    $ time-hist ./test_clisp.lisp
    Timing 100 runs of: ./test_clisp.lisp
       3 0.019
      32 0.020
      65 0.021
    $ cat test_cmucl.lisp
    #!/usr/local/bin/cmucl -script
    (format t "hello world!~%")
    $ time-hist ./test_cmucl.lisp
    Timing 100 runs of: ./test_cmucl.lisp
      59 0.016
      41 0.017
    $ 

The point is *not* that one is a few milliseconds faster than the
other, but that on such a system "bigger" is not necessarily "slower". 

[A secondary point is that CMUCL is perfectly acceptable for simple
"scripting", *including* low-traffic CGI scripting! For high-traffic
sites, of course, one would use compiled code in a persistent Lisp
application server.]


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <810sp19ona2ce5um3a8lacilup983j5m68@4ax.com>
On Sat, 10 Dec 2005 20:15:48 -0600, ····@rpw3.org (Rob Warnock) wrote:

>Duane Rettig  <·····@franz.com> wrote:
>+---------------
>| Ulrich Hobelmann <···········@web.de> writes:
>| > George Neuner wrote:
>| >>>>> mmap()ed files are also cached, no?
>| >>>> No.  Mapped files are handled by the virtual memory system and all
>| >>>> modern systems DMA pages directly to/from disk with no buffering.
>| >>> Seriously?  I thought every common OS would buffer/cache most pages...
>| >> Seriously, you need to read up on MMUs, virtual memory and demand
>| >> paging systems.
>| >
>| > What I (think I) know about modern OSes is that they cache everything
>| > from disk, I suppose by associating file blocks with in-memory-blocks.
>| > Surely I could be wrong...
>+---------------
>
>Ulrich isn't wrong, George. The "global page cache" has pretty much
>completely replaced the "file block cache" in recent versions of the
>VM systems of most decent operating systems [e.g., Irix, *BSD, Linux].
>Pretty much *everything* is cached using the same mechanism [which
>does have its downsides sometimes, to be sure.].


I think this is a debate about terminology.  You are arguing that the
page in RAM is a "cache" of the disk data and I am arguing that it is
not - at least not in general.  

The traditional software notion of cache is that of a copy kept for
some reason ... the operative term being "copy".  It is applicable to
mapped executables, but not to swappable data because the page in RAM
is not, in general, a copy of the disk data.


WRT "global page cache", AFAICT it is nothing more than a new and
somewhat misleading name for what used to be called the "page frame
table".  


I think we've wasted enough bandwidth ... time to agree to disagree.

George
--
for email reply remove "/" from address
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <hn6ip1tqjgq2iqvumaae0h0tva3hmks919@4ax.com>
On Thu, 08 Dec 2005 22:15:52 +0100, Ulrich Hobelmann
<···········@web.de> wrote:

>What I (think I) know about modern OSes is that 
>they cache everything from disk, I suppose by associating file blocks 
>with in-memory-blocks.  

If you're talking about the file system you are correct.  Mapped files
are not handled by the file system, they are handled by the VMM.


>I'd really like to know where you got your information that
>mmap()ed data isn't cached at all, because I don't believe it.

Several sources including close examination of the VMM code for some
older versions of BSD and Linux.  I freely admit that I'm not up to
date on what the latest kernels do.

---
You can find overviews of the Linux kernel online through the kernel
programming link at http://www.linuxhq.com or at the following links

The Linux Kernel - http://www.tldp.org/LDP/tlk/tlk.html  See the
memory management section in chapter  3.

Linux Kernel 2.4 Internals -
http://www.moses.uklinux.net/patches/lki.html  See the page cache
section in chapter 4.

More in depth information on Linux's VMM operation is at
http://www.linux-mm.org.  In particular look at the sections on fault
handling and page cache lookup.

Pay careful attention when they start talking about the "page cache"
and the "swap cache".  If you look closely, you will see that both
really refer to VMM management structures rather than to any buffered
collections of page data.  


The best source of information is just to read the code.  Rapidly
scanning the online source for Linux 2.6.1 ... 

To map a file, mmap() verifies the existence of the file and checks
that the file's current size will fully contain the requested array.
If so it allocates virtual pages for the array, gets the file inode
(opening the file if necessary) , creates a set of structures which
asociates each virtual page with the inode and byte offset within the
file, and updates the VMM's global page table and the current
process's page table.

The invalid page fault handler looks up the entry in the page table
and calls the page swapping code. 

The page swapping code looks up the file inode and byte offset in the
page table entry and queues a read to the disk block device driver.

I didn't look at the latest disk device drivers - they are
architecture dependent and too messy to scan quickly, but previous
versions I have examined did not cache anything ... that was left to
higher levels.  



AFAICR, last I looked the BSD 4.2 code was quite similar.


---
Also, if you are interested, have a look at these:

Linux Kernel Development (1st ed.), Robert Love, 2003, ISBN 0672325128
[Rambles a bit but has a lot of low level detail.  there is a newer
version available but I haven't seen it yet.]

The Design and Implementation of the 4.4BSD UNIX Operating System,
Marshall Kirk McKusick, et al, Addison-Wesley, 1996, ISBN
0-201-54979-4
[Very good book in the style of "Design of the Unix Operating System".
Doesn't show a lot of examples but has lots of discussion of low level
operation and principles.]

George
--
for email reply remove "/" from address
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3vt13kF17sh9gU1@individual.net>
George Neuner wrote:
>> I'd really like to know where you got your information that
>> mmap()ed data isn't cached at all, because I don't believe it.
> 
> Several sources including close examination of the VMM code for some
> older versions of BSD and Linux.  I freely admit that I'm not up to
> date on what the latest kernels do.

Ok.

> To map a file, mmap() verifies the existence of the file and checks
> that the file's current size will fully contain the requested array.
> If so it allocates virtual pages for the array, gets the file inode
> (opening the file if necessary) , creates a set of structures which
> asociates each virtual page with the inode and byte offset within the
> file, and updates the VMM's global page table and the current
> process's page table.
> 
> The invalid page fault handler looks up the entry in the page table
> and calls the page swapping code. 
> 
> The page swapping code looks up the file inode and byte offset in the
> page table entry and queues a read to the disk block device driver.

Yep, but I'd suspect that once a page is read in from disk, it stays in 
memory unless the file receives a write (i.e. changes); i.e. it doesn't 
reread the page from disk on every instruction that accesses that memory 
page.  I also (perhaps wrongly) call that caching.

Sorry for the confusion.

> The Design and Implementation of the 4.4BSD UNIX Operating System,
> Marshall Kirk McKusick, et al, Addison-Wesley, 1996, ISBN
> 0-201-54979-4

I read this one and some other OS books some time ago.  Most of them 
didn't yet really go into depth on stuff like mmap(), since the old 
systems (that is BSD; Mach started the whole memory mapping stuff I 
think) didn't have it yet.

-- 
Majority, n.: That quality that distinguishes a crime from a law.
From: Duane Rettig
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <o0y82ud7g8.fsf@franz.com>
[just a couple of points here ...]

George Neuner <·········@comcast.net> writes:

> [IMO, the control offered by mmap() et al. is too coarse grained.
> Windows file mapping API is easier to use and more flexible.
> Application level control of the process's virtual memory is one of
> the few areas where Windows really outshines Unix/Linux.]

I agree.  Microsoft did this (mostly) right.

What they got wrong was disallowing the PAGE_EXECUTE_WRITECOPY
option for a CreateFileMapping.  When a system has DEP (data execution
protection) disabled, we can create a copy-on-write mapping for what
we call a .dxl file (a heap image, essentialy) by using a PAGE_WRITECOPY
access.  But as soon as one turns on DEP for a machine with the hardware
for it, no data pages can be executed unless some kind of EXECUTE
permission is given to the page.  In general this is a good thing, and
I would think that stacks should never be granted execute access, for
security reasons, but the whole idea of a dynamic implementation of a
mapping is thwarted by not allowing CreateFileMapping to obey the
PAGE_EXECUTE_WRITECOPY protection rights.  In these situations, we're
forced to VirtualAlloc the whole memory and copy the whole heap in.  It's
still fast for small images, but it defeats the scalability and sharing
bonuses that mapping provides.

>>> I haven't investigated exactly *how* current OS loaders handle code
>>> relocation. 
>>
>>I think most functions are accessed only indirectly through a table; 
>>yes, sounds awfully slow, but I too haven't exactly understood the 
>>badly-documented machine-level stuff that happens inside a typical 
>>dynamic loader. :(
>
> My point was that code relocation requires adding an offset to every
> non-relative internal address reference so that it works at it's new
> location.  The executable file contains metadata for each code segment
> indicatig where these addresses are to be found.
>
> Older systems used to handle relocation by coping the code into swap,
> patching up the address references and then executing the modified
> code directly from the swap.  
>
> I think current systems cache the fix up address map and patch the
> code on demand, one page at a time on-the-fly as needed.

But of course this all assumes C/C++/ld() style binding of functions
and calls.  In a lisp code situation, it is almost never necessary
to patch any code, or to relocate it when moving it.  Proof of existence
for this concept is Allegro CL, where the _only_ architecture ever to
have required code-fixup after relocation was the Cray, which did not
provide pc-relative branches (every branch was through a general register
or to a hard address, offset only by the program base address).  But on
all other architectures, which have pc-relative branching, code vectors
can be completely non-relocation sensitive.  The data which would normally
be relocated (which usually has to go through register/displacement
addressing anyway) is simply contained in the function object, and since
the function register is part of the environment as well as the global
register, there is no problem accessing any data at all with direct
memory references, and no indirections.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3vt1tnF16k5imU1@individual.net>
Duane Rettig wrote:
> But of course this all assumes C/C++/ld() style binding of functions
> and calls.  In a lisp code situation, it is almost never necessary
> to patch any code, or to relocate it when moving it.  Proof of existence

Does this mean that all function calls go through a function-symbol 
lookup table first?

> for this concept is Allegro CL, where the _only_ architecture ever to
> have required code-fixup after relocation was the Cray, which did not
> provide pc-relative branches (every branch was through a general register

I think I don't really understand you here.  PC-relative branches to 
call functions sound like a problem to me, since garbage collection may 
relocate functions, which would need to change the relative offset...

> or to a hard address, offset only by the program base address).  But on
> all other architectures, which have pc-relative branching, code vectors
> can be completely non-relocation sensitive.  The data which would normally
> be relocated (which usually has to go through register/displacement
> addressing anyway) is simply contained in the function object, and since
> the function register is part of the environment as well as the global
> register, there is no problem accessing any data at all with direct
> memory references, and no indirections.

Hm, this sounds like code and data are quite close, which to me sounds 
problematic, since code and data use different CPU caches.  Or does the 
code cache simply ignore it when you write data through the data cache?

Sorry for the confusion, but I'm not quite familiar with dynamic loading 
in Lisp runtimes.

-- 
Majority, n.: That quality that distinguishes a crime from a law.
From: Duane Rettig
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <o0u0diaz9q.fsf@franz.com>
Ulrich Hobelmann <···········@web.de> writes:

> Duane Rettig wrote:
>> But of course this all assumes C/C++/ld() style binding of functions
>> and calls.  In a lisp code situation, it is almost never necessary
>> to patch any code, or to relocate it when moving it.  Proof of existence
>
> Does this mean that all function calls go through a function-symbol
> lookup table first?

What table?

Lisp needs no table to link functions to symbols.  Note in the following
transcript that the symbol BAR (even though it's not yet defined) is
directly referenced by #'FOO - there's no reason why it should be
placed into a table and indirected from there:

CL-USER(1): (compile (defun foo (x y) (declare (optimize speed (safety 0))) (bar x y)))
Warning: While compiling these undefined functions were referenced: BAR.
FOO
NIL
NIL
CL-USER(2): (disassemble 'foo)
;; disassembly of #<Function FOO>
;; formals: X Y
;; constant vector:
0: BAR

;; code start: #x1065c124:
   0: 55          pushl	ebp
   1: 8b ec       movl	ebp,esp
   3: 56          pushl	esi
   4: 83 ec 24   subl	esp,$36
   7: 8b 5e 12   movl	ebx,[esi+18]  ; BAR
  10: ff 57 27   call	*[edi+39]     ; SYS::TRAMP-TWO
  13: c9          leave
  14: 8b 75 fc   movl	esi,[ebp-4]
  17: c3          ret
CL-USER(3): (inspect #'foo)
A NEW #<Function FOO>
  lambda-list: (X Y)
   0 excl-type ----> Bit field: #x08
   1 flags --------> Bit field: #x88
   2 start --------> Bit field: #x1065c124
   3 hash ---------> Bit field: #x000052a8
   4 symdef -------> The symbol FOO
   5 code ---------> short simple CODE vector (13) = #(35669 22252 60547 ...)
   6 formals ------> (X Y), a proper list with 2 elements
   7 cframe-size --> fixnum 0 [#x00000000]
   8 immed-args ---> fixnum 0 [#x00000000]
   9 locals -------> fixnum 0 [#x00000000]
  10 <constant> ---> The symbol BAR
[1i] CL-USER(4): 

While FOO is executing, a register always holds #'FOO in it so BAR
is always right there, as I said in my previous post, with no need
for indirection.

Note also, on a subject I had mentioned earlier: if I capture the
code vector for FOO and then define another function with similar
structure but names changed (to protect the innocent? :-):

[1i] CL-USER(4): :i 5
A NEW short simple CODE vector (13) @ #x1065c132
   0-> The field #x8b55
   1-> The field #x56ec
   2-> The field #xec83
   3-> The field #x8b24
   4-> The field #x125e
   5-> The field #x57ff
   6-> The field #xc927
   7-> The field #x758b
   8-> The field #xc3fc
   9-> The field #x0000
  10-> The field #x0000
  11-> The field #x0000
  12-> The field #x0000
[1i] CL-USER(5): (setq code1 *)
#(35669 22252 60547 35620 4702 22527 51495 30091 50172 0 ...)
[1i] CL-USER(6): :i q
#<Function FOO>
CL-USER(7): (compile (defun bas (z w) (declare (optimize speed (safety 0))) (bam z w)))
Warning: While compiling these undefined functions were referenced: BAM.
BAS
NIL
NIL
CL-USER(8): (inspect #'bas)
A NEW #<Function BAS>
  lambda-list: (Z W)
   0 excl-type ----> Bit field: #x08
   1 flags --------> Bit field: #x88
   2 start --------> Bit field: #x1066a574
   3 hash ---------> Bit field: #x000052aa
   4 symdef -------> The symbol BAS
   5 code ---------> short simple CODE vector (13) = #(35669 22252 60547 ...)
   6 formals ------> (Z W), a proper list with 2 elements
   7 cframe-size --> fixnum 0 [#x00000000]
   8 immed-args ---> fixnum 0 [#x00000000]
   9 locals -------> fixnum 0 [#x00000000]
  10 <constant> ---> The symbol BAM
[1i] CL-USER(9): :i 5
A NEW short simple CODE vector (13) @ #x1066a582
   0-> The field #x8b55
   1-> The field #x56ec
   2-> The field #xec83
   3-> The field #x8b24
   4-> The field #x125e
   5-> The field #x57ff
   6-> The field #xc927
   7-> The field #x758b
   8-> The field #xc3fc
   9-> The field #x0000
  10-> The field #x0000
  11-> The field #x0000
  12-> The field #x0000
[1i] CL-USER(10): (setq code2 *)
#(35669 22252 60547 35620 4702 22527 51495 30091 50172 0 ...)
[1i] CL-USER(11): (eq code1 code2)
NIL
[1i] CL-USER(12): (equalp code1 code2)
T
[1i] CL-USER(13): 

That these two code vectors are not the same one, but they have
the same bit patterns, so they actually could be merged and shared
by one code vector, presumably in pure (non-writable) space.

>> for this concept is Allegro CL, where the _only_ architecture ever to
>> have required code-fixup after relocation was the Cray, which did not
>> provide pc-relative branches (every branch was through a general register
>
> I think I don't really understand you here.  PC-relative branches to
> call functions sound like a problem to me, since garbage collection
> may relocate functions, which would need to change the relative
> offset...

I'm not talking about jumps and calls from one function to another; those
can always be accomplished via a jump through a register.  What I'm talking
about is internal pc-relative branches.  If you have a construct

 (if foo (bar) (bas))

then in the basic unoptimized compilation there tends to be a test of foo,
a jump-if-false instruction to the basic block that has the call to bas,
and also after the call to bar there is a jmp to the code immediately after
bas.  These two branches are completely relocatable if they can be implemented
as pc-relative jumps, but since the Cray didn't have these, it was necessary
to relocate the jumps when relocating the code-vector.  Fortunately, the
amount of change to the jumps directly corresponded to the change in location
of the code vector, so we never had to carry "relocation bits" around with
the code-vector; we just had to do the relocation _at_ the _time_ it was
moved, so that the branch targets could all be adjusted by the same amount.

>> or to a hard address, offset only by the program base address).  But on
>> all other architectures, which have pc-relative branching, code vectors
>> can be completely non-relocation sensitive.  The data which would normally
>> be relocated (which usually has to go through register/displacement
>> addressing anyway) is simply contained in the function object, and since
>> the function register is part of the environment as well as the global
>> register, there is no problem accessing any data at all with direct
>> memory references, and no indirections.
>
> Hm, this sounds like code and data are quite close, which to me sounds
> problematic, since code and data use different CPU caches.  Or does
> the code cache simply ignore it when you write data through the data
> cache?

This is precisely the point I was making - it's the cache-line, not
the page!  Whatever size a cache-line is, if code is on one line and
data on the other, never the twain shall meet, even if they are on the
same page.  The _only_ thing that is necessary on a page basis is to be
sure all appropriate access permissions are set, so that on pages which
are expected to be both writable and executable, they can be.

Now, at the cache-line level, if a code vector is written (say, because
it was just created or moved from a different location) then it is a good
idea to flush that line (which can be done in most architectures, or on
x86-based architectures and some others it is done automatically) so that
the instruction cache will see the updated data when it goes to execute that
code.  Usually this is a very inexpensive single instruction; if the line
which had just been written to was not in the instruction cache (a good bet,
since when we move a code vector we want fresh memory to move to) then
nothing actually happens, and the first execution of an instruction here
will go through the normal faulting process which will retrieve the
memory at that location).  Since some architectures have prefetching
instructions, it is important to do the flush before any of these
prefetches could happen.  So if the data being written was previously
being executed, then the cache-line flush causes the i-cache line to
become invalid, and so the cycle is similar to a fresh read from memory,
except that if the d-cache hasn't yet finished its flush, the i-cache
either has to wait or take it direcly from the d-cache line.  All of this
happens within a hundred or so (my gross estimation) cycles, but since
the turnaround in Lisp code from moving a code vector to executing it is
going to be in the thousands of cycles, there is not much chance of
any pipeline stalling due to this issue, at least.

> Sorry for the confusion, but I'm not quite familiar with dynamic
> loading in Lisp runtimes.

No problem - very few people are.  Even some of the brilliant hardware
designers I talked to over the past 15 years, who know their architecture
but who only see it in the light of how it will be used for C/ld()
style linking used to tend to scratch their heads and say something
like "interesting use of these features".  I don't hear this nearly
as often, though, since the popularization of Java and of other dynamic
and gc'd languages...

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Waldek Hebisch
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <dndopu$b4n$1@panorama.wcss.wroc.pl>
Duane Rettig <·····@franz.com> wrote:
> Lisp needs no table to link functions to symbols.  Note in the following
> transcript that the symbol BAR (even though it's not yet defined) is
> directly referenced by #'FOO - there's no reason why it should be
> placed into a table and indirected from there:
> 
> CL-USER(1): (compile (defun foo (x y) (declare (optimize speed (safety 0))) (bar x y)))
> Warning: While compiling these undefined functions were referenced: BAR.
> FOO
> NIL
> NIL
> CL-USER(2): (disassemble 'foo)
> ;; disassembly of #<Function FOO>
> ;; formals: X Y
> ;; constant vector:
> 0: BAR
> 
> ;; code start: #x1065c124:
>    0: 55          pushl ebp
>    1: 8b ec       movl  ebp,esp
>    3: 56          pushl esi
>    4: 83 ec 24   subl   esp,$36
>    7: 8b 5e 12   movl   ebx,[esi+18]  ; BAR
>   10: ff 57 27   call   *[edi+39]     ; SYS::TRAMP-TWO
>   13: c9          leave
>   14: 8b 75 fc   movl   esi,[ebp-4]
>   17: c3          ret

I do not understand your language here: you load address of the symbol
BAR into ebx and then you call SYS::TRAMP-TWO. Presumably SYS::TRAMP-TWO
will fetch the address of function from the symbol and then fetch the
address of the code vector from #'BAR. Only then the actual call can
take place. So you chase through three pointers (#'FOO, BAR, #'BAR) and
use a helper function. Looks very much like an indirect call to me.


> CL-USER(3): (inspect #'foo)
> A NEW #<Function FOO>
>   lambda-list: (X Y)
>    0 excl-type ----> Bit field: #x08
>    1 flags --------> Bit field: #x88
>    2 start --------> Bit field: #x1065c124
>    3 hash ---------> Bit field: #x000052a8
>    4 symdef -------> The symbol FOO
>    5 code ---------> short simple CODE vector (13) = #(35669 22252 60547 ...)
>    6 formals ------> (X Y), a proper list with 2 elements
>    7 cframe-size --> fixnum 0 [#x00000000]
>    8 immed-args ---> fixnum 0 [#x00000000]
>    9 locals -------> fixnum 0 [#x00000000]
>   10 <constant> ---> The symbol BAR
> [1i] CL-USER(4): 
> 
> While FOO is executing, a register always holds #'FOO in it so BAR
> is always right there, as I said in my previous post, with no need
> for indirection.
> 

Linux ELF (for AMD64) code looks like this:

      call bar_slot


and separately:

bar_fixup:
      push bar_no
      jmp do_fixup
bar_slot:
      jmpq   ····@bar_no(%rip)

So you jump to the address contained in the global offset table in
the corresponding slot. Before the first call the slot contains the
addess of bar_fixup, so the fixup routine is called to fill the GOT
slot with the address of bar (the push gives identifing argument the
fixup routine so that it knows which slot should be fixed). Note
that code is read-only, only data slots are modified. Less flexible
than Lisp, but I would say that it uses less indirection.

i386 version is simiar, but address of global offset table have to
be in ebx. BTW, it seems that extra indirection is not that expensive,
at least for C benchmarks (SPEC) shows that the main cost is to tie
an extra register on an already register-starved architecture.

-- 
                              Waldek Hebisch
·······@math.uni.wroc.pl 
From: Duane Rettig
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <o0vexwix73.fsf@franz.com>
Waldek Hebisch <·······@math.uni.wroc.pl> writes:

> Duane Rettig <·····@franz.com> wrote:
>> Lisp needs no table to link functions to symbols.  Note in the following
>> transcript that the symbol BAR (even though it's not yet defined) is
>> directly referenced by #'FOO - there's no reason why it should be
>> placed into a table and indirected from there:
>> 
>> CL-USER(1): (compile (defun foo (x y) (declare (optimize speed (safety 0))) (bar x y)))
>> Warning: While compiling these undefined functions were referenced: BAR.
>> FOO
>> NIL
>> NIL
>> CL-USER(2): (disassemble 'foo)
>> ;; disassembly of #<Function FOO>
>> ;; formals: X Y
>> ;; constant vector:
>> 0: BAR
>> 
>> ;; code start: #x1065c124:
>>    0: 55          pushl ebp
>>    1: 8b ec       movl  ebp,esp
>>    3: 56          pushl esi
>>    4: 83 ec 24   subl   esp,$36
>>    7: 8b 5e 12   movl   ebx,[esi+18]  ; BAR
>>   10: ff 57 27   call   *[edi+39]     ; SYS::TRAMP-TWO
>>   13: c9          leave
>>   14: 8b 75 fc   movl   esi,[ebp-4]
>>   17: c3          ret
>
> I do not understand your language here: you load address of the symbol
> BAR into ebx and then you call SYS::TRAMP-TWO. Presumably SYS::TRAMP-TWO
> will fetch the address of function from the symbol and then fetch the
> address of the code vector from #'BAR. Only then the actual call can
> take place. So you chase through three pointers (#'FOO, BAR, #'BAR) and
> use a helper function. Looks very much like an indirect call to me.

I never said it wasn't an indirect call (I looked back over my responses,
and I did say the words "no indirections", but I meant that in the context
of table lookups - sorry about that).

Obviously, there must be some sort of indirection in order to keep the code
position-independent.  In this case the rest of the calling sequence is
at a location called tramp-two, which finishes the call with the count register
set to 2.  The extra jump doesn't figure in to the cost of the call, because
memory interlocks would be stalling the pipeline anyway, due to the dynamicity
of the function call.  If we were doing something like
(funcall (lambda ...) ...) then the constant would in fact be a function object,
and we would not be going through the symbol-value.  In essence, the symbol _is_
the lookup table that is usually in the GOT or TOC on shared-libraries.


>> CL-USER(3): (inspect #'foo)
>> A NEW #<Function FOO>
>>   lambda-list: (X Y)
>>    0 excl-type ----> Bit field: #x08
>>    1 flags --------> Bit field: #x88
>>    2 start --------> Bit field: #x1065c124
>>    3 hash ---------> Bit field: #x000052a8
>>    4 symdef -------> The symbol FOO
>>    5 code ---------> short simple CODE vector (13) = #(35669 22252 60547 ...)
>>    6 formals ------> (X Y), a proper list with 2 elements
>>    7 cframe-size --> fixnum 0 [#x00000000]
>>    8 immed-args ---> fixnum 0 [#x00000000]
>>    9 locals -------> fixnum 0 [#x00000000]
>>   10 <constant> ---> The symbol BAR
>> [1i] CL-USER(4): 
>> 
>> While FOO is executing, a register always holds #'FOO in it so BAR
>> is always right there, as I said in my previous post, with no need
>> for indirection.
>> 
>
> Linux ELF (for AMD64) code looks like this:
>
>       call bar_slot
>
>
> and separately:
>
> bar_fixup:
>       push bar_no
>       jmp do_fixup
> bar_slot:
>       jmpq   ····@bar_no(%rip)
>
> So you jump to the address contained in the global offset table in
> the corresponding slot. Before the first call the slot contains the
> addess of bar_fixup, so the fixup routine is called to fill the GOT
> slot with the address of bar (the push gives identifing argument the
> fixup routine so that it knows which slot should be fixed). Note
> that code is read-only, only data slots are modified. Less flexible
> than Lisp, but I would say that it uses less indirection.

It's interesting to see how close it comes to actually being dynamic.
If it only were to _not_ fix up the address, it would be truly dynamic, 
and redefinition would be possible.  Unfortunately, though there is
indeed one less memory indirection in this position-independent (but
non-relocatable) call, the initial lookup is very expensive (have you
ever actually single-stepped through the do_fixup code?).  Static
languages get away with this because it is a one-shot deal per function
or per call (depending on the style of fixup used), and not likely to be
noticed by users.

I find it interesting, in looking at MS's runtime architecture for
XP/64 for x86-64, that they have a concept of "hot-patching" - I haven't
looked closely at it yet, but it looks like it might be a way to unsnap
already-snapped linkages like the one you are describing, so that
functions can be redefined "hot" (i.e while the program is running).
Hmm, yet another Lisp concept absorbed into the industry...
Anyway, the unsnapping doesn't fix the problem that the symbolic lookup
itself is table-driven at run-time, and is thus extremely slow (lisp
looks up symbols at read time and load time, and places them as
constants into function objects at COMPILE time (and at load time for
COMPILE-FILE), so there is no run-time lookup involved.

> i386 version is simiar, but address of global offset table have to
> be in ebx. BTW, it seems that extra indirection is not that expensive,
> at least for C benchmarks (SPEC) shows that the main cost is to tie
> an extra register on an already register-starved architecture.

Right.  Indirection is not a Bad Thing, if used wisely.  This is the
same kind of indirection as is done in Allegro CL, so I don't even
think of it as indirection, per se.  [the example I showed was a symbol
established in the function object, but we also have a global table,
which is centered around NIL (always in a register), so if it is
a common symbol being compiled, it is just taken out of the global table
rather than the function object]:

CL-USER(1): (compile (defun foo (x y z)
                       (declare (optimize speed (safety 0) (debug 0)))
                       (list x y z)))
FOO
NIL
NIL
CL-USER(2): (disassemble *)
;; disassembly of #<Function FOO>
;; formals: X Y Z

;; code start: #x10683634:
   0: 33 c9       xorl	ecx,ecx
   2: b1 03       movb	cl,$3
   4: ff 67 2f   jmp	*[edi+47]     ; LIST
   7: 90          nop
CL-USER(3): 

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3vvst9F16ohfrU1@individual.net>
Waldek Hebisch wrote:
> Linux ELF (for AMD64) code looks like this:
> 
>       call bar_slot
> 
> 
> and separately:
> 
> bar_fixup:
>       push bar_no
>       jmp do_fixup
> bar_slot:
>       jmpq   ····@bar_no(%rip)

Ah yes, Mac OS code for PPC (in Mach-O format) look similar.  To me it's 
very hard to read and understand, though, since the asm syntax is quite 
complicated and I never know exactly what is actually known at 
compile-time and what happens at run-time.  Disassembling is no option, 
since most information isn't known (shown as 0) at pre-load time.

The interesting thing, though, is that I don't remember seeing anything 
like that on Linux or NetBSD/x86.  There were no fixup-functions and 
stuff like that when I looked at compiled C output.  Are the linkers for 
Mach-O and ELF/64bit linkages fundamentally different from ELF/32bit?

> So you jump to the address contained in the global offset table in
> the corresponding slot. Before the first call the slot contains the
> addess of bar_fixup, so the fixup routine is called to fill the GOT
> slot with the address of bar (the push gives identifing argument the
> fixup routine so that it knows which slot should be fixed). Note
> that code is read-only, only data slots are modified. Less flexible
> than Lisp, but I would say that it uses less indirection.

Sounds ok, yes.

> i386 version is simiar, but address of global offset table have to
> be in ebx. BTW, it seems that extra indirection is not that expensive,
> at least for C benchmarks (SPEC) shows that the main cost is to tie
> an extra register on an already register-starved architecture.

:)  Oh well, next year even Macs will be Intel, when one of my reasons 
for going there in the first place was to leave the ugly x86 thing for 
some/any RISC arch.  Hm, at least Intels are fast.

-- 
If you have to ask what jazz is, you'll never know.
	Louis Armstrong
From: Duane Rettig
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <o0r78kiw8h.fsf@franz.com>
Ulrich Hobelmann <···········@web.de> writes:

> Waldek Hebisch wrote:
>> Linux ELF (for AMD64) code looks like this:
>>       call bar_slot
>> and separately:
>> bar_fixup:
>>       push bar_no
>>       jmp do_fixup
>> bar_slot:
>>       jmpq   ····@bar_no(%rip)
>
> Ah yes, Mac OS code for PPC (in Mach-O format) look similar.  To me
> it's very hard to read and understand, though, since the asm syntax is
> quite complicated and I never know exactly what is actually known at
> compile-time and what happens at run-time.  Disassembling is no
> option, since most information isn't known (shown as 0) at pre-load
> time.
>
> The interesting thing, though, is that I don't remember seeing
> anything like that on Linux or NetBSD/x86.  There were no
> fixup-functions and stuff like that when I looked at compiled C
> output.  Are the linkers for Mach-O and ELF/64bit linkages
> fundamentally different from ELF/32bit?

Yes, because x86-genre architectures allow for full-word loads of
constants, which can thus be relocated.  Even the x86-64 architecture
allows for a 64-bit constant to be loaded into a register, and it can
thus be relocated.  However, it has no instructions for jumping to
64-bit locations, so jumping is limited to 32-bit values.  But even the
32-bit jumps are relatively long instructions, taking up as many as
many as 8 bytes (off the top of my head) of instruction space, so i-cache
pressure is increased for these direct jumps.  And if you load a 64-bit
value into a register and jump to it, well, that is two instructions, and
the first might be 9 or 10 bytes.  Talk about i-cache pressure...

RISC architectures, on the other hand, limit instruction size to 4
bytes, period.  Even the 64-bit architectures use 32-bit instructions.
So register/displacement "indirections" are almost always necessary,
except for very high and very low addresses (which the kernel usually
confiscates for itself anyway).  Constants can be generated inline
(and thus relocated) but they can't be done in one instruction, so
the relocation has to be broken into at least two pieces.  Check out,
for example, what it takes to generate a relocatable constant in 64-bit
sparc:

	sethi	%hh(cons),%o4
	sethi	%lm(cons),%g5
	or	%o4,%hm(cons),%o4
	or	%g5,%lo(cons),%g5
	sllx	%o4,32,%o4
	or	%o4,%g5,%o4

I'd much prefer the single indirection of a register/displacement
PIC style access, myself, wouldn't you?

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <4016nkF185t2fU1@individual.net>
Duane Rettig wrote:
> RISC architectures, on the other hand, limit instruction size to 4
> bytes, period.  Even the 64-bit architectures use 32-bit instructions.
> So register/displacement "indirections" are almost always necessary,
> except for very high and very low addresses (which the kernel usually
> confiscates for itself anyway).  Constants can be generated inline
> (and thus relocated) but they can't be done in one instruction, so
> the relocation has to be broken into at least two pieces.  Check out,
> for example, what it takes to generate a relocatable constant in 64-bit
> sparc:
> 
> 	sethi	%hh(cons),%o4
> 	sethi	%lm(cons),%g5
> 	or	%o4,%hm(cons),%o4
> 	or	%g5,%lo(cons),%g5
> 	sllx	%o4,32,%o4
> 	or	%o4,%g5,%o4
> 
> I'd much prefer the single indirection of a register/displacement
> PIC style access, myself, wouldn't you?

Yes, makes sense.  I've only noticed that accessing global variables 
(with full 32bit addresses) takes several instructions on PPC, but I've 
never suspected the architecture to be at fault (just the Mach-O 
format).  Seems I was wrong there.

-- 
If you have to ask what jazz is, you'll never know.
	Louis Armstrong
From: Marcin 'Qrczak' Kowalczyk
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <874q5hp23d.fsf@qrnik.zagroda>
Ulrich Hobelmann <···········@web.de> writes:

> The interesting thing, though, is that I don't remember seeing
> anything like that on Linux or NetBSD/x86.  There were no
> fixup-functions and stuff like that when I looked at compiled C
> output.

There are if you compile with -fPIC which is recommended for shared
libraries. Although on x86 code which is not position-independent
can still be put in a shared library, at the cost of lost sharing
of memory I think.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <4007arF182thkU1@individual.net>
Marcin 'Qrczak' Kowalczyk wrote:
> Ulrich Hobelmann <···········@web.de> writes:
> 
>> The interesting thing, though, is that I don't remember seeing
>> anything like that on Linux or NetBSD/x86.  There were no
>> fixup-functions and stuff like that when I looked at compiled C
>> output.
> 
> There are if you compile with -fPIC which is recommended for shared
> libraries. Although on x86 code which is not position-independent
> can still be put in a shared library, at the cost of lost sharing
> of memory I think.

Ah, ok, I never looked at PIC output.  The Mac gcc compiler seems to do 
that by default (but eliminates the unneeded fix-up segments during 
linking (for non-libraries at least)).

-- 
If you have to ask what jazz is, you'll never know.
	Louis Armstrong
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3vvs1gF17c6g6U1@individual.net>
Duane Rettig wrote:
> CL-USER(2): (disassemble 'foo)
> ;; disassembly of #<Function FOO>
> ;; formals: X Y
> ;; constant vector:
> 0: BAR
> 
> ;; code start: #x1065c124:
>    0: 55          pushl	ebp
>    1: 8b ec       movl	ebp,esp
>    3: 56          pushl	esi
>    4: 83 ec 24   subl	esp,$36
>    7: 8b 5e 12   movl	ebx,[esi+18]  ; BAR
>   10: ff 57 27   call	*[edi+39]     ; SYS::TRAMP-TWO

Ah ok, so that does an indirect function call, ok.  I think on most 
Intels this is slower than a direct inline-address call, but probably 
still faster than everything Java or ObjC (for instance) do.  Similar to 
C++ calls.

I think it would be cool to use direct call instructions, and patch 
their target address on relocation, but that's probably much more 
complicated to do (and wouldn't allow sharing code between processes, or 
between similar functions as in your example below).

>   13: c9          leave
>   14: 8b 75 fc   movl	esi,[ebp-4]
>   17: c3          ret
> CL-USER(3): (inspect #'foo)
> A NEW #<Function FOO>
>   lambda-list: (X Y)
>    0 excl-type ----> Bit field: #x08
>    1 flags --------> Bit field: #x88
>    2 start --------> Bit field: #x1065c124
>    3 hash ---------> Bit field: #x000052a8
>    4 symdef -------> The symbol FOO
>    5 code ---------> short simple CODE vector (13) = #(35669 22252 60547 ...)
>    6 formals ------> (X Y), a proper list with 2 elements
>    7 cframe-size --> fixnum 0 [#x00000000]
>    8 immed-args ---> fixnum 0 [#x00000000]
>    9 locals -------> fixnum 0 [#x00000000]
>   10 <constant> ---> The symbol BAR
> [1i] CL-USER(4): 
> 
> While FOO is executing, a register always holds #'FOO in it so BAR
> is always right there, as I said in my previous post, with no need
> for indirection.

Ok, I thought that symbols had to be looked up, when they really are 
interned before everything else happens.

Thanks for the enlightenment.

-- 
Majority, n.: That quality that distinguishes a crime from a law.
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <p4lmp1p31j8o8gsbn84cuii60tm1jb56qt@4ax.com>
On Fri, 09 Dec 2005 00:02:15 -0800, Duane Rettig <·····@franz.com>
wrote:

>
>[just a couple of points here ...]
>
>George Neuner <·········@comcast.net> writes:
>
>> [IMO, the control offered by mmap() et al. is too coarse grained.
>> Windows file mapping API is easier to use and more flexible.
>> Application level control of the process's virtual memory is one of
>> the few areas where Windows really outshines Unix/Linux.]
>
>I agree.  Microsoft did this (mostly) right.
>
>What they got wrong was disallowing the PAGE_EXECUTE_WRITECOPY
>option for a CreateFileMapping.  When a system has DEP (data execution
>protection) disabled, we can create a copy-on-write mapping for what
>we call a .dxl file (a heap image, essentialy) by using a PAGE_WRITECOPY
>access.  But as soon as one turns on DEP for a machine with the hardware
>for it, no data pages can be executed unless some kind of EXECUTE
>permission is given to the page. 

Which is why I always liked Intel's two level segment:page protection
scheme.  Tying execute permission to segment rather than page allows
discrimination of different types of data.

Protected mode segment handling had performance issues vs. simple
paging, but given the obvious benefits I could never understood why
essentially _no one_ ever used it.  Instead of fixing the performance
problems, Intel looked around, saw no one using segments, and made the
situation worse by deprecating their use and removing some of the
support hardware.  

Now in IA64 the MMU no longer supports segmentation.


>> I think current systems cache the fix up address map and patch the
>> code on demand, one page at a time on-the-fly as needed.
>
>But of course this all assumes C/C++/ld() style binding of functions
>and calls.  In a lisp code situation, it is almost never necessary
>to patch any code, or to relocate it when moving it.  Proof of existence
>for this concept is Allegro CL, where the _only_ architecture ever to
>have required code-fixup after relocation was the Cray, which did not
>provide pc-relative branches (every branch was through a general register
>or to a hard address, offset only by the program base address).  But on
>all other architectures, which have pc-relative branching, code vectors
>can be completely non-relocation sensitive.  The data which would normally
>be relocated (which usually has to go through register/displacement
>addressing anyway) is simply contained in the function object, and since
>the function register is part of the environment as well as the global
>register, there is no problem accessing any data at all with direct
>memory references, and no indirections.

Having the hardware base register obviated the need to patch position
sensitive code.  Flat mode IA32 isn't really hostile to position
independent code but it doesn't do any favors either and most
compilers just don't bother, assuming that the OS will handle
relocatation of the based code if necessary.

George
--
for email reply remove "/" from address
From: Waldek Hebisch
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <dndmg0$ak6$1@panorama.wcss.wroc.pl>
Ulrich Hobelmann <···········@web.de> wrote:
 
> No, I'm just talking about the way memory is handled on most current 
> systems.  If I had time to do my own system, it'd include a file-open 
> flag called don't-cache-this-file, so the file would be cached with 
> lowest priority (dropped from cache as soon as anything else needed 
> memory).  I'm not sure most current system even have memory priorities 
> (like they do have scheduling priorities), only some basic LRU.
> 

from man 2 open on Linux:

       O_DIRECT
              Try to minimize cache effects of the I/O to and from this  file.
              In  general  this  will degrade performance, but it is useful in
              special situations, such  as  when  applications  do  their  own
              caching.   File I/O is done directly to/from user space buffers.
              The I/O is synchronous, i.e., at the completion of  the  read(2)
              or  write(2) system call, data is guaranteed to have been trans-
              ferred.  Under Linux 2.4 transfer sizes, and  the  alignment  of
              user buffer and file offset must all be multiples of the logical
              block size of the file system.  Under  Linux  2.6  alignment  to
              512-byte boundaries suffices.
              A  semantically similar interface for block devices is described
              in raw(8).


Not exactly what you want, but with a little extra userspace code you
get equivalent effect.

-- 
                              Waldek Hebisch
·······@math.uni.wroc.pl 
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3vvt53F16ohfrU2@individual.net>
Waldek Hebisch wrote:
> from man 2 open on Linux:
> 
>        O_DIRECT
[...]
> Not exactly what you want, but with a little extra userspace code you
> get equivalent effect.

Not bad, but seems to be a Linuxism, so I guess not many are using it 
(especially those who use fopen() and friends).

-- 
If you have to ask what jazz is, you'll never know.
	Louis Armstrong
From: Kaz Kylheku
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <1134144709.603306.246190@g49g2000cwa.googlegroups.com>
George Neuner wrote:
> only writable pages whose contents have changed need to be stored back
> to disk.

Any page which isn't mapped to mass storage and whose RAM storage is
being replaced has to be written somewhere so that it can be later
restored.  It doesn't matter whether or not it has been modified since
the last time it happened.

A page can be written out more than once without ever being modified in
between those accesses. It depends on the desig, or perhaps
configuration, of the VM system.

When a read access occurs to a not-present page, it is brought in from
swap. At this point, the swap copy can be liberated. So the next time
the page is evacuated, new storage has to be found for it in the swap
space where it can be written out.  In such a system, a page does not
exist in both places at once. Either it's in the swap space, or it is
in memory. This means that whatever swap space you configure nicely
adds to your total virtual memory size. If you have a gigabyte of RAM,
and add a gigabyte of swap, you have a two gigabyte VM.

Then there is the design whereby swap storage is not liberated when
swapping a page in. Over its lifetime, once a page acquires a mapping
into swap, it keeps that one (and in fact, you can assign it one when
allocating that page) and so the only time it is written out during
replacement is when it's dirty. RAM then basically becomes just a cache
over that swap space. The size of your virtual memory is the size of
that swap space. If that swap space is only as big as physical memory,
then it makes no sense to have it enabled at all. If you want a two
gigabyte VM with one gigabyte of RAM, you need two gigabytes of disk
space rather than just one.

> >I'd like to be able to tell the OS not to cache whatever file I'm
> >reading sequentially (or just the "current" couple of 100k).
>
> I think for that you'll have to get friendly with mmap() and/or
> CreateFileMapping().

It would be useful to be able, at a high level, to tell the machine
that you won't be referencing some memory for quite some time. That
hint would be processed at multiple levels of the memory hierarchy:
anywhere where it is relevant, influencing cache replacement choices at
all levels from the CPU down to virtual memory.
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <66nmp1lmnjiaparhkc3o91pkr7m27o7e5n@4ax.com>
On 9 Dec 2005 08:11:49 -0800, "Kaz Kylheku" <········@gmail.com>
wrote:

>George Neuner wrote:
>> only writable pages whose contents have changed need to be stored back
>> to disk.
>
>Any page which isn't mapped to mass storage and whose RAM storage is
>being replaced has to be written somewhere so that it can be later
>restored.  It doesn't matter whether or not it has been modified since
>the last time it happened.

We're discussing VMMs ... non-mapped pages are beyond the scope.


>A page can be written out more than once without ever being modified in
>between those accesses. It depends on the desig, or perhaps
>configuration, of the VM system.
>
>When a read access occurs to a not-present page, it is brought in from
>swap. At this point, the swap copy can be liberated. So the next time
>the page is evacuated, new storage has to be found for it in the swap
>space where it can be written out.  In such a system, a page does not
>exist in both places at once. Either it's in the swap space, or it is
>in memory. This means that whatever swap space you configure nicely
>adds to your total virtual memory size. If you have a gigabyte of RAM,
>and add a gigabyte of swap, you have a two gigabyte VM.

I have only seen that design in books -  _never_ in practice.  I would
not be suprised to learn that some mainframe OSes used it but, AFAIK,
no micro OS ever has.  When I started in computing, Unix v7 and VMS 2
were current - both mapped pages 1:1 with backing store and so has
every other VMM system I've encountered since.


>Then there is the design whereby swap storage is not liberated when
>swapping a page in. Over its lifetime, once a page acquires a mapping
>into swap, it keeps that one (and in fact, you can assign it one when
>allocating that page) and so the only time it is written out during
>replacement is when it's dirty. RAM then basically becomes just a cache
>over that swap space. The size of your virtual memory is the size of
>that swap space. If that swap space is only as big as physical memory,
>then it makes no sense to have it enabled at all. If you want a two
>gigabyte VM with one gigabyte of RAM, you need two gigabytes of disk
>space rather than just one.

Which is the design of all current OSes.  The benefit being that
unmodified pages do not need to be swapped and can simply be
overwritten - a huge performance boost at the expense of some disk
space.

George
--
for email reply remove "/" from address
From: Kaz Kylheku
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <1134353213.599251.165540@g49g2000cwa.googlegroups.com>
George Neuner wrote:
> On 9 Dec 2005 08:11:49 -0800, "Kaz Kylheku" <········@gmail.com>
> wrote:
>
> >George Neuner wrote:
> >> only writable pages whose contents have changed need to be stored back
> >> to disk.
> >
> >Any page which isn't mapped to mass storage and whose RAM storage is
> >being replaced has to be written somewhere so that it can be later
> >restored.  It doesn't matter whether or not it has been modified since
> >the last time it happened.
>
> We're discussing VMMs ... non-mapped pages are beyond the scope.

I mean something like anonymous mappings, not unmapped pages. For
instance heaps and stacks are not file mappings. Pages from these don't
have to have any assigned space on disk until it's time to swap them
out.

> >in memory. This means that whatever swap space you configure nicely
> >adds to your total virtual memory size. If you have a gigabyte of RAM,
> >and add a gigabyte of swap, you have a two gigabyte VM.
>
> I have only seen that design in books -  _never_ in practice.

Linux.
From: Rob Warnock
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <teednb1W-sOvyADenZ2dnUVZ_sWdnZ2d@speakeasy.net>
Kaz Kylheku <········@gmail.com> wrote:
+---------------
| George Neuner wrote:
| > >This means that whatever swap space you configure nicely
| > >adds to your total virtual memory size. If you have a gigabyte
| > > of RAM,and add a gigabyte of swap, you have a two gigabyte VM.
| >
| > I have only seen that design in books -  _never_ in practice.
| 
| Linux.
+---------------

And Irix & FreeBSD. There are probably more...


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Rob Warnock
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <YsCdnY1-SdWs_AfenZ2dnUVZ_v-dnZ2d@speakeasy.net>
Ulrich Hobelmann  <···········@web.de> wrote:
+---------------
| mmap()ed files are also cached, no?
+---------------

Not always. See the O_SYNC to "open()" on some OSes, e.g., Linux:

  O_SYNC The file is opened  for  synchronous  I/O.  Any  writes  on  the
	 resulting  file  descriptor will block the calling process until
	 the data has been physically written to the underlying hardware.
	 See RESTRICTIONS below, though.
  ...
  RESTRICTIONS
     There are many infelicities in the protocol underlying  NFS,  affecting
     amongst others O_SYNC and O_NDELAY.

     POSIX provides for three different variants of synchronised I/O, corre-
     sponding to the flags O_SYNC, O_DSYNC and O_RSYNC.  Currently (2.1.130)
     these are all synonymous under Linux.

+---------------
| I'd like to be able to tell the OS not to cache whatever file I'm 
| reading sequentially (or just the "current" couple of 100k).
+---------------

See my previous reply re O_DIRECT.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Ulrich Hobelmann
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <3vvt9bF16ohfrU3@individual.net>
Rob Warnock wrote:
> Ulrich Hobelmann  <···········@web.de> wrote:
> +---------------
> | mmap()ed files are also cached, no?
> +---------------
> 
> Not always. See the O_SYNC to "open()" on some OSes, e.g., Linux:
> 
>   O_SYNC The file is opened  for  synchronous  I/O.  Any  writes  on  the
> 	 resulting  file  descriptor will block the calling process until
> 	 the data has been physically written to the underlying hardware.
> 	 See RESTRICTIONS below, though.

But this only guarantees that any writes you did so far are made 
persistent.  It says nothing about the pages still being in cache memory 
after you munmap() the region.

-- 
If you have to ask what jazz is, you'll never know.
	Louis Armstrong
From: Rob Warnock
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <Luadnbtr8OgKkwHeRVn-qQ@speakeasy.net>
Ulrich Hobelmann  <···········@web.de> wrote:
+---------------
| Rob Warnock wrote:
| > Ulrich Hobelmann  <···········@web.de> wrote:
| > +---------------
| > | mmap()ed files are also cached, no?
| > +---------------
| > 
| > Not always. See the O_SYNC to "open()" on some OSes, e.g., Linux:
...
| But this only guarantees that any writes you did so far are made 
| persistent.  It says nothing about the pages still being in cache
| memory after you munmap() the region.
+---------------

Oh, sorry, you're correct. When opened O_SYNC and mmap'd MAP_SHARED,
the pages -- written or not -- *will* be cached [if there's sufficient
free memory]. One main case when this is not true is when a file has
been mmap'd MAP_PRIVATE. Then I think the modified pages would get tossed
immediately when the munmap() occurs, since the modified pages are no
longer accessible. [Unmodified pages would still be cached, of course.]


-Rob

p.s. Another exception: If you open("/dev/mem", O_RDWR | O_SYNC)
and then mmap() it MAP_SHARED, you'll actually get uncached loads
and stores to the underlying bus space. [I use this a lot for
hardware debugging!]

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Thomas A. Russ
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <ymiy82wojop.fsf@sevak.isi.edu>
George Neuner <·········@comcast.net> writes:

> Unused code/data doesn't waste RAM - only virtual addresses and disk
> swap space.  Most OSes memory map executables directly from the file
> system so code doesn't pollute the file cache or swap space.

This would only be true if you could arrange for the unused code or data
to occupy its own set of memory pages.  If it is mixed in with used code
or data, then it will take up RAM when the other items on its memory
page are referenced.

-- 
Thomas A. Russ,  USC/Information Sciences Institute
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <i8ffp1hk1sl3hhdalpl80qv659jgkjb6oe@4ax.com>
On 07 Dec 2005 10:17:58 -0800, ···@sevak.isi.edu (Thomas A. Russ)
wrote:

>George Neuner <·········@comcast.net> writes:
>
>> Unused code/data doesn't waste RAM - only virtual addresses and disk
>> swap space.  Most OSes memory map executables directly from the file
>> system so code doesn't pollute the file cache or swap space.
>
>This would only be true if you could arrange for the unused code or data
>to occupy its own set of memory pages.  If it is mixed in with used code
>or data, then it will take up RAM when the other items on its memory
>page are referenced.

On most CPUs today the cost of executing code from a dirty page is
higher than from a clean page and substantially higher if any of the
code addresses fall within a dirty cache line.

I can't think of _any_ reason to mix code and unrelated data, writable
or not, on the same page and I can't think of a *good* reason to mix
code with writable related data.

Even on the fly code generation doesn't pass muster as far as I'm
concerned.  While instruction bytes are being generated they are data
... once finished they are code.  Out of line data referenced by the
instructions may as well be on a separate page.  It's somewhat easier
to deal with them separately and generally better for performance.

George
--
for email reply remove "/" from address
From: Duane Rettig
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <o0lkyvqy7w.fsf@franz.com>
George Neuner <·········@comcast.net> writes:

> On 07 Dec 2005 10:17:58 -0800, ···@sevak.isi.edu (Thomas A. Russ)
> wrote:
>
>>George Neuner <·········@comcast.net> writes:
>>
>>> Unused code/data doesn't waste RAM - only virtual addresses and disk
>>> swap space.  Most OSes memory map executables directly from the file
>>> system so code doesn't pollute the file cache or swap space.
>>
>>This would only be true if you could arrange for the unused code or data
>>to occupy its own set of memory pages.  If it is mixed in with used code
>>or data, then it will take up RAM when the other items on its memory
>>page are referenced.
>
> On most CPUs today the cost of executing code from a dirty page is
> higher than from a clean page and substantially higher if any of the
> code addresses fall within a dirty cache line.

The last part of your statement is definitely true.  I'm not so sure
about the first part; if by "higher cost" you mean that on those
operating systems which exclude execute access on writiable pages,
then of course the answer is yes; it is higher cost because it
can't be done.  But on architectures that I know of which provide
i-cache flushing automatically, or which provide the user with
instructions to flush d-cache and/or i-cache on a per-line basis,
the only cost associated with a dirty page is in the paging
itself; it is the same cost which is present on a page which
isn't being executed.  If you consider a decent LRU algortihm
on a system, the fact that a dirty page is being flushed is
either a sign that the page isn't being used (including execution)
or it is a sign that the whole machine is thrashing so badly that
any extra pipeline flush overhead associated with the flush is
the least of the machine's problems...

> I can't think of _any_ reason to mix code and unrelated data, writable
> or not, on the same page and I can't think of a *good* reason to mix
> code with writable related data.

In Allegro CL, at least, compiled lisp code is housed in lisp data
called "code vectors".  Its formal type is (simple-array excl::code (*))
[actually, it's short-simple-vector in 7.0 and beyond, but that is
a different matter] and it is just data to the allocator and to the
garbage-collector.  Generally, code vectors are allocated immediately
before or after function objects, which in Allegro CL have no writable
data to speak of (there are times when start addresses are replaced
for tracing, etc, but that is only for debug purposes).  So there is
little chance for such overlap of code and data within the same cache
line.

> Even on the fly code generation doesn't pass muster as far as I'm
> concerned.  While instruction bytes are being generated they are data
> ... once finished they are code.

Not necessarily.  The GC views any code vectors it sees as data,
and if any are located in recent generations, they tend to get
moved.  There is a cost associated with this (i.e. the copying itself
plus cache-flushing during gc) but the ability for the application to
scale up without running into memory fragmentation, whether the app
is code intensive or data intensive, far outweighs the cost at gc time.

>  Out of line data referenced by the
> instructions may as well be on a separate page.  It's somewhat easier
> to deal with them separately and generally better for performance.

I am much more sympathetic to this less-absolute comment, and in
fact I agree with it.  For various reasons we provide the ability to
create a ".pll" file (which stands for "pure lisp library") which
contains code vectors and strings.  Once a pll is established, code
vectors and constant strings are matched against those in the pll and
are used there instead of where they would have been in the heap,
thus allowing th gc to work much less hard at managing its heap objects.
And besides not having to be concerned about cache-flushing, many
code vectors are bit-for-bit identical, so they can be merged and
this causes a reduction in working set size.  There is also a dramatic
reduction in heap size, because code vectors and strings are no longer
in the heap, so the gc has less to have to deal with.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <2ubhp1hc24ahe34d99h7u1vdd1n1eagura@4ax.com>
On Thu, 08 Dec 2005 09:45:39 -0800, Duane Rettig <·····@franz.com>
wrote:

>George Neuner <·········@comcast.net> writes:
>
>>Even on the fly code generation doesn't pass muster as far as I'm
>> concerned.  While instruction bytes are being generated they are data
>> ... once finished they are code.
>
>Not necessarily.  The GC views any code vectors it sees as data,
>and if any are located in recent generations, they tend to get
>moved.  There is a cost associated with this (i.e. the copying itself
>plus cache-flushing during gc) but the ability for the application to
>scale up without running into memory fragmentation, whether the app
>is code intensive or data intensive, far outweighs the cost at gc time.

I get your point, but I was really responding to Thomas's comment
about mixing code and data on the same page.  Treating the entire page
at different times as code or as data for management purposes doesn't
bother me.  You only take a hit on the flip and the hit is only severe
if a code access directly follows a data access.

George
--
for email reply remove "/" from address
From: Duane Rettig
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <o07jaeeoft.fsf@franz.com>
George Neuner <·········@comcast.net> writes:

> On Thu, 08 Dec 2005 09:45:39 -0800, Duane Rettig <·····@franz.com>
> wrote:
>
>>George Neuner <·········@comcast.net> writes:
>>
>>>Even on the fly code generation doesn't pass muster as far as I'm
>>> concerned.  While instruction bytes are being generated they are data
>>> ... once finished they are code.
>>
>>Not necessarily.  The GC views any code vectors it sees as data,
>>and if any are located in recent generations, they tend to get
>>moved.  There is a cost associated with this (i.e. the copying itself
>>plus cache-flushing during gc) but the ability for the application to
>>scale up without running into memory fragmentation, whether the app
>>is code intensive or data intensive, far outweighs the cost at gc time.
>
> I get your point, but I was really responding to Thomas's comment
> about mixing code and data on the same page.  Treating the entire page
> at different times as code or as data for management purposes doesn't
> bother me.  You only take a hit on the flip and the hit is only severe
> if a code access directly follows a data access.

Only on a cache-line basis, not on a page basis.  My argument is with
the granularity of your assertion.  Now, if you're talking about actually
changing access permissions in the mmu between write and execute, then
of course that is a problem.  But in a system which is going to execute
code which it has just built as data, the page it lives on must already
have read, write, and execute permissions in order for performance not
to go down the tubes.

-- 
Duane Rettig    ·····@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <ojjmp1hftkdu3qbok4mjote1e72qaho85r@4ax.com>
On Thu, 08 Dec 2005 23:09:58 -0800, Duane Rettig <·····@franz.com>
wrote:

>George Neuner <·········@comcast.net> writes:
>
>> On Thu, 08 Dec 2005 09:45:39 -0800, Duane Rettig <·····@franz.com>
>> wrote:
>>
>>>George Neuner <·········@comcast.net> writes:
>>>
>>>>Even on the fly code generation doesn't pass muster as far as I'm
>>>> concerned.  While instruction bytes are being generated they are data
>>>> ... once finished they are code.
>>>
>>>Not necessarily.  The GC views any code vectors it sees as data,
>>>and if any are located in recent generations, they tend to get
>>>moved.  There is a cost associated with this (i.e. the copying itself
>>>plus cache-flushing during gc) but the ability for the application to
>>>scale up without running into memory fragmentation, whether the app
>>>is code intensive or data intensive, far outweighs the cost at gc time.
>>
>> I get your point, but I was really responding to Thomas's comment
>> about mixing code and data on the same page.  Treating the entire page
>> at different times as code or as data for management purposes doesn't
>> bother me.  You only take a hit on the flip and the hit is only severe
>> if a code access directly follows a data access.
>
>Only on a cache-line basis, not on a page basis.  My argument is with
>the granularity of your assertion.  Now, if you're talking about actually
>changing access permissions in the mmu between write and execute, then
>of course that is a problem.  But in a system which is going to execute
>code which it has just built as data, the page it lives on must already
>have read, write, and execute permissions in order for performance not
>to go down the tubes.

I'm not certain about this.  I don't have the extensive cross platform
experience that you apparently do, but my understanding of the Intel
design was that the 8-way assoc cache response time to report clean
was variable because of the address decoding and lookup logic - the
best case was no lines at all from the page and the worst was the
address was in a dirty line and had to be written back.

The variable response time mattered because the IF unit bypasses the
d-cache and reads RAM directly, but isn't permitted to fetch
instructions from a dirty address.  My understanding was that the IF
unit starts the fetch and cache query in parallel, but delays delivery
to the decode queue until the cache responds, aborts if the cache
reports the address is dirty and then waits for the address to be
clean before trying again.

It's been a while since I read the manuals but I remember the cache
seemed to have some conditions under which it could take extra cycles
to answer, stalling the pipeline even though the address was clean.

Of course I could be mis-remembering everything.

George
--
for email reply remove "/" from address
From: Rob Warnock
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <YsCdnZJ-SdXb_QfenZ2dnUVZ_v-dnZ2d@speakeasy.net>
George Neuner  <·········@comcast.net> wrote:
+---------------
| I don't think it's possible in most current OSes to bypass the file
| cache when reading using standard I/O calls.
+---------------

But note than many current operating systems, expecially those that
support high-performance filesystems like XFS (which includes Irix
and Linux, at a minimum), provide an O_DIRECT mode bit in "open()"
which *does* bypass the cache. From "man 2 open" on Linux:

    O_DIRECT
       Try to minimize cache effects of the I/O to and from this  file.
       In  general  this  will degrade performance, but it is useful in
       special situations, such  as  when  applications  do  their  own
       caching.   File I/O is done directly to/from user space buffers.
       The I/O is synchronous, i.e., at the completion of  the  read(2)
       or  write(2) system call, data is guaranteed to have been trans-
       ferred.  Under Linux 2.4 transfer sizes, and  the  alignment  of
       user buffer and file offset must all be multiples of the logical
       block size of the file system.  Under  Linux  2.6  alignment  to
       512-byte boundaries suffices.
       A  semantically similar interface for block devices is described
       in raw(8).


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: George Neuner
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <8rkmp155thsc1cobj5q0miqion0e56jqhd@4ax.com>
On Fri, 09 Dec 2005 22:57:42 -0600, ····@rpw3.org (Rob Warnock) wrote:

>George Neuner  <·········@comcast.net> wrote:
>+---------------
>| I don't think it's possible in most current OSes to bypass the file
>| cache when reading using standard I/O calls.
>+---------------
>
>But note than many current operating systems, expecially those that
>support high-performance filesystems like XFS (which includes Irix
>and Linux, at a minimum), provide an O_DIRECT mode bit in "open()"
>which *does* bypass the cache. From "man 2 open" on Linux:
>
>    O_DIRECT
>       Try to minimize cache effects of the I/O to and from this  file.
>       In  general  this  will degrade performance, but it is useful in
>       special situations, such  as  when  applications  do  their  own
>       caching.   File I/O is done directly to/from user space buffers.
>       The I/O is synchronous, i.e., at the completion of  the  read(2)
>       or  write(2) system call, data is guaranteed to have been trans-
>       ferred.  Under Linux 2.4 transfer sizes, and  the  alignment  of
>       user buffer and file offset must all be multiples of the logical
>       block size of the file system.  Under  Linux  2.6  alignment  to
>       512-byte boundaries suffices.
>       A  semantically similar interface for block devices is described
>       in raw(8).

You're right, it was introduced in Linux 2.4.  I have been working in
the Windows world for a while so I missed it.

Thanks,
George
--
for email reply remove "/" from address
From: Carl Shapiro
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <ouyhd9gk0j6.fsf@panix3.panix.com>
George Neuner <·········@comcast.net> writes:

> You're right, it was introduced in Linux 2.4.  I have been working in
> the Windows world for a while so I missed it.

In the Win32 world, pass FILE_FLAG_NO_BUFFERING to CreateFile for
bypassing the buffer cache.  OpenVMS also allows the user to toggle
buffering, which is often off by default.  POSIX 1003.1b introduced
the O_SYNC and O_RSYNC flags to open(2) which can control write and
read buffering.  I don't know how widely (or correctly) implemented
this feature is on today's UNIX systems.
From: Greg Menke
Subject: Re: Is Greenspun enough?
Date: 
Message-ID: <m3ek4rwrom.fsf@athena.pienet>
George Neuner <·········@comcast.net> writes:

> On 03 Dec 2005 17:02:28 -0500, Greg Menke <············@toadmail.com>
> wrote:
> 
> >
> >Just got hold of some C++ at work, while I've not discovered a homebrew
> >GC yet I have discovered a buggy homebrew implementation of sprintf.
> > :
> >I asked the developer why it was there and couldn't we please just use
> >the local C library version- the answer was the reimplementation was
> >required because the developer wanted to be able to print objects
> >usefully.
> 
> That is not unusual for a small embedded application.
> 
> Suppose you need formatted printing but you know, for example, that
> you will never print floating point values and memory is tight so you
> don't want to include any unnecessary printf() helper functions in the
> executable.   You copy the source to printf() - which is included in
> your cross-compiler library - and remove all the floating point stuff.
> 
> In fact it's pretty normal to _have_ to modify the library startup
> functions to set up heap and so forth, but I've never seen a custom
> implementation of a standard library routine done for any other
> purpose than embedded.  
> 
> Certainly the reason given sounds ridiculous.

If it was a small embedded app without something like newlib, I'd buy
it- but this is a app to run on Linux on a moderate scale machine.  

The app follows the general pattern of a C developer who moved up to
C++, bought the Stroustrop book and tries to use some oo stuff but still
mostly thinks in C.  The thing is loaded with templates and generates an
18 megabyte!!! executable from a 4 meg source tree.

Gregm