From: John W.F. McClain
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <1992Apr1.194523.9898@athena.mit.edu>
In article <·················@ibis.cs.umass.edu> ····@cs.umass.edu writes:

>For the most part, though, I think LISP, Smalltalk, etc. hardware is dead for
>the economic and market reasons touched on above. Adding some minor features
>to general purpose architectures seems to be the farthest one should
>rationally go in the face of market forces today and in the foreseeable
>future.

Actually this gets at one of my sub-questions.  What "minor" features
could one add to an architecture that would make it more LISP
friendly?  Would these features have a negative impact on other
languages?  To put it another way, how would one provide features in a
RISC architecture to make it more LISP friendly in a way consistent with
the a RISC design philosophy?  Is this possible?

One suggestion made in comp.lang.lisp is tag bits.  But how would one
add tag bits to a architecture while maintaining a RISC-style?  Are
there any other ideas?

John W.F. McClain
····@athena.mit.edu

From: Paul Wilson
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <ktkt4iINN65g@boogie.cs.utexas.edu>
In article <····················@athena.mit.edu> ····@athena.mit.edu (John W.F. McClain) writes:
>In article <·················@ibis.cs.umass.edu> ····@cs.umass.edu writes:
>
>>For the most part, though, I think LISP, Smalltalk, etc. hardware is dead for
>>the economic and market reasons touched on above. Adding some minor features
>>to general purpose architectures seems to be the farthest one should
>>rationally go in the face of market forces today and in the foreseeable
>>future.
>
>Actually this gets at one of my sub-questions.  What "minor" features
>could one add to an architecture that would make it more LISP
>friendly?  Would these features have a negative impact on other
>languages?  To put it another way, how would one provide features in a
>RISC architecture to make it more LISP friendly in a way consistent with
>the a RISC design philosophy?  Is this possible?
>
>One suggestion made in comp.lang.lisp is tag bits.  But how would one
>add tag bits to a architecture while maintaining a RISC-style?  Are
>there any other ideas?
>

The main thing you probably want is a very large second-level cache,
big enough to hold the whole youngest generation of a generational
gc.  Then you want nonblocking prefetching (software-controlled is
OK) first-level cache so that you can prefetch the memory you're about
to allocate.  Otherwise you'll miss your cache A LOT, because a
high-performance Lisp system typically allocates (hence touches,
dirties, and writes back) a lot of memory between garbage collections. 
(This is inherent in the way efficient gc's work, whether they're
copying, mark-sweep, or whatever.  You trade some space for time,
by gc'ing seldom enough that most objects are born and die between
collections.)

Have a look at the SPARC's tagged arithmetic instructions.  All they
really do---if I recall correctly---is check to see if the low two
bits are zeroes, and makes it easy (by setting condition codes?) to
do a quick branch if they're not.  Steenkiste and Hennessy's paper
on tags explains how to exploit conventional hardware's address
decode logic (and object alignment constraints) to get some of the
other benefits of tags, and how careful choice of tag values allows
neat hacks so that they don't cost much.  

The SPARC is the intellectual descendent of the SOAR chip from Berkeley.
Dave Ungar's 1986 thesis talks about the cost/benefit analysis of tagging
and other features for supporting dynamically typed languages like
Lisp and Smalltalk.

Ungar argues that most features don't really have an important impact
on performance, or actually DEGRADE performance by affecting the CPU's
basic cycle time and/or by making the chip harder to design and get to
market.  (A 6 month late chip is slower just because the competition
has gotten significantly faster in 6 months.)  And the LAST thing you
want to do is interfere with the basic pipeline of a hot chip.

Of the few features the SOAR folk DID decide were worth the effort and
chip area, some are of questionable value now, because people have come
up with better software techniques.  Register windows are an example---if
you have a good register allocator in your compiler, and do snazzy inlining
and aggressive optimizations, the register windows become much less
useful.  (Ungar, Chambers, and Hoelzle have done wonderful things with
compiling away the costs of dynamic dispatching... see their papers in
the OOPSLA and SIGPLAN conferences of the last few years.)

Another example is hardware support for incremental and generational gc's.
MIT, Symbolics, and TI Lisp machines had invisible forwarding pointers
in hardware, so that a copying collector could move objects around while
the program was running.  That way, it didn't have to stop the running
program to collect garbage.  Without hardware support, that would be
very expensive.  But now Henry's invented a non-copying version of his
incremental scheme, which avoids the problem entirely.

Soar and SPUR (the Lisp-oriented RISC that followed SOAR) incorporated
hardware support for the detection of intergenerational pointers. (That is,
pointers from old data, which don't get gc'd often, into new data,
which do.)  My 1989 OOPSLA paper shows why that's probably not
the right thing to do anyway, and you're better off doing "card marking",
where you just keep dirty bits.  Ungar has since improved on my scheme
by using bytes as dirty bits, because byte stores are cheap and fast on
most architectures.  (We just modified my GC to use Ungar's byte maps, too.)

Fast atomic byte stores are handy.  Atomic bit stores would be nice, too.
You want them atomic because setting a byte or bit in a word shouldn't
interfere with another thread's doing the same thing to a different
byte or bit in the same word.

I think that one nice hardware feature would be bit addressing, because
then we could steal 10 low-order bits instead of just 2, and use more
patterns for tags.  That's not likely, though, because the world pretty
much assumes C pointers-to-character these days.  And 2 bits actually
is adequate, if not ideal---it lets you distinguish between pointers,
immediate integers, and a couple of other things.  For further 
discriminations, you can consult object headers.


I think it's probably better to focus on better OS support and software
techniques than on specialized hardware.   For example, you can do
wonderful things with virtual memory protections, and take extra advantage
of the parallel checking hardware you've already got---the TLB.

In that respect, conventional architectures are fine, if the pages aren't
too large.  The current trend toward huge pages is very annoying.  The new
ARM 600 takes a more reasonable approach, having independently protectable
1KB chunks within 4KB pages.  You at least want small units of protection,
but there's no reason they should be the same as the unit of address
translation.

   -- PRW

----------------------------------------
Some references.  (My papers have lots of references to other people's papers.):

@book{Unga86,
author = "David Ungar",
title = "Design and Evaluation of a High-Performance {S}malltalk
System",
publisher = "MIT Press",
address = "Cambridge, Massachusetts",
year = 1986	}

@phdthesis{Stee87,
author = "Steenkiste, Peter",
title = "Lisp on a Reduced-Instruction-Set Processor:  Characterization and
Optimization",
school = "Stanford University", 
note = "Also appears as Technical Report CSL-TR-87-324, Stanford University Computer System Laboratory, Palo Alto, California, 1987", 
month = "March",
year = 	1987	}

@inproceedings{GCandMH,
author = "Wilson, Paul R.",
title = "Some Issues and Strategies in Heap Management and Memory
Hierarchies",
booktitle = "{OOPSLA/ECOOP} '90 Workshop on Garbage Collection in
Object-Oriented Systems",
note = "Also in {\em SIGPLAN Notices 23}(1):45--52, January 1991.",
month = oct,
year = 1990	}

@inproceedings{OSsupport,
author = "Wilson, Paul R.",
title = "Operating System Support for Small Objects",
booktitle = "Proceedings of the {IEEE} Int'l Workshop on Object Orientation In Operating Systems ({IWOOOS-II})",
month = Oct,
year = 1991,
note = {Palo Alto, California}}

@article{Bake92,
author = "Baker, Jr., Henry G.",
title = "The Treadmill: Real-Time Garbage Collection Without Motion Sickness",
journal = "{ACM SIGPLAN} Notices",
volume = 27,
number = 3,
month = Mar,
year = 1992,
pages = "66--70"
}

-----------------------------------------

-- 
| Paul R. Wilson, runtime environmentalist          ······@cs.utexas.edu      |
| U. of Texas Computer Sciences Dept.               voice: (512) 471-9555     |
| Taylor Hall 2.124, Austin, TX 78712-1188          fax:   (512) 471-8885     |
| "Inertia makes the world go 'round."              DoD M.C. # e  (Sz GS450T) |
From: Tom Knight
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <TK.92Apr6120907@wheat-chex.ai.mit.edu>
In article <············@boogie.cs.utexas.edu> ······@cs.utexas.edu (Paul Wilson) writes:

   The main thing you probably want is a very large second-level cache,
   big enough to hold the whole youngest generation of a generational
   gc.

In large parallel machines, large caches containing unused data are
not benign.  There is a cost to having data in your cache (the
requirement that it be updated) whether you use the data or not.  It
is not obvious that extremely large second level caches are desirable
in such parallel machines.

   Then you want nonblocking prefetching (software-controlled is
   OK) first-level cache so that you can prefetch the memory you're about
   to allocate.  Otherwise you'll miss your cache A LOT, because a
   high-performance Lisp system typically allocates (hence touches,
   dirties, and writes back) a lot of memory between garbage collections. 
 
While I certainly concur that you want a load/don't block instruction,
if you can guarantee that the data is freshly allocated, then you
don't have to prefetch the location into your cache.  You can simply
write it.  No one else can have a pointer to that memory (by
assumption, since you are freshly allocating it).  This is one of the
ways advanced languages can help hardware.  Don't throw the benefits
away.

   Have a look at the SPARC's tagged arithmetic instructions.  All they
   really do---if I recall correctly---is check to see if the low two
   bits are zeroes, and makes it easy (by setting condition codes?) to
   do a quick branch if they're not.  Steenkiste and Hennessy's paper
   on tags explains how to exploit conventional hardware's address
   decode logic (and object alignment constraints) to get some of the
   other benefits of tags, and how careful choice of tag values allows
   neat hacks so that they don't cost much.  

I think everyone has looked at these.  They are cute, but, especially
the version in the SPARC is little more than a joke, since the
checking is missing from almost everything except the add instruction.
We used to call these marketing tags at Symbolics, and my opinion of
them hasn't changed much.  While we might be able to do data typing
this way, the technique falls flat on its face for doing presence bits
for supporting futures or dataflow features in parallel
implementations.  Lucid found this out to their pain in the Alliant
QLisp implementation, for example.  You don't get good performance for
floats, of course, unless you are willing to butcher the precision of
your mantissas.  But hey, it's Unix.  It doesn't have to be right, as
long as it boots fast.

   The SPARC is the intellectual descendent of the SOAR chip from Berkeley.
   Dave Ungar's 1986 thesis talks about the cost/benefit analysis of tagging
   and other features for supporting dynamically typed languages like
   Lisp and Smalltalk.

   Ungar argues that most features don't really have an important impact
   on performance, or actually DEGRADE performance by affecting the CPU's
   basic cycle time and/or by making the chip harder to design and get to
   market.  (A 6 month late chip is slower just because the competition
   has gotten significantly faster in 6 months.)  And the LAST thing you
   want to do is interfere with the basic pipeline of a hot chip.

This argument is pure rubbish.  The alpha people are desperately
trying to figure out what profitable things to do with the 3 million
extra transistors available on the next generation die.  If you or
someone else can show me how tagging impacts the critical path of the
pipe, I might pay some attention.  Anyone who looks seriously at it
can see that the tagging paths are parallel to, shorter than, and
easier to implement than, for example, the add path.  The real
difficulties, which I don't want to minimize, have to do with fitting
the tags into power-of-two words, when the IEEE standard requires 64
bit precision.  Non power-of-two implementations face incredible
difficulties in interfacing with the rest of the computing world.  The
most profitable approaches probably involve coding non-floats as
various types of NANs in the double float format.

   Of the few features the SOAR folk DID decide were worth the effort and
   chip area, some are of questionable value now, because people have come
   up with better software techniques.  Register windows are an example---if
   you have a good register allocator in your compiler, and do snazzy inlining
   and aggressive optimizations, the register windows become much less
   useful.  (Ungar, Chambers, and Hoelzle have done wonderful things with
   compiling away the costs of dynamic dispatching... see their papers in
   the OOPSLA and SIGPLAN conferences of the last few years.)

Register windows were a bad idea, but had little to do with supporting
Lisp and Smalltalk.

   Another example is hardware support for incremental and generational gc's.
   MIT, Symbolics, and TI Lisp machines had invisible forwarding pointers
   in hardware, so that a copying collector could move objects around while
   the program was running.  That way, it didn't have to stop the running
   program to collect garbage.  Without hardware support, that would be
   very expensive.  But now Henry's invented a non-copying version of his
   incremental scheme, which avoids the problem entirely.

These techniques are very clever, but don't entirely solve the gc problems.
Write barriers are still very useful, low cost, and desirable, in my
opinion.

   Soar and SPUR (the Lisp-oriented RISC that followed SOAR) incorporated
   hardware support for the detection of intergenerational pointers. (That is,
   pointers from old data, which don't get gc'd often, into new data,
   which do.)  My 1989 OOPSLA paper shows why that's probably not
   the right thing to do anyway, and you're better off doing "card marking",
   where you just keep dirty bits.  Ungar has since improved on my scheme
   by using bytes as dirty bits, because byte stores are cheap and fast on
   most architectures.  (We just modified my GC to use Ungar's byte maps, too.)

Byte writes are a plague on the world.  Alpha does not support them,
for extremely good reasons having to do with the need to error correct
main memory contents on a byte boundary basis if you allow byte writes
-- or else do read-pause-writes for each write.  It's bad enough that
they felt it necessary to support 32 bit word writes.  I predict that
byte writes will not be "cheap and fast on most architectures" from
now on.

   Fast atomic byte stores are handy.  Atomic bit stores would be nice, too.
   You want them atomic because setting a byte or bit in a word shouldn't
   interfere with another thread's doing the same thing to a different
   byte or bit in the same word.

   I think that one nice hardware feature would be bit addressing, because
   then we could steal 10 low-order bits instead of just 2, and use more
   patterns for tags.  That's not likely, though, because the world pretty
   much assumes C pointers-to-character these days.  And 2 bits actually
   is adequate, if not ideal---it lets you distinguish between pointers,
   immediate integers, and a couple of other things.  For further 
   discriminations, you can consult object headers.

Bitwise writes would be even worse, for the same reasons.

   I think it's probably better to focus on better OS support and software
   techniques than on specialized hardware.   For example, you can do
   wonderful things with virtual memory protections, and take extra advantage
   of the parallel checking hardware you've already got---the TLB.

   In that respect, conventional architectures are fine, if the pages aren't
   too large.  The current trend toward huge pages is very annoying.  The new
   ARM 600 takes a more reasonable approach, having independently protectable
   1KB chunks within 4KB pages.  You at least want small units of protection,
   but there's no reason they should be the same as the unit of address
   translation.

The trend to large pages makes a great deal of sense due to increased
main memory size because of low costs and the large size of disk
cylinders.  It also allows relatively larger physically mapped first
level caches, where the page translation and the cache lookup happen
in parallel.  What you are really after is not larger page sizes, but
smaller units of control for write and read barriers.  These are not
the same, and it is a mistake to confuse them.  The ARM approach
sounds 1/2 right, probably for the wrong reasons, but now give me
write barriers.
From: David V. Wallace
Subject: Ideas for new LISP Machines
Date: 
Message-ID: <GUMBY.92Apr14021204@Cygnus.COM>
   Date: 10 Apr 92 21:56:05 GMT
   From: ······@cs.utexas.edu (Paul Wilson)

   In article <··················@Cygnus.COM> ·····@Cygnus.COM (David V. Wallace) writes:
   >
   >   Date: 7 Apr 92 19:54:48 GMT
   >   From: ······@cs.utexas.edu (Paul Wilson)
   >                                                     (I think most
   >   of the costs of floats are the compiler's problem, not the hardware's;
   >   with good inlining, dataflow analysis, etc., most floats can be raw
   >   untagged values on the stack or in registers.
   >
   >This makes scavenging the stack painful if you're doing realtime gc in
   >another execution thread.

   The easy way to do this is to partition the register set into pointer
   registers and nonpointer registers, as is done in Yale's T compiler.

This doesn't work on current architectures (and I apologize that I had
forgotten that this discussion is on _future_ architectures).  The two
problems are:
 o - on a register-poor architecture (such as the sparc) the cost of
     untagging may still be lower than the cost of all the reloads
     you'll have to do because you don't have enough free registers of
     the proper type.  Judicious tag choice and bibop-ing is needed to
     make this true, of course.
 o - You can't intercall with functions written in other languages
     unless you inhibit the GC.

Current microprocessors do make some concessions in this direction,
often with dedicated fp and sp regs.

One implementation that used this technique to great advantage as
MACLISP.  In fact maclisp maintained a seperate PDL for flonums.
(This was made easier by the fact that on the 10 any reg -- or any
word in fact -- can be a stack pointer).  Its numeric performance was
claimed to be better than the 10's FORTRAN compiler's, though I never
verified this myself.  

   The ``right'' way to do it is to have the compiler record what's where,
   and in what format, at what point in the code.  (For compactness, that
   might involve defaults so that you only record the (relatively few)
   exceptions explicitly.)

Would you keep this info close to the function or in a seperate area?
Which has better performance?
From: Paul Wilson
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <kumaeoINNpkv@estoril.cs.utexas.edu>
In article <···················@Cygnus.COM> ·····@Cygnus.COM (David V. Wallace) writes:
>   Date: 10 Apr 92 21:56:05 GMT
>   From: ······@cs.utexas.edu (Paul Wilson)
>
>   In article <··················@Cygnus.COM> ·····@Cygnus.COM (David V. Wallace) writes:
>   >
>   >   Date: 7 Apr 92 19:54:48 GMT
>   >   From: ······@cs.utexas.edu (Paul Wilson)
>   >                                                     (I think most
>   >   of the costs of floats are the compiler's problem, not the hardware's;
>   >   with good inlining, dataflow analysis, etc., most floats can be raw
>   >   untagged values on the stack or in registers.
>   >
>   >This makes scavenging the stack painful if you're doing realtime gc in
>   >another execution thread.
>
>   The easy way to do this is to partition the register set into pointer
>   registers and nonpointer registers, as is done in Yale's T compiler.
>
>This doesn't work on current architectures (and I apologize that I had
>forgotten that this discussion is on _future_ architectures).  The two
>problems are:
> o - on a register-poor architecture (such as the sparc) the cost of
>     untagging may still be lower than the cost of all the reloads
>     you'll have to do because you don't have enough free registers of
>     the proper type.  Judicious tag choice and bibop-ing is needed to
>     make this true, of course.

Umm... I didn't say this was the ``right'' way to do it---just the
easy way.  (In fact, I said the opposite.)  I agree it's not as attractive
on a register-poor architecture.

I'm not sure if the SPARC qualifies as a register-poor architecture
in this respect.  It does in another respect, though.  It's got
tons o' registers, but very few GLOBAL registers you can play with
in an application-specific way.  In particular, I'd like special
global registers for values I need to reference often in my
system: a current-environment pointer, a base-pointer to my
card-marking bitmap (used for gc purposes, but used by any
code that stores potential pointers into the heap, not just
at gc time), and a coupla pervasive constants like false and null.
I'd rather have fewer registers in my windows and more in my
bank of globals.  (Registers, registers, everywhere, but...)

And if I recall correctly, the SPARC ABI has either zero or
close to zero registers you can use in this way---the global
registers are either reserved for particular uses or reserved
to the system for whatever it wants, so you can't absolutely
count on them not getting clobbered by system calls, etc.  Yuck.

(I think this may be one of the reasons that the GNU C compiler
still doesn't define a portable way to allocate a global register :-(.)

>   The ``right'' way to do it is to have the compiler record what's where,
>   and in what format, at what point in the code.  (For compactness, that
>   might involve defaults so that you only record the (relatively few)
>   exceptions explicitly.)
>
>Would you keep this info close to the function or in a seperate area?
>Which has better performance?

Good question.  I'd guess locality would enter into this in funny
ways. (On timescales smaller than the gc interval, you don't want
the info grouped with the function.  On timescales larger than
that, you may.)  On the other hand, I believe Moss et al.'s results
show that the space cost is less than 20%, so maybe it's no big deal
just to group it with the code. 

-- 
| Paul R. Wilson, runtime environmentalist          ······@cs.utexas.edu      |
| U. of Texas Computer Sciences Dept.               voice: (512) 471-9555     |
| Taylor Hall 2.124, Austin, TX 78712-1188          fax:   (512) 471-8885     |
| "Inertia makes the world go 'round."              DoD M.C. # e  (Sz GS450T) |
From: Wayne Schlitt
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <WAYNE.92Apr14080035@backbone.uucp>
In article <···················@Cygnus.COM> ·····@Cygnus.COM (David V. Wallace) writes:
>  o - on a register-poor architecture (such as the sparc) the cost of
>      untagging may still be lower than the cost of all the reloads
>      you'll have to do because you don't have enough free registers of
>      the proper type. [ ... ]

umm, pardon my ignorance, but in what way is the sparc a
"register-poor" architecture?  I thought it had 32 registers, split
between calling, current and called functions.  Isnt this enough?


-wayne
From: David V. Wallace
Subject: Ideas for new LISP Machines
Date: 
Message-ID: <GUMBY.92Apr16121814@Cygnus.COM>
   Date: 14 Apr 92 14:00:35 GMT
   From: ·····@backbone.uucp (Wayne Schlitt)

   In article <···················@Cygnus.COM> ·····@Cygnus.COM (David V. Wallace) writes:
   >  o - on a register-poor architecture (such as the sparc) the cost of
   >      untagging may still be lower than the cost of all the reloads
   >      you'll have to do because you don't have enough free registers of
   >      the proper type. [ ... ]

   umm, pardon my ignorance, but in what way is the sparc a
   "register-poor" architecture?  I thought it had 32 registers, split
   between calling, current and called functions.  Isnt this enough?

You're excused.  If you look at the calling convention you have only 8
general-purpose local registers.  The others are eaten up by the
calling convention.  If you start partitioning them by type you
quickly run out.

If you change the calling convention you can use a lot more, of
course, but then you can't intercall with functions written in other
languages.
From: Preston Briggs
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <1992Apr16.222355.25419@rice.edu>
>   umm, pardon my ignorance, but in what way is the sparc a
>   "register-poor" architecture?  I thought it had 32 registers, split
>   between calling, current and called functions.  Isnt this enough?


·····@Cygnus.COM (David V. Wallace) writes:
>You're excused.  If you look at the calling convention you have only 8
>general-purpose local registers.  The others are eaten up by the
>calling convention.  If you start partitioning them by type you
>quickly run out.

The Sparc has plenty of registers for each routine.
It's absurd to imagine that you've only got 8 for local use.
The incoming parameter registers may be reused as you're done with
then parameters (or spill the parameters if necessary).  How many
routines have 8 parameters anyway?  Further, the outgoing parameter
register can be use for temporaries in between procedure calls.

You may have to spill a little now and then, but you always have to
spill on real programs (except for the simplest leaf routines).  But
if you can keep most of the spills out of inner loops, then you've got
enough registers.

Preston Briggs
From: Herman Rubin
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <45512@mentor.cc.purdue.edu>
In article <···················@Cygnus.COM> ·····@Cygnus.COM (David V. Wallace) writes:

			........................

>   >                                                     (I think most
>   >   of the costs of floats are the compiler's problem, not the hardware's;
>   >   with good inlining, dataflow analysis, etc., most floats can be raw
>   >   untagged values on the stack or in registers.
>   >
>   >This makes scavenging the stack painful if you're doing realtime gc in
>   >another execution thread.

>   The easy way to do this is to partition the register set into pointer
>   registers and nonpointer registers, as is done in Yale's T compiler.

>This doesn't work on current architectures (and I apologize that I had
>forgotten that this discussion is on _future_ architectures).  The two
>problems are:
> o - on a register-poor architecture (such as the sparc) the cost of
>     untagging may still be lower than the cost of all the reloads
>     you'll have to do because you don't have enough free registers of
>     the proper type.  Judicious tag choice and bibop-ing is needed to
>     make this true, of course.
> o - You can't intercall with functions written in other languages
>     unless you inhibit the GC.

Most architectures are register-poor.  Partitioning the registers into pointer
registers and non-pointer registers, or having separate floating-point registers,
makes for problems of all kinds.  

The idea of making things easy for the compiler rather than efficient is a 
clear indication that the language/compiler combination used is bad, not
allowing a programmer to do intelligent things.  This then results in the
present approaches to making sure that the programmer does not know of the
machine capabilities, and then the machine designer leaves them out because
"nobody" uses them, etc.
                                                                                
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
······@pop.stat.purdue.edu (Internet, bitnet)  
{purdue,pur-ee}!pop.stat!hrubin(UUCP)
From: Andy Glew
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <GLEW.92Apr1220416@pdx007.intel.com>
    One suggestion made in comp.lang.lisp is tag bits.  But how would one
    add tag bits to a architecture while maintaining a RISC-style?  Are
    there any other ideas?

By now, people are pretty good at placing tags in the upper bits of a
pointer, and/or in the lower bits of a pointer where misaligned
address traps can be used to do some tag checking.

SUN has some tagged instructions in SPARC.

One idea I've always been fond of is to provide an architectural mask,
part of process context, that is ANDed with all virtual addresses, so
software doesn't need to mask off the tag before addressing.
--

Andy Glew, ····@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Pkwy, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

Intel Inside (tm)
From: david carlton
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <CARLTON.92Apr2164722@husc8.harvard.edu>
In article <·················@pdx007.intel.com>, ····@pdx007.intel.com (Andy Glew) writes:

> One idea I've always been fond of is to provide an architectural mask,
> part of process context, that is ANDed with all virtual addresses, so
> software doesn't need to mask off the tag before addressing.

You don't need this if you tag the low bits and have cheap load with
offset.  (Or if you would be doing a load with offset anyways.)  If a
register contains a cons box that is tagged with 1, say, you can get
at the car by loading -1(register) and the cdr by loading 3(register).
(Assuming some obvious things about how the cons box is stored.)  So I
don't think that the above is really necessary.

david carlton
·······@husc.harvard.edu

       GOOD-NIGHT, everybody..  Now I have to go administer FIRST-AID
       to my pet LEISURE SUIT!!
From: Michael Meissner
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <MEISSNER.92Apr3104846@tiktok.osf.org>
In article <····················@husc8.harvard.edu>
·······@husc8.harvard.edu (david carlton) writes:

| You don't need this if you tag the low bits and have cheap load with
| offset.  (Or if you would be doing a load with offset anyways.)  If a
| register contains a cons box that is tagged with 1, say, you can get
| at the car by loading -1(register) and the cdr by loading 3(register).
| (Assuming some obvious things about how the cons box is stored.)  So I
| don't think that the above is really necessary.

Though of course, the above would not work on the 88k, since offsets
are unsigned on that architecture....
--
Michael Meissner	email: ········@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

You are in a twisty little passage of standards, all conflicting.
From: Preston Briggs
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <1992Apr3.165625.2503@rice.edu>
>·······@husc8.harvard.edu (david carlton) writes:
>| You don't need this if you tag the low bits and have cheap load with
>| offset.  (Or if you would be doing a load with offset anyways.)  If a
>| register contains a cons box that is tagged with 1, say, you can get
>| at the car by loading -1(register) and the cdr by loading 3(register).
>| (Assuming some obvious things about how the cons box is stored.)  So I
>| don't think that the above is really necessary.

········@osf.org (Michael Meissner) writes:
>Though of course, the above would not work on the 88k, since offsets
>are unsigned on that architecture....

Just point in front of the cons cell, so 3(reg) gets the car
and 7(reg) gets the cdr.

Preston Briggs
From: Andy Glew
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <GLEW.92Apr3172834@pdx007.intel.com>
    | ·······@husc8.harvard.edu (david carlton) writes:
    | You don't need this if you tag the low bits and have cheap load with
    | offset.  (Or if you would be doing a load with offset anyways.)  If a
    | register contains a cons box that is tagged with 1, say, you can get
    | at the car by loading -1(register) and the cdr by loading 3(register).
    | (Assuming some obvious things about how the cons box is stored.)  So I
    | don't think that the above is really necessary.

    [Michael Meissner	email: ········@osf.org]
    Though of course, the above would not work on the 88k, since offsets
    are unsigned on that architecture....

I believe that the MC88110 has added signed offsets back to the 88K architecture.

Yep, here it is, from the COMPCON paper p. 157:

    "An option was added to allow 16-bit immediate address offsets and
    literal constants to be treated as signed numbers (the original
    88100 treated them as unsigned)."

I think that unsigned offsets in addresses are pretty much recognized
as a nice thing for hardware, but not so nice for software.  Lacking
evidence (beyond the very first architectural study I did, way back in
1983, on the distribution of constants), I believe that unsigned
immediates in general are nice for hardware, but not necessarily the
best thing to do for software.

--

Andy Glew, ····@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Pkwy, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

Intel Inside (tm)
From: Michael Meissner
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <MEISSNER.92Apr6102821@tiktok.osf.org>
In article <·················@pdx007.intel.com> ····@pdx007.intel.com
(Andy Glew) writes:

| I believe that the MC88110 has added signed offsets back to the 88K architecture.
| 
| Yep, here it is, from the COMPCON paper p. 157:
| 
|     "An option was added to allow 16-bit immediate address offsets and
|     literal constants to be treated as signed numbers (the original
|     88100 treated them as unsigned)."

I don't know how much of the 88110 architecture has been released or
not, but my memory of the signed offsets is that it's a mode bit in
the PSW, which makes it an all or nothing approach.

| I think that unsigned offsets in addresses are pretty much recognized
| as a nice thing for hardware, but not so nice for software.  Lacking
| evidence (beyond the very first architectural study I did, way back in
| 1983, on the distribution of constants), I believe that unsigned
| immediates in general are nice for hardware, but not necessarily the
| best thing to do for software.

For integer constants, and stack frame addresses, I agree that signed
offsets is generally more desirable than unsigned offsets.  However,
for things like loading up a global/static variable, it is less
desirable.  It forces you to add special relocations that deal with
the sign extension of bit 16, when the address is broken into two
instruction (load upper immediate followed by the memory instruction).
Futhermore, it prevents (or at least hinders) splitting up these two
instructions for scheduling purposes.

Finally, with regard to global addresses, I notice that botht he R4000
and Alpha do not deal at all well with global/static variable
addresses being more than 32-bits.  I think that there is an implicit
assumption that linkers won't put things above the 4G limit.
--
Michael Meissner	email: ········@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

You are in a twisty little passage of standards, all conflicting.
From: Stephen L Nicoud
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <70470@bcsaic.UUCP>
In article <·················@pdx007.intel.com> ····@pdx007.intel.com (Andy Glew) writes:
>
>    One suggestion made in comp.lang.lisp is tag bits.  But how would one
>    add tag bits to a architecture while maintaining a RISC-style?  Are
>    there any other ideas?
>
>By now, people are pretty good at placing tags in the upper bits of a
>pointer, and/or in the lower bits of a pointer where misaligned
>address traps can be used to do some tag checking.
>
>SUN has some tagged instructions in SPARC.
>
>One idea I've always been fond of is to provide an architectural mask,
>part of process context, that is ANDed with all virtual addresses, so
>software doesn't need to mask off the tag before addressing.

Anyone remember SPUR?  SPUR was the "Symbolic Processing Using RISCs"
VLSI Multiprocessor Workstation being researched and developed at
UC-Berkeley during the 1980s.

Here's an excerpt from the 1985 UCB/CSD report:

   SPUR (Symbolic Processing Using RISCs) is a workstation for
   conducting parallel processing research.  ... SPUR features include
   a large virtually-tagged cache, address translation without a
   translation buffer, LISP support with datatype tags but without
   microcode, multiple cache consistency in hardware, and an IEEE
   floating-point coprocessor without microcode.

   -- from

       SPUR: A VLSI Multiprocessor Workstation
       Hill, et al
       Report No., UCB/CSD 86/273
       December 1985

       Computer Science Division (EECS)
       University of California
       Berkeley, California 94720


   -- also

       SPUR Lisp: Design and Implementation
       Benjamin Zorn, Paul Hilfinger, Kinson Ho, James Larus
       Report No., UCB/CSD 87/373
       September 1987

       SPUR Memory System Architecture
       David A. Wood, Susan Eggers and Garth Gibson
       Report No., UCB/CSD 87/394
       December 1987

       Features for Multiprocessing in SPUR Lisp
       Benjamin Zorn, Paul Hilfinger, Kinson Ho, James Larus, Luigi Semenzato
       Report No., UCB/CSD 88/406
       March 8, 1988

Stephen
--
Neither The Boeing Company nor any of their employees, makes any
warranty, whatsoever, implied, or assumes any legal liability or
responsibility regarding any information, disclosed, or represents
that its use would not infringe privately owned rights.  No specific
reference constitutes or implies endorsement, recommendation, or
favoring by the The Boeing Company.  The views and opinions expressed
herein do not necessarily reflect those of the The Boeing Company, and
shall not be used for advertising or product endorsement purposes.
-- 
Stephen L Nicoud  <·······@boeing.com>  uw-beaver!bcsaic!stephen
Boeing Computer Services Research and Technology, Computer Science
Bellevue, Washington  USA
"I ask unananimous consent to revise and extend my remarks."
From: Guillermo J. Rozas
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <JINX.92Apr6120427@chamarti.ai.mit.edu>
In article <·················@pdx007.intel.com> ····@pdx007.intel.com (Andy Glew) writes:

*     One suggestion made in comp.lang.lisp is tag bits.  But how would one
*     add tag bits to a architecture while maintaining a RISC-style?  Are
*     there any other ideas?

* By now, people are pretty good at placing tags in the upper bits of a
* pointer, and/or in the lower bits of a pointer where misaligned
* address traps can be used to do some tag checking.

* SUN has some tagged instructions in SPARC.

* One idea I've always been fond of is to provide an architectural mask,
* part of process context, that is ANDed with all virtual addresses, so
* software doesn't need to mask off the tag before addressing.
* --

There is a better hack for high-tagged pointers, namely XOR, and it
need not be part of a process context, but part of the instruction.
Loads and Stores currently have many bits of offset (typically 16),
and it would be easy to reuse some of them.

Consider two new instructions

    load-tagged		tag,offset(base),destination
    store-tagged	tag,offset(base),source

Where base, source, and destination are registers, and tag and offset
are constants.

The operation of load-tagged is equivalent to

    add		#offset,base,temp
    xor		#tag<<datum-length,temp,temp
    load	0(temp),destination

which is similar to the operation of a normal load, except for the xor
in the high bits.

Similarly for store-tagged.

The advantage of XOR over AND is that if you allocate your address
space correctly, you will get a trap when you are using the wrong type
(e.g. CAR of an ARRAY), and since the trap handler can XOR the tag
right back, it can even recover trivially what object was being
incorrectly manipulated.

Although this would be nice, support for generic arithmetic is even
more important.  The Sparc's add and subtract tagged are along the
right direction, but condition codes are considered harmful nowadays.

Consider a new pair of instructions:

    add-ok	addend,augend,condition
    sub-ok	minuend,subtrahend,condition

Where all operands are registers, and condition gets a 0 if the
operands are both fixnums and the operation does not overflow, and a 1
otherwise.  The instructions would work differently depending on where
the tags lived, but the concept works for either scheme (for high
tags, fixnums would be tagged with 0 and -1, depending on their sign).

A generic add would then become

    add-ok	addend,augend,condition
    add		addend,augend,result
    jne		condition,do_generic_case

Of course, this only improves the performance of fixnums, and not of
flonums, but it is pretty RISC-y.
From: Mark Tillotson
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <MARKT.92Apr21173216@wundt.harlqn.co.uk>
[peering out from flame-retardant clothing]

Well, it serves me right for cross-posting without thinking, but I was
only suggesting an enhancement that would help with garbage-collected
languages, not saying that this was what RISC's are currently lacking!

OK, so I'll slightly rephrase my suggestion:

  And finally I would put a plea in for flexible addressing mode on
  RISCs, of the form:
  [(Rb XOR d) + Ri<<n]
                                  
  The key point is that the scaling should be by a power of 2 that is
  specified in the instruction, (here n), NOT implied by the size of
  operand.  This means that you can represent integers with low-order
  zero tag bits and still use them directly to address vectors of bytes,
  words, or doubles...  Of course n is signed, so the shift can go
  either way.

  The d field allows removal of any low-bit tags from the pointer
  and need only be 4 bits or so.  This is also very handy for
  big/little endian conversion, note...
--

------------------------------------------------------
|\  /|          | ,  M. Tillotson       Harlequin Ltd. \
| \/ |  /\| |/\ |<   ·····@uk.co.harlqn  Barrington Hall,\
|    |  \_| |   | \  +44 223 872522       Barrington,      \
I came, I saw, I core-dumped...            Cambridge CB2 5RG \
My opinions, like my teeth, are all my own (and full of holes?)\
From: Bruce Holmer
Subject: Re: Ideas for new LISP Machines
Date: 
Message-ID: <HOLMER.92Apr3185747@rose.eecs.nwu.edu>
Andy> By now, people are pretty good at placing tags in the upper bits of a
Andy> pointer, and/or in the lower bits of a pointer where misaligned
Andy> address traps can be used to do some tag checking.

Andy> One idea I've always been fond of is to provide an architectural mask,
Andy> part of process context, that is ANDed with all virtual addresses, so
Andy> software doesn't need to mask off the tag before addressing.

David> You don't need this if you tag the low bits and have cheap load with
David> offset.  (Or if you would be doing a load with offset anyways.)  If a
David> register contains a cons box that is tagged with 1, say, you can get
David> at the car by loading -1(register) and the cdr by loading 3(register).
David> (Assuming some obvious things about how the cons box is stored.)  So I
David> don't think that the above is really necessary.


Although our research was for Prolog, you might be interested in the
following article:

"Fast Prolog with an Extended General Purpose Architecture," B. K.
Holmer, et. al., Proceedings of the 17th Int'l Symp. on Computer
Architecture, May 1990, pp 282-291 (also Sigarch Comp. Arch. News, v.
18, n. 2, June 90).

To summarize, we dedicated about 11% of our chip area to special
features to support Prolog, and these features yielded a 70%
performance gain.

For a combination of reasons, we chose to place the tag bits in the
most significant portion of the 32-bit word.  All arithmetic and logic
operations, however, used the full 32-bits (see below).  Addresses
were first passed through a segment mapping table which translated the
most significant 6 bits into a 12-bit value.  This procedure is very
similar to that used by the IBM ROMP processor.  Our use was for a
generalized masking--either you zero out the tag bits, leave them
alone (as you would do for C), or do a combination.  The processor was
word addressed, so the "loss" in address space is not as severe as for
a byte-addressed machine.  If I had to do it over again, I might chose
to put the tag bits in the least-significant bits and use byte
addresses (after the article was written, our group gained lots of
experience with porting our Prolog system to general purpose
machines--see Van Roy and Despain's article in the Jan. 1992
Computer).

The other features that we added included hardware to support fast
branching on tag bits and single-cycle double-word load and store
capability.

Feature                     Performance Benefit (individually)

Fast tag branch logic            19%
Double word memory port          17%
Segment mapping                  10%


There were some other features that you can read about in the paper if
you're interested.  In all, they gave a cumulative affect of 70%
performance benefit.  I would guess that placing the tags in the
least-significant bits of the word would give a 50% performance
benefit (since some features are no longer needed--for example, the
segment mapping).


--Bruce Holmer
······@eecs.nwu.edu