Followup to "Effect of multiple-processors on memory allocation"

From: Adam Warner
Subject: Followup to "Effect of multiple-processors on memory allocation"
Date: Fri, 17 Nov 2006 00:28:55 +0000
Message-ID: <4s4e26Ftus83U1@mid.individual.net>

Hi all,

In October Frode Vatvedt Fjeld had an interesting exchange with William D
Clinger regarding sequential consistency (in a comp.lang.lisp thread
initiated by Rob Thorpe). The message ID of the parent thread is
<·······················@h48g2000cwc.googlegroups.com>. William D Clinger
made these claims w.r.t. the x86 architecture at message ID
<························@m7g2000cwm.googlegroups.com>, archived at
<http://groups.google.com/group/comp.lang.lisp/msg/efed7a2d691d4ca3>:

> I think most systems today provide sequential consistency, meaning
> that the problem you describe is a non-issue for lisp or other
> software-level systems.

Not so. The Pentium 4, Xeon, and P6 family provide processor
order. The Sparc family provides total store order (and two even
weaker orders). None of these provide strong ordering, which is
probably what you think sequential consistency means.

Frode Vatvedt Fjeld responded that he was not thinking of "strict
consistency", leading William D Clinger to make some further claims:

If you want sequential consistency in a multiprocessor system,
with most current hardware, you need to follow certain rules when
writing your program, and you need a compiler that inserts memory
barriers according to certain rules, and also limits its optimizations
to conform to certain rules. On x86 systems, memory barriers are
typically implemented by MFENCE instructions. On Sparcs, memory
barriers are typically implemented by membar instructions.

Taken in context William D Clinger appears to imply that one must use,
at a minimum, memory barriers upon multiprocessor Pentium 4, Xeon, and P6
systems to ensure sequential consistency.

While this appears to be a highly contagious meme I haven't found any
evidence for it. Yet it only takes a counterexample to prove that the x86
shared-memory multiprocessor architecture is not sequentially consistent.
I suspect there is no counterexample that isn't a processor or system
errata. I've posted a request for a counterexample of x86 sequential
consistency in comp.arch and comp.programming.threads (message ID
<··············@mid.individual.net>, archived at
<http://groups.google.com/group/comp.arch/msg/6fb1ae78bfe4f386>).

If the x86 architecture is sequentially consistent then the multiprocessing
benefits are significant, as explained in 1999 by Andy Glew at message ID
<············@news.doit.wisc.edu> and archived at
<http://groups.google.com/group/comp.arch/msg/db53f79ece143334>:

Sequential consistency obtained via snooping is global:
it'll work for any ordering model you want. It doesn't
require complicated fence operations that specify exactly
what gets ordered and what not.

Yet, for all this, it is simple and high performance.

E.g. a system that maintains sequential consistency
via snooping never needs to stall its pipeline for
multiprocessor coherency. Synchronization operations
are cheap, and can be done in the cache.

Weakly ordered systems need some sort of external visibility
of fence operations. That can quickly become the bottleneck.

So: sequential consistency via snooping is
a) simple
b) faster than weakly ordered systems with externally visible fences
c) semantically more flexible.

What's the alternative? Here's Andy Glew from the same thread (message ID
<·············@news.doit.wisc.edu>, archived at
<http://groups.google.com/group/comp.arch/msg/b9221274acfc4097>):

Memory barriers are pretty damned costly, if they are implemented
via pipeline flushes.

All memory barriers require some external visibility,
which is itself complicated (unless you have an unordered processor
and an ordered interconnect).

If not implemented via flushes, memory barriers require markings
on transactions, and add a whole slew of boundary cases.

===

I turn your contention on its head:
I'm a big advocate of weak ordering.

But, I have seen just how much faster sequential consistency
implemented by snooping can be. Faster, less hardware,
fewer validation conditions.

I think the real question is, how can we build a weakly ordered
memory system that is as fast as the sequential one?

I'd like to gain from the benefits of x86-32 and x86-64 sequential
consistency to communicate between potentially multiple processors via
POSIX threads without using explicit synchronization primitives
(synchronization is cheaply and implicitly handled by the hardware caches
of the x86 architecture). If the shared-memory x86 multiprocessor
architecture is not sequentially consistent then my software may generate
corrupt data that is impossible to consistently reproduce (presuming I'm
even lucky enough to detect the corruption).

These are high stakes for anyone implicitly relying upon x86 sequential
consistency. There can be practical resolution of this issue. Let's see
evidence for claims that current x86 shared-memory multiprocessor
architectures are not sequentially consistent.

Regards,
Adam

Re: Followup to "Effect of multiple-processors on memory allocation" William D Clinger
- Re: Followup to "Effect of multiple-processors on memory allocation" Adam Warner
Re: Followup to "Effect of multiple-processors on memory allocation" Rob Thorpe
- Re: Followup to "Effect of multiple-processors on memory allocation" Adam Warner
  - Re: Followup to "Effect of multiple-processors on memory allocation" William D Clinger
    - Re: Followup to "Effect of multiple-processors on memory allocation" Rob Warnock
      - Re: Followup to "Effect of multiple-processors on memory allocation" Adam Warner
      - Re: Followup to "Effect of multiple-processors on memory allocation" William D Clinger
  - Re: Followup to "Effect of multiple-processors on memory allocation" Rob Thorpe
    - Re: Followup to "Effect of multiple-processors on memory allocation" Rob Thorpe
      - Re: Followup to "Effect of multiple-processors on memory allocation" William D Clinger
        Re: Followup to "Effect of multiple-processors on memory allocation" Rob Thorpe
  - Re: Followup to "Effect of multiple-processors on memory allocation" Alex Mizrahi

From: William D Clinger
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Fri, 17 Nov 2006 03:55:57 +0000
Message-ID: <1163735756.991668.210760@j44g2000cwa.googlegroups.com>

Adam Warner wrote:
> These are high stakes for anyone implicitly relying upon x86 sequential
> consistency. There can be practical resolution of this issue. Let's see
> evidence for claims that current x86 shared-memory multiprocessor
> architectures are not sequentially consistent.

You might start by reading section 7.2 of the IA-32 Intel Architecture
Software Developer's Manual, Volume 3: System Programming Guide [2].
To understand the practical implications of the admittedly technical
prose in that manual, see Hans Boehm's PLDI 2005 paper [1].  Doug
Lea's cookbook for compiler writers gives practical advice on how to
deal with the x86 family [3].

Will

[1] Hans-J Boehm.  Threads cannot be implemented as a library.
    PLDI 2005, pages 261-268.
    http://portal.acm.org/citation.cfm?doid=1065010.1065042
    TR version at:
    http://www.hpl.hp.com/techreports/2004/HPL-2004-209.html
    Weblog discussion at:
    http://lambda-the-ultimate.org/node/950
[2] IA-32 Intel Architecture Software Developer's Manual
    Volume 3: System Programming Guide
    http://download.intel.com/design/Pentium4/manuals/25366820.pdf
[3] Doug Lea.  The JSR-133 Cookbook for Compiler Writers.
    http://gee.cs.oswego.edu/dl/jmm/cookbook.html

From: Adam Warner
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Fri, 17 Nov 2006 08:12:14 +0000
Message-ID: <4s596tFu1ms2U1@mid.individual.net>

On Thu, 16 Nov 2006 19:55:57 -0800, William D Clinger wrote:

> Adam Warner wrote:
>> These are high stakes for anyone implicitly relying upon x86 sequential
>> consistency. There can be practical resolution of this issue. Let's see
>> evidence for claims that current x86 shared-memory multiprocessor
>> architectures are not sequentially consistent.
> 
> You might start by reading section 7.2 of the IA-32 Intel Architecture
> Software Developer's Manual, Volume 3: System Programming Guide [2].

I can see how you got the impression that current IA-32 microprocessor
architectures are not sequentially consistent yet the document actually
indicates the opposite.

First off, quoting Andy Glew again to keep our definitions in sync (BTW 
"Andy Glew's only real claim to fame is his participation in the Intel P6
Microarchitecture 1990-1995, Intel's first dynamic execution, out-of-order,
processor design")
<http://www.stanford.edu/class/ee380/Abstracts/011205.html>
<http://groups.google.com/group/comp.arch/msg/db53f79ece143334>:

   BTW, when I say "sequential consistency maintained via snooping", I
   include both processor ordering, and other equivalent implementation
   techniques.

Quoting Intel's misdirection first:
<http://www.intel.com/design/processor/manuals/253668.pdf>

   Software intended to operate correctly in processor-ordered processors
   (such as the Pentium 4, Intel Xeon, and P6 family processors) should
   not depend on the relatively strong ordering of the Pentium or Intel486
   processors. Instead, it should insure that accesses to shared variables
   that are intended to control concurrent execution among processors are
   explicitly required to obey program ordering through the use of
   appropriate locking or serializing operations (see Section 7.2.4,
   "Strengthening or Weakening the Memory Ordering Model").

Intel has been telling people that they should not depend upon this
relatively strong ordering for over a decade. Reality escapes in 7.2.4:

   Despite the fact that Pentium 4, Intel Xeon, and P6 family processors
   support processor ordering, Intel does not guarantee that future
   processors will support this model. To make software portable to future
   processors, it is recommended that operating systems provide critical
   region and resource control constructs and API's (application program
   interfaces) based on I/O, locking, and/or serializing instructions be
   used to synchronize access to shared areas of memory in
   multiple-processor systems. Also, software should not depend on
   processor ordering in situations where the system hardware does not
   support this memory-ordering model.

These processors support the same orderings. Intel _recommends_ that
operating systems provide serializing instructions to synchronize access
to shared areas of memory in multiple-processor systems since Intel does
not guarantee that future processors will support this memory model.

Ghrachorloo, K. Memory consistency models for shared-memory
multiprocessors. Tech. Rep. CSL-TR-95-685, Computer Systems Laboratory,
Departments of Electrical Engineering and Computer Science, Stanford
University, Stanford, CA, December 1995. Available:
<www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-9.pdf>

Page 298:

   Other companies such as Hewlett-Packard (HP), Intel, and Silicon
   Graphics (SGI) have chosen to support sequential consistency in their
   current architectures. The next generation processors from these
   companies exploit aggressive implementation techniques in conjunction
   with dynamically scheduled processors to efficiently support sequential
   consistency (mentioned in the previous paragraph). However, there is
   still a dependence on relaxed models for enabling compiler
   optimizations for explicitly parallel programs. Furthermore, companies
   such as Intel still maintain the option to move to relaxed models by
   recommending that programmers use special serializing and locking
   operations for future compatibility.

The P6 is one of the "next generation processors ... [to] ... exploit
aggressive implementation techniques in conjunction with dynamically
scheduled processors to efficiently support sequential consistency". Intel
still maintains the option to move to relaxed models by recommending that
programmers use special serializing and locking operations for future
compatibility. The status quo hasn't changed since 1995.

Each of these generations are 100% backwards compatible with binaries of
the previous generation compiled without special serializing and locking
operations. There would have been chaos if the P4 started corrupting an
enormous worldwide install base of P6 binaries for multiprocessor systems.

> To understand the practical implications of the admittedly technical
> prose in that manual, see Hans Boehm's PLDI 2005 paper [1].  Doug Lea's
> cookbook for compiler writers gives practical advice on how to deal with
> the x86 family [3].
> 
> Will
> 
> [1] Hans-J Boehm.  Threads cannot be implemented as a library.
>     PLDI 2005, pages 261-268.
>     http://portal.acm.org/citation.cfm?doid=1065010.1065042 TR version
>     at:
>     http://www.hpl.hp.com/techreports/2004/HPL-2004-209.html

Page 7:

   These techniques generally rely on the ability to access shared
   variables with ordinary load and store instructions. In practice, some
   control over reordering memory references is needed as well, but much
   of the performance of these techniques is attributable to minimizing
   such restrictions on reordering.

> [3] Doug Lea.  The JSR-133 Cookbook for Compiler Writers.
>     http://gee.cs.oswego.edu/dl/jmm/cookbook.html

Where's the claim that the P4 is not sequentially consistent? Everything
except StoreLoad is a no-op. "StoreLoad barriers protect against a
subsequent load incorrectly using Store1's data value rather than that
from a more recent store to the same location performed by a different
processor."

If a different processor is still seeing the older data value it can still
be sequentially consistent. This barrier ensures that a different
processor sees the new value immediately. My algorithm polls for the new
value. Once it sees the new value it can deduce that the processor can
also see all previous writes by the other processor due to sequential
consistency:

   CPU A makes a number of memory writes before writing to a memory
   location that acts as a flag. When CPU B reads the changed flag (it's
   theoretically irrelevant whether the change takes a nanosecond or a day
   to propagate to CPU B) all the writes that were made by CPU A prior to
   CPU A writing to the memory flag must be correctly readable by CPU B. A
   sequentially consistent architecture satisfies this property.

I don't need a StoreLoad barrier. If the hardware implementation is so
poor that it takes a day for the writes to propagate to CPU B then CPU B
will sit there happily polling for a day waiting for the flag to change.
But once that flag changes I must be able to rely upon all previous writes
having propagated as well.

Regards,
Adam

From: Rob Thorpe
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Fri, 17 Nov 2006 10:55:28 +0000
Message-ID: <1163760928.670305.118160@h48g2000cwc.googlegroups.com>

Adam Warner wrote:
> Hi all,
>
> In October Frode Vatvedt Fjeld had an interesting exchange with William D
> Clinger regarding sequential consistency (in a comp.lang.lisp thread
> initiated by Rob Thorpe). The message ID of the parent thread is
> <·······················@h48g2000cwc.googlegroups.com>. William D Clinger
> made these claims w.r.t. the x86 architecture at message ID
> <························@m7g2000cwm.googlegroups.com>, archived at
> <http://groups.google.com/group/comp.lang.lisp/msg/efed7a2d691d4ca3>:

Although I started the thread, I did so to answer a question presented
to me in another forum.
That thread was:-
http://www.realworldtech.com/forums/?action=detail&id=73680&threadid=73581&roomid=11

The people who have debated around this issue reads like a who's who of
computing (not that I consider myself a who).  The entire thread is
worth reading, if you've got a spare two weeks!

I'll comment on what you've said after I've had time to digest it.

From: Adam Warner
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Fri, 17 Nov 2006 12:57:32 +0000
Message-ID: <4s5ptrFu5ul1U2@mid.individual.net>

On Fri, 17 Nov 2006 02:55:28 -0800, Rob Thorpe wrote:

> Although I started the thread, I did so to answer a question presented
> to me in another forum.
> That thread was:-
> http://www.realworldtech.com/forums/?action=detail&id=73680&threadid=73581&roomid=11
> 
> The people who have debated around this issue reads like a who's who of
> computing (not that I consider myself a who).  The entire thread is
> worth reading, if you've got a spare two weeks!
> 
> I'll comment on what you've said after I've had time to digest it.

Great! I've skimmed parts of the thread. I appear to be echoing Linus
Torvalds' observations (he was able to talk to an actual x86 architect):
<http://www.realworldtech.com/forums/?action=detail&id=73984&threadid=73581&roomid=11>

   I actually think that the most acceptable situation
   is to make both loads and stores "appear" totally ordered,
   and just execute (both) totally out-of-order.

   In fact, despite all the documentation otherwise, at least
   according to one x86 architect I've talked to, that's
   what some (and he actually claimed "all") implementations
   of the x86 architecture really do. And yes, the reason
   is exactly to make sure you don't break software by mistake.

   So it's actually possible that read barriers on x86 are
   basically just expensive no-ops. I don't know how to test
   that well, and it's not something I'd like to depend on
   anyway (since the architecture documentation clearly says
   that you should not depend on it), but if it were
   to be true, I'd actually be happier.

   Exactly because - as you say - that's obviously even
   more robust, and less surprising to the software
   side. This is not an issue of "totally robust" versus
   "totally fragile" - it's very much a matter of degrees.

   ...

   And as mentioned, x86 actually is ordered even wrt
   loads, at least for the dependent ones. This is something
   that people depend on, not just in Java VM's. It's possible
   that x86's are actually a lot more ordered than that too,
   despite the language in the architecture documentation.

   ...

   The architect I talked to seemed to imply that there is
   a lot of crap code out there, and that it's simply easier
   for them in the end to guarantee "total ordering" than
   to spend all their time validating that nobody depends on
   it.

This is a stronger memory model than I require!

BTW it has just clicked that my description of writing to a block of memory
and then setting a flag mirrors initialising an object and then setting a
memory location with the address of the object. Reads will indirect
through this pointer so sequential consistency is sufficient to ensure
that the object will be viewed as correctly initialised.

I need no stronger memory guarantees because I'm polling for the changed
flag/pointer value. AMD guarantees that any x86-64 processor will see the
latest value "for software to operate correctly."

Page 93 of "AMD64 Architecture Programmer's Manual Volume 1: Application Programming"
<http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24592.pdf>:

   Reads can be reordered ahead of writes, except that a read cannot be
   reordered ahead of a prior write if the read is from the same location as
   the prior write. In this case, the read instruction stalls until the write
   instruction is committed. This is because the result of the write
   instruction is required by the read instruction for software to operate
   correctly.

1. Start with a pointer value of NULL.
2. Initialise an object.
3. Write the address of the object into the pointer value.
4. A subsequent read of the pointer value will stall until the write to
   the same location is committed. No CPU will ever see a NULL pointer
   value.

I've just received a great pointer to the qprof library that includes
atomic operations for Linux programs:
Message ID <··············@mid.individual.net>
<http://groups.google.com/group/comp.programming.threads/msg/f675f5d6a5b29644>

The qprof README includes a comment that HP doesn't follow documented x86
behaviour because it is so weakly specified that it is impractical to use.
"Current real implementations appear to be much better behaved."

Regards,
Adam

From: William D Clinger
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Fri, 17 Nov 2006 14:58:02 +0000
Message-ID: <1163775482.142788.257590@b28g2000cwb.googlegroups.com>

Adam Warner quoting Andy Glew:
>   BTW, when I say "sequential consistency maintained via snooping", I
>   include both processor ordering, and other equivalent implementation
>   techniques.

Maybe he's using a weird definition of "snooping".
With the usual definition, snooping is irrelevant
to the basic problem, which is that the processor
issuing a write may read from shared memory before
the write it issued has appeared on the bus where
it can be snooped.

> Intel has been telling people that they should not depend upon this
> relatively strong ordering for over a decade. Reality escapes in 7.2.4:
>
>    Despite the fact that Pentium 4, Intel Xeon, and P6 family processors
>    support processor ordering, Intel does not guarantee that future
>    processors will support this model.

You are quoting this as though you think Intel's
definition of processor ordering implies sequential
consistency.  It doesn't.

>    Other companies such as Hewlett-Packard (HP), Intel, and Silicon
>    Graphics (SGI) have chosen to support sequential consistency in their
>    current architectures.

They did that through a combination of hardware
and software techniques.  In particular, support
for sequential consistency can involve compilers
that emit memory barriers as needed, and are
careful not to perform certain optimizations
that would have been transparent on older
architectures.

You should read Boehm's paper.  He used to work
at SGI, now works at HP, and is quite familiar
with their architectures and compilers.

> Where's the claim that the P4 is not sequentially consistent? Everything
> except StoreLoad is a no-op. "StoreLoad barriers protect against a
> subsequent load incorrectly using Store1's data value rather than that
> from a more recent store to the same location performed by a different
> processor."
>
> If a different processor is still seeing the older data value it can still
> be sequentially consistent.

Indeed it can be.  It is not guaranteed to be.

> If a different processor is still seeing the older data value it can still
> be sequentially consistent. This barrier ensures that a different
> processor sees the new value immediately. My algorithm polls for the new
> value. Once it sees the new value it can deduce that the processor can
> also see all previous writes by the other processor due to sequential
> consistency:
>
>    CPU A makes a number of memory writes before writing to a memory
>    location that acts as a flag. When CPU B reads the changed flag (it's
>    theoretically irrelevant whether the change takes a nanosecond or a day
>    to propagate to CPU B) all the writes that were made by CPU A prior to
>    CPU A writing to the memory flag must be correctly readable by CPU B. A
>    sequentially consistent architecture satisfies this property.
>
> I don't need a StoreLoad barrier. If the hardware implementation is so
> poor that it takes a day for the writes to propagate to CPU B then CPU B
> will sit there happily polling for a day waiting for the flag to change.
> But once that flag changes I must be able to rely upon all previous writes
> having propagated as well.

For this particular algorithm, I believe your claim
that x86 multiprocessor architectures will behave as
though they were sequentially consistent.

> Page 93 of "AMD64 Architecture Programmer's Manual Volume 1:
> Application Programming" ...:
>
>    Reads can be reordered ahead of writes, except that a read cannot be
>    reordered ahead of a prior write if the read is from the same location as
>    the prior write. In this case, the read instruction stalls until the write
>    instruction is committed. This is because the result of the write
>    instruction is required by the read instruction for software to operate
>    correctly.

How can a read instruction on processor B stall until a write
instruction on processor A has been committed, except through
communication between the two processors?  The text you quoted
is a typical guarantee for the interaction of writes and reads
issued by the same processor.  The problems we're talking about
crop up in multiprocessor systems precisely because cleverness
of that kind is no longer transparent.

Will

From: Rob Warnock
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Sat, 18 Nov 2006 03:59:59 +0000
Message-ID: <htidnXCXTYSiGMPYnZ2dnUVZ_u2dnZ2d@speakeasy.net>

William D Clinger <········@yahoo.com> wrote:
+---------------
| Adam Warner quoting Andy Glew:
| > Other companies such as Hewlett-Packard (HP), Intel, and Silicon
| > Graphics (SGI) have chosen to support sequential consistency in
| > their current architectures.
| 
| They did that through a combination of hardware and software
| techniques.  In particular, support for sequential consistency
| can involve compilers that emit memory barriers as needed, and are
| careful not to perform certain optimizations that would have been
| transparent on older architectures.
+---------------

I can't speak for H-P & Intel, but the SGI "Origin" (MIPS/Irix) and
"Altix" (IPF/Linux) ccNUMA systems [with many hundreds of CPUs!!]
*do* provide sequential consistency in the *hardware* -- no software
support needed[1], and *without* any global snooping[2] -- using
DASH-style directory-based cache coherency [where the global state
of a given cache line is held-in/owned-by the *memory* subsystem(s),
not the CPU(s)].

[1] After boot time, that is. At boot time the O/S is of course
    required to set up the cache system and the global memory
    switching fabric to "do the right thing" later, at runtime.

[2] There is of course "local" snooping in the front-side busses
    of each CPU by the cache-directory system, to turn local cache
    misses (reads) and writes into global cache-line reads and
    cache-line write-invalidates, respectively. [To the CPUs the
    cache-directory system appears to be "just another CPU" on
    the same front-side bus.]

+---------------
| You should read Boehm's paper. He used to work at SGI, now works
| at HP, and is quite familiar with their architectures and compilers.
+---------------

Yes, but AFAIK the only compiler tweaks [other than for pure
performance] were to fix bugs in the CPUs [e.g., the famous
"branch from the last word of a page" brokenness in R4000].

This *has* to be the case, actually, since with "Origin" & "Altix"
all of the DMA devices also participate in the global cache coherency,
and most of them are hard-coded silicon, with no "compiler tweaks"
even possible.

Though note that for the highest-performance DMA devices -- ones
with multiple connections [either physical paths or just multiple
"virtual channels"] into the memory switching fabric, special
ordering considerationsr *were* necessary in a very few cases
where one needed to coerce sequential consistency into a total
order. [And, unfortunately, for a few devices interrupts and
DMA traffic went into the memory switching fabric over different
"virtual channels", and there again...]

Ob. credentials: I worked for ~13 years at SGI designing/debugging
very high performance networking and cluster-interconnect adapters,
and had to deal with this stuff on pretty much a daily basis.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Adam Warner
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Sat, 18 Nov 2006 06:44:39 +0000
Message-ID: <4s7oemFu5rn2U1@mid.individual.net>

On Fri, 17 Nov 2006 21:59:59 -0600, Rob Warnock wrote:

> William D Clinger <········@yahoo.com> wrote:
> +---------------
> | Adam Warner quoting Andy Glew:

BTW that was Adam Warner quoting the highly regarded thesis of Kourosh
Gharachorloo, "Memory consistency models for shared-memory
multiprocessors." Tech. Rep. CSL-TR-95-685, Computer Systems Laboratory,
Departments of Electrical Engineering and Computer Science, Stanford
University, Stanford, CA, December 1995. Available from:
<www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-9.pdf>

Page 298:

   Other companies such as Hewlett-Packard (HP), Intel, and Silicon
   Graphics (SGI) have chosen to support sequential consistency in their
   current architectures. The next generation processors from these
   companies exploit aggressive implementation techniques in conjunction
   with dynamically scheduled processors to efficiently support sequential
   consistency (mentioned in the previous paragraph). However, there is
   still a dependence on relaxed models for enabling compiler
   optimizations for explicitly parallel programs. Furthermore, companies
   such as Intel still maintain the option to move to relaxed models by
   recommending that programmers use special serializing and locking
   operations for future compatibility.

> | > Other companies such as Hewlett-Packard (HP), Intel, and Silicon
> | > Graphics (SGI) have chosen to support sequential consistency in
> | > their current architectures.
> | 
> | They did that through a combination of hardware and software
> | techniques.  In particular, support for sequential consistency
> | can involve compilers that emit memory barriers as needed, and are
> | careful not to perform certain optimizations that would have been
> | transparent on older architectures.
> +---------------
> 
> I can't speak for H-P & Intel, but the SGI "Origin" (MIPS/Irix) and
> "Altix" (IPF/Linux) ccNUMA systems [with many hundreds of CPUs!!]
> *do* provide sequential consistency in the *hardware* -- no software
> support needed[1], and *without* any global snooping[2] -- using
> DASH-style directory-based cache coherency [where the global state
> of a given cache line is held-in/owned-by the *memory* subsystem(s),
> not the CPU(s)].

Very informative. One would expect that any hardware architecture that
requires software memory barriers to achieve sequential consistency is
hardware that is by definition not sequentially consistent. I note that
you pointed this out to William in the October thread:
Message ID: <································@speakeasy.net>
<http://groups.google.com/group/comp.lang.lisp/msg/0e9fa68d6b794856>

   +---------------
   | If you want sequential consistency in a multiprocessor system,
   | with most current hardware, you need to follow certain rules when
   | writing your program, and you need a compiler that inserts memory
   | barriers according to certain rules, and also limits its optimizations
   | to conform to certain rules.
   +---------------

   If that is necessary, then I would say that such a multiprocessor
   system does *not* provide sequential consistency in the hardware.
   SGI's "Origin" and "Altix" do (and presumably a number of other
   non-x86 platforms as well).

I'm delighted that William chose to directly comment upon the soundness of
my synchronization approach.

Regards,
Adam

From: William D Clinger
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Sat, 18 Nov 2006 14:59:19 +0000
Message-ID: <1163861959.099560.47010@f16g2000cwb.googlegroups.com>

Rob Warnock provided this information:
> I can't speak for H-P & Intel, but the SGI "Origin" (MIPS/Irix) and
> "Altix" (IPF/Linux) ccNUMA systems [with many hundreds of CPUs!!]
> *do* provide sequential consistency in the *hardware* -- no software
> support needed[1], and *without* any global snooping[2] -- using
> DASH-style directory-based cache coherency [where the global state
> of a given cache line is held-in/owned-by the *memory* subsystem(s),
> not the CPU(s)].

IIRC, none of the MIPS processors require memory barriers.
I'm surprised the Itanium didn't, but what do I know?

The compiler still gets involved in sequential consistency
because you have to disable certain optimizations.  The
classic example is common subexpression elimination, which
cannot be performed on expressions that involve a shared
variable that might be updated by a concurrent thread.
Boehm's paper describes some lesser-known optimizations
that have to be disabled.

With x86 processors that use Processor Ordering, sequential
consistency also requires the compiler to emit StoreLoad
barriers; as several people have said in the various threads
and papers that have been cited, this barrier commonly takes
the form of a locked addl no-op.

For x86 multiprocessing of Common Lisp and Scheme, we
have at least three choices:

(1) emit enough memory barriers to ensure sequential
    consistency;
(2) emit barriers only for accesses to volatile
    variables, whatever they may be;
(3) forget about the problem, and just say it's the
    programmer's responsibility to avoid races.

With current x86 processors, (1) is likely to be slow.

(2) is basically what Java requires, and is probably
what most C and C++ programmers expect, but it would
require extensions to Common Lisp and Scheme that
identify the volatile variables, or reinterpretations of
existing constructs to imply volatile or non-volatile
semantics.

(3) is the approach taken by pthreads, SRFI 18, and
SRFI 21.  As Boehm observes, the definition of a race
depends on the language's semantics, which depends in
turn upon a memory model for concurrent execution,
which C, C++, Common Lisp, and Scheme don't officially
have, but approach (3) works pretty well in practice
regardless.

I'd like to hear from anyone else who is interested in
developing a memory model for shared-memory
multiprocessor execution of Scheme.

Will

From: Rob Thorpe
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Fri, 17 Nov 2006 15:44:53 +0000
Message-ID: <1163778293.494503.188690@k70g2000cwa.googlegroups.com>

Adam Warner wrote:
> On Fri, 17 Nov 2006 02:55:28 -0800, Rob Thorpe wrote:
>
> > Although I started the thread, I did so to answer a question presented
> > to me in another forum.
> > That thread was:-
> > http://www.realworldtech.com/forums/?action=detail&id=73680&threadid=73581&roomid=11
> >
> > The people who have debated around this issue reads like a who's who of
> > computing (not that I consider myself a who).  The entire thread is
> > worth reading, if you've got a spare two weeks!
> >
> > I'll comment on what you've said after I've had time to digest it.
>
> Great! I've skimmed parts of the thread. I appear to be echoing Linus
> Torvalds' observations (he was able to talk to an actual x86 architect):
> <http://www.realworldtech.com/forums/?action=detail&id=73984&threadid=73581&roomid=11>
>
>    I actually think that the most acceptable situation
>    is to make both loads and stores "appear" totally ordered,
>    and just execute (both) totally out-of-order.
>
>    In fact, despite all the documentation otherwise, at least
>    according to one x86 architect I've talked to, that's
>    what some (and he actually claimed "all") implementations
>    of the x86 architecture really do. And yes, the reason
>    is exactly to make sure you don't break software by mistake.

I'm not sure if this is true FWIW.  Linus mearly suggested it.
I think he may be wrong in cases where interrupts occur.  I haven't got
time to check right now.

From: Rob Thorpe
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Tue, 21 Nov 2006 13:33:31 +0000
Message-ID: <1164116011.247598.95050@k70g2000cwa.googlegroups.com>

Rob Thorpe wrote:
> Adam Warner wrote:
> > On Fri, 17 Nov 2006 02:55:28 -0800, Rob Thorpe wrote:
> >
> > > Although I started the thread, I did so to answer a question presented
> > > to me in another forum.
> > > That thread was:-
> > > http://www.realworldtech.com/forums/?action=detail&id=73680&threadid=73581&roomid=11
> > >
> > > The people who have debated around this issue reads like a who's who of
> > > computing (not that I consider myself a who).  The entire thread is
> > > worth reading, if you've got a spare two weeks!
> > >
> > > I'll comment on what you've said after I've had time to digest it.
> >
> > Great! I've skimmed parts of the thread. I appear to be echoing Linus
> > Torvalds' observations (he was able to talk to an actual x86 architect):
> > <http://www.realworldtech.com/forums/?action=detail&id=73984&threadid=73581&roomid=11>
> >
> >    I actually think that the most acceptable situation
> >    is to make both loads and stores "appear" totally ordered,
> >    and just execute (both) totally out-of-order.
> >
> >    In fact, despite all the documentation otherwise, at least
> >    according to one x86 architect I've talked to, that's
> >    what some (and he actually claimed "all") implementations
> >    of the x86 architecture really do. And yes, the reason
> >    is exactly to make sure you don't break software by mistake.
>
> I'm not sure if this is true FWIW.  Linus mearly suggested it.
> I think he may be wrong in cases where interrupts occur.  I haven't got
> time to check right now.

Just to mention a little more.
This article is quite interesting:-
http://www.linuxjournal.com/article/8212
Especially in the point that x86 is different from AMD64.

I'm pretty sure machines don't order reads precisely.  Doing so would
be difficult in general, many operations perform both reads and writes
and must be executed out-of-order.  Probably though 99.9% of the time
reads come out ordered.

From: William D Clinger
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Tue, 21 Nov 2006 17:26:34 +0000
Message-ID: <1164129994.076749.12390@b28g2000cwb.googlegroups.com>

Rob Thorpe wrote:
> This article is quite interesting:-
> http://www.linuxjournal.com/article/8212
> Especially in the point that x86 is different from AMD64.

"Interesting" is a sly way to put it.  The article says:
> Although AMD64 is compatible with x86, it offers a slightly
> stronger memory-consistency model, in that it does not
> reorder a store ahead of a load....
> Although it is possible in theory to create a parallel
> program that works on some x86 CPUs but fails on AMD64
> due to this difference in memory-consistency model, in
> practice this difference has little effect on porting code
> from x86 to AMD64.

I don't know whether the AMD64 memory model is actually
stronger than x86 processor ordering, as the article alleges.
If that is true, however, then the conclusion of the paragraph
above is backwards: it would be possible in theory to create
programs that work with AMD64 but fail with x86 processor
ordering.

Obviously that would have no effect on porting code from
x86 to AMD64.  What does have an effect, however, is that
most x86 multiprocessors were sequentially consistent up
until about five years ago (as shown by Adam Warner's
quotations from ten-year-old papers).  Thus old x86
multiprocessor programs may not port to AMD64, just as
they may not port to current x86 multiprocessors.

Will

From: Rob Thorpe
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Tue, 21 Nov 2006 18:04:30 +0000
Message-ID: <1164132270.341158.178710@m73g2000cwd.googlegroups.com>

William D Clinger wrote:
> Rob Thorpe wrote:
> > This article is quite interesting:-
> > http://www.linuxjournal.com/article/8212
> > Especially in the point that x86 is different from AMD64.
>
> "Interesting" is a sly way to put it.  The article says:
> > Although AMD64 is compatible with x86, it offers a slightly
> > stronger memory-consistency model, in that it does not
> > reorder a store ahead of a load....
> > Although it is possible in theory to create a parallel
> > program that works on some x86 CPUs but fails on AMD64
> > due to this difference in memory-consistency model, in
> > practice this difference has little effect on porting code
> > from x86 to AMD64.
>
> I don't know whether the AMD64 memory model is actually
> stronger than x86 processor ordering, as the article alleges.
> If that is true, however, then the conclusion of the paragraph
> above is backwards: it would be possible in theory to create
> programs that work with AMD64 but fail with x86 processor
> ordering.

Indeed.

> Obviously that would have no effect on porting code from
> x86 to AMD64.  What does have an effect, however, is that
> most x86 multiprocessors were sequentially consistent up
> until about five years ago (as shown by Adam Warner's
> quotations from ten-year-old papers).  Thus old x86
> multiprocessor programs may not port to AMD64, just as
> they may not port to current x86 multiprocessors.

Yes. I agree.  I don't know if Intel have adopted this aspect of AMD64
either.  There are some details of it they have not adopted.

From: Alex Mizrahi
Subject: Re: Followup to "Effect of multiple-processors on memory allocation"
Date: Sat, 18 Nov 2006 14:09:38 +0000
Message-ID: <455f1425$0$49199$14726298@news.sunsite.dk>

(message (Hello 'Adam)
(you :wrote  :on '(17 Nov 2006 12:57:32 GMT))
(

 AW> This is a stronger memory model than I require!

well, but what is this then:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4827353

they say they have reproducible Java VM crash when membar is made no-op.
maybe i'm missing something, but why do membar causing problems, what do 
they use it for?

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
"People who lust for the Feel of keys on their fingertips (c) Inity")