Single-threaded implementation of RMA for shared memory

------------------------------------------------------------------------

Basic Assumptions

* All of the local windows associated with the specified window object
  are located in and accessible through shared memory.

* All processors involved in the communicator are homogeneous.

* Only basic datatypes are supported.

------------------------------------------------------------------------

General Notes

------------------------------------------------------------------------

Data Structures

------------------------------------------------------------------------

MPID_shm_Win_create


* If the shared memory is not cache coherent, the initialize the
  preceding put flag

  If the local window is located in non-cache coherent shared memory,
  then we need to track put operations to the local window which
  (might) have occurred since the last fence.  This tracking is
  required so that cache lines associated with the local window can be
  invalidated, ensuring that the local process sees the changes.

  Q: Can puts happen before the first fence?  In other words, is an
  exposure epoch implicitly opened as part of the window creation
  process?

* Initialize the inter-process (shared memory) mutex

  Mutexes are required in order to ensure that accumulate operations
  on any given element (basic datatype) in the local window are
  atomic.

  NOTE: multiple mutexes may be needed if the local window is broken
  into multiple regions.  For details, see the discussion in
  MPID_shm_Accumulate().

------------------------------------------------------------------------

MPID_shm_Win_fence

* If the shared memory is not cache coherent, flush cache and/or write
  buffer as necessary
  
  If the shared memrory is not cache coherent and stores were
  performed to the local window, then (depending on the
  architecture specifics and the RMA implementation) we might need
  to perform the following operations.
  
  1) if system is using a write-back caching strategy, then flush
  the cache
  
  2) flush the write buffer
  
  NOTE: It may be possible to defer these operations when
  NOSUCCEED is also supplied.  It's currently unclear if this
  would be beneficial.

* barrier

  We need a barrier to ensure that all remote puts and local stores to
  the local window have completed so the results are available to
  operations performed after the fence operation.  We also
  need to ensure that any remote gets and local loads from the local
  window are complete before any future remote puts or local stores
  are allowed to affect the local window.
  
* If the shared memory is not cache coherent

  * invlidate cache

    If the shared memrory is not cache coherent and RMA puts were
    performed to the local window, then (depending on the
    architecture specifics and the RMA implementation) we might to
    invalidate any cache lines associated with the shared memory
    bound to this window.                        

  * set (or clear) preceding put flag based on the assertions

    NOTE: To reduce unncessary cache and write buffer flushes, the
    barrier (above) could be replaced with an alltoall gather of the
    operation occuring between node pairs.  Using this information, we
    could eliminate flushes except when an operation actually affected
    the local window.

------------------------------------------------------------------------

MPID_shm_Get

* Copy data directly from the target buffer (located in shared memory)
  to the origin buffer.

------------------------------------------------------------------------

MPID_shm_Put

* Copy data directly from the the origin buffer to the target buffer
  (located in shared memory).

------------------------------------------------------------------------

MPID_shm_Accumulate

* Lock target local window

  The standard says that operations on elements (basic datatypes) need
  to be atomic, but the entire accumulate operation need not be atomic
  with repsect to other accumulate operations.  The simple solution is
  to lock the whole window when performing an operation; however this
  ensures that operations are serialized which will seriously hurt
  performance when multiple processes/threads are attempting to
  accumulate data into a single window (or even a single large buffer
  in that window).
  
  TODO: Develop an algorithm for performing the operations when the
  local window is broken into multiple regions, with a mutex per
  region.  Care must be taken to ensure that if an element spans two
  regions, then the mutexes for both regions must be locked before the
  operation is performed on that element.  Performing these lock
  operations is likely to be somewhat expensive, so we will want a
  tuneable parameter for specifying the minimum size of a region.
  
  Q: Do inter-process mutexes also ensure mutual exclusion for threads
  within the same process?  If not, then we need to a acquire both a
  thread and process locks.  We probably want to acquire the thread
  lock first to minimize the contention at the process lock.

* Perform requested accumulation

  We need an algorithm for performing accumulations when the
  datatype are non-contiguous.  Ideally, the two dataloops and the
  accumulation operations could be processed without requiring any
  extra copying, packing, or temporary buffers.

  NOTE: While it may be possible to write a function to perform the
  requested operations, it is likely that such functionality will need
  to be inlined so that appropriate locking of local window regions
  occurs as data is being processed.  Also, the dataloops will need to
  be optimized so that it is not necessary to acquire a region's mutex
  more than once per request.

 * Unlock target local window

------------------------------------------------------------------------