Important issues with passive target accumulate * Simple accumulations should combine lock/unlock with operation when latency is high * Separate accumulations into non-overlapping buffers (within the same local window) should be processed with as much parallelism as possible. For single threaded MPI runtime systems, it is reasonable to restrict such optimizations to accumulations issued from separate processes since communications about multiple accumulations from a single process are likely to be serialized by the networking device anyway. The trick to parallelizing multiple requests is identifying that their target buffers do not overlap. Detecting this seems extremely difficult, however, an assertion at the time the lock is acquired could be used to inform the runtime system that no overlapping accumulations will be issued by other processes during this epoch. * Two-sided Unless multiple threads (and processors) are present on the target system to process accumulation operations, serialization is eminent, and the best optmization is to not have a ton of multi-threading optimizations in the way slowing things down. This suggests that two implementations may be necessary: one using threads and one that doesn't. If multiple threads and processors are available on the target system, then one might be able to use separate threads to process accumulate requests from different processes. Multiple thread might also be useful to help overlap communication and computation for a single request if that request is large enough. * Remote memory The same issues that apply for the two-sided case, apply here. The only real difference is that we might use get/put to fill buffers instead of send/recv. * Shared memory * Multi-method * Large accumulations into the same buffer should be pipelined to achieve maximum parallelism. If it is possible to detect that the same target buffer was being used by a set of processes, and that no other overlapping target buffers were simultaneously being operated upon, then mutexes can be associated with areas of progress within that buffer rather particular regions of the target's local window. Defining the areas of progress may happen naturally as a result of limited buffer space necessitating the use of segments. So, atomicity would need to be guaranteed for the operations associated with a particular segment associated with a buffer rather than a particular region with the local window. In practice, detecting that the conditions have been met to ensure atomicity may prove too difficult to obtain reasonable performance. * Two-sided If multiple threads are available to process multiple data streams, then it should be possible to pipeline the processing. It is critical, however, that the data be organized and sent in such a way so as to maximize * Remote memory * Shared memory For shared memory, this can be accomplished by dividing local windows into regions and providing a separate mutex for each region. * Multi-method * Datatype caching [traff00:mpi-impl] and [booth00:mpi-impl] discuss the need for datatype caching by the target process when the target process must be involved in the RMA operations (i.e., when the data cannot be directly read and interpreted by the origin process) Datatype caching can be either proactive or reactive. In the proactive case, the origin process would track whether the target process has a copy of the datatype and send the datatype to target process when necessary. This means that each datatype must contain tracking information. Unfortunately, because of dynamic processes in MPI-2, something more complex than a simple bit vector must be used to track processes already caching a datatype. What is the correct, high-performance structure? Alternatively, a reactive approach could be used. In the reactive approach, the origin process would assume that the target process already knows about the datatype. If that assumption is false, the target process will request the datatype from the origin process. This simplifies the tracking on the origin, but does not completely eliminate it. It is necessary for the origin to increase the reference count associated with the datatype until the next synchronization point to ensure that the datatype is not deleted before the target process has had sufficient opportunity to request a copy of the datatype. * Two-sided * Remote memory * Shared memory For shared memory, data type caching is unnecessary if the origin process performs the work. If the target is involved in the work, necessary datatype information can be placed in shared memory. * Multi-method