/* THESE NOTES ARE OLD AND MAY NOT REFLECT THE CURRENT IMPLEMENTATION. */ Main features ------------- * Requests for one-sided operations are inserted in the same queue as send-recv operations. The progress function MPID_Make_progress is aware of and makes progress on one-sided as well as send-recv ops. (Main rationale: (1) why not? (2) See the first example below.) * An asynchronous agent kicks in periodically. The agent could be a thread, process, or signal handler. The agent calls the general progress function MPID_Make_progress and therefore makes progres on everything. (Q: If the agent is a process, how does it access the request queue on the main MPI process?) If it's a process, it must be able to access the request queues as well as the memory buffers. The queues can be kept in shared memory, but not the user buffers for active target RMA. Therefore, the request object must have some flag that indicates if everything is in shared memory. If so, the agent can process it; otherwise it will have to skip it. * Whether the epoch has been started on the target or not is checked at the target. The origin adds "requests" to the queue at the target in a nonblocking fashion. For methods that support one-sided communication, there must be a way for the source to determine if the epoch has begun on the target, so that it can directly access the remote buffer. It checks this by doing a remote read of the variable that indicates the status of the epoch. What about lock checks? Tricky. Either need atomic remote test-and-set operations or an agent on the remote side to grant locks. * Each window object maintains the state of the lock on the local window, that is, whether the window has been locked by any other process for passive target communication. The state has 3 values: no lock (shared lock, ranks) - ranks of processes in win group that hold shared lock (exclusive lock, rank) - rank of process in win group that holds excl lock MPI_Win_lock and the following puts/gets are nonblocking. They all get sent to the request queue on the target. The lock is resolved at the target, and the operations for the process that is granted the lock are processed. Requests from other processes just remain in the queue until that process is given the lock. * Datatype packing, shipping, and caching is a separate topic to be addressed later. Currently we assume basic datatypes. * Need efficiency for lock-put-unlock or fence-put-fence type of operations. Need to send one message instead of three. Rambling Notes -------------- Active target operations can be performed without an asynchronous agent at the target because there will always be a synchronization call (MPI_Win_fence, MPI_Win_wait) at the target at which progress can be made. It is not sufficient, however, to cause progress only at synch. calls because the following code is valid (see Fig. 6.8 on pg. 142 of the standard): Process 0 Process 1 --------- --------- start post put complete recv send wait The recv must cause progress on the put, otherwise the code will deadlock. If the above code is valid, then one can also replace both send and recv with MPI_Barrier, and the resulting code should also not deadlock. Passive target operations, on the other hand, do require an asynchronous agent at the target because there is no guarantee when an MPI function may otherwise be called at the target. (The NEC implementation, as reported in their SC00 paper, relies on an MPI call at the target, which is a mistake. They took that shortcut because theirs is a single-threaded implementation.) The asynchronous agent can cause progress (if we so choose) on active target RMA operations and even on send-recv operations. In fact, it is not clear whether the BSP model of disjoint computation and communication phases is the best at all times. That topic itself is a subject for a paper. It is unclear whether the asynchronous agent should be a thread, process, or a signal handler invoked in response to SIGIO or SIGALARM. There are cases where one would be better than the other two. So instead of selecting one method, we should support all three, and focus on specifying the "function" that gets called when the asynchronous agent somehow gets invoked. In other words, what would such a function do? Can it be just the usual "make progress"? A put or get cannot access remote memory unless the exposure epoch has started on the remote side. Need to be careful about this in the shared-memory case where remote memory resides in shared memory. Can't directly write to shared memory without checking whether epoch has begun on the remote side. MPID_Win_create() { All-to-all communication to exchange each process's window size and disp_unit. Also need to communicate if any of the windows is from memory allocated by MPI_Mem_alloc. All these needed for later error checking. If the method supports direct one-sided operations, we also need each process's window base address. Note that if there are 8,000 processes, the window object on each process will have 8,000 entries containing all this info. Not scalable. Instead, we could avoid the all-to-all and request the info from the target at the first RMA call. If we simply let the target discover the error later on, error reporting is hard. And "Advice to Implementers" says that a high quality impl. will check at put/get time if the target memory address is correct. Dup communicator } MPID_Put() { Check correctness of target address Check field of local window object to see if an epoch-starting function was previously called. Else complain. Fill in request object. Add request to request queue for the corresponding method /* NEED API */ Add request to a queue of outstanding requests for this epoch. This is needed to check completion at the next synch. call. } Communication is not much different than for regular message passing. On the remote side, an unexpected message is one for which a fence or post has not been called on that window and it is active target communication. In such cases the message has to be buffered (if it is a short message). If it is a long message, the handshake should be delayed until the fence or post is called. MPID_Win_fence() { If no preceding fence on this window or if MPI_MODE_NO_PRECEDE is asserted, call MPID_Make_progress() else { /* this fence completes an epoch */ all-to-all communication to inform each process how many RMA calls were made to that process in this epoch /* can we do with barrier instead of all-to-all? */ call MPID_Make_Progress() until all RMA completed } } MPID_Win_post() { Mark local window to indicate that post has been called and the group of processes for which RMA has been enabled. No communication needed here. } MPID_Win_start() { Mark local window to indicate that start has been called and the group of processes for which RMA has been enabled. No communication needed here. } MPID_Win_complete() { Effectively an MPI_Wait for all RMA calls that have been called by this process since the last MPI_Win_start. Need to communicate with MPI_Win_waits on other processes to inform them how many RMA calls have taken place. } MPID_Win_wait() { Effectively an MPI_Wait for all RMA calls that have taken place on this window as target since the last MPI_Win_post(). This process doesn't know how many have happened. Need to be told by origin processes when they reach complete(). Call MPID_Make_progress until all RMAs are processed. Can't exit this function until all have completed. } MPID_Win_test() { Nonblocking version of MPI_Win_wait } We implement MPI_Win_lock as follows. A lock gets sent as a lock "request" to the target process. No response is needed. The lock request gets added to the request queue on the target. Lock requests from multiple processes may get queued. When the progress engine kicks in and encounters a lock request in the queue, it checks whether the requested lock can be granted (i.e. no other process holds a conflicting lock on the window). If the lock can be granted, the field in the window is updated and the request is deleted from the queue. No acknowledgement is sent to the requestor of the lock because it is a nonblocking lock and all locks must eventually be granted. RMA requests from other processes get added to the request queue on the target regardless of who has the lock or not. When the progress engine kicks in, it processes the requests in order. For each request, it checks the lock status of the window. If the window is not locked or the requesting process holds a shared or exclusive lock, it processes the request. It processes the request like it would process a send-recv. (For accumulate, it performs the arithmetic operation.) If the window is locked by some other process, it simply moves on to the next request in the queue. Eventually the window will be unlocked in response to an unlock request. In the next round through the request queue some other process will get access to the window. Since the lock is only a lock on the window at the target, there is no question of deadlock or starvation. The target grants locks in the order it encounters them in the queue. Even if there is a barrier between the lock and unlock, it should not be a problem because the processes are not blocked on the lock. MPID_Win_lock() { if it is a lock on a remote window Fill in a lock request object. Add request to request queue for the corresponding method (i.e. send it to target) No reply expected. This function must not block. end if if it is a lock on a local window Atomically check if lock can be acquired and acquire it If it cannot call MPID_Make_progress() and try to acquire the lock. Repeat until lock is acquired end if end if } MPID_Win_unlock() { if it is an unlock on a remote window Call MPID_Make_progress() until all RMA operations on this window for this epoch have completed (at least locally). Fill in an unlock request object. Add request to request queue for the corresponding method (i.e., send it to the target) end if if it is an unlock on the local window change lock variable on window to unlocked end if } New Request types ----------------- Put Get Accumulate Lock Unlock