MVAPICH2 Changelog ------------------ This file briefly describes the changes to the MVAPICH2 software package. The logs are arranged in the "most recent first" order. MVAPICH2 2.3.7 (03/02/2022) * Features and Enhancements (since 2.3.6): - Added support for systems with Rockport's switchless networks * Added automatic architecture detection * Optimized performance for point-to-point operations - Added support for the Cray Slingshot 10 interconnect - Enhanced support for blocking collective offload using Mellanox SHARP * Scatter and Scatterv - Enhanced support for non-blocking collective offload using Mellanox SHARP * Iallreduce, Ibarrier, Ibcast, and Ireduce * Bug Fixes (since 2.3.6): - Removed several deprectated functions - Thanks to Honggang Li @RedHat for the report - Fixed a bug where tools like CMake FindMPI would not detect MVAPICH when compiled without Hydra mpiexec - Thanks to Chris Chambreau and Adam Moody @LLNL for the report - Fixed compilation error when building with mpirun and without hydra - Thanks to James Long @University of Illinois for the report - Fixed issue with setting RoCE mode correctly without RDMA_CM. - Thanks to Nicolas Gagnon @Rockport Networks for the report - Fixed an issue on heterogeneous clusters where QP attributes were set incorrectly - Thanks to X-ScaleSolutions for the report and fix - Fixed a memory leak in improbe on the PSM channel - Thanks to Gregory Lee @LLNL Beichuan Yan @University of Colorado for the report - Added retry logic for PSM connection establishment - Thanks to Gregory Lee @LLNL for the report and X-ScaleSolutions for the patch - Fixed an initialization error when using PSM and gcc's -pg option - Thanks to Gregory Lee @LLNL for the report and X-ScaleSolutions for the patch - Fixed a potential integer overflow when transfering large arrays - Thanks to Alexander Melnikov for the report and patch MVAPICH2 2.3.6 (05/11/2021) * Features and Enhancements (since 2.3.5): - Support collective offload using Mellanox's SHARP for Reduce and Bcast - Enhanced tuning framework for Reduce and Bcast using SHARP - Enhanced performance for UD-Hybrid code - Add multi-rail support for UD-Hybrid code - Enhanced performance for shared-memory collectives - Enhanced job-startup performance for flux job launcher - Add support in mpirun_rsh to use srun daemons to launch jobs - Add support in mpirun_rsh to specify processes per node using '-ppn' option - Use PMI2 by default when SLURM is selected as process manager - Add support to use aligned memory allocations for multi-threaded applications - Thanks to Evan J. Danish @OSC for the report - Architecture detection and enhanced point-to-point tuning for Oracle BM.HPC2 cloud shape - Enhanced collective tuning for Frontera@TACC and Expanse@SDSC - Add support for GCC compiler v11 - Add support for Intel IFX compiler - Update hwloc v1 code to v1.11.14 - Update hwloc v2 code to v2.4.2 * Bug Fixes (since 2.3.5): - Updates to IME support in MVAPICH2 - Thanks to Bernd Schubert and Jean-Yves Vet @DDN for the patch - Improve error reporting in dlopen code path - Thanks to Matthew W. Anderson @INL for the report - Fix memory leak in collectives code path - Thanks to Matthew W. Anderson @INL and the PETSc team for the report and patch - Fix issues in DPM code - Thanks to Lana Deere @D2S Inc for the report - Fix issues when using sys_siglist array - Thanks to Jorge D'Elia @Universidad Nacional Del Litoral in Santa Fe, Argentina for the report - Fix issues with GCC v11 - Thanks to Honggang Li @RedHat for the report - Fix issues in Win_shared_alloc - Thanks to Adam Moody @LLNL for the report - Fix issues with HDF5 in ROMIO code - Thanks to Mark Dixon @Durham University for the report - Fix issues with srun based launch when SLURM hostfile is specified manually - Thanks to Greg Lee @LLNL for the report - Fix an issue with external32 datatypes being converted incorrectly - Thanks to Adam Moody @LLNL for the report - Fix issues in UD-Hybrid code path - Fix issues in MPI_Win_test leading to hangs in multi-rail scenarios - Fix issues in job startup code leading to degraded startup performance - Update code to work with any number of HCAs in a graceful fashion - Fix hang in shared memory code with stencil applications - Fix segmentation fault in finalize - Fix compilation warnings, memory leaks, and spelling mistakes MVAPICH2 2.3.5 (11/30/2020) * Features and Enhancements (since 2.3.4): - Enhanced performance for MPI_Allreduce and MPI_Barrier - Support collective offload using Mellanox's SHARP for Barrier - Enhanced tuning framework for Barrier using SHARP - Remove dependency on underlying libibverbs, libibmad, libibumad, and librdmacm libraries using dlopen - Add support for Broadcom NetXtreme RoCE HCA - Enhanced inter-node point-to-point support - Support architecture detection for Fujitsu A64fx processor - Enhanced point-to-point and collective tuning for Fujitsu A64fx processor - Enhanced point-to-point and collective tuning for AMD ROME processor - Add support for process placement aware HCA selection - Add "MV2_PROCESS_PLACEMENT_AWARE_HCA_MAPPING" environment variable to enable process placement aware HCA mapping - Add support to select HWLOC v1 and HWLOC v2 at configure time - Select using configure time flag --with-hwloc=version - Takes options of v1 (default) and v2 - Add support to auto-detect RoCE HCAs and auto-detect GID index - Add support to use RoCE/Ethernet and InfiniBand HCAs at the same time - Add architecture-specific flags to improve performance of certain CUDA operations - Thanks to Chris Chambreau @LLNL for the report - Read MTU and maximum outstanding RDMA operations from the device - Improved performance and scalability for UD-based communication - Update maximum HCAs supported by default from 4 to 10 - Enhanced collective tuning for Frontera@TACC, Expanse@SDSC, Ookami@StonyBrook, and bb5@EPFL - Enhanced support for SHARP v2.1.0 - Generalize code for GPU support - Update hwloc v2 code to v2.3.0 * Bug Fixes (since 2.3.4): - Fix issue with mpiexec+PBS when calling MPI_Abort - Thanks to Matthew W. Anderson @INL for the report and initial patch - Fix validation failure with multi-threaded applications when InfiniBand registration cache is enabled. - Thanks to Alexander Melnikov for the report and initial patch - Fix issue with realloc when InfiniBand registration cache is enabled - Thanks to Si Lu @TACC and Viet-Duc Le @KISTI for reporting the issue - Fix out of tree builds for ROMIO - Thanks to Per Berg @Defense Center for Operative Oceanography, Denmark for the report - Fix integer overflow errors in the collective code path - Thanks to Kiran Ravikumar @GaTech for the report and reproducer - Fix issue with Hybrid+Spread mapping on hyper-threaded systems - Fix out-of-memory issue when allocating CUDA events - Fix issue with large message UD transfers where packets were incorrectly marked as dropped/missing - Fix spelling mistakes - Thanks to Jens.Schleusener @fossies.org for the report - Revert changes which caused dependencies on lex/yacc at configure time - Thanks to Daniel Pou @HPE for the report - Fix issues with UD Zcopy data transfers - Fix issues with handling datatypes in the collective code - Revert moving -lmpi, -lmpicxx, and -lmpifort before other LDFLAGS in compiler wrappers like mpicc, mpicxx, mpif77, and mpif90 - This was causing issues with certain legacy applications - Thanks to Nicolas Morey-Chaisemartin @SUSE for the report - Fix compilation warnings and memory leaks MVAPICH2 2.3.4 (06/01/2020) * Features and Enhancements (since 2.3.3): - Improved performance for small message collective operations - Improved performance for data transfers from/to non-contiguous buffers used by user-defined datatypes - Add custom API to identify if MVAPICH2 has in-built CUDA support - New API 'MPIX_Query_cuda_support' defined in mpi-ext.h - New macro 'MPIX_CUDA_AWARE_SUPPORT' defined in mpi-ext.h - Add support for MPI_REAL16 based reduction operations for Fortran programs - MPI_SUM, MPI_MAX, MPI_MIN, MPI_LAND, MPI_LOR, MPI_MINLOC, and MPI_MAXLOC - Thanks to Greg Lee@LLNL for the report and reproducer - Thanks to Hui Zhou@ANL for the initial patch - Add support to intercept aligned_alloc in ptmalloc - Thanks to Ye Luo @ANL for the report and the reproducer - Add support to enable fork safety in MVAPICH2 using environment variable "MV2_SUPPORT_FORK_SAFETY" - Add support for user to modify QKEY using environment variable "MV2_DEFAULT_QKEY" - Add multiple MPI_T PVARs and CVARs for point-to-point and collective operations - Enhanced point-to-point and collective tuning for AMD EPYC Rome, Frontera@TACC, Longhorn@TACC, Mayer@Sandia, Pitzer@OSC, Catalyst@EPCC, Summit@ORNL, Lassen@LLNL, and Sierra@LLNL systems - Give preference to CMA if LiMIC2 and CMA are enabled at the same time - Move -lmpi, -lmpicxx, and -lmpifort before other LDFLAGS in compiler wrappers like mpicc, mpicxx, mpif77, and mpif90 - Allow passing flags to nvcc compiler through environment variable NVCCFLAGS - Display more meaningful error messages for InfiniBand asynchronous events - Add support for AMD Optimizing C/C++ (AOCC) compiler v2.1.0 - Add support for GCC compiler v10.1.0 - Requires setting FFLAGS=-fallow-argument-mismatch at configure time - Update to hwloc v2.2.0 * Bug Fixes (since 2.3.3): - Fix compilation issue with IBM XLC++ compilers and CUDA 10.2 - Fix hangs with MPI_Get operations win UD-Hybrid mode - Initialize MPI3 data structures correctly to avoid random hangs caused by garbage values - Fix corner case with LiMIC2 and MPI3 one-sided operations - Add proper fallback and warning message when shared RMA window cannot be created - Fix race condition in calling mv2_get_path_rec_sl by introducing mutex - Thanks to Alexander Melnikov for reporting the issue and providing the patch - Fix mapping generation for the cases where hwloc returns zero on non-numa machines - Thanks to Honggang Li @Red Hat for the report and initial patch - Fix issues with InfiniBand registration cache and PGI20 compiler - Fix warnings raised by Coverity scans - Thanks to Honggang Li @Red Hat for the report - Fix bad baseptr address returned from MPI_Win_shared_query - Thanks to Adam Moody@LLNL for the report and discussion - Fix issues with HCA selection logic in heterogeneous multi-rail scenerios - Fix spelling mistake in error message - Thanks to Bill Long and Krishna Kandalla @Cray/HPE for the report - Fix compilation warnings and memory leaks MVAPICH2 2.3.3 (01/09/2020) * Features and Enhancements (since 2.3.2): - Enhanced performance for intra-node collective operations - Add support for PMIx protocol for SLURM and JSM process managers - Add support for RDMA_CM based multicast group creation - Enhance point-to-point and collective tunings for Fulhame@EPCC, Catalyst@ARM, Mayer@Sandia, and Frontera@TACC - Update default cache line size on x86_64 platforms to 64 bytes - Enhance spread mapping to use even distribution of ranks - Add multiple MPI_T PVARs and CVARs for point-to-point and collective operations - Add support for sub-communicator level MPI_T PVARs - Added architecture detection support for Marvel QEDR RoCE HCA - Add runtime parameter 'MV2_SUPPRESS_HCA_WARNINGS' to suppress HCA warnings - Update to hwloc 1.11.13 * Bug Fixes (since 2.3.2): - Fix error in reading MV2_NUM_SA_QUERY_RETRIES - Thanks to Alexander Melnikov for reporting the issue and providing the patch - Fix build issues with ch3:sock - Thanks to Georg Geiser for reporting the issue - Fix issue in invalid memory reference when accessing comm_ptr during MPID_Win_free - Thanks to Karl Schulz for reporting the issue - Fix build issues with CLANG compilers on POWER9 - Fix clang optimizing out valloc and calloc calls - Fix compilation warnings and memory leaks MVAPICH2 2.3.2 (08/09/2019) * Features and Enhancements (since 2.3.1): - Improved performance for inter-node communication - Improved performance for Gather, Reduce, and Allreduce with cyclic hostfile - Thanks to X-ScaleSolutions for the patch - Improved performance for intra-node point-to-point communication - Add support for Mellanox HDR adapters - Add support for Cascade lake systems - Add support for Microsoft Azure platform - Add support for new NUMA-aware hybrid binding policy - Add support for AMD EPYC Rome architecture - Improved multi-rail selection logic - Enhanced heterogeniety detection logic - Enhanced point-to-point and collective tuning for AMD EPYC Rome, Frontera@TACC, Mayer@Sandia, Pitzer@OSC, Summit@ORNL, Lassen@LLNL, and Sierra@LLNL systems - Enhanced point-to-point and collective tuning for Microsoft Azure - Enhance output of MV2_SHOW_CPU_BINDING to include binding policy - Add multiple PVARs and CVARs for point-to-point and collective operations * Bug Fixes (since 2.3.1): - Fix issue with support for DDN Infinite Memory Engine (IME) - Thanks to Judit Planas @EPFL for reporting the issue - Fix issue when compiling with PGI 19.x - Thanks to Timothy S. Carlson @PNNL for reporting the issue - Fix issue with Infiniband build when ib_uverbs module is not loaded - Thanks to Nicolas Morey-Chaisemartin @SUSE for reporting and providing the patch - Fix issues with DPM support - Thanks to Kenneth McElvain@UC Berkeley for reporting the issues - Fix issue with handling datatype based collectives - Fix hang in Get accumulate - Fix to honor scheduler/administrator reservations for CPU binding - Fix issue with CPU binding for non-power-of-two processes - Fix HCA detection logic to select correct tuning tables for single node scenarios - Fix segfault when freeing removed duplicate communicator - Fix issues in handling very large messages with CMA - Fix issue with very large message point-to-point communication - Fix issue with registration cache on large number of nodes - Fix compilation warnings and memory leaks MVAPICH2 2.3.1 (03/01/2019) * Features and Enhancements (since 2.3): - Add support for JSM and Flux resource managers - Architecture detection, enhanced point-to-point and collective tuning for AMD EPYC system - Enhanced point-to-point and collective tuning for IBM POWER9 and ARM systems - Add support of DDN Infinite Memory Engine (IME) to ROMIO - Thanks to Sylvain Didelot @DDN for the patch - Optimize performance of MPI_Wait operation - Update to hwloc 1.11.11 * Bug Fixes (since 2.3): - Fix autogen error with Flang compiler on ARM systems - Thanks to Nathan Sircombe @ARM for the patch - Fix issues with shmem collectives on ARM architecture - Thanks to Pavel Shamis @ARM for the patch - Fix issues with MPI-3 shared memory windows for PSM-CH3 and PSM2-CH3 channel - Thanks to Adam Moody @LLNL for the report - Fix segfault in MPI_Reduce - Thanks to Samuel Khuvis @OSC for the report - Fix compilation issues with IBM XLC compiler - Thanks to Ken Raffenetti and Yanfei Guo @ANL for the patch - Fix issues with MPI_Mprobe/Improbe and MPI_Mrecv/Imrecv for PSM-CH3 and PSM2-CH3 channel - Thanks to Adam Moody @LLNL for the report - Fix compilation issues with PGI compilers for CUDA-enabled builds - Fix potential hangs in MPI_Finalize - Fix issues in handling very large messages with RGET protocol - Fix issues with handling GPU buffers - Fix issue with hardware multicast based Allreduce - Fix build issue with TCP/IP-CH3 channel - Fix memory leaks exposed by TotalView - Thanks to Adam Moody @LLNL for the report - Fix issues with cleaning up temporary files generated in CUDA builds - Fix compilation warnings MVAPICH2 2.3 (07/23/2018) * Features and Enhancements (since 2.3rc2): - Add point-to-point and collective tuning for IBM POWER9 CPUs - Enhanced collective tuning for IBM POWER8, Intel Skylake, Intel KNL, Intel Broadwell architectures * Bug Fixes (since 2.3rc2): - Fix issues in CH3-TCP/IP channel - Fix build and runtime issues with CUDA support - Fix error when XRC and RoCE were enabled at the same time - Fix issue with XRC connection establishment - Fix for failure at finalize seen on iWARP enabled devices - Fix issue with MPI_IN_PLACE-based communication in MPI_Reduce and MPI_Reduce_scatter - Fix issue with allocating large number of shared memory based MPI3-RMA windows - Fix failure in mpirun_rsh with large number of nodes - Fix singleton initialization issue with SLURM/PMI2 and PSM/Omni-Path - Thanks to Adam Moody @LLNL for the report - Fix build failure with when enabling GPFS support in ROMIO - Thanks to Doug Johnson @OHTech for the report - Fix issues with architecture detection in PSM-CH3 and PSM2-CH3 channels - Fix failures with CMA read at very large message sizes - Fix faiures with MV2_SHOW_HCA_BINDING on single-node jobs - Fix compilation warnings and memory leaks MVAPICH2 2.3rc2 (04/30/2018) * Features and Enhancements (since 2.3rc1): - Based on MPICH v3.2.1 - Enhanced small message performance for MPI_Alltoallv - Improve performance for host-based transfers when CUDA is enabled - Add architecture detection for IBM POWER9 CPUs - Enhance architecture detection for Intel Skylake CPUs - Enhance MPI initialization to gracefully handle RDMA_CM failures - Improve algorithm selection of several collectives - Enhance detection of number and IP addresses of IB devices - Tested with CLANG v5.0.0 * Bug Fixes (since 2.3rc1): - Fix issue in autogen step with duplicate error messages - Fix issue with XRC connection establishment - Fix build issue with SLES 15 and Perl 5.26.1 - Thanks to Matias A Cabral @Intel for the report and patch - Fix segfault when manually selecting collective algorithms - Fix cleanup of preallocated RDMA_FP regions at RDMA_CM finalize - Fix compilation warnings and memory leaks MVAPICH2 2.3rc1 (02/19/2018) * Features and Enhancements (since 2.3b): - Enhanced performance for Allreduce, Reduce_scatter_block, Allgather, Allgatherv through new algorithms - Thanks to Danielle Sikich and Adam Moody @ LLNL for the patch - Enhance support for MPI_T PVARs and CVARs - Improved job startup time for OFA-IB-CH3, PSM-CH3, and PSM2-CH3 - Support to automatically detect IP address of IB/RoCE interfaces when RDMA_CM is enabled without relying on mv2.conf file - Enhance HCA detection to handle cases where node has both IB and RoCE HCAs - Automatically detect and use maximum supported MTU by the HCA - Added logic to detect heterogeneous CPU/HFI configurations in PSM-CH3 and PSM2-CH3 channels - Thanks to Matias Cabral@Intel for the report - Enhanced intra-node and inter-node tuning for PSM-CH3 and PSM2-CH3 channels - Enhanced HFI selection logic for systems with multiple Omni-Path HFIs - Enhanced tuning and architecture detection for OpenPOWER, Intel Skylake and Cavium ARM (ThunderX) systems - Added 'SPREAD', 'BUNCH', and 'SCATTER' binding options for hybrid CPU binding policy - Rename MV2_THREADS_BINDING_POLICY to MV2_HYBRID_BINDING_POLICY - Added support for MV2_SHOW_CPU_BINDING to display number of OMP threads - Update to hwloc version 1.11.9 * Bug Fixes (since 2.3b): - Fix issue with RDMA_CM in multi-rail scenario - Fix issues in nullpscw RMA test. - Fix issue with reduce and allreduce algorithms for large message sizes - Fix hang issue in hydra when no SLURM environment is present - Thanks to Vaibhav Sundriyal for the report - Fix issue to test Fortran KIND with FFLAGS - Thanks to Rob Latham@mcs.anl.gov for the patch - Fix issue in parsing environment variables - Fix issue in displaying process to HCA binding - Enhance CPU binding logic to handle vendor specific core mappings - Fix compilation warnings and memory leaks MVAPICH2 2.3b (08/10/2017) * Features and Enhancements (since 2.3a): - Enhance performance of point-to-point operations for CH3-Gen2 (InfiniBand), CH3-PSM, and CH3-PSM2 (Omni-Path) channels - Improve performance for MPI-3 RMA operations - Introduce support for Cavium ARM (ThunderX) systems - Improve support for process to core mapping on many-core systems - New environment variable MV2_THREADS_BINDING_POLICY for multi-threaded MPI and MPI+OpenMP applications - Support `linear' and `compact' placement of threads - Warn user if oversubcription of core is detected - Improve launch time for large-scale jobs with mpirun_rsh - Add support for non-blocking Allreduce using Mellanox SHARP - Efficient support for different Intel Knight's Landing (KNL) models - Improve performance for Intra- and Inter-node communication for OpenPOWER architecture - Improve support for large processes per node and hugepages on SMP systems - Enhance collective tuning for Intel Knight's Landing and Intel Omni-Path based systems - Enhance collective tuning for Bebop@ANL, Bridges@PSC, and Stampede2@TACC systems - Enhance large message intra-node performance with CH3-IB-Gen2 channel on Intel Knight's Landing - Enhance support for MPI_T PVARs and CVARs * Bug Fixes (since 2.3a): - Fix issue with bcast algorithm selection - Fix issue with large message transfers using CMA - Fix issue in Scatter and Gather with large messages - Fix tuning tables for various collectives - Fix issue with launching single-process MPI jobs - Fix compilation error in the CH3-TCP/IP channel - Thanks to Isaac Carroll@Lightfleet for the patch - Fix issue with memory barrier instructions on ARM - Thanks to Pavel (Pasha) Shamis@ARM for reporting the issue - Fix compilation warnings and memory leaks MVAPICH2 2.3a (03/29/2017) * Features and Enhancements (since 2.2): - Based on and ABI compatible with MPICH 3.2 - Support collective offload using Mellanox's SHArP for Allreduce - Enhance tuning framework for Allreduce using SHArP - Introduce capability to run MPI jobs across multiple InfiniBand subnets - Introduce basic support for executing MPI jobs in Singularity - Enhance collective tuning for Intel Knight's Landing and Intel Omni-path - Enhance process mapping support for multi-threaded MPI applications - Introduce MV2_CPU_BINDING_POLICY=hybrid - Introduce MV2_THREADS_PER_PROCESS - On-demand connection management for PSM-CH3 and PSM2-CH3 channels - Enhance PSM-CH3 and PSM2-CH3 job startup to use non-blocking PMI calls - Enhance debugging support for PSM-CH3 and PSM2-CH3 channels - Improve performance of architecture detection - Introduce run time parameter MV2_SHOW_HCA_BINDING to show process to HCA bindings - Enhance MV2_SHOW_CPU_BINDING to enable display of CPU bindings on all nodes - Deprecate OFA-IB-Nemesis channel - Update to hwloc version 1.11.6 * Bug Fixes (since 2.2): - Fix issue with ring startup in multi-rail systems - Fix startup issue with SLURM and PMI-1 - Thanks to Manuel Rodriguez for the report - Fix startup issue caused by fix for bash `shellshock' bug - Fix issue with very large messages in PSM - Fix issue with singleton jobs and PMI-2 - Thanks to Adam T. Moody@LLNL for the report - Fix incorrect reporting of non-existing files with Luster ADIO - Thanks to Wei Kang@NWU for the report - Fix hang in MPI_Probe - Thanks to John Westlund@Intel for the report - Fix issue while setting affinity with Torque Cgroups - Thanks to Doug Johnson@OSC for the report - Fix runtime errors observed when running MVAPICH2 on aarch64 platforms - Thanks to Sreenidhi Bharathkar Ramesh@Broadcom for posting the original patch - Thanks to Michal Schmidt@RedHat for reposting it - Fix failure in mv2_show_cpu_affinity with affinity disabled - Thanks to Carlos Rosales-Fernandez@TACC for the report - Fix mpirun_rsh error when running short-lived non-MPI jobs - Thanks to Kevin Manalo@OSC for the report - Fix comment and spelling mistake - Thanks to Maksym Planeta for the report - Ignore cpusets and cgroups that may have been set by resource manager - Thanks to Adam T. Moody@LLNL for the report and the patch - Fix reduce tuning table entry for 2ppn 2node - Fix compilation issues due to inline keyword with GCC 5 and newer - Fix compilation warnings and memory leaks MVAPICH2 2.2 (09/07/2016) * Features and Enhancements (since 2.2rc2): - Single node collective tuning for Bridges@PSC, Stampede@TACC and other architectures - Enable PSM builds when both PSM and PSM2 libraries are present - Thanks to Adam T. Moody@LLNL for the report and patch - Add support for HCAs that return result of atomics in big endian notation - Establish loopback connections by default if HCA supports atomics * Bug Fixes (since 2.2rc2): - Fix minor error in use of communicator object in collectives - Fix missing u_int64_t declaration with PGI compilers - Thanks to Adam T. Moody@LLNL for the report and patch - Fix memory leak in RMA rendezvous code path - Thanks to Min Si@ANL for the report and patch MVAPICH2 2.2rc2 (08/08/2016) * Features and Enhancements (since 2.2rc1): - Enhanced performance for MPI_Comm_split through new bitonic algorithm - Thanks to Adam T. Moody@LLNL for the patch - Enable graceful fallback to Shared Memory if LiMIC2 or CMA transfer fails - Enable support for multiple MPI initializations - Unify process affinity support in Gen2, PSM and PSM2 channels - Remove verbs dependency when building the PSM and PSM2 channels - Allow processes to request MPI_THREAD_MULTIPLE when socket or NUMA node level affinity is specified - Point-to-point and collective performance optimization for Intel Knights Landing - Automatic detection and tuning for InfiniBand EDR HCAs - Warn user to reconfigure library if rank type is not large enough to represent all ranks in job - Collective tuning for Opal@LLNL, Bridges@PSC, and Stampede-1.5@TACC - Tuning and architecture detection for Intel Broadwell processors - Add ability to avoid using --enable-new-dtags with ld - Thanks to Adam T. Moody@LLNL for the suggestion - Add LIBTVMPICH specific CFLAGS and LDFLAGS - Thanks to Adam T. Moody@LLNL for the suggestion * Bug Fixes (since 2.2rc1): - Disable optimization that removes use of calloc in ptmalloc hook detection code - Thanks to Karl W. Schulz@Intel - Fix weak alias typos (allows successful compilation with CLANG compiler) - Thanks to Min Dong@Old Dominion University for the patch - Fix issues in PSM large message gather operations - Thanks to Adam T. Moody@LLNL for the report - Enhance error checking in collective tuning code - Thanks to Jan Bierbaum@Technical University of Dresden for the patch - Fix issues with UD based communication in RoCE mode - Fix issues with PMI2 support in singleton mode - Fix default binding bug in hydra launcher - Fix issues with Checkpoint Restart when launched with mpirun_rsh - Fix fortran binding issues with Intel 2016 compilers - Fix issues with socket/NUMA node level binding - Disable atomics when using Connect-IB with RDMA_CM - Fix hang in MPI_Finalize when using hybrid channel - Fix memory leaks MVAPICH2 2.2rc1 (03/29/2016) * Features and Enhancements (since 2.2b): - Support for OpenPower architecture - Optimized inter-node and intra-node communication - Support for Intel Omni-Path architecture - Thanks to Intel for contributing the patch - Introduction of a new PSM2 channel for Omni-Path - Support for RoCEv2 - Architecture detection for PSC Bridges system with Omni-Path - Enhanced startup performance and reduced memory footprint for storing InfiniBand end-point information with SLURM - Support for shared memory based PMI operations - Availability of an updated patch from the MVAPICH project website with this support for SLURM installations - Optimized pt-to-pt and collective tuning for Chameleon InfiniBand systems at TACC/UoC - Enable affinity by default for TrueScale(PSM) and Omni-Path(PSM2) channels - Enhanced tuning for shared-memory based MPI_Bcast - Enhanced debugging support and error messages - Update to hwloc version 1.11.2 * Bug Fixes (since 2.2b): - Fix issue in some of the internal algorithms used for MPI_Bcast, MPI_Alltoall and MPI_Reduce - Fix hang in one of the internal algorithms used for MPI_Scatter - Thanks to Ivan Raikov@Stanford for reporting this issue - Fix issue with rdma_connect operation - Fix issue with Dynamic Process Management feature - Fix issue with de-allocating InfiniBand resources in blocking mode - Fix build errors caused due to improper compile time guards - Thanks to Adam Moody@LLNL for the report - Fix finalize hang when running in hybrid or UD-only mode - Thanks to Jerome Vienne@TACC for reporting this issue - Fix issue in MPI_Win_flush operation - Thanks to Nenad Vukicevic for reporting this issue - Fix out of memory issues with non-blocking collectives code - Thanks to Phanisri Pradeep Pratapa and Fang Liu@GaTech for reporting this issue - Fix fall-through bug in external32 pack - Thanks to Adam Moody@LLNL for the report and patch - Fix issue with on-demand connection establishment and blocking mode - Thanks to Maksym Planeta@TU Dresden for the report - Fix memory leaks in hardware multicast based broadcast code - Fix memory leaks in TrueScale(PSM) channel - Fix compilation warnings MVAPICH2 2.2b (11/12/2015) * Features and Enhancements (since 2.2a): - Enhanced performance for small messages - Enhanced startup performance with SLURM - Support for PMIX_Iallgather and PMIX_Ifence - Support to enable affinity with asynchronous progress thread - Enhanced support for MPIT based performance variables - Tuned VBUF size for performance - Improved startup performance for QLogic PSM-CH3 channel - Thanks to Maksym Planeta@TU Dresden for the patch * Bug Fixes (since 2.2a): - Fix issue with MPI_Get_count in QLogic PSM-CH3 channel with very large messages (>2GB) - Fix issues with shared memory collectives and checkpoint-restart - Fix hang with checkpoint-restart - Fix issue with unlinking shared memory files - Fix memory leak with MPIT - Fix minor typos and usage of inline and static keywords - Thanks to Maksym Planeta@TU Dresden for the patch and suggestions - Fix missing MPIDI_FUNC_EXIT - Thanks to Maksym Planeta@TU Dresden for the patch - Remove unused code - Thanks to Maksym Planeta@TU Dresden for the patch - Continue with warning if user asks to enable XRC when the system does not support XRC MVAPICH2 2.2a (08/17/2015) * Features and Enhancements (since 2.1 GA): - Based on MPICH 3.1.4 - Support for backing on-demand UD CM information with shared memory for minimizing memory footprint - Reorganized HCA-aware process mapping - Dynamic identification of maximum read/atomic operations supported by HCA - Enabling support for intra-node communications in RoCE mode without shared memory - Updated to hwloc 1.11.0 - Updated to sm_20 kernel optimizations for MPI Datatypes - Automatic detection and tuning for 24-core Haswell architecture * Bug Fixes (since 2.1 GA): - Fix for error with multi-vbuf design for GPU based communication - Fix bugs with hybrid UD/RC/XRC communications - Fix for MPICH putfence/getfence for large messages - Fix for error in collective tuning framework - Fix validation failure with Alltoall with IN_PLACE option - Thanks for Mahidhar Tatineni @SDSC for the report - Fix bug with MPI_Reduce with IN_PLACE option - Thanks to Markus Geimer for the report - Fix for compilation failures with multicast disabled - Thanks to Devesh Sharma @Emulex for the report - Fix bug with MPI_Bcast - Fix IPC selection for shared GPU mode systems - Fix for build time warnings and memory leaks - Fix issues with Dynamic Process Management - Thanks to Neil Spruit for the report - Fix bug in architecture detection code - Thanks to Adam Moody @LLNL for the report MVAPICH2-2.1 (04/03/2015) * Features and Enhancements (since 2.1rc2): - Tuning for EDR adapters - Optimization of collectives for SDSC Comet system * Bug-Fixes (since 2.1rc2): - Relocate reading environment variables in PSM - Thanks to Adam Moody@LLNL for the suggestion - Fix issue with automatic process mapping - Fix issue with checkpoint restart when full path is not given - Fix issue with Dynamic Process Management - Fix issue in CUDA IPC code path - Fix corner case in CMA runtime detection MVAPICH2-2.1rc2 (03/12/2015) * Features and Enhancements (since 2.1rc1): - Based on MPICH-3.1.4 - Enhanced startup performance with mpirun_rsh - Checkpoint-Restart Support with DMTCP (Distributed MultiThreaded CheckPointing) - Thanks to the DMTCP project team (http://dmtcp.sourceforge.net/) - Support for handling very large messages in RMA - Optimize size of buffer requested for control messages in large message transfer - Enhanced automatic detection of atomic support - Optimized collectives (bcast, reduce, and allreduce) for 4K processes - Introduce support to sleep for user specified period before aborting - Thanks to Adam Moody@LLNL for the suggestion - Disable PSM from setting CPU affinity - Thanks to Adam Moody@LLNL for providing the patch - Install PSM error handler to print more verbose error messages - Thanks to Adam Moody@LLNL for providing the patch - Introduce retry mechanism to perform psm_ep_open in PSM channel - Thanks to Adam Moody@LLNL for providing the patch * Bug-Fixes (since 2.1rc1): - Fix failures with shared memory collectives with checkpoint-restart - Fix failures with checkpoint-restart when using internal communication buffers of different size - Fix undeclared variable error when --disable-cxx is specified with configure - Thanks to Chris Green from FANL for the patch - Fix segfault seen during connect/accept with dynamic processes - Thanks to Neil Spruit for the fix - Fix errors with large messages pack/unpack operations in PSM channel - Fix for bcast collective tuning - Fix assertion errors in one-sided put operations in PSM channel - Fix issue with code getting stuck in infinite loop inside ptmalloc - Thanks to Adam Moody@LLNL for the suggested changes - Fix assertion error in shared memory large message transfers - Thanks to Adam Moody@LLNL for reporting the issue - Fix compilation warnings MVAPICH2-2.1rc1 (12/18/2014) * Features and Enhancements (since 2.1a): - Based on MPICH-3.1.3 - Flexibility to use internal communication buffers of different size for improved performance and memory footprint - Improve communication performance by removing locks from critical path - Enhanced communication performance for small/medium message sizes - Support for linking Intel Trace Analyzer and Collector - Increase the number of connect retry attempts with RDMA_CM - Automatic detection and tuning for Haswell architecture * Bug-Fixes (since 2.1a): - Fix automatic detection of support for atomics - Fix issue with void pointer arithmetic with PGI - Fix deadlock in ctxidup MPICH test in PSM channel - Fix compile warnings MVAPICH2-2.1a (09/21/2014) * Features and Enhancements (since 2.0): - Based on MPICH-3.1.2 - Support for PMI-2 based startup with SLURM - Enhanced startup performance for Gen2/UD-Hybrid channel - GPU support for MPI_Scan and MPI_Exscan collective operations - Optimize creation of 2-level communicator - Collective optimization for PSM-CH3 channel - Tuning for IvyBridge architecture - Add -export-all option to mpirun_rsh - Support for additional MPI-T performance variables (PVARs) in the CH3 channel - Link with libstdc++ when building with GPU support (required by CUDA 6.5) * Bug-Fixes (since 2.0): - Fix error in large message (>2GB) transfers in CMA code path - Fix memory leaks in OFA-IB-CH3 and OFA-IB-Nemesis channels - Fix issues with optimizations for broadcast and reduce collectives - Fix hang at finalize with Gen2-Hybrid/UD channel - Fix issues for collectives with non power-of-two process counts - Thanks to Evren Yurtesen for identifying the issue - Make ring startup use HCA selected by user - Increase counter length for shared-memory collectives MVAPICH2-2.0 (06/20/2014) * Features and Enhancements (since 2.0rc2): - Consider CMA in collective tuning framework * Bug-Fixes (since 2.0rc2): - Fix bug when disabling registration cache - Fix shared memory window bug when shared memory collectives are disabled - Fix mpirun_rsh bug when running mpmd programs with no arguments MVAPICH2-2.0rc2 (05/25/2014) * Features and Enhancements (since 2.0rc1): - CMA support is now enabled by default - Optimization of collectives with CMA support - RMA optimizations for shared memory and atomic operations - Tuning RGET and Atomics operations - Tuning RDMA FP-based communication - MPI-T support for additional performance and control variables - The --enable-mpit-pvars=yes configuration option will now enable only MVAPICH2 specific variables - Large message transfer support for PSM interface - Optimization of collectives for PSM interface - Updated to hwloc v1.9 * Bug-Fixes (since 2.0rc1): - Fix multicast hang when there is a single process on one node and more than one process on other nodes - Fix non-power-of-two usage of scatter-doubling-allgather algorithm - Fix for bcastzero type hang during finalize - Enhanced handling of failures in RDMA_CM based connection establishment - Fix for a hang in finalize when using RDMA_CM - Finish receive request when RDMA READ completes in RGET protocol - Always use direct RDMA when flush is used - Fix compilation error with --enable-g=all in PSM interface - Fix warnings and memory leaks MVAPICH2-2.0rc1 (03/24/2014) * Features and Enhancements (since 2.0b): - Based on MPICH-3.1 - Enhanced direct RDMA based designs for MPI_Put and MPI_Get operations in OFA-IB-CH3 channel - Optimized communication when using MPI_Win_allocate for OFA-IB-CH3 channel - MPI-3 RMA support for CH3-PSM channel - Multi-rail support for UD-Hybrid channel - Optimized and tuned blocking and non-blocking collectives for OFA-IB-CH3, OFA-IB-Nemesis, and CH3-PSM channels - Improved hierarchical job startup performance - Optimized sub-array data-type processing for GPU-to-GPU communication - Tuning for Mellanox Connect-IB adapters - Updated hwloc to version 1.8 - Added options to specify CUDA library paths - Deprecation of uDAPL-CH3 channel * Bug-Fixes (since 2.0b): - Fix issues related to MPI-3 RMA locks - Fix an issue related to MPI-3 dynamic window - Fix issues related to MPI_Win_allocate backed by shared memory - Fix issues related to large message transfers for OFA-IB-CH3 and OFA-IB-Nemesis channels - Fix warning in job launch, when using DPM - Fix an issue related to MPI atomic operations on HCAs without atomics support - Fixed an issue related to selection of compiler. (We prefer the GNU, Intel, PGI, and Ekopath compilers in that order). - Thanks to Uday R Bondhugula from IISc for the report - Fix an issue in message coalescing - Prevent printing out inter-node runtime parameters for pure intra-node runs - Thanks to Jerome Vienne from TACC for the report - Fix an issue related to ordering of messages for GPU-to-GPU transfers - Fix a few memory leaks and warnings MVAPICH2-2.0b (11/08/2013) * Features and Enhancements (since 2.0a): - Based on MPICH-3.1b1 - Multi-rail support for GPU communication - Non-blocking streams in asynchronous CUDA transfers for better overlap - Initialize GPU resources only when used by MPI transfer - Extended support for MPI-3 RMA in OFA-IB-CH3, OFA-IWARP-CH3, and OFA-RoCE-CH3 - Additional MPIT counters and performance variables - Updated compiler wrappers to remove application dependency on network and other extra libraries - Thanks to Adam Moody from LLNL for the suggestion - Capability to checkpoint CH3 channel using the Hydra process manager - Optimized support for broadcast, reduce and other collectives - Tuning for IvyBridge architecture - Improved launch time for large-scale mpirun_rsh jobs - Introduced retry mechanism in mpirun_rsh for socket binding - Updated hwloc to version 1.7.2 * Bug-Fixes (since 2.0a): - Consider list provided by MV2_IBA_HCA when scanning device list - Fix issues in Nemesis interface with --with-ch3-rank-bits=32 - Better cleanup of XRC files in corner cases - Initialize using better defaults for ibv_modify_qp (initial ring) - Add unconditional check and addition of pthread library - MPI_Get_library_version updated with proper MVAPICH2 branding - Thanks to Jerome Vienne from the TACC for the report MVAPICH2-2.0a (08/24/2013) * Features and Enhancements (since 1.9): - Based on MPICH-3.0.4 - Dynamic CUDA initialization. Support GPU device selection after MPI_Init - Support for running on heterogeneous clusters with GPU and non-GPU nodes - Supporting MPI-3 RMA atomic operations and flush operations with CH3-Gen2 interface - Exposing internal performance variables to MPI-3 Tools information interface (MPIT) - Enhanced MPI_Bcast performance - Enhanced performance for large message MPI_Scatter and MPI_Gather - Enhanced intra-node SMP performance - Tuned SMP eager threshold parameters - Reduced memory footprint - Improved job-startup performance - Warn and continue when ptmalloc fails to initialize - Enable hierarchical SSH-based startup with Checkpoint-Restart - Enable the use of Hydra launcher with Checkpoint-Restart * Bug-Fixes (since 1.9): - Fix data validation issue with MPI_Bcast - Thanks to Claudio J. Margulis from University of Iowa for the report - Fix buffer alignment for large message shared memory transfers - Fix a bug in One-Sided shared memory backed windows - Fix a flow-control bug in UD transport - Thanks to Benjamin M. Auer from NASA for the report - Fix bugs with MPI-3 RMA in Nemesis IB interface - Fix issue with very large message (>2GB bytes) MPI_Bcast - Thanks to Lu Qiyue for the report - Handle case where $HOME is not set during search for MV2 user config file - Thanks to Adam Moody from LLNL for the patch - Fix a hang in connection setup with RDMA-CM MVAPICH2-1.9 (05/06/2013) * Features and Enhancements (since 1.9rc1): - Updated to hwloc v1.7      - Tuned Reduce, AllReduce, Scatter, Reduce-Scatter and Allgatherv Collectives * Bug-Fixes (since 1.9rc1): - Fix cuda context issue with async progress thread     - Thanks to Osuna Escamilla Carlos from env.ethz.ch for the report     - Overwrite pre-existing PSM environment variables     - Thanks to Adam Moody from LLNL for the patch   - Fix several warnings       - Thanks to Adam Moody from LLNL for some of the patches MVAPICH2-1.9RC1 (04/16/2013) * Features and Enhancements (since 1.9b): - Based on MPICH-3.0.3 - Updated SCR to version 1.1.8 - Install utility scripts included with SCR - Support for automatic detection of path to utilities used by mpirun_rsh during configuration - Utilities supported: rsh, ssh, xterm, totalview - Support for launching jobs on heterogeneous networks with mpirun_rsh - Tuned Bcast, Reduce, Scatter Collectives - Tuned MPI performance on Kepler GPUs - Introduced MV2_RDMA_CM_CONF_FILE_PATH parameter which specifies path to mv2.conf * Bug-Fixes (since 1.9b): - Fix autoconf issue with LiMIC2 source-code - Thanks to Doug Johnson from OH-TECH for the report - Fix build errors with --enable-thread-cs=per-object and --enable-refcount=lock-free - Thanks to Marcin Zalewski from Indiana University for the report - Fix MPI_Scatter failure with MPI_IN_PLACE - Thanks to Mellanox for the report - Fix MPI_Scatter failure with cyclic host files - Fix deadlocks in PSM interface for multi-threaded jobs - Thanks to Marcin Zalewski from Indiana University for the report - Fix MPI_Bcast failures in SCALAPACK - Thanks to Jerome Vienne from TACC for the report - Fix build errors with newer Ekopath compiler - Fix a bug with shmem collectives in PSM interface - Fix memory corruption when more entries specified in mv2.conf than the requested number of rails - Thanks to Akihiro Nomura from Tokyo Institute of Technology for the report - Fix memory corruption with CR configuration in Nemesis interface MVAPICH2-1.9b (02/28/2013) * Features and Enhancements (since 1.9a2): - Based on MPICH-3.0.2 - Support for all MPI-3 features - Support for single copy intra-node communication using Linux supported CMA (Cross Memory Attach) - Provides flexibility for intra-node communication: shared memory, LiMIC2, and CMA - Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR) - Support for application-level checkpointing - Support for hierarchical system-level checkpointing - Improved job startup time - Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized startup on homogeneous clusters - New version of LiMIC2 (v0.5.6) - Provides support for unlocked ioctl calls - Tuned Reduce, Allgather, Reduce_Scatter, Allgatherv collectives - Introduced option to export environment variables automatically with mpirun_rsh - Updated to HWLOC v1.6.1 - Provided option to use CUDA library call instead of CUDA driver to check buffer pointer type - Thanks to Christian Robert from Sandia for the suggestion - Improved debug messages and error reporting * Bug-Fixes (since 1.9a2): - Fix page fault with memory access violation with LiMIC2 exposed by newer Linux kernels - Thanks to Karl Schulz from TACC for the report - Fix a failure when lazy memory registration is disabled and CUDA is enabled - Thanks to Jens Glaser from University of Minnesota for the report - Fix an issue with variable initialization related to DPM support - Rename a few internal variables to avoid name conflicts with external applications - Thanks to Adam Moody from LLNL for the report - Check for libattr during configuration when Checkpoint/Restart and Process Migration are requested - Thanks to John Gilmore from Vastech for the report - Fix build issue with --disable-cxx - Set intra-node eager threshold correctly when configured with LiMIC2 - Fix an issue with MV2_DEFAULT_PKEY in partitioned InfiniBand network - Thanks to Jesper Larsen from FCOO for the report - Improve makefile rules to use automake macros - Thanks to Carmelo Ponti from CSCS for the report - Fix configure error with automake conditionals - Thanks to Evren Yurtesen from Abo Akademi for the report - Fix a few memory leaks and warnings - Properly cleanup shared memory files (used by XRC) when applications fail MVAPICH2-1.9a2 (11/08/2012) * Features and Enhancements (since 1.9a): - Based on MPICH2-1.5 - Initial support for MPI-3: (Available for all interfaces: OFA-IB-CH3, OFA-IWARP-CH3, OFA-RoCE-CH3, uDAPL-CH3, OFA-IB-Nemesis, PSM-CH3) - Nonblocking collective functions available as "MPIX_" functions (e.g., "MPIX_Ibcast") - Neighborhood collective routines available as "MPIX_" functions (e.g., "MPIX_Neighbor_allgather") - MPI_Comm_split_type function available as an "MPIX_" function - Support for MPIX_Type_create_hindexed_block - Nonblocking communicator duplication routine MPIX_Comm_idup (will only work for single-threaded programs) - MPIX_Comm_create_group support - Support for matched probe functionality (e.g., MPIX_Mprobe, MPIX_Improbe, MPIX_Mrecv, and MPIX_Imrecv), (Not Available for PSM) - Support for "Const" (disabled by default) - Efficient vector, hindexed datatype processing on GPU buffers - Tuned alltoall, Scatter and Allreduce collectives - Support for Mellanox Connect-IB HCA - Adaptive number of registration cache entries based on job size - Revamped Build system: - Uses automake instead of simplemake, - Allows for parallel builds ("make -j8" and similar) * Bug-Fixes (since 1.9a): - CPU frequency mismatch warning shown under debug - Fix issue with MPI_IN_PLACE buffers with CUDA - Fix ptmalloc initialization issue due to compiler optimization - Thanks to Kyle Sheumaker from ACT for the report - Adjustable MAX_NUM_PORTS at build time to support more than two ports - Fix issue with MPI_Allreduce with MPI_IN_PLACE send buffer - Fix memleak in MPI_Cancel with PSM interface - Thanks to Andrew Friedley from LLNL for the report MVAPICH2-1.9a (09/07/2012) * Features and Enhancements (since 1.8): - Support for InfiniBand hardware UD-multicast - UD-multicast-based designs for collectives (Bcast, Allreduce and Scatter) - Enhanced Bcast and Reduce collectives with pt-to-pt communication - LiMIC-based design for Gather collective - Improved performance for shared-memory-aware collectives - Improved intra-node communication performance with GPU buffers using pipelined design - Improved inter-node communication performance with GPU buffers with non-blocking CUDA copies - Improved small message communication performance with GPU buffers using CUDA IPC design - Improved automatic GPU device selection and CUDA context management - Optimal communication channel selection for different GPU communication modes (DD, DH and HD) in different configurations (intra-IOH and inter-IOH) - Removed libibumad dependency for building the library - Option for selecting non-default gid-index in a loss-less fabric setup in RoCE mode - Option to disable signal handler setup - Tuned thresholds for various architectures - Set DAPL-2.0 as the default version for the uDAPL interface - Updated to hwloc v1.5 - Option to use IP address as a fallback if hostname cannot be resolved - Improved error reporting * Bug-Fixes (since 1.8): - Fix issue in intra-node knomial bcast - Handle gethostbyname return values gracefully - Fix corner case issue in two-level gather code path - Fix bug in CUDA events/streams pool management - Fix ptmalloc initialization issue when MALLOC_CHECK_ is defined in the environment - Thanks to Mehmet Belgin from Georgia Institute of Technology for the report - Fix memory corruption and handle heterogeneous architectures in gather collective - Fix issue in detecting the correct HCA type - Fix issue in ring start-up to select correct HCA when MV2_IBA_HCA is specified - Fix SEGFAULT in MPI_Finalize when IB loop-back is used - Fix memory corruption on nodes with 64-cores - Thanks to M Xie for the report - Fix hang in MPI_Finalize with Nemesis interface when ptmalloc initialization fails - Thanks to Carson Holt from OICR for the report - Fix memory corruption in shared memory communication - Thanks to Craig Tierney from NOAA for the report and testing the patch - Fix issue in IB ring start-up selection with mpiexec.hydra - Fix issue in selecting CUDA run-time variables when running on single node in SMP only mode - Fix few memory leaks and warnings MVAPICH2-1.8 (04/30/2012) * Features and Enhancements (since 1.8rc1): - Introduced a unified run time parameter MV2_USE_ONLY_UD to enable UD only mode - Enhanced designs for Alltoall and Allgather collective communication from GPU device buffers - Tuned collective communication from GPU device buffers - Tuned Gather collective - Introduced a run time parameter MV2_SHOW_CPU_BINDING to show current CPU bindings - Updated to hwloc v1.4.1 - Remove dependency on LEX and YACC * Bug-Fixes (since 1.8rc1): - Fix hang with multiple GPU configuration - Thanks to Jens Glaser from University of Minnesota for the report - Fix buffer alignment issues to improve intra-node performance - Fix a DPM multispawn behavior - Enhanced error reporting in DPM functionality - Quote environment variables in job startup to protect from shell - Fix hang when LIMIC is enabled - Fix hang in environments with heterogeneous HCAs - Fix issue when using multiple HCA ports in RDMA_CM mode - Thanks to Steve Wise from Open Grid Computing for the report - Fix hang during MPI_Finalize in Nemesis IB netmod - Fix for a start-up issue in Nemesis with heterogeneous architectures - Fix few memory leaks and warnings MVAPICH2-1.8rc1 (03/22/2012) * Features & Enhancements (since 1.8a2): - New design for intra-node communication from GPU Device buffers using CUDA IPC for better performance and correctness - Thanks to Joel Scherpelz from NVIDIA for his suggestions - Enabled shared memory communication for host transfers when CUDA is enabled - Optimized and tuned collectives for GPU device buffers - Enhanced pipelined inter-node device transfers - Enhanced shared memory design for GPU device transfers for large messages - Enhanced support for CPU binding with socket and numanode level granularity - Support suspend/resume functionality with mpirun_rsh - Exporting local rank, local size, global rank and global size through environment variables (both mpirun_rsh and hydra) - Update to hwloc v1.4 - Checkpoint-Restart support in OFA-IB-Nemesis interface - Enabling run-through stabilization support to handle process failures in OFA-IB-Nemesis interface - Enhancing OFA-IB-Nemesis interface to handle IB errors gracefully - Performance tuning on various architecture clusters - Support for Mellanox IB FDR adapter * Bug-Fixes (since 1.8a2): - Fix a hang issue on InfiniHost SDR/DDR cards - Thanks to Nirmal Seenu from Fermilab for the report - Fix an issue with runtime parameter MV2_USE_COALESCE usage - Fix an issue with LiMIC2 when CUDA is enabled - Fix an issue with intra-node communication using datatypes and GPU device buffers - Fix an issue with Dynamic Process Management when launching processes on multiple nodes - Thanks to Rutger Hofman from VU Amsterdam for the report - Fix build issue in hwloc source with mcmodel=medium flags - Thanks to Nirmal Seenu from Fermilab for the report - Fix a build issue in hwloc with --disable-shared or --disabled-static options - Use portable stdout and stderr redirection - Thanks to Dr. Axel Philipp from *MTU* Aero Engines for the patch - Fix a build issue with PGI 12.2 - Thanks to Thomas Rothrock from U.S. Army SMDC for the patch - Fix an issue with send message queue in OFA-IB-Nemesis interface - Fix a process cleanup issue in Hydra when MPI_ABORT is called (upstream MPICH2 patch) - Fix an issue with non-contiguous datatypes in MPI_Gather - Fix a few memory leaks and warnings MVAPICH2-1.8a2 (02/02/2012) * Features and Enhancements (since 1.8a1p1): - Support for collective communication from GPU buffers - Non-contiguous datatype support in point-to-point and collective communication from GPU buffers - Efficient GPU-GPU transfers within a node using CUDA IPC (for CUDA 4.1) - Alternate synchronization mechanism using CUDA Events for pipelined device data transfers - Exporting processes local rank in a node through environment variable - Adjust shared-memory communication block size at runtime - Enable XRC by default at configure time - New shared memory design for enhanced intra-node small message performance - Tuned inter-node and intra-node performance on different cluster architectures - Update to hwloc v1.3.1 - Support for fallback to R3 rendezvous protocol if RGET fails - SLURM integration with mpiexec.mpirun_rsh to use SLURM allocated hosts without specifying a hostfile - Support added to automatically use PBS_NODEFILE in Torque and PBS environments - Enable signal-triggered (SIGUSR2) migration * Bug Fixes (since 1.8a1p1): - Set process affinity independently of SMP enable/disable to control the affinity in loopback mode - Report error and exit if user requests MV2_USE_CUDA=1 in non-cuda configuration - Fix for data validation error with GPU buffers - Updated WRAPPER_CPPFLAGS when using --with-cuda. Users should not have to explicitly specify CPPFLAGS or LDFLAGS to build applications - Fix for several compilation warnings - Report an error message if user requests MV2_USE_XRC=1 in non-XRC configuration - Remove debug prints in regular code path with MV2_USE_BLOCKING=1 - Thanks to Vaibhav Dutt for the report - Handling shared memory collective buffers in a dynamic manner to eliminate static setting of maximum CPU core count - Fix for validation issue in MPICH2 strided_get_indexed.c - Fix a bug in packetized transfers on heterogeneous clusters - Fix for deadlock between psm_ep_connect and PMGR_COLLECTIVE calls on QLogic systems - Thanks to Adam T. Moody for the patch - Fix a bug in MPI_Allocate_mem when it is called with size 0 - Thanks to Michele De Stefano for reporting this issue - Create vendor for Open64 compilers and add rpath for unknown compilers - Thanks to Martin Hilgemen from Dell Inc. for the initial patch - Fix issue due to overlapping buffers with sprintf - Thanks to Mark Debbage from QLogic for reporting this issue - Fallback to using GNU options for unknown f90 compilers - Fix hang in PMI_Barrier due to incorrect handling of the socket return values in mpirun_rsh - Unify the redundant FTB events used to initiate a migration - Fix memory leaks when mpirun_rsh reads hostfiles - Fix a bug where library attempts to use in-active rail in multi-rail scenario MVAPICH2-1.8a1p1 (11/14/2011) * Bug Fixes (since 1.8a1) - Fix for a data validation issue in GPU transfers - Thanks to Massimiliano Fatica, NVIDIA, for reporting this issue - Tuned CUDA block size to 256K for better performance - Enhanced error checking for CUDA library calls - Fix for mpirun_rsh issue while launching applications on Linux Kernels (3.x) MVAPICH2-1.8a1 (11/09/2011) * Features and Enhancements (since 1.7): - Support for MPI communication from NVIDIA GPU device memory - High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) - High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) - Communication with contiguous datatype - Reduced memory footprint of the library - Enhanced one-sided communication design with reduced memory requirement - Enhancements and tuned collectives (Bcast and Alltoallv) - Update to hwloc v1.3.0 - Flexible HCA selection with Nemesis interface - Thanks to Grigori Inozemtsev, Queens University - Support iWARP interoperability between Intel NE020 and Chelsio T4 Adapters - RoCE enable environment variable name is changed from MV2_USE_RDMAOE to MV2_USE_RoCE * Bug Fixes (since 1.7): - Fix for a bug in mpirun_rsh while doing process clean-up in abort and other error scenarios - Fixes for code compilation warnings - Fix for memory leaks in RDMA CM code path MVAPICH2-1.7 (10/14/2011) * Features and Enhancements (since 1.7rc2): - Support SHMEM collectives up to 64 cores/node - Update to hwloc v1.2.2 - Enhancement and tuned collective (GatherV) * Bug Fixes: - Fixes for code compilation warnings - Fix job clean-up issues with mpirun_rsh - Fix a hang with RDMA CM MVAPICH2-1.7rc2 (09/19/2011) * Features and Enhancements (since 1.7rc1): - Based on MPICH2-1.4.1p1 - Integrated Hybrid (UD-RC/XRC) design to get best performance on large-scale systems with reduced/constant memory footprint - Shared memory backed Windows for One-Sided Communication - Support for truly passive locking for intra-node RMA in shared memory and LIMIC based windows - Integrated with Portable Hardware Locality (hwloc v1.2.1) - Integrated with latest OSU Micro-Benchmarks (3.4) - Enhancements and tuned collectives (Allreduce and Allgatherv) - MPI_THREAD_SINGLE provided by default and MPI_THREAD_MULTIPLE as an option - Enabling Checkpoint/Restart support in pure SMP mode - Optimization for QDR cards - On-demand connection management support with IB CM (RoCE interface) - Optimization to limit number of RDMA Fast Path connections for very large clusters (Nemesis interface) - Multi-core-aware collective support (QLogic PSM interface) * Bug Fixes: - Fixes for code compilation warnings - Compiler preference lists reordered to avoid mixing GCC and Intel compilers if both are found by configure - Fix a bug in transferring very large messages (>2GB) - Thanks to Tibor Pausz from Univ. of Frankfurt for reporting it - Fix a hang with One-Sided Put operation - Fix a bug in ptmalloc integration - Avoid double-free crash with mpispawn - Avoid crash and print an error message in mpirun_rsh when the hostfile is empty - Checking for error codes in PMI design - Verify programs can link with LiMIC2 at runtime - Fix for compilation issue when BLCR or FTB installed in non-system paths - Fix an issue with RDMA-Migration - Fix for memory leaks - Fix an issue in supporting RoCE with second port on available on HCA - Thanks to Jeffrey Konz from HP for reporting it - Fix for a hang with passive RMA tests (QLogic PSM interface) MVAPICH2-1.7rc1 (07/20/2011) * Features and Enhancements (since 1.7a2) - Based on MPICH2-1.4 - CH3 shared memory channel for standalone hosts (including laptops) without any InfiniBand adapters - HugePage support - Improved on-demand InfiniBand connection setup - Optimized Fence synchronization (with and without LIMIC2 support) - Enhanced mpirun_rsh design to avoid race conditions and support for improved debug messages - Optimized design for collectives (Bcast and Reduce) - Improved performance for medium size messages for QLogic PSM - Support for Ekopath Compiler * Bug Fixes - Fixes in Dynamic Process Management (DPM) support - Fixes in Checkpoint/Restart and Migration support - Fix Restart when using automatic checkpoint - Thanks to Alexandr for reporting this - Compilation warnings fixes - Handling very large one-sided transfers using RDMA - Fixes for memory leaks - Graceful handling of unknown HCAs - Better handling of shmem file creation errors - Fix for a hang in intra-node transfer - Fix for a build error with --disable-weak-symbols - Thanks to Peter Willis for reporting this issue - Fixes for one-sided communication with passive target synchronization - Proper error reporting when a program is linked with both static and shared MVAPICH2 libraries MVAPICH2-1.7a2 (06/03/2011) * Features and Enhancements (Since 1.7a) - Improved intra-node shared memory communication performance - Tuned RDMA Fast Path Buffer size to get better performance with less memory footprint (CH3 and Nemesis) - Fast process migration using RDMA - Automatic inter-node communication parameter tuning based on platform and adapter detection (Nemesis) - Automatic intra-node communication parameter tuning based on platform - Efficient connection set-up for multi-core systems - Enhancements for collectives (barrier, gather and allgather) - Compact and shorthand way to specify blocks of processes on the same host with mpirun_rsh - Support for latest stable version of HWLOC v1.2 - Improved debug message output in process management and fault tolerance functionality - Better handling of process signals and error management in mpispawn - Performance tuning for pt-to-pt and several collective operations * Bug fixes - Fixes for memory leaks - Fixes in CR/migration - Better handling of memory allocation and registration failures - Fixes for compilation warnings - Fix a bug that disallows '=' from mpirun_rsh arguments - Handling of non-contiguous transfer in Nemesis interface - Bug fix in gather collective when ranks are in cyclic order - Fix for the ignore_locks bug in MPI-IO with Lustre MVAPICH2-1.7a (04/19/2011) * Features and Enhancements - Based on MPICH2-1.3.2p1 - Integrated with Portable Hardware Locality (hwloc v1.1.1) - Supporting Large Data transfers (>2GB) - Integrated with Enhanced LiMIC2 (v0.5.5) to support Intra-node large message (>2GB) transfers - Optimized and tuned algorithm for AlltoAll - Enhanced debugging config options to generate core files and back-traces - Support for Chelsio's T4 Adapter MVAPICH2-1.6 (03/09/2011) * Features and Enhancements (since 1.6-RC3) - Improved configure help for MVAPICH2 features - Updated Hydra launcher with MPICH2-1.3.3 Hydra process manager - Building and installation of OSU micro benchmarks during default MVAPICH2 installation - Hydra is the default mpiexec process manager * Bug fixes (since 1.6-RC3) - Fix hang issues in RMA - Fix memory leaks - Fix in RDMA_FP MVAPICH2-1.6-RC3 (02/15/2011) * Features and Enhancements - Support for 3D torus topology with appropriate SL settings - For both CH3 and Nemesis interfaces - Thanks to Jim Schutt, Marcus Epperson and John Nagle from Sandia for the initial patch - Quality of Service (QoS) support with multiple InfiniBand SL - For both CH3 and Nemesis interfaces - Configuration file support (similar to the one available in MVAPICH). Provides a convenient method for handling all runtime variables through a configuration file. - Improved job-startup performance on large-scale systems - Optimization in MPI_Finalize - Improved pt-to-pt communication performance for small and medium messages - Optimized and tuned algorithms for Gather and Scatter collective operations - Optimized thresholds for one-sided RMA operations - User-friendly configuration options to enable/disable various checkpoint/restart and migration features - Enabled ROMIO's auto detection scheme for filetypes on Lustre file system - Improved error checking for system and BLCR calls in checkpoint-restart and migration codepath - Enhanced OSU Micro-benchmarks suite (version 3.3) Bug Fixes - Fix in aggregate ADIO alignment - Fix for an issue with LiMIC2 header - XRC connection management - Fixes in registration cache - IB card detection with MV2_IBA_HCA runtime option in multi rail design - Fix for a bug in multi-rail design while opening multiple HCAs - Fixes for multiple memory leaks - Fix for a bug in mpirun_rsh - Checks before enabling aggregation and migration - Fixing the build errors with --disable-cxx - Thanks to Bright Yang for reporting this issue - Fixing the build errors related to "pthread_spinlock_t" seen on RHEL systems MVAPICH2-1.6-RC2 (12/22/2010) * Features and Enhancements - Optimization and enhanced performance for clusters with nVIDIA GPU adapters (with and without GPUDirect technology) - Enhanced R3 rendezvous protocol - For both CH3 and Nemesis interfaces - Robust RDMA Fast Path setup to avoid memory allocation failures - For both CH3 and Nemesis interfaces - Multiple design enhancements for better performance of medium sized messages - Enhancements and optimizations for one sided Put and Get operations - Enhancements and tuning of Allgather for small and medium sized messages - Optimization of AllReduce - Enhancements to Multi-rail Design and features including striping of one-sided messages - Enhancements to mpirun_rsh job start-up scheme - Enhanced designs for automatic detection of various architectures and adapters * Bug fixes - Fix a bug in Post-Wait/Start-Complete path for one-sided operations - Resolving a hang in mpirun_rsh termination when CR is enabled - Fixing issue in MPI_Allreduce and Reduce when called with MPI_IN_PLACE - Thanks to the initial patch by Alexander Alekhin - Fix for an issue in rail selection for small RMA messages - Fix for threading related errors with comm_dup - Fix for alignment issues in RDMA Fast Path - Fix for extra memcpy in header caching - Fix for an issue to use correct HCA when process to rail binding scheme used in combination with XRC. - Fix for an RMA issue when configured with enable-g=meminit - Thanks to James Dinan of Argonne for reporting this issue - Only set FC and F77 if gfortran is executable MVAPICH2-1.6RC1 (11/12/2010) * Features and Enhancements - Using LiMIC2 for efficient intra-node RMA transfer to avoid extra memory copies - Upgraded to LiMIC2 version 0.5.4 - Removing the limitation on number of concurrent windows in RMA operations - Support for InfiniBand Quality of Service (QoS) with multiple lanes - Enhanced support for multi-threaded applications - Fast Checkpoint-Restart support with aggregation scheme - Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance - Support for new standardized Fault Tolerant Backplane (FTB) Events for Checkpoint-Restart and Job Pause-Migration-Restart Framework - Dynamic detection of multiple InfiniBand adapters and using these by default in multi-rail configurations (OLA-IB-CH3, OFA-iWARP-CH3 and OFA-RoCE-CH3 interfaces) - Support for process-to-rail binding policy (bunch, scatter and user-defined) in multi-rail configurations (OFA-IB-CH3, OFA-iWARP-CH3 and OFA-RoCE-CH3 interfaces) - Enhanced and optimized algorithms for MPI_Reduce and MPI_AllReduce operations for small and medium message sizes. - XRC support with Hydra Process Manager - Improved usability of process to CPU mapping with support of delimiters (',' , '-') in CPU listing - Thanks to Gilles Civario for the initial patch - Use of gfortran as the default F77 compiler - Support of Shared-Memory-Nemesis interface on multi-core platforms requiring intra-node communication only (SMP-only systems, laptops, etc. ) * Bug fixes - Fix for memory leak in one-sided code with --enable-g=all --enable-error-messages=all - Fix for memory leak in getting the context of intra-communicator - Fix for shmat() return code check - Fix for issues with inter-communicator collectives in Nemesis - KNEM patch for osu_bibw issue with KNEM version 0.9.2 - Fix for osu_bibw error with Shared-memory-Nemesis interface - Fix for Win_test error for one-sided RDMA - Fix for a hang in collective when thread level is set to multiple - Fix for intel test errors with rsend, bsend and ssend operations in Nemesis - Fix for memory free issue when it allocated by scandir - Fix for a hang in Finalize - Fix for issue with MPIU_Find_local_and_external when it is called from MPIDI_CH3I_comm_create - Fix for handling CPPFLGS values with spaces - Dynamic Process Management to work with XRC support - Fix related to disabling CPU affinity when shared memory is turned off at run time - MVAPICH2-1.5.1 (09/14/10) * Features and Enhancements - Significantly reduce memory footprint on some systems by changing the stack size setting for multi-rail configurations - Optimization to the number of RDMA Fast Path connections - Performance improvements in Scatterv and Gatherv collectives for CH3 interface (Thanks to Dan Kokran and Max Suarez of NASA for identifying the issue) - Tuning of Broadcast Collective - Support for tuning of eager thresholds based on both adapter and platform type - Environment variables for message sizes can now be expressed in short form K=Kilobytes and M=Megabytes (e.g. MV2_IBA_EAGER_THRESHOLD=12K) - Ability to selectively use some or all HCAs using colon separated lists. e.g. MV2_IBA_HCA=mlx4_0:mlx4_1 - Improved Bunch/Scatter mapping for process binding with HWLOC and SMT support (Thanks to Dr. Bernd Kallies of ZIB for ideas and suggestions) - Update to Hydra code from MPICH2-1.3b1 - Auto-detection of various iWARP adapters - Specifying MV2_USE_IWARP=1 is no longer needed when using iWARP - Changing automatic eager threshold selection and tuning for iWARP adapters based on number of nodes in the system instead of the number of processes - PSM progress loop optimization for QLogic Adapters (Thanks to Dr. Avneesh Pant of QLogic for the patch) * Bug fixes - Fix memory leak in registration cache with --enable-g=all - Fix memory leak in operations using datatype modules - Fix for rdma_cross_connect issue for RDMA CM. The server is prevented from initiating a connection. - Don't fail during build if RDMA CM is unavailable - Various mpirun_rsh bug fixes for CH3, Nemesis and uDAPL interfaces - ROMIO panfs build fix - Update panfs for not-so-new ADIO file function pointers - Shared libraries can be generated with unknown compilers - Explicitly link against DL library to prevent build error due to DSO link change in Fedora 13 (introduced with gcc-4.4.3-5.fc13) - Fix regression that prevents the proper use of our internal HWLOC component - Remove spurious debug flags when certain options are selected at build time - Error code added for situation when received eager SMP message is larger than receive buffer - Fix for Gather and GatherV back-to-back hang problem with LiMIC2 - Fix for packetized send in Nemesis - Fix related to eager threshold in nemesis ib-netmod - Fix initialization parameter for Nemesis based on adapter type - Fix for uDAPL one sided operations (Thanks to Jakub Fedoruk from Intel for reporting this) - Fix an issue with out-of-order message handling for iWARP - Fixes for memory leak and Shared context Handling in PSM for QLogic Adapters (Thanks to Dr. Avneesh Pant of QLogic for the patch) MVAPICH2-1.5 (07/09/10) * Features and Enhancements (since 1.5-RC2) - SRQ turned on by default for Nemesis interface - Performance tuning - adjusted eager thresholds for variety of architectures, vbuf size based on adapter types and vbuf pool sizes - Tuning for Intel iWARP NE020 adapter, thanks to Harry Cropper of Intel - Introduction of a retry mechanism for RDMA_CM connection establishment * Bug fixes (since 1.5-RC2) - Fix in build process with hwloc (for some Distros) - Fix for memory leak (Nemesis interface) MVAPICH2-1.5-RC2 (06/21/10) * Features and Enhancements (since 1.5-RC1) - Support for hwloc library (1.0.1) for defining CPU affinity - Deprecating the PLPA support for defining CPU affinity - Efficient CPU affinity policies (bunch and scatter) to specify CPU affinity per job for modern multi-core platforms - New flag in mpirun_rsh to execute tasks with different group IDs - Enhancement to the design of Win_complete for RMA operations - Flexibility to support variable number of RMA windows - Support for Intel iWARP NE020 adapter * Bug fixes (since 1.5-RC1) - Compilation issue with the ROMIO adio-lustre driver, thanks to Adam Moody of LLNL for reporting the issue - Allowing checkpoint-restart for large-scale systems - Correcting a bug in clear_kvc function. Thanks to T J (Chris) Ward, IBM Research, for reporting and providing the resolving patch - Shared lock operations with RMA with scatter process distribution. Thanks to Pavan Balaji of Argonne for reporting this issue - Fix a bug during window creation in uDAPL - Compilation issue with --enable-alloca, Thanks to E. Borisch, for reporting and providing the patch - Improved error message for ibv_poll_cq failures - Fix an issue that prevents mpirun_rsh to execute programs without specifying the path from directories in PATH - Fix an issue of mpirun_rsh with Dynamic Process Migration (DPM) - Fix for memory leaks (both CH3 and Nemesis interfaces) - Updatefiles correctly update LiMIC2 - Several fixes to the registration cache (CH3, Nemesis and uDAPL interfaces) - Fix to multi-rail communication - Fix to Shared Memory communication Progress Engine - Fix to all-to-all collective for large number of processes MVAPICH2-1.5-RC1 (05/04/10) * Features and Enhancements - MPI 2.2 compliant - Based on MPICH2-1.2.1p1 - OFA-IB-Nemesis interface design - OpenFabrics InfiniBand network module support for MPICH2 Nemesis modular design - Support for high-performance intra-node shared memory communication provided by the Nemesis design - Adaptive RDMA Fastpath with Polling Set for high-performance inter-node communication - Shared Receive Queue (SRQ) support with flow control, uses significantly less memory for MPI library - Header caching - Advanced AVL tree-based Resource-aware registration cache - Memory Hook Support provided by integration with ptmalloc2 library. This provides safe release of memory to the Operating System and is expected to benefit the memory usage of applications that heavily use malloc and free operations. - Support for TotalView debugger - Shared Library Support for existing binary MPI application programs to run ROMIO Support for MPI-IO - Support for additional features (such as hwloc, hierarchical collectives, one-sided, multithreading, etc.), as included in the MPICH2 1.2.1p1 Nemesis channel - Flexible process manager support - mpirun_rsh to work with any of the eight interfaces (CH3 and Nemesis channel-based) including OFA-IB-Nemesis, TCP/IP-CH3 and TCP/IP-Nemesis - Hydra process manager to work with any of the eight interfaces (CH3 and Nemesis channel-based) including OFA-IB-CH3, OFA-iWARP-CH3, OFA-RoCE-CH3 and TCP/IP-CH3 - MPIEXEC_TIMEOUT is honored by mpirun_rsh * Bug fixes since 1.4.1 - Fix compilation error when configured with `--enable-thread-funneled' - Fix MPE functionality, thanks to Anthony Chan for reporting and providing the resolving patch - Cleanup after a failure in the init phase is handled better by mpirun_rsh - Path determination is correctly handled by mpirun_rsh when DPM is used - Shared libraries are correctly built (again) MVAPICH2-1.4.1 * Enhancements since mvapich2-1.4 - MPMD launch capability to mpirun_rsh - Portable Hardware Locality (hwloc) support, patch suggested by Dr. Bernd Kallies - Multi-port support for iWARP - Enhanced iWARP design for scalability to higher process count - Ring based startup support for RDMAoE * Bug fixes since mvapich2-1.4 - Fixes for MPE and other profiling tools as suggested by Anthony Chan (chan@mcs.anl.gov) - Fixes for finalization issue with dynamic process management - Removed overrides to PSM_SHAREDCONTEXT, PSM_SHAREDCONTEXTS_MAX variables. Suggested by Ben Truscott . - Fixing the error check for buffer aliasing in MPI_Reduce as suggested by Dr. Rajeev Thakur - Fix Totalview integration for RHEL5 - Update simplemake to handle build timestamp issues - Fixes for --enable-g={mem, meminit} - Improved logic to control the receive and send requests to handle the limitation of CQ Depth on iWARP - Fixing assertion failures with IMB-EXT tests - VBUF size for very small iWARP clusters bumped up to 33K - Replace internal mallocs with MPIU_Malloc uniformly for correct tracing with --enable-g=mem - Fixing multi-port for iWARP - Fix memory leaks - Shared-memory reduce fixes for MPI_Reduce invoked with MPI_IN_PLACE - Handling RDMA_CM_EVENT_TIMEWAIT_EXIT event - Fix for threaded-ctxdup mpich2 test - Detecting spawn errors, patch contributed by Dr. Bernd Kallies - IMB-EXT fixes reported by Yutaka from Cray Japan - Fix alltoall assertion error when limic is used MVAPICH2-1.4 * Enhancements since mvapich2-1.4rc2 - Efficient runtime CPU binding - Add an environment variable for controlling the use of multiple cq's for iWARP interface. - Add environmental variables to disable registration cache for All-to-All on large systems. - Performance tune for pt-to-pt Intra-node communication with LiMIC2 - Performance tune for MPI_Broadcast * Bug fixes since mvapich2-1.4rc2 - Fix the reading error in lock_get_response by adding initialization to req->mrail.protocol - Fix mpirun_rsh scalability issue with hierarchical ssh scheme when launching greater than 8K processes. - Add mvapich_ prefix to yacc functions. This can avoid some namespace issues when linking with other libraries. Thanks to Manhui Wang for contributing the patch. MVAPICH2-1.4-rc2 * Enhancements since mvapich2-1.4rc1 - Added Feature: Check-point Restart with Fault-Tolerant Backplane Support (FTB_CR) - Added Feature: Multiple CQ-based design for Chelsio iWARP - Distribute LiMIC2-0.5.2 with MVAPICH2. Added flexibility for selecting and using a pre-existing installation of LiMIC2 - Increase the amount of command line that mpirun_rsh can handle (Thanks for the suggestion by Bill Barth @ TACC) * Bug fixes since mvapich2-1.4rc1 - Fix for hang with packetized send using RDMA Fast path - Fix for allowing to use user specified P_Key's (Thanks to Mike Heinz @ QLogic) - Fix for allowing mpirun_rsh to accept parameters through the parameters file (Thanks to Mike Heinz @ QLogic) - Modify the default value of shmem_bcast_leaders to 4K - Fix for one-sided with XRC support - Fix hang with XRC - Fix to always enabling MVAPICH2_Sync_Checkpoint functionality - Fix build error on RHEL 4 systems (Reported by Nathan Baca and Jonathan Atencio) - Fix issue with PGI compilation for PSM interface - Fix for one-sided accumulate function with user-defined contiguous datatypes - Fix linear/hierarchical switching logic and reduce threshold for the enhanced mpirun_rsh framework. - Clean up intra-node connection management code for iWARP - Fix --enable-g=all issue with uDAPL interface - Fix one sided operation with on demand CM. - Fix VPATH build MVAPICH2-1.4-rc1 * Bugs fixed since MVAPICH2-1.2p1 - Changed parameters for iWARP for increased scalability - Fix error with derived datatypes and Put and Accumulate operations Request was being marked complete before data transfer had actually taken place when MV_RNDV_PROTOCOL=R3 was used - Unregister stale memory registrations earlier to prevent malloc failures - Fix for compilation issues with --enable-g=mem and --enable-g=all - Change dapl_prepost_noop_extra value from 5 to 8 to prevent credit flow issues. - Re-enable RGET (RDMA Read) functionality - Fix SRQ Finalize error Make sure that finalize does not hang when the srq_post_cond is being waited on. - Fix a multi-rail one-sided error when multiple QPs are used - PMI Lookup name failure with SLURM - Port auto-detection failure when the 1st HCA did not have an active failure - Change default small message scheduling for multirail for higher performance - MPE support for shared memory collectives now available MVAPICH2-1.2p1 (11/11/2008) * Changes since MVAPICH2-1.2 - Fix shared-memory communication issue for AMD Barcelona systems. MVAPICH2-1.2 (11/06/2008) * Bugs fixed since MVAPICH2-1.2-rc2 - Ignore the last bit of the pkey and remove the pkey_ix option since the index can be different on different machines. Thanks for Pasha@Mellanox for the patch. - Fix data types for memory allocations. Thanks for Dr. Bill Barth from TACC for the patches. - Fix a bug when MV2_NUM_HCAS is larger than the number of active HCAs. - Allow builds on architectures for which tuning parameters do not exist. * Changes related to the mpirun_rsh framework - Always build and install mpirun_rsh in addition to the process manager(s) selected through the --with-pm mechanism. - Cleaner job abort handling - Ability to detect the path to mpispawn if the Linux proc filesystem is available. - Added Totalview debugger support - Stdin is only available to rank 0. Other ranks get /dev/null. * Other miscellaneous changes - Add sequence numbers for RPUT and RGET finish packets. - Increase the number of allowed nodes for shared memory broadcast to 4K. - Use /dev/shm on Linux as the default temporary file path for shared memory communication. Thanks for Doug Johnson@OSC for the patch. - MV2_DEFAULT_MAX_WQE has been replaced with MV2_DEFAULT_MAX_SEND_WQE and MV2_DEFAULT_MAX_RECV_WQE for send and recv wqes, respectively. - Fix compilation warnings. MVAPICH2-1.2-RC2 (08/20/2008) * Following bugs are fixed in RC2 - Properly handle the scenario in shared memory broadcast code when the datatypes of different processes taking part in broadcast are different. - Fix a bug in Checkpoint-Restart code to determine whether a connection is a shared memory connection or a network connection. - Support non-standard path for BLCR header files. - Increase the maximum heap size to avoid race condition in realloc(). - Use int32_t for rank for larger jobs with 32k processes or more. - Improve mvapich2-1.2 bandwidth to the same level of mvapich2-1.0.3. - An error handling patch for uDAPL interface. Thanks for Nilesh Awate for the patch. - Explicitly set some of the EP attributes when on demand connection is used in uDAPL interface. MVAPICH2-1.2-RC1 (07/02/08) * Following features are added for this new mvapich2-1.2 release: - Based on MPICH2 1.0.7 - Scalable and robust daemon-less job startup -- Enhanced and robust mpirun_rsh framework (non-MPD-based) to provide scalable job launching on multi-thousand core clusters -- Available for OpenFabrics (IB and iWARP) and uDAPL interfaces (including Solaris) - Adding support for intra-node shared memory communication with Checkpoint-restart -- Allows best performance and scalability with fault-tolerance support - Enhancement to software installation -- Change to full autoconf-based configuration -- Adding an application (mpiname) for querying the MVAPICH2 library version and configuration information - Enhanced processor affinity using PLPA for multi-core architectures - Allows user-defined flexible processor affinity - Enhanced scalability for RDMA-based direct one-sided communication with less communication resource - Shared memory optimized MPI_Bcast operations - Optimized and tuned MPI_Alltoall MVAPICH2-1.0.2 (02/20/08) * Change the default MV2_DAPL_PROVIDER to OpenIB-cma * Remove extraneous parameter is_blocking from the gen2 interface for MPIDI_CH3I_MRAILI_Get_next_vbuf * Explicitly name unions in struct ibv_wr_descriptor and reference the members in the code properly. * Change "inline" functions to "static inline" properly. * Increase the maximum number of buffer allocations for communication intensive applications * Corrections for warnings from the Sun Studio 12 compiler. * If malloc hook initialization fails, then turn off registration cache * Add MV_R3_THESHOLD and MV_R3_NOCACHE_THRESHOLD which allows R3 to be used for smaller messages instead of registering the buffer and using a zero-copy protocol. * Fixed an error in message coalescing. * Setting application initiated checkpoint as default if CR is turned on. MVAPICH2-1.0.1 (10/29/07) * Enhance udapl initializaton, set all ep_attr fields properly. Thanks for Kanoj Sarcar from NetXen for the patch. * Fixing a bug that miscalculates the receive size in case of complex datatype is used. Thanks for Patrice Martinez from Bull for reporting this problem. * Minor patches for fixing (i) NBO for rdma-cm ports and (ii) rank variable usage in DEBUG_PRINT in rdma-cm.c Thanks to Steve Wise for reporting these. MVAPICH2-1.0 (09/14/07) * Following features and bug fixes are added in this new MVAPICH2-1.0 release: - Message coalescing support to enable reduction of per Queue-pair send queues for reduction in memory requirement on large scale clusters. This design also increases the small message messaging rate significantly. Available for Open Fabrics Gen2-IB. - Hot-Spot Avoidance Mechanism (HSAM) for alleviating network congestion in large scale clusters. Available for Open Fabrics Gen2-IB. - RDMA CM based on-demand connection management for large scale clusters. Available for OpenFabrics Gen2-IB and Gen2-iWARP. - uDAPL on-demand connection management for large scale clusters. Available for uDAPL interface (including Solaris IB implementation). - RDMA Read support for increased overlap of computation and communication. Available for OpenFabrics Gen2-IB and Gen2-iWARP. - Application-initiated system-level (synchronous) checkpointing in addition to the user-transparent checkpointing. User application can now request a whole program checkpoint synchronously with BLCR by calling special functions within the application. Available for OpenFabrics Gen2-IB. - Network-Level fault tolerance with Automatic Path Migration (APM) for tolerating intermittent network failures over InfiniBand. Available for OpenFabrics Gen2-IB. - Integrated multi-rail communication support for OpenFabrics Gen2-iWARP. - Blocking mode of communication progress. Available for OpenFabrics Gen2-IB. - Based on MPICH2 1.0.5p4. * Fix for hang while using IMB with -multi option. Thanks to Pasha (Mellanox) for reporting this. * Fix for hang in memory allocations > 2^31 - 1. Thanks to Bryan Putnam (Purdue) for reporting this. * Fix for RDMA_CM finalize rdma_destroy_id failure. Added Timeout env variable for RDMA_CM ARP. Thanks to Steve Wise for suggesting these. * Fix for RDMA_CM invalid event in finalize. Thanks to Steve Wise and Sean Hefty. * Fix for shmem memory collectives related memory leaks * Updated src/mpi/romio/adio/ad_panfs/Makefile.in include path to find mpi.h. Contributed by David Gunter, Los Alamos National Laboratory. * Fixed header caching error on handling datatype messages with small vector sizes. * Change the finalization protocol for UD connection manager. * Fix for the "command line too long" problem. Contributed by Xavier Bru from Bull (http://www.bull.net/) * Change the CKPT handling to invalidate all unused registration cache. * Added ofed 1.2 interface change patch for iwarp/rdma_cm from Steve Wise. * Fix for rdma_cm_get_event err in finalize. Reported by Steve Wise. * Fix for when MV2_IBA_HCA is used. Contributed by Michael Schwind of Technical Univ. of Chemnitz (Germany). MVAPICH2-0.9.8 (11/10/06) * Following features are added in this new MVAPICH2-0.9.8 release: - BLCR based Checkpoint/Restart support - iWARP support: tested with Chelsio and Ammasso adapters and OpenFabrics/Gen2 stack - RDMA CM connection management support - Shared memory optimizations for collective communication operations - uDAPL support for NetEffect 10GigE adapter. MVAPICH2-0.9.6 (10/22/06) * Following features and bug fixes are added in this new MVAPICH2-0.9.6 release: - Added on demand connection management. - Enhance shared memory communication support. - Added ptmalloc memory hook support. - Runtime selection for most configuration options. MVAPICH2-0.9.5 (08/30/06) * Following features and bug fixes are added in this new MVAPICH2-0.9.5 release: - Added multi-rail support for both point to point and direct one side operations. - Added adaptive RDMA fast path. - Added shared receive queue support. - Added TotalView debugger support * Optimization of SMP startup information exchange for USE_MPD_RING to enhance performance for SLURM. Thanks to Don and team members from Bull and folks from LLNL for their feedbacks and comments. * Added uDAPL build script functionality to set DAPL_DEFAULT_PROVIDER explicitly with default suggestions. * Thanks to Harvey Richardson from Sun for suggesting this feature. MVAPICH2-0.9.3 (05/20/06) * Following features are added in this new MVAPICH2-0.9.3 release: - Multi-threading support - Integrated with MPICH2 1.0.3 stack - Advanced AVL tree-based Resource-aware registration cache - Tuning and Optimization of various collective algorithms - Processor affinity for intra-node shared memory communication - Auto-detection of InfiniBand adapters for Gen2 MVAPICH2-0.9.2 (01/15/06) * Following features are added in this new MVAPICH2-0.9.2 release: - InfiniBand support for OpenIB/Gen2 - High-performance and optimized support for many MPI-2 functionalities (one-sided, collectives, datatype) - Support for other MPI-2 functionalities (as provided by MPICH2 1.0.2p1) - High-performance and optimized support for all MPI-1 functionalities MVAPICH2-0.9.0 (11/01/05) * Following features are added in this new MVAPICH2-0.9.0 release: - Optimized two-sided operations with RDMA support - Efficient memory registration/de-registration schemes for RDMA operations - Optimized intra-node shared memory support (bus-based and NUMA) - Shared library support - ROMIO support - Support for multiple compilers (gcc, icc, and pgi) MVAPICH2-0.6.5 (07/02/05) * Following features are added in this new MVAPICH2-0.6.5 release: - uDAPL support (tested for InfiniBand, Myrinet, and Ammasso GigE) MVAPICH2-0.6.0 (11/04/04) * Following features are added in this new MVAPICH2-0.6.0 release: - MPI-2 functionalities (one-sided, collectives, datatype) - All MPI-1 functionalities - Optimized one-sided operations (Get, Put, and Accumulate) - Support for active and passive synchronization - Optimized two-sided operations - Scalable job start-up - Optimized and tuned for the above platforms and different network interfaces (PCI-X and PCI-Express) - Memory efficient scaling modes for medium and large clusters