NewMadeleine

Documentation

README

NewMadeleine README

This document describes nmad installation and configuration.

for any question, send mail to: Alexa.nosp@m.ndre.nosp@m..Deni.nosp@m.s@in.nosp@m.ria.f.nosp@m.r.

Prerequisites

The following development tools are required to compile NewMadeleine:

  • GNU C Compiler gcc (version 4.0 and higher) or compatible (icc, clang).
  • GNU make (version 3.81 and higher).
  • autoconf (v 2.50 or later)
  • pkg-config
  • hwloc
  • libexpat XML parser (set $EXPAT_ROOT if it cannot be found by pkg-config)
  • ibverbs/OFED for InfiniBand networks support (set $IBHOME if not installed in /usr) (optional)
  • PSM for Intel InfiniPath networks (optional)
  • PSM2 for Intel OmniPath networks (set PSM2_DIR if not installed in system directories) (optional)
  • Portals4 for Atos BXI networks (optional)
  • libfabric (OFI) for Cray Slingshot networks (optional)
  • MX for Myrinet networks support (set $MX_DIR if not installed in /usr) (optional)
  • libpmi2 for PMI2 (slurm) integration and/or libpmix for PMIx (set $PMIX_ROOT if not installed in /usr)
  • other PM2 modules:
    • Puk: mandatory
    • PadicoTM: recommended, used as launcher
    • pioman: optional, for progression
    • these modules are built automatically when using the automated build system.

Download

NewMadeleine may be downloaded either as a tarball from https://pm2.gitlabpages.inria.fr/releases/.

or use the git master from https://gitlab.inria.fr/pm2/pm2.

Installation

Installation of NewMadeleine is a standard sequence of ./configure ; make ; make install. However, to build and install all modules required by NewMadeleine, we propose a script to build the modules with the right order and parameters.

Automated build (recommended):

To build all modules required by nmad, we recommend to use the build script located in pm2/scripts/pm2-build-packages using a given or a custom configuration file, e.g.:

 % cd pm2/scripts
 % ./pm2-build-packages ./madmpi.conf --prefix=$HOME/soft/x86_64

For a standard multi-threaded build, it is advised to use madmpi.conf. For a non-threaded (no progression!), configuration from madmpi-mini.conf will lead to a slightly more efficient library.

Manual build (not recommended, advanced users only).

Module nmad requires other pm2 modules: Puk, PadicoTM, PukABI (optionnal), pioman (optionnal).

For each module:

./autogen.sh
mkdir build ; cd build
../configure [your options here]
make
make install

Note: nmad purposely cannot be configured in its source directory. Please use a separate build directory.

Usefull configure flags (see ./configure –help)

--enable-sampling       Enable network sampling
--enable-mpi            Enable builtin MPI implementation MadMPI
--with-pioman           use pioman I/O manager [default=no]
--with-ibverbs          use Infiniband ibverbs [default=check]
--with-mx               use Myrinet MX [default=check]
--with-psm              use Intel Performance Scaled Messasging (PSM) [default=check]
--with-psm2             use Intel Performance Scaled Messasging 2 (PSM2) [default=check]

Building application code

For an MPI applicatiion using MadMPI, use the standard mpicc, mpif77 and mpif90 compiler frontends to build and link.

To build an application using native NewMadeleine interface, get the required flags through pkg-config. For CFLAGS:

% pkg-config --cflags nmad

For libraries:

% pkg-config --libs nmad

In a Makefile, you will typically need:

CFLAGS += $(shell pkg-config --cflags)
LIBS += $(shell pkg-config --libs)

Launcher

For MadMPI use the standard mpirun as launcher. Please see mpirun --help for up-to-date documentation. Please note that MadMPI mpirun is a frontend to padico-launch so it accepts all options described below.

For native NewMadeleine applications, it is recommended to use padico-launch as a launcher for nmad. It accepts parameters similar to mpirun. Please see padico-launch --help for up-to-date documentation. For example:

% padico-launch -n 2 -nodelist jack0,jack1 nm_bench_sendrecv

starts program 'nm_bench_sendrecv' on hosts jack0 and jack1, using auto-detected network.

Environment variables may be set using -D parameters, e.g.:

% padico-launch -c -p -n 2 -nodelist jack0,jack1 -DNMAD_DRIVER=ib nm_bench_sendrecv

starts program 'nm_bench_sendrecv' on hosts jack0 and jack1, over Infiniband, using one console per process.

On clusters using slurm, mpirun and padico-launch will start processes using srun. It is possible for the user to directly use srun without using mpirun at all. Make sure to enable pmi2 or pmix, e.g.:

% srun -N 2 --exclusive --mpi=pmi2 nm_bench_sendrecv

Debug

gdb

To launch each process in a gdb debugger, use argument -d in association with -c to get one console per node, i.e.:

% padico-launch -c -d -n 2 nm_sr_hello

valgrind

To launch each process in valgrind memcheck tool, use argument --padico-valgrind, in association with -c, i.e.:

% padico-launch -c --padico-valgrind -n 2 nm_sr_hello

Detect invalid data change

NewMadeleine comes with a tool to detect if user data is modified while a non-blocking send is manipulating the data, which leads to data corruption. To detect such a bug in application code, please set the environment variable NMAD_ISEND_CHECK to a non-null value, i.e.:

% padico-launch -n 2 -DNMAD_ISEND_CHECK=1 nm_sr_hello

It checks whether user buffer has been modified between nm_sr_isend and nm_sr_swait, or between MPI_Isend and MPI_Test or MPI_Wait. This feature is only available when nmad is built in debug mode.

Verbosity

By default, NewMadeleine is quiet and outputs only warnings and fatal errors. To display info about the init (network detection, addresses, drivers used), it is advised to use verbose mode with the '-v' parameter:

% padico-launch -v -n 2 nm_sr_hello

Verbose mode is the default when NewMadeleine is built in debug mode. It is possible to switch to quiet mode with parameter '-q'. A custom trace policy may be given with --trace (syntax not documented yet).

Per-node logging

To help debug code on a large number of nodes, standard output and stderr may be captured and sent to disk, with one file per node using the --log parameter, e.g.:

% padico-launch --log=${HOME}/log-$$ -n 2 nm_sr_hello

will send output to files in ${HOME}/log-$$. Directory is created if needed. File names contain: the username, the session uuid, the node rank, the hostname, and the node uuid, to avoid collisions and to allow easy browsing.

Deadlocks

To help debug deadlocks in communications, NewMadeleine is able to detect stalled packets using the environment variable NMAD_PWSEND_TIMEOUT, i.e. :

% padico-launch -n 2 -DNMAD_PWSEND_TIMEOUT=1 nm_sr_hello

It checks whether a packet wrapper takes more than 30 seconds to be sent on any track, or to be received only on large track. Since it relies on timers from profiling, it requires NewMadeleine to be built with profiling (--enable-profile at configure).

In addition, when built with PadicoTM (the default), this flag enables a watchdog to check how often the optimizing strategy is called.

Memory

To help diagnose OOM errors, a memory monitor is available to display the allocated memory. It may be enabled by using the MemMonitor PadicoTM module. It is loaded by using -iload-MemMonitor init flag, e.g.:

% padico-launch -n 2 -iload-MemMonitor -DPADICO_MEM_MONITOR_PERIOD=5 nm_sr_hello

By default, it displays periodically the memory usage of the whole process, as given by getrusage. The period may be tuned through the optionnal environment variable PADICO_MEM_MONITOR_PERIOD (in seconds); the default is 3 seconds.

In addition, when NewMadeleine profiling is enabled, it displays the amount of memory allocated directly by Puk+PadicoTM+NewMadeleine, to distinguish its memory usage from the application (total amount of memory in bytes, number of mallocs, number of frees).

PAJE traces

NewMadeleine may generate a trace of its internal state in the PAJE format. To do so, you may use the --enable-trace configuration flag. This option requires the external library GTG.

When compiled with traces, NewMadeleine will automatically generate a PAJE trace file in the current directory at the end of the execution. A single file is generated for all nodes.

The content of traces may be controlled by the NMAD_TRACE environment variable. It must contain a coma-separated list of the following items:

- core    trace state of nmad core (beware: huge traces)
- driver  trace state of packet-wrappers
- pack    trace state of pack/unpack requests
- link    generate arrows for messages
- all     all of the above
- none    no trace
- ^core   remove core state
- ^driver remove pw state
- ^pack   remove request state
- ^link   remove arrows

Operands are evaluated in order when adding/removing filters. The default when the variable is not set by user is NMAD_TRACE=all,^core.

Profiling

To enable profiling counters in NewMadeleine, please give --enable-profile to the configure.

Then it is possible to control which counters are displayed using the PUK_PROFILE environment variable which gives a filter to be matched against the profiling varibale name. By default, nothing is displayed. Use ‘PUK_PROFILE=’*'to display all variables, PUK_PROFILE=nm_drv.*` to only display variables from nmad drivers, etc.

Set PUK_DISPLAY_PROFILE=yes to display the description of all profiling variables.

Memory profiling is available only if Puk, in addition to nmad, was configured with the --enable-profile option.

Advanced Tuning

Parameters

NewMadeleine is tuned through parameters that can be set through environment variables or programmatically (see Puk-opt.h). For convenience, environment variables may be set on the command line using the following syntax:

% padico-launch -DVAR=value

to set a value to environment variable VAR.

Parameters are typed (string, int, unsigned, bool). Valid values for boolean variables are: 0/1, y/n, yes/no, true/false, on/off, enabled/disabled.

To display the list of all parameters and their value, give the parameter -DPUK_DISPLAY_ENV=yes

Strategy

The strategy used by nmad is selected using the following rules:

  1. if the environment variable NMAD_STRATEGY is set, it is used whatever the other configuration parameters are.
  2. if the variable is not set, strategy 'aggreg' is used by default.

Valid strategies are: default, aggreg, aggreg_autoextended, split_balance, prio.

The following are deprecated/unmatained: split_all, qos

The default choice should fit most cases.

Drivers

The drivers used by nmad are selected using the following rules:

  1. if the environment variable NMAD_DRIVER is set, it is used by default. It may contain the name of a single driver for single rail, or a list of multiple drivers separated by '+' for multirail, e.g. NMAD_DRIVER=mx+ibverbs

The following driver names are recognized:

  • ibverbs for default InfiniBand drivers
  • ibrcache, iblr2, ibsrq or ibbuf to force the InfiniBand protocol
  • tcp for TCP sockets
  • psm for Infinipath
  • psm2 for Omni-Path
  • bxi or portals4 for Portals4 network (tested only with Atos BXI)
  • ucx for UCX library
  • ofi for libfabric
  • shm for shared memory on the same node
  • local for Unix domain sockets (basic driver, for debug)
  • self for intra-process loopback is always added by default by nmad and does not need to be given by the enduser.
  • other drivers (mx, sisci, cci, dcfa are deprecated).

if nmad is launched with mpirun, srun, or padico-launch, then PadicoTM default NetSelector rules apply:

  • intra-process uses 'self'
  • intra-node inter-process uses 'shm'
  • inter-node uses 'psm2' if OmniPath is present (auto-detected)
  • inter-node uses 'psm' if InfiniPath is present (auto-detected)
  • inter-node uses 'ibverbs' if InfiniBand is present (auto-detected) and nodes are on the same IB subnet (same subnet manager). Thus it is important to configure subnet GID prefix and not keep the factory default GID prefix in opensm.
  • inter-node uses 'mx' if Myrinet is auto-detetected and nodes are on the same Myrinet network (same mapper).
  • inter-node uses 'tcp' if IP is available, and neither IB nor MX are available.
  • as last resort, routed messages over control channel is used if no direct connection is possible. Usual NetSelector customization is possible.

if nmad is launched through the cmdline launcher, then a "-R <string>" parameter is taken as a railstring, with the same syntax as NMAD_DRIVER. Please note that cmdline launcher is only for debug purpose and manages only 2 nodes.

  1. if another custom launcher is used, it may set a selector using the 'sesion' interface.
  2. in any other case, 'self' is used for intra-process; 'tcp' for inter-process.

Strategy 'prio' limits the total number of simultaneous outgoing packets. This number may be tuned using environment variable NM_PRIO_MAX_PW. The default value is 2.

NMAD_DISPLAY_DRIVERS=1 displays strategy and drivers used by each process.

For most users, auto-detection should do the right thing and endusers are not expected to manually select a driver.

General tuning

  • NMAD_AUTO_FLUSH=1 asks nmad to flush outgoing packets after every posted send. It ensures data is sent earlier; in return, it increses contention between threads and prevent the 'aggreg' strategy to actually aggregate messages.

Shm tuning

By defdault, intra-node shared memory communication uses a pipelined copy. The more efficient "Cross Memory Attachment" (CMA) method may be used on systems that support it using environment variable NMAD_SHM_CMA=1.

Binding

NUIOA (Non-Uniform I/O Access) may be takein into account by NewMadeleine. To do so, use NMAD_NUIOA_ENABLE=1 to automatically bind threads to the NUMA node where the network board is attached.

This is disabled by default.

Infiniband tuning

Infiniband may be tuned at run time through environment variables:

  • NMAD_IBVERBS_RCACHE=1 enables the registration cache; the default choice is to use the rcache-mini backend, best used with the PukABI module. Other backend, bringing other mechanisms to maintain registration cache consistency may be selected with the following variables:
    • NMAD_RCACHE_ODP=1 memory blocks are registered for On-Demand Paging (ODP), with page fault prefetch, without cache. Performance is poor.
    • NMAD_RCACHE_IODP=1 the full process memory is registered for Implicit On-Demand Paging (IODP), with prefetch. This mode is available on ConnectX-5+ hardware. It is safe to use, compatible with any memory allocator, and gets fair performance (although a little slower than default choice).
    • NMAD_RCACHE_NOCACHE=1 no cache is implemented, memory is registered and deregistered for each packet. Performance is poor. It is used for debugging purpose only; end-users are not expected to use it.
  • NMAD_IBVERBS_SRQ=1 enables the use of Shared Requests Queues for scalability in number of nodes. The latency penalty is low. It is used by default for > 16 nodes
  • NMAD_IBVERBS_CHECKSUM=1 enables the checksum computation on the fly in the driver.
  • NMAD_IBVERBS_ALIGN=<n> sets alignment of every packets sent through InfiniBand to <n>, using padding. Default is 64 bytes.
  • NMAD_IBVERBS_MEMALIGN=<n> enforces alignment of buffers used internally in all InfiniBand drivers. Default is 4096 bytes.
  • NMAD_IBVERBS_COMP_CHANNEL=1 enables the completion channel (to use blocking syscalls) with driver 'srq'.
  • NMAD_RCACHE_CHECKSUM=1 enables checksums for the 'rcache' driver, for debugging purpose only.

To use a specific IB device or port, they mey be specified in the driver string through driver attributes. The supported attributes for IB drivers ar ibv_device and ibv_port, e.g.:

% padico-launch -n 2
-DNMAD_DRIVER=ibverbs:ibv_device=mlx5_0:ibv_port=1 nm_sr_hello

Either ibv_device, ibv_port, or both, may be given.

When opensm is used as subnet manager, subnet GID must be customized with a value unique to the given subnet, so as nmad is able to automatically detect IB connectivity. As root:

  • create the default opensm config file:
    % opensm -o -c /var/cache/opensm/opensm.opts
    
  • in the above file, customize the line with subnet_prefix to some other value than the factory default 0xfe80000000000000. Set the same subnet GID on all nodes of the subnet.
  • restart opensm:
    % /etc/init.d/infiniband restart
    

PSM2 tuning

By default, nmad sets HFI_NO_CPUAFFINITY=1 if no value was set by the user, to ensure that PSM2 does not mangle with thread binding as set by mpirun. To disable this feature, the user can set HFI_NO_CPUAFFINITY=0 explicitely.

When multiple Omni-Path ports are present, nmad uses by default psm2 automatic port selection. A specific port may be selected by using the port attribute, e.g.:

% padico-launch -n 2 -DNMAD_DRIVER=psm2:port=2 nm_sr_hello

to select the second port. Ports are numbered from 1. port=0 enables the automatic port selection.

Launcher advanced tuning

The appropriate launcher to use is usually selected automatically. For testing and debugging, it may be forced using environment variable NM_LAUNCHER. Valid values are:

  • 'madico': use PadicoTM as launcher. Launch nodes through ssh.
  • 'pmi2': use slurm PMI2
  • 'pmix': use slurm PMIx
  • 'single': single node
  • 'cmdline': processes are launched by user, connection information is given on command-line. This launcher is able to launch only 2 processes per job.

The default is 'pmix' if PMIx is detected in the job, 'pmi2' if a slurm job is detected with pmi2 enabled, 'madico' if we detect job was launched with 'padico-launch' and neither PMI2 nor PMIx are available, 'single' if nothing else is available. 'cmdline' is never selected by default and should be used only for debug.

In case PMIx is not installed in system directories, a full path may be given to –with-pmix=/full/path to configure, or it may be set globally through the PMIX_ROOT environment variable.

Multicast tuning

The default routing tree for the multicast interface is a binomial one. You can change it with the environment variable NMAD_MCAST_TREE set to binary, 3ary, 4ary, 8ary, binomial, 3nomial, 4nomial, 8nomial, flat, chain, bitree, ladder or simply default. The default choice will use 4nomial for messages < 32kB then binomial for larger messages.

When the bitree routing tree is selected, the multicast tree will be split in two trees. You have to provide the characteristics of the bitrees with environment variables:

  • the type of the first tree (containing recipients with the higher priorities if priorities are used) with the environment variable NMAD_MCAST_BITREE_FIRST. It can take the same values than NMAD_MCAST_TREE (except bitree).
  • similarly, the type of the second tree (containing remaining recipients) with NMAD_MCAST_BITREE_SECOND.
  • the number of recipients in the first tree has to be set with the environment variable NMAD_MCAST_BITREE_THRESOLD. If the number of recipients in the whole multicast is lower than this value, only the first tree type will be used.

If you are not using bitrees, you can use the delegate option, which has to be enbled with the environment variable NMAD_MCAST_DELEGATE=1. With this option, the root node of a multicast sends data to the first recipient node and lets this recipient manage the rest of the multicast (perform a binomial tree if this kind of tree is selected, etc).

By default, broadcasting trees are reordered to take into account message priorities. You can disable this reordering with the environment variable NMAD_MCAST_REORDER_TREE=0.

Simulation with simgrid

Simulation may be performed by compiling NewMadeleine with support for simgrid. To do so, the requirements are:

  • simgrid installed (tested with simgrid >= 3.31)
  • support for dladdr() in the libc (GNU extension, glibc only)
  • objdump
  • Nix patchelf (>= 0.18) NewMadeleine must be configured with --with-simgrid.

Compilation of user code is done as usual, with mpicc for MPI code or by using pkg-config for native NewMadeleine code. Note that build with -fPIC and linking with -shared will be forced, so as to generate a dynamically loadable object instead of a plain binary. This should be transparent for configure/makefiles as long as they do not try to start the binary (which is actually a dynamic object).

Then launching must be done with nm_simgrid_run instead of padico-launch or mpirun. See nm_simgrid_run -h for help on accepted parameters.

NewMadeleine will automatically perform global symbols privatization, and supports dynamic linking. Dynamic libraries that needs to be privatized must be declared with -lib to nm_simgrid_run. To do so, libraries are automatically duplicated thus enough disk space must be available in <prefix>/var/tmp/. To start a large number of simulated nodes (several hundreds), it may be needed to increase /proc/sys/vm/max_map_count.

Documentation

To generate doxygen documentation:

% cd $prefix/build/nmad
% make docs

It is available online at https://pm2.gitlabpages.inria.fr/pm2/nmad/doc/.