README for MadMPI benchmark

This document describes MadMPI benchmark installation and configuration.

For any question, mailto: Alexandre.Denis@inria.fr

For more information, see: http://pm2.gitlabpages.inria.fr/mpibenchmark/

Quick Start

A quick cheat sheet for the impatient:

./configure
make
mpiexec -n 2 -host host1,host2 ./mpi_bench_suite_overlap | tee out.dat

It runs from 10 minutes through 2h, depanding on network speed. Then build the performance report using:

./mpi_bench_extract out.dat

It outputs data in out.dat.d/. It is possible to transfer data to another host and extract the performance report with another installation of MadMPI benchmark so as to not have to install gnuplot on the computing nodes.

Please send the out.dat to Alexa.nosp@m.ndre.nosp@m..Deni.nosp@m.s@in.nosp@m.ria.f.nosp@m.r to have it integrated on the MadMPI benchmark web site.

Requirements

MPI library
autoconf (v 2.50 or later, for git users)
hwloc (optional)
gnuplot (optional, v5.0 or later)
GraphicsMagick (optional)
doxygen (optional, for doc generation)
OpenMP compiler (optional, for OpenMP+MPI benchmarks)

Installation

MadMPI benchmark follows usual autoconf procedure:

./configure [your options here]
make
make install

The make install step is optional. The benchmark may be run from its build directory. To get help on supported flags for configure, run:

./configure --help

Flags that may be of interest are MPICC= to give the name of the command to build MPI applications, and –prefix= to give installation path.

Documentation

Benchmarks may be run separetely (single benchmark per binary), or as a binary running a full series.
For overlap benchmarks, run mpi_bench_suite_overlap on 2 nodes, capture its standard output in a file, and pass this file to mpi_bench_extract. The processed data is outputed to a ${file}.d/ directory containing:
- raw series for each packet size (files ${bench}-series/${bench}-s${size}.dat)
- 2D data formated to feed gnuplot pm3d graphs, joined with referece non-overlapped values (files ${bench}-ref.dat)
- gnuplot scripts (files ${bench}.gp)
- individual graphs for each benchmark (files ${bench}.png)
- synthetic graphs (all.png)

Base benchmarks

Base benchmarks measure performance of various point-to-point operations:

mpi_bench_sendrecv: send/receive pingpong, used as a reference
mpi_bench_bidir: bidirectionnal send/receive pingpong
mpi_bench_noncontig: send/receive pingpong with non-contiguous datatype, used as a reference
mpi_bench_send_overhead: processor time consumed on sender side to send data (the overhead from LogP). Usefull to explain overlap benchmarks.

The full series may be run with mpi_bench_suite_base.

Overlap benchmarks

mpi_bench_overlap_sender: overlap on sender side (i.e. MPI_Isend, computation, MPI_Wait), total time
mpi_bench_overlap_recv: overlap on receiver side (i.e. MPI_Irecv, computation, MPI_Wait), total time
mpi_bench_overlap_bidir: overlap on both sides
mpi_bench_overlap_sender_noncontig: overlap on sender side, with non-contiguous datatype
mpi_bench_overlap_send_overhead: overlap on sender side (i.e. MPI_Isend, computation, MPI_Wait), measure time on sender side only
mpi_bench_overlap_Nload: overlap on sender side, with multi-threaded computation load

The full series may be run with mpi_bench_suite_overlap.

Collective benchmarks

Each mpi_bench_coll_* benchmark measures performance of the given collective operation. Synchronization uses synchronized clocks.

The full series may be run with mpi_bench_suite_coll.

Requests benchmarks

These benchmarks measure the scalability with a large number of requests.

mpi_bench_reqs_burst sends bursts of N non-blocking requests, on the same tag, matched in order
mpi_bench_reqs_tags sends bursts of N non-blocking requests, on different tag, in the same order on sender and receiver
mpi_bench_reqs_shuffle sends bursts of N non-blocking requests on random tags
mpi_bench_reqs_anysrc same as shuffle, but received through MPI_ANY_SOURCE requests
mpi_bench_reqs_test same as shuffle, with completion through MPI_Test

The full series may be run with mpi_bench_suite_reqs.

In the results, the column size is actually the number of requests.

RMA benchmarks

The full series may be run with mpi_bench_suite_rma.

Threaded benchmarks

These benchmarks measure the performance of features related to multi-threading.

mpi_bench_thread_1toN_rr sends data from a single thread on the sender side, to N threads on the receiver side, with a round-robin strategy.
mpi_bench_thread_1toN_single sends data from a single thread on the sender side, with N receive posted on the receiver side, but only a single thread actually matching.
mpi_bench_thread_NtoN sends data from N threads on the sender to N threads on the receiver (parallel ping-pongs).

The full series may be run with mpi_bench_suite_thread.

Noise benchmarks

These benchmarks measure system noise caused by MPI.

mpi_bench_noise_nocomm performs some computation without any communication
mpi_bench_noise_posted_recv performs the same computation with a posted MPI_Irecv

Data extraction

Feed the full output of a given benchmark series into mpi_bench_extract to get split files for each benchmark and automatically generate graphs (if gnuplot is installed).