MadMPI benchmark


README for MadMPI benchmark

This document describes MadMPI benchmark installation and configuration.

Quick Start

A quick cheat sheet for the impatient:

mpiexec -n 2 -host host1,host2 ./mpi_bench_suite_overlap | tee out.dat

It runs from 10 minutes through 2h, depanding on network speed. Then build the performance report using:

./mpi_bench_extract out.dat

It outputs data in out.dat.d/. It is possible to transfer data to another host and extract the performance report with another installation of MadMPI benchmark so as to not have to install gnuplot on the computing nodes.

  • MPI library
  • autoconf (v 2.50 or later, for git users)
  • hwloc (optional)
  • gnuplot (optional, v5.0 or later)
  • GraphicsMagick (optional)
  • doxygen (optional, for doc generation)
  • OpenMP compiler (optional, for OpenMP+MPI benchmarks)


MadMPI benchmark follows usual autoconf procedure:

./configure [your options here]
make install

The make install step is optional. The benchmark may be run from its build directory. To get help on supported flags for configure, run:

./configure --help

Flags that may be of interest are MPICC= to give the name of the command to build MPI applications, and –prefix= to give installation path.


  • Benchmarks may be run separetely (single benchmark per binary), or as a binary running a full series.
  • For overlap benchmarks, run mpi_bench_suite_overlap on 2 nodes, capture its standard output in a file, and pass this file to mpi_bench_extract. The processed data is outputed to a ${file}.d/ directory containing:
    • raw series for each packet size (files ${bench}-series/${bench}-s${size}.dat)
    • 2D data formated to feed gnuplot pm3d graphs, joined with referece non-overlapped values (files ${bench}-ref.dat)
    • gnuplot scripts (files ${bench}.gp)
    • individual graphs for each benchmark (files ${bench}.png)
    • synthetic graphs (all.png)

Base benchmarks

Base benchmarks measure performance of various point-to-point operations:

  • mpi_bench_sendrecv: send/receive pingpong, used as a reference
  • mpi_bench_bidir: bidirectionnal send/receive pingpong
  • mpi_bench_noncontig: send/receive pingpong with non-contiguous datatype, used as a reference
  • mpi_bench_send_overhead: processor time consumed on sender side to send data (the overhead from LogP). Usefull to explain overlap benchmarks.

The full series may be run with mpi_bench_suite_base.

Overlap benchmarks

  • mpi_bench_overlap_sender: overlap on sender side (i.e. MPI_Isend, computation, MPI_Wait), total time
  • mpi_bench_overlap_recv: overlap on receiver side (i.e. MPI_Irecv, computation, MPI_Wait), total time
  • mpi_bench_overlap_bidir: overlap on both sides
  • mpi_bench_overlap_sender_noncontig: overlap on sender side, with non-contiguous datatype
  • mpi_bench_overlap_send_overhead: overlap on sender side (i.e. MPI_Isend, computation, MPI_Wait), measure time on sender side only
  • mpi_bench_overlap_Nload: overlap on sender side, with multi-threaded computation load

The full series may be run with mpi_bench_suite_overlap.

Collective benchmarks

Each mpi_bench_coll_* benchmark measures performance of the given collective operation. Synchronization uses synchronized clocks.

The full series may be run with mpi_bench_suite_coll.

Requests benchmarks

These benchmarks measure the scalability with a large number of requests.

  • mpi_bench_reqs_burst sends bursts of N non-blocking requests, on the same tag, matched in order
  • mpi_bench_reqs_tags sends bursts of N non-blocking requests, on different tag, in the same order on sender and receiver
  • mpi_bench_reqs_shuffle sends bursts of N non-blocking requests on random tags
  • mpi_bench_reqs_anysrc same as shuffle, but received through MPI_ANY_SOURCE requests
  • mpi_bench_reqs_test same as shuffle, with completion through MPI_Test

The full series may be run with mpi_bench_suite_reqs.

In the results, the column size is actually the number of requests.

RMA benchmarks

The full series may be run with mpi_bench_suite_rma.

Threaded benchmarks

These benchmarks measure the performance of features related to multi-threading.

  • mpi_bench_thread_1toN_rr sends data from a single thread on the sender side, to N threads on the receiver side, with a round-robin strategy.
  • mpi_bench_thread_1toN_single sends data from a single thread on the sender side, with N receive posted on the receiver side, but only a single thread actually matching.
  • mpi_bench_thread_NtoN sends data from N threads on the sender to N threads on the receiver (parallel ping-pongs).

The full series may be run with mpi_bench_suite_thread.

Noise benchmarks

These benchmarks measure system noise caused by MPI.

  • mpi_bench_noise_nocomm performs some computation without any communication
  • mpi_bench_noise_posted_recv performs the same computation with a posted MPI_Irecv

Data extraction

Feed the full output of a given benchmark series into mpi_bench_extract to get split files for each benchmark and automatically generate graphs (if gnuplot is installed).