This document describes MadMPI benchmark installation and configuration.
For any question, mailto: Alexandre.Denis@inria.fr
For more information, see: http://pm2.gitlabpages.inria.fr/mpibenchmark/
A quick cheat sheet for the impatient:
./configure make mpiexec -n 2 -host host1,host2 ./mpi_bench_suite_overlap | tee out.dat
It runs from 10 minutes through 2h, depanding on network speed. Then build the performance report using:
./mpi_bench_extract out.dat
It outputs data in out.dat.d/
. It is possible to transfer data to another host and extract the performance report with another installation of MadMPI benchmark so as to not have to install gnuplot
on the computing nodes.
Please send the out.dat
to Alexa.nosp@m.ndre.nosp@m..Deni.nosp@m.s@in.nosp@m.ria.f.nosp@m.r to have it integrated on the MadMPI benchmark web site.
MadMPI benchmark follows usual autoconf procedure:
./configure [your options here] make make install
The make install
step is optional. The benchmark may be run from its build directory. To get help on supported flags for configure, run:
./configure --help
Flags that may be of interest are MPICC= to give the name of the command to build MPI applications, and –prefix= to give installation path.
mpi_bench_suite_overlap
on 2 nodes, capture its standard output in a file, and pass this file to mpi_bench_extract
. The processed data is outputed to a ${file}.d/
directory containing:${bench}-series/${bench}-s${size}.dat
)${bench}-ref.dat
)${bench}.gp
)${bench}.png
)all.png
)Base benchmarks measure performance of various point-to-point operations:
mpi_bench_sendrecv
: send/receive pingpong, used as a referencempi_bench_bidir
: bidirectionnal send/receive pingpongmpi_bench_noncontig
: send/receive pingpong with non-contiguous datatype, used as a referencempi_bench_send_overhead
: processor time consumed on sender side to send data (the overhead from LogP). Usefull to explain overlap benchmarks.The full series may be run with mpi_bench_suite_base
.
mpi_bench_overlap_sender
: overlap on sender side (i.e. MPI_Isend, computation, MPI_Wait), total timempi_bench_overlap_recv
: overlap on receiver side (i.e. MPI_Irecv, computation, MPI_Wait), total timempi_bench_overlap_bidir
: overlap on both sidesmpi_bench_overlap_sender_noncontig
: overlap on sender side, with non-contiguous datatypempi_bench_overlap_send_overhead
: overlap on sender side (i.e. MPI_Isend, computation, MPI_Wait), measure time on sender side onlympi_bench_overlap_Nload
: overlap on sender side, with multi-threaded computation loadThe full series may be run with mpi_bench_suite_overlap
.
Each mpi_bench_coll_*
benchmark measures performance of the given collective operation. Synchronization uses synchronized clocks.
The full series may be run with mpi_bench_suite_coll
.
These benchmarks measure the scalability with a large number of requests.
mpi_bench_reqs_burst
sends bursts of N non-blocking requests, on the same tag, matched in ordermpi_bench_reqs_tags
sends bursts of N non-blocking requests, on different tag, in the same order on sender and receivermpi_bench_reqs_shuffle
sends bursts of N non-blocking requests on random tagsmpi_bench_reqs_anysrc
same as shuffle, but received through MPI_ANY_SOURCE requestsmpi_bench_reqs_test
same as shuffle, with completion through MPI_Test
The full series may be run with mpi_bench_suite_reqs
.
In the results, the column size
is actually the number of requests.
The full series may be run with mpi_bench_suite_rma
.
These benchmarks measure the performance of features related to multi-threading.
mpi_bench_thread_1toN_rr
sends data from a single thread on the sender side, to N threads on the receiver side, with a round-robin strategy.mpi_bench_thread_1toN_single
sends data from a single thread on the sender side, with N receive posted on the receiver side, but only a single thread actually matching.mpi_bench_thread_NtoN
sends data from N threads on the sender to N threads on the receiver (parallel ping-pongs).The full series may be run with mpi_bench_suite_thread
.
These benchmarks measure system noise caused by MPI.
mpi_bench_noise_nocomm
performs some computation without any communicationmpi_bench_noise_posted_recv
performs the same computation with a posted MPI_Irecv
Feed the full output of a given benchmark series into mpi_bench_extract
to get split files for each benchmark and automatically generate graphs (if gnuplot
is installed).