A driver is a component that exhibits an interface of type NewMad_minidriver (struct nm_minidriver_iface_s).
Only a subset of the functions is required.
A driver with a context corresponds to a NIC. The same driver may be used in multiple contexts in case of multi-rail. An instance is created for each gate. Drivers are fully encapsulated and may be used in various places. They must not depend on nmad core nor include any private nmad header.
Init
getprops: the function is used to get properties of the driver. It is mandatory that all drivers define this function. In the properties, the field 'hints' contains hints given by upper layers for the driver; all other fields are initialized to 0 by the caller and are supposed to be filled-in by the driver:struct nm_minidriver_capabilities_sexposes the capabilities of the driver;struct nm_drv_profile_sgives hints about nominal performance. This function is always called first, before init. It should be called exactly once. Additionnal hints are available as context attributes:session_size,rank,wide_url_support.init: the function is used to initialize the driver in the given context. It is mandatory that all drivers define this function. It returns an url that the launcher will provide to other nodes. The url is per-context. If a driver needs per instance urls, it is its responsibility to manage per-instance urls through the unique per-context url. In case wide_url_support is granted by the launcher, it may initialize a vector of urls, packed as a single wide per-context url. It must still support cases where non-scalable launcher (e.g. PMI2) does not support wide urls.close: the function closes a driver in the given context. It is always called last, after all instances have been disconnected. This function may be leftNULL. It is an error to call any function of the driver afterclosehas been called.
Connection establishment
Connection establishment is asynchronous: connect_async() is called on all nodes for each gate, then connect_wait() is called for each gate. Multiple connection establishments may run at the same time. Function connect_async() may not block. A driver supporting fully asynchronous connection establishment is allowed to only define connect_async() and leave connect_wait() to NULL.
Function disconnect() is optionnal and may be left NULL. It is guaranteed that no communication will take place on a given link after disconnect.
Packets
All send/receive operations are based on struct nm_pkt_s, which represents a packet. It contains fields for:
- data:
buf,iovanddatacontain various representations of the data to send/receive. The one actually used depends on the capabilities of the driver, whether flagssupports_data,supports_buf_send,supports_buf_recv,supports_iovecare set. If no flag is set at all, contiguous data will be used (i.e.iovwithcount= 1). rdv_data: this field is filled by the receiver and read by the sender. It enables the driver to piggy-back some data with the rendez-vous. It is transported by the higher layers. On the receiver side, it is allocated by the driver itself and must be freed by the driver upon request completion. On the sender side, it is passed to the driver and remain allocated during the whole pkt life.- driver-related fields: the driver may store its own state associated with the packet in the packet itself. The generic field is
driver_data. In addition,send_prefetchandrecv_prefetchmay be used by drivers supporting prefetch (see below).
Sending
3 methods are available to initiate a send operation: buffer-based, iov-based, data-based. 2 methods are available to manage completion: polling and blocking wait. A driver must provide at least one method to initiate a send, must provide polling, and may optionnaly provide blocking wait. Methods to initiate a send:
- buffer-based: the user calls
send_pkt_buf_get()to get a buffer, the driver fills thebuffield in the pkt, the the user fills the buffer, then callssend_pkt_buf_post()to send it. The driver has thus the opportunity to allocate buffers directly in registered memory. This method is only possible for drivers for small packets. - iov-based: the user fills the
iovfield in the pkt, then callssend_pkt_post(). If the driver does not support iovecs (capability supports_iovec=0), thensend_pkt_post()will always be called with iov.n=1. - data-based: the user fills the
datafield in the pkt with astruct nm_data_sthat describes the data layout, then callssend_pkt_post(). The implementor should use the method that allows the lowest number of memory copies: buffer-based for small packets on network with memory registration or shared memory, data-based for large packets that need a memory copy, iov-based (with iovec support, as much as possible) otherwise. The caller must check the capabilities to know which method to use. For completion, support ofsend_pkt_poll()is mandatory; support ofsend_pkt_wait()is optionnal. Even when using blocking wait,send_pkt_poll()will be called at least once before.
It must be noted that these functions may be called from different threads, but not concurrently for a given pkt/instance.
Receiving
3 methods are available to initiate a recv operation: buffer-based, iov-base, data-based. 2 methods are available to manage completion: polling and blocking wait. A driver must provide at least one method to initiate a recv, must provide polling, and may optionnaly provide blocking wait. For scalability, before posting a recv request on a given instance, it is possible to ask globally (context-wide) for which instance has data available for receive. This may be done in a non-blocking way with recv_probe_any() or blocking with recv_wait_any(). After one if these functions returns, it is guaranteed that posting a receive on the returned instance will succeed immediately. This mechanism is reserved to drivers for small packet, without rendez-vous.
- buffer-based: the user calls
recv_pkt_poll()(without posting anything before). If there is a pending packet, the driver fills thebuffield of the pkt with incoming data. Then, the user callsrecv_pkt_buf_release()to give the buffer back to the driver. - iov-based: the user fills the
iovfield of the pkt, then callsrecv_pkt_post()to submit it. Completion may be polled withrecv_pkt_poll()or waited for withrecv_pkt_wait(). If the driver does not support iovecs (capability supports_iovec=0), thensend_pkt_post()will always be called with iov.n=1. Please note that capability supports_iovec is global, for both sending and receiving. - data-based: as for iovec, the user calls
recv_pkt_post(), then polls its completion withrecv_pkt_poll()or waits withrecv_pkt_wait(), with the difference that thedatafield of pkt is used instead ofiov. For completion, support ofrecv_pkt_poll()is mandatory for iov-based and data-based methods; blocking wait is optionnal. Upon termination, the user may want to cancel pending requests: recv_pkt_cancel()cancels a postedrecv_pkt_post()recv_cancel_any()cancels arecv_wait_any()that may be running at the same time in another thread.
Prefetching
To optimize transfer on networks that require memory registration, it is possible to allow speculative registration on the sender or receiver side while a rendez-vous is still in progress. The user calls send_pkt_prefetch() when it is likely that the message will be accepted on the given network, before having received the RTR, and recv_pkt_prefetch() when it is likely that the message will be arriving through the given network, before having received rendez-vous request. In case the data arrives through another network or with a different layout (multiple chunks), functions send_pkt_unfetch() and recv_pkt_unfetch() are called to cancel prefetch. The driver may match calls to send_pkt_prefetch() with the corresponding send_pkt_post() through the pkt.
Rendez-vous data
The driver may want to piggy-back some data with the RTR messages. In this case, on the receiver side, it may fill the rdv_data field of the pkt in the recv_pkt_post(). On the sender side, this data will be available in the rdv_data field of the pkt given to send_pkt_post(). This may be usefull to driver implementors to transfer memory registration information or addresses for RDMA. See description for field rdv_data above.
Errors
Errors may be returned when any function is called. Global errors likely to be returned by any of the functions are as follows:
- -NM_EBROKEN the connection is broken. The peer is now considered unreachable.
- -NM_ECLOSED trying to send/receive on a closed connection
- -NM_ENOTIMPL this feature is not implemented by the driver. This is a non-fatal error.
- -NM_EINVAL an invalid parameter was given to the function. This is a non-fatal error.