RDMA and RoCE for Network Efficiency and Performance

Overview


Remote Direct Memory Access (RDMA)

Remote Direct Memory Access (RDMA) provides direct memory access from the memory of one host (storage or compute) to the memory of another host without involving the remote Operating System and CPU, boosting network and host performance with lower latency, lower CPU load and higher bandwidth. In contrast, TCP/IP communications typically require copy operations, which add latency and consume significant CPU and memory resources.

RDMA over Converged Ethernet (RoCE)

RDMA over Converged Ethernet (RoCE) is a standard protocol which enables RDMA’s efficient data transfer over Ethernet networks allowing transport offload with hardware RDMA engine implementation, and superior performance. RoCE is a standard protocol defined in the InfiniBand Trade Association (IBTA) standard. RoCE makes use of UDP encapsulation allowing it to transcend Layer 3 networks. RDMA is a key capability natively used by the InfiniBand interconnect technology. Both InfiniBand and Ethernet RoCE share a common user API but have different physical and link layers.

RoCE Fabric Consideration

Mellanox ConnectX-4 and later generations incorporate Resilient RoCE to provide best of breed performance with only a simple enablement of Explicit Congestion Notification (ECN) on the network switches. Lossless fabric which is usually achieved through enablement of PFC is not mandated anymore. The Resilient RoCE congestion management, implemented in ConnectX NIC hardware delivers reliability even with UDP over a lossy network.

Implementing Applications over RDMA/RoCE

Application developers have several options for implementing acceleration with RDMA/RoCE using RDMA infrastructure verbs/libraries or middleware libraries:

Infrastructure

  • RDMA Verbs - Using libibverbs library (available inbox for major distributions) provides API interfaces needed to send and receive data
  • RDMA Communication Manager (RDMA-CM) - The RDMA CM library is a communication manager (CM) used to set up reliable, connected, and unreliable datagram data transfers. It works in conjunction with the RDMA verbs API that is defined by the libibverbs library.

    Middleware

  • Unified Communication X (UCX) - Open-source production-grade communication framework for data-centric and high-performance applications driven by industry, laboratories, and academia http://www.openucx.org.
  • Accelio - A high-performance asynchronous reliable messaging and RPC open-source community driven library
    NOTE: Accelio is no longer recommended for new projects. For new projects, please refer to UCX.

Soft RoCE

Soft RoCE is a software implementation of RoCE that allows RoCE to run on any Ethernet network adapter whether it offers hardware acceleration or not. Soft-RoCE is released as part of upstream kernel 4.8 as well as with Mellanox OFED 4.0 and above.


The Soft-RoCE distribution is available at:

RDMA Advantages

  • Zero-copy: Send and receive data to and from remote buffers
  • Kernel bypass: improving latency and throughput
  • Low CPU involvement: Access remote server’s memory without consuming CPU cycles on the remote server
  • Convergence: Single fabric to support Storage and Compute
  • Close to wire speed performance on Lossy Fabrics
  • Available in InfiniBand and Ethernet (L2 and L3)

Where is RDMA used?

  • High Performance Computing (HPC): MPI and SHMEM
  • Machine learning: TensorFlow™, Caffe, Microsoft Cognitive Toolkit (CNTK), PaddlePaddle and more
  • Big data: Spark, Hadoop
  • Data Bases: Oracle, SAP (HANA)
  • Storage: NVMe-oF (remote block access to NVMe SSDs), iSER (iSCSI Extensions for RDMA), Lustre, GPFS, HDFS, Ceph, EMC ScaleIO, VMware Virtual SAN, Dell Fluid Cache, Windows SMB Direct

Hardware support for RDMA and RoCE

NIC/HCA ConnectX-3 Pro ConnectX-4 and above
RDMA hardware acceleration
RoCE support / Hardware acceleration ✔ + “Resilient RoCE” allowing running RoCE over Lossy fabrics

Software Drivers support for RDMA and RoCE

RDMA and RoCE are supported on major operating systems from these versions:

Operating System Inbox version Async version
Linux RedHat 7.3
SLES 12 SP2
Kernel 4.4
MLNX_OFED 3.0
Windows Server Windows Server 2016 WinOF-2 1.20
WinOF 4.70
VMware ESXi 6.5 MLNX-NATIVE-ESX 4.16.8.8
FreeBSD Planned for 2017 H2