[RxM] Proposal for using multiple conn instead of one conn for one RxM endpoint #9776
Replies: 4 comments 6 replies
-
The capability you describe is already available in the PSM3 provider. The PSM3 provider was designed for Intel Ethernet RDMA NICs, but is implemented using standard verbs so it also runs well on other vendors RoCE and IB NICs. It also supports TCP/IP sockets when RDMA is not available. When using user space QPS, the PSM3_QP_PER_NIC option controls this, and other modes such as PSM3_RDMA=1 or 2 enable use of RC QPs and direct application data placement. (Also see PSM3_MR_CACHE_MODE). PSM3 has many other tuning options, statistics and diagnostics to aid in fabric and application analysis and tuning. FYI, the PSM3 provider and Intel Ethernet have been used in all the Xeon MLPerf submissions, so it is well tuned for AI use cases. PSM3 is also available as part of the "Intel Ethernet Fabric Suite". Search intel.com for "Intel Ethernet Fabric Suite" to find the free download which includes extensive documentation. The Intel Ethernet Fabric Suite also includes a kernel rendezvous module to greatly reduce the number of RC QPs needed (reducing RC QPs needed from O(nodes * process_per_node * process_per_node) to O(nodes) ). The rendezvous module also provides optimized MR caching with GPU DIrect support (for both Intel OneAPI/Level Zero based GPUs and NVIDIA Cuda GPUs). By default, the rendezvous module creates 4 RC QPs between each pair of nodes. The recommended mode being to install Intel Ethernet Fabric Suite and then use PSM3_RDMA=1 or PSM3_GPUDIRECT=1 to enable all the optimized RDMA features. The download includes source and binary rpms/or .debs. For other NICs, you will likely find each vendor has implemented their own optimized provider. For example, the EFA provider for AWS EFA NICs, the CXI provider for HPE's HPC NICs, the opx provider for Cornelis NICs, etc. There is also a UCX provider which allows use of NVIDIA UCX below libfabric. Note, the original goal of RXM was to provide a simple example of how to create an OFI provider for an RDMA NIC. In keeping with the spirit of that goal, it has purposely been kept simple. Various vendors have cut/pasted portions of that code to implement their own providers and the tuning and performance characteristics of each vendors hardware is often different, but captured in their provider. So RXM has lived up to its original goal. |
Beta Was this translation helpful? Give feedback.
-
Hi Star, All ideas to improve performance will be considered. Am I correct in assuming bonding/link-aggregation used in your example is just for demo purpose and not related to nor used in your idea? Or are you wanting to bring link-aggregation to RXM and is there something missing from current bonding/link-aggregation scheme? |
Beta Was this translation helpful? Give feedback.
-
How is this different from using the mrail provider with two rxm endpoints with 1 qp per endpoint? |
Beta Was this translation helpful? Give feedback.
-
Most of the functionality in rxm for handling connection multiplexing has been isolated into the address vector and utility code. Those changes were made to allow tcp to support rdm endpoints optimally without code duplication. As a result, tcp no longer needs rxm, though tcp;rxm is supported for wire protocol compatibility. The intent was for verbs to follow this same path, such that rdm endpoints would be implemented directly in verbs. verbs;rxm would continue to be supported for compatibility, but the verbs rdm protocol could be wire compatible with verbs;rxm if desired. (The tcp rdm protocol is optimized, and not wire compatible with tcp;rxm). It's worth noting that in in the case of tcp, a direct implementation of rdm endpoints showed significant performance and stability improvements. My expectation is that verbs rdm would show a minimal performance gain. That's because verbs doesn't add any protocol today, plus rxm is designed around verbs semantics (required use of bounce buffers, memory registration, forced use of a rendezvous protocol, etc.). I do think the resulting code will be more stable under stress scenarios. From the viewpoint of the implementation, your proposal differs from mrail. Mrail would create N rdm endpoints, each with their own connections, rather than 1 rdm endpoint with N connections. It may still be worthwhile to capture mrail performance, since that's available today. It may not be ideal but might be good enough for immediate use. Before enhancing rxm, I would first see about adding rdm endpoint support directly into the verbs provider. Then update verbs rdm to form connection groups. Because verbs need to deal directly with HW devices, the implementation to group connections together would differ when working with verbs than, say, sockets and tcp connections. The connection grouping code may be common. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I'm Star, a developer with an interest in High Performance Datacenter Network. I've been following Libfabric for sometime.
RxM is utility provider and creates only one connection and communicates with peer currently. The connection is specifically refers to QP if the core provider is Verbs. I’ve noticed that the performance of verbs;ofi_rxm could benefit from using multiple connection (qp) instead of one, especially in the context of NIC is bonding (refer to https://docs.nvidia.com/networking-ethernet-software/cumulus-linux-37/Layer-2/Bonding-Link-Aggregation/). This enhancement could maximizing NIC throughput utilization, leading to double bandwidth of verbs;ofi_rxm.
Here’s how I envision it working:
To illustrate the value of this idea, consider the following scenario:
As shown in the picture, two hosts with one CX6 (two port, each port 100Gbps) connected via a switch. Two port are aggregated into a single logical bonded interface for the benefit like Load balancing and Failover protection.
Linear scaling of bandwidth is also important feature. Such configuration is quite common in the current datacenter. However in the context of Libfabric, none of these advantages work because RxM only create one connection on the logical bonded interface.
That's exactly why I came up with the idea. I believe this feature is very valueable for the user who using Libfabric when they training large language models, because those tasks are bandwidth-sensitivity.
I have build a simple demo to implement all three of point in libfabric and obtained some data using fabtest. More specially, the test use verbs;ofi_rxm and set FI_OFI_RXM_NR_CONN=2 to create two connection (qp) for one RxM ep in each side. The result shows that the bandwidth could be nearly doubled as expect (12232.14 vs 21938.82). The traffic monitor also proves the improvement.
FI_OFI_RXM_NR_CONN=1
FI_OFI_RXM_NR_CONN=2
I would love to hear your thoughts on this proposal. Do you think this would be a valuable addition to Libfabric? Are there any potential issues or improvements that you foresee? If there's interest, I can contribute the code to the community.
Thank you all for considering this proposal. I look forward to your feedback and hope we can make Libfabric even better together.
All the best,
Star
Beta Was this translation helpful? Give feedback.
All reactions