[RxM] Proposal for using multiple conn instead of one conn for one RxM endpoint #9776

staryxchen · 2024-01-30T07:55:55Z

staryxchen
Jan 30, 2024

Hi all,
I'm Star, a developer with an interest in High Performance Datacenter Network. I've been following Libfabric for sometime.

RxM is utility provider and creates only one connection and communicates with peer currently. The connection is specifically refers to QP if the core provider is Verbs. I’ve noticed that the performance of verbs;ofi_rxm could benefit from using multiple connection (qp) instead of one, especially in the context of NIC is bonding (refer to https://docs.nvidia.com/networking-ethernet-software/cumulus-linux-37/Layer-2/Bonding-Link-Aggregation/). This enhancement could maximizing NIC throughput utilization, leading to double bandwidth of verbs;ofi_rxm.

Here’s how I envision it working:

Add a new environment variable called FI_OFI_RXM_NR_CONN to define how many connection used by one rxm_ep
making mulitple rxm_conn bond to one rxm_ep
rxm_ep maintain a variable to decide using which one conn when sending message

To illustrate the value of this idea, consider the following scenario:
As shown in the picture, two hosts with one CX6 (two port, each port 100Gbps) connected via a switch. Two port are aggregated into a single logical bonded interface for the benefit like Load balancing and Failover protection.
Linear scaling of bandwidth is also important feature. Such configuration is quite common in the current datacenter. However in the context of Libfabric, none of these advantages work because RxM only create one connection on the logical bonded interface.

That's exactly why I came up with the idea. I believe this feature is very valueable for the user who using Libfabric when they training large language models, because those tasks are bandwidth-sensitivity.

I have build a simple demo to implement all three of point in libfabric and obtained some data using fabtest. More specially, the test use verbs;ofi_rxm and set FI_OFI_RXM_NR_CONN=2 to create two connection (qp) for one RxM ep in each side. The result shows that the bandwidth could be nearly doubled as expect (12232.14 vs 21938.82). The traffic monitor also proves the improvement.

FI_OFI_RXM_NR_CONN=1

FI_OFI_RXM_NR_CONN=2

I would love to hear your thoughts on this proposal. Do you think this would be a valuable addition to Libfabric? Are there any potential issues or improvements that you foresee? If there's interest, I can contribute the code to the community.
Thank you all for considering this proposal. I look forward to your feedback and hope we can make Libfabric even better together.

All the best,
Star

ToddRimmer · 2024-01-30T13:20:05Z

ToddRimmer
Jan 30, 2024

The capability you describe is already available in the PSM3 provider.

The PSM3 provider was designed for Intel Ethernet RDMA NICs, but is implemented using standard verbs so it also runs well on other vendors RoCE and IB NICs. It also supports TCP/IP sockets when RDMA is not available.

When using user space QPS, the PSM3_QP_PER_NIC option controls this, and other modes such as PSM3_RDMA=1 or 2 enable use of RC QPs and direct application data placement. (Also see PSM3_MR_CACHE_MODE). PSM3 has many other tuning options, statistics and diagnostics to aid in fabric and application analysis and tuning. FYI, the PSM3 provider and Intel Ethernet have been used in all the Xeon MLPerf submissions, so it is well tuned for AI use cases.

PSM3 is also available as part of the "Intel Ethernet Fabric Suite". Search intel.com for "Intel Ethernet Fabric Suite" to find the free download which includes extensive documentation.

The Intel Ethernet Fabric Suite also includes a kernel rendezvous module to greatly reduce the number of RC QPs needed (reducing RC QPs needed from O(nodes * process_per_node * process_per_node) to O(nodes) ). The rendezvous module also provides optimized MR caching with GPU DIrect support (for both Intel OneAPI/Level Zero based GPUs and NVIDIA Cuda GPUs). By default, the rendezvous module creates 4 RC QPs between each pair of nodes. The recommended mode being to install Intel Ethernet Fabric Suite and then use PSM3_RDMA=1 or PSM3_GPUDIRECT=1 to enable all the optimized RDMA features. The download includes source and binary rpms/or .debs.

For other NICs, you will likely find each vendor has implemented their own optimized provider. For example, the EFA provider for AWS EFA NICs, the CXI provider for HPE's HPC NICs, the opx provider for Cornelis NICs, etc. There is also a UCX provider which allows use of NVIDIA UCX below libfabric.

Note, the original goal of RXM was to provide a simple example of how to create an OFI provider for an RDMA NIC. In keeping with the spirit of that goal, it has purposely been kept simple. Various vendors have cut/pasted portions of that code to implement their own providers and the tuning and performance characteristics of each vendors hardware is often different, but captured in their provider. So RXM has lived up to its original goal.

1 reply

staryxchen Jan 31, 2024
Author

I am not familiar with PSM3 and have never used it before. What's worse, I found that I cannot enable PSM3 in my machine which configure a Mellanox RoCE NIC and can enable Verbs provider. Could you provide some requirements needed for enabling PSM3 so that I can verify PSM3 is meet the capability I described?

ghost · 2024-01-30T14:33:15Z

ghost
Jan 30, 2024

Hi Star,

All ideas to improve performance will be considered.

Am I correct in assuming bonding/link-aggregation used in your example is just for demo purpose and not related to nor used in your idea? Or are you wanting to bring link-aggregation to RXM and is there something missing from current bonding/link-aggregation scheme?
Using multiple QPs across multiple ports can certainly increase bandwidth. However, can/will you still guarantee completion ordering or is it for relaxed ordering only?

1 reply

staryxchen Jan 31, 2024
Author

It' correct that bonding/link-aggregation is not related to my idea. In fact bonding/link-aggregation is de facto configuration in our production environment, so all of my considerations in terms of performance optimization are premised on such configuration.
In my opinion, this idea is an enhancements for RxM, so keeping same completion ordering as before is better. My current code implementation selects a connection from multiple connections in a round-robin fashion, and multiple msg_ep share a single msg_cq. Indeed, some unexp_msg have been encountered during testing, but this does not break the completion ordering, as RxM will take measures to handle these situations (ref to rxm_get_unexp_msg).

sunkuamzn · 2024-01-30T17:02:18Z

sunkuamzn
Jan 30, 2024
Collaborator

How is this different from using the mrail provider with two rxm endpoints with 1 qp per endpoint?

1 reply

staryxchen Jan 31, 2024
Author

To be honest, I never used mrail. After I read the document about mrail, I think the starting point of this proposal may be similar to mrail, but mrail still experimental so far. Perhaps we can gain some inspiration from mrail and implement the same feature in RxM.

BTW, mrail seems to be too heavy, it need run over two layer (versb;ofi_rxm), I think it's hard to get better performance than RxM.

shefty · 2024-02-02T21:30:02Z

shefty
Feb 2, 2024
Maintainer

Most of the functionality in rxm for handling connection multiplexing has been isolated into the address vector and utility code. Those changes were made to allow tcp to support rdm endpoints optimally without code duplication. As a result, tcp no longer needs rxm, though tcp;rxm is supported for wire protocol compatibility. The intent was for verbs to follow this same path, such that rdm endpoints would be implemented directly in verbs. verbs;rxm would continue to be supported for compatibility, but the verbs rdm protocol could be wire compatible with verbs;rxm if desired. (The tcp rdm protocol is optimized, and not wire compatible with tcp;rxm).

It's worth noting that in in the case of tcp, a direct implementation of rdm endpoints showed significant performance and stability improvements. My expectation is that verbs rdm would show a minimal performance gain. That's because verbs doesn't add any protocol today, plus rxm is designed around verbs semantics (required use of bounce buffers, memory registration, forced use of a rendezvous protocol, etc.). I do think the resulting code will be more stable under stress scenarios.

From the viewpoint of the implementation, your proposal differs from mrail. Mrail would create N rdm endpoints, each with their own connections, rather than 1 rdm endpoint with N connections. It may still be worthwhile to capture mrail performance, since that's available today. It may not be ideal but might be good enough for immediate use.

Before enhancing rxm, I would first see about adding rdm endpoint support directly into the verbs provider. Then update verbs rdm to form connection groups. Because verbs need to deal directly with HW devices, the implementation to group connections together would differ when working with verbs than, say, sockets and tcp connections. The connection grouping code may be common.

3 replies

staryxchen Feb 4, 2024
Author

The abstract of connection in RxM is quite succinct (👍）, so I think handling connection multiplexing in RxM code is easy. In my implementation, I add the local_index field to struct rxm_conn so that rxm_ep can find the corresponding conn by some strategy instead of only using fixed peer->index . By doing so, each local rxm_ep will have at least 1 connection to a single
remote rxm_ep.

I'm try to capture mrail performance. I wonder it's possible to provide an option like FI_OFI_MRAIL_NUMBER to let user specify how many rail to use instead of setting FI_OFI_MRAIL_ADDR.

Am I correct that you think it is better to support multiple connection in verbs provider than RxM? From the starting point of this proposal (fully utilize the capability of HW devices), it may better to support group connection in core provider as it can work directly with specific types of hardware to get best performance.

BTW, could you please provide me with a brief update on the progress of adding RDM endpoint support directly into the verbs provider when you have a moment?

shefty Feb 5, 2024
Maintainer

Yes, I think supporting multiple connections directly in the core provider will be best in the long term. But it's definitely more work.

The rxm_av abstraction is generic and no longer tied directly to the rxm provider. The name was kept the same to avoid unnecessary code churn. The AV maintains a list of peers (addresses) and connection objects. The connection object differs based on the user. For rxm, it's rxm_conn. For tcp, it's xnet_conn. If you mapped this directly to verbs, it should probably be some new vrb_conn structure that will look similar to xnet_conn.

Verbs rdm support is not there. My intent was to model it closely after the tcp implementation, such that vrb_rdm would look similar to xnet_rdm. The first step was to rework the locking and progress model to make it possible. I think that's been done. Locking and progress should both be handled by the vrb_progress structure. RDM support would require migrating the rxm_ep code to verbs. That opens up the potential to redesign significant pieces of the code. For example, tag matching could move directly to the msg endpoint.

staryxchen Feb 8, 2024
Author

Get it. Thank you for your valuable reply. I'll try to support multiple connections in the core provider next.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RxM] Proposal for using multiple conn instead of one conn for one RxM endpoint #9776

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[RxM] Proposal for using multiple conn instead of one conn for one RxM endpoint #9776

staryxchen Jan 30, 2024

Replies: 4 comments · 6 replies

ToddRimmer Jan 30, 2024

staryxchen Jan 31, 2024 Author

ghost Jan 30, 2024

staryxchen Jan 31, 2024 Author

sunkuamzn Jan 30, 2024 Collaborator

staryxchen Jan 31, 2024 Author

shefty Feb 2, 2024 Maintainer

staryxchen Feb 4, 2024 Author

shefty Feb 5, 2024 Maintainer

staryxchen Feb 8, 2024 Author

staryxchen
Jan 30, 2024

Replies: 4 comments 6 replies

ToddRimmer
Jan 30, 2024

staryxchen Jan 31, 2024
Author

ghost
Jan 30, 2024

staryxchen Jan 31, 2024
Author

sunkuamzn
Jan 30, 2024
Collaborator

staryxchen Jan 31, 2024
Author

shefty
Feb 2, 2024
Maintainer

staryxchen Feb 4, 2024
Author

shefty Feb 5, 2024
Maintainer

staryxchen Feb 8, 2024
Author