Issue establishing connection between RDM endpoints using TCP provider #9051

mason1504 · 2023-06-16T16:08:02Z

mason1504
Jun 16, 2023

I have an issue with Libfabric v1.18.0 using an FI_EP_RDM endpoint type with the tcp provider.

The issue I see is that when I run up two instances of my application, both sides create an RDM endpoint using the tcp provider and add the corresponding address to the address vector.

Both applications then start sending and receiving data on these endpoints, but sometimes I see an issue where there is a race condition as Libfabric must be attempting to establish a tcp connection from both hosts at the same time, with fi_send being called in a loop in both applications.

Enabling Libfabric debug logging I see the following repeated, and a connection is never actually established between hosts:

Host 1:
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_CONNREQ
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_process_connreq():422 connreq for 0000024171D3AAC0
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_process_connreq():463 simultaneous, reject peer

Host 2:
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_cm_msg():112 Connection refused from remote
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_req_done():196 Failed to receive connect response
11libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_SHUTDOWN

Note that the call to fi_send returns -FI_EAGAIN and remains in that state.

Note also that this is a race condition, sometimes the connections establish without issue and data is transferred between hosts fine.

I'm just wondering if anyone else has experienced this, I have some ideas on how to resolve but any advice on how best to fix this would be appreciated.

Thanks.

shefty · 2023-06-16T16:29:14Z

shefty
Jun 16, 2023
Maintainer

I'm not aware of any issue in this area. It's actually a common occurrence for a couple of applications.

It is possible for both sides to try to establish a connection to the other at the same time. One side should be chosen as the winner, and the other connection should be dropped. The winner is picked based on comparing the IP address and TCP port numbers. From the log output above, it looks like the connection from host 1 -> 2 won and was established. Do you see any indication of that in the log?

Can you confirm that host 1 thinks it has an active connection to host 2? Hopefully the log can provide some indication why host 2 doesn't think it has a connection back.

2 replies

mason1504 Jun 16, 2023
Author

Thanks for the prompt reply.

The above log entries get logged continuously, so seems it's trying to establish a connection and failing constantly. Sometimes though as I said above it works fine and resolves. When it works I see the following log entry:

libfabric:1352:1686917384::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_CONNREQ
libfabric:1352:1686917384::tcp:ep_ctrl:xnet_process_connreq():422 connreq for 000002BE0FF66A40
libfabric:1352:1686917384::tcp:ep_ctrl:xnet_process_connreq():439 simultaneous, accept peer 000002BE0FF66A40

Notice it says accept peer rather than reject.

shefty Jun 16, 2023
Maintainer

There should be log messages for connections from host 1 -> 2 and from host 2 -> 1. The repeating logs are for host 2 -> 1. Can you attach a full log file (feel free to trim off a bunch of the repeated messages at the end)?

shefty · 2023-06-16T16:31:09Z

shefty
Jun 16, 2023
Maintainer

Created issue #9052 for this discussion, as the description indicates a bug.

0 replies

mason1504 · 2023-06-19T11:17:22Z

mason1504
Jun 19, 2023
Author

Host_2.txt
Host_1.txt

I've attached logs with comments included, search for '//' for my hand written comments.
To summarise:

Host 1 connects to the passive endpoint on host 2 to exchange RDM address information.
The relevant RDM endpoint addresses are then inserted into the address vectors on both hosts, the TCP connection is closed.
Sends between hosts fail, host 1 logs 'Connection refused from remote' and host 2 logs 'simultaneous, reject peer' this must be a race condition as it doesn't always happen.

2 replies

shefty Jun 20, 2023
Maintainer

Hmm... from host 1:

libfabric:2328:1687172123::tcp:ep_ctrl:xnet_handle_cm_msg():104<warn> Failed to read cm data

And host 2:

libfabric:31612:1687172123::tcp:domain:xnet_complete_tx():337<warn> msg send failed

These show up early in the output and look suspect. The message from host 2 may not mean anything, but I'm not as sure about the message from host 1 and need to understand how it can occur.

shefty Jun 20, 2023
Maintainer

This line also shows up in host 1 output:

libfabric:2328:1687172123::tcp:ep_data:xnet_progress_hdr():779<warn> Payload offset is too large

This makes me skeptical this is a peer connection issue, versus some other problem. The payload offset message is reporting some sort of corruption in the message header that's been received. Is there any chance this is somehow mixing libfabric versions or tcp with tcp;ofi_rxm providers? (These seem unlikely.) It looks like there may be some sort of data corruption occurring that's hitting the tcp provider's messages. If you can get more information from the above print location in xnet_progress_hdr():779, maybe it'll help identify what might be going on. For example, print hdr_size.

If hdr_size is too large, the code can catch that problem. But if it's corrupted, yet still in a valid range, then the entire stream is mucked up.

mason1504 · 2023-06-22T11:40:03Z

mason1504
Jun 22, 2023
Author

The provider on both hosts is 'tcp' version 773248 according to the fi_fabric_attr structure, both using Libfabric v1.18.

I don't always see that message regarding the payload offset being too large, but I've added logging into the Libfabric code and when it does happen I see the following:

libfabric:12996:1687430169::tcp:ep_data:xnet_progress_hdr():779<warn> Payload offset is too large hdr_size=244, XNET_MAX_HDR=120

I've attached logs below for an example of where the connection fails without the payload offset being logged as too large:

Host_2-2.txt
Host_1-2.txt

2 replies

shefty Jun 22, 2023
Maintainer

Do you always see this message when the connection setup fails? Or still only sometimes?

shefty Jun 22, 2023
Maintainer

Sorry, I misread your last comment. You answered my question.

mason1504 · 2023-06-22T12:24:48Z

mason1504
Jun 22, 2023
Author

Away from this topic I'm seeing a crash when I run ping pong on Windows, using an RDM endpoint type with the UDP provider. The crash is in the first call to fi_cq_read, recreatable every time. Shall I raise a separate discussion for this issue?

1 reply

shefty Jun 22, 2023
Maintainer

Please open a separate issue for this.

shefty · 2023-06-26T18:52:47Z

shefty
Jun 26, 2023
Maintainer

I think there may be 2 or 3 somewhat independent problems being exposed.

Possible issue:

libfabric:2328:1687172123::tcp:ep_data:xnet_progress_hdr():779<warn> Payload offset is too large

I don't know how this is occurring. This might be difficult to debug.

Possible issue:

libfabric:2328:1687172123::tcp:ep_ctrl:xnet_handle_cm_msg():104<warn> Failed to read cm data

I know how this might occur. There's even a comment in the code about one possibility. I can update the print at this location, which may give help confirm whether the potential problem is occurring or not.

Possible issue:
In either of the above cases, the corresponding connection is being destroyed by one side. However, for some reason, the other side is not detecting that the connection was destroyed. Typically, the peer would detect this when reading the CQ. The socket would be marked in an error state, and the connection would be closed. This would allow subsequent attempts to send to try to establish a new connection. It seems that somehow the connection in question is not being progressed such that the peer can detect that the connection is no longer valid.

1 reply

shefty Jun 26, 2023
Maintainer

@mason1504 - Would you be able to run with this pull request included? #9076 It expands the prints to show the error codes that are being returned.

shefty · 2023-06-26T22:53:10Z

shefty
Jun 26, 2023
Maintainer

From host_2-2 log:

EpListener::acceptNewConnection ConnectMessage: Provider: tcp Ep type: 3 CTI addr: 10.1.26.157:50502 Listen addr: 10.1.26.157:5001
EpMgr::processIncomingConnectRequest, remoteAddress: 10.1.26.157:5001 local address: 10.1.30.37:5000 01234567-89ab-cdef-1241-456789abcdef

Does the app use both FI_EP_MSG and FI_EP_RDM endpoint types? The logs suggest that is the case. The address shown above is also inserted into the AV:

libfabric:12996:1687430168::tcp:av:ofi_av_insert_addr():291<info> inserting addr: fi_sockaddr_in://10.1.26.157:50502

I'm guessing that message is some sort of OOB address exchange. If connections between MSG and RDM endpoints are being mingled, that could explain the problems being reported. How are you selecting and obtaining the address of the RDM endpoints?

6 replies

shefty Jun 27, 2023
Maintainer

There is code to handle simultaneous connections, where both sides try to send at the same time. That's the reason for the messages that show up as 'reject peer' or 'accept peer'. This happens frequently with MPI applications.

mason1504 Jun 28, 2023
Author

I've done a test with the latest code in main, logs are below and the enhanced logging is present:

Host_2-3.txt
Host_1-3.txt

shefty Jun 28, 2023
Maintainer

I'm trying to understand a specific flow. Is there any synchronization done after the addresses are exchanged and inserted into the local AV prior to sending on the rdm endpoint? What I'm trying to determine is if A could send to B prior to B inserted A's address into its AV.

shefty Jun 28, 2023
Maintainer

Based on the logs, I think the answer to my last question is 'yes'. Simpler question. Are you using the same domain to open both msg and rdm endpoints? That isn't supported by the provider. And I don't think the code does anything to prevent an app from doing this.

shefty Jun 29, 2023
Maintainer

If you are mixing msg and rdm endpoints on the same domain, the patch in this PR will detect the problem earlier and fail the ep open call:
#9092
Another change is needed to fix-up the domain returned from fi_getinfo.
I didn't examine the call paths to figure out what problems would arise mixing different ep types on the same domain. I know the locking would not work, so multi-threaded apps can break. But there could be other issues as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue establishing connection between RDM endpoints using TCP provider #9051

{{title}}

Replies: 7 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Issue establishing connection between RDM endpoints using TCP provider #9051

mason1504 Jun 16, 2023

Replies: 7 comments · 14 replies

shefty Jun 16, 2023 Maintainer

mason1504 Jun 16, 2023 Author

shefty Jun 16, 2023 Maintainer

shefty Jun 16, 2023 Maintainer

mason1504 Jun 19, 2023 Author

shefty Jun 20, 2023 Maintainer

shefty Jun 20, 2023 Maintainer

mason1504 Jun 22, 2023 Author

shefty Jun 22, 2023 Maintainer

shefty Jun 22, 2023 Maintainer

mason1504 Jun 22, 2023 Author

shefty Jun 22, 2023 Maintainer

shefty Jun 26, 2023 Maintainer

shefty Jun 26, 2023 Maintainer

shefty Jun 26, 2023 Maintainer

shefty Jun 27, 2023 Maintainer

mason1504 Jun 28, 2023 Author

shefty Jun 28, 2023 Maintainer

shefty Jun 28, 2023 Maintainer

shefty Jun 29, 2023 Maintainer

mason1504
Jun 16, 2023

Replies: 7 comments 14 replies

shefty
Jun 16, 2023
Maintainer

mason1504 Jun 16, 2023
Author

shefty Jun 16, 2023
Maintainer

shefty
Jun 16, 2023
Maintainer

mason1504
Jun 19, 2023
Author

shefty Jun 20, 2023
Maintainer

shefty Jun 20, 2023
Maintainer

mason1504
Jun 22, 2023
Author

shefty Jun 22, 2023
Maintainer

shefty Jun 22, 2023
Maintainer

mason1504
Jun 22, 2023
Author

shefty Jun 22, 2023
Maintainer

shefty
Jun 26, 2023
Maintainer

shefty Jun 26, 2023
Maintainer

shefty
Jun 26, 2023
Maintainer

shefty Jun 27, 2023
Maintainer

mason1504 Jun 28, 2023
Author

shefty Jun 28, 2023
Maintainer

shefty Jun 28, 2023
Maintainer

shefty Jun 29, 2023
Maintainer