Replies: 7 comments 14 replies
-
I'm not aware of any issue in this area. It's actually a common occurrence for a couple of applications. It is possible for both sides to try to establish a connection to the other at the same time. One side should be chosen as the winner, and the other connection should be dropped. The winner is picked based on comparing the IP address and TCP port numbers. From the log output above, it looks like the connection from host 1 -> 2 won and was established. Do you see any indication of that in the log? Can you confirm that host 1 thinks it has an active connection to host 2? Hopefully the log can provide some indication why host 2 doesn't think it has a connection back. |
Beta Was this translation helpful? Give feedback.
-
Created issue #9052 for this discussion, as the description indicates a bug. |
Beta Was this translation helpful? Give feedback.
-
I've attached logs with comments included, search for '//' for my hand written comments.
|
Beta Was this translation helpful? Give feedback.
-
The provider on both hosts is 'tcp' version 773248 according to the fi_fabric_attr structure, both using Libfabric v1.18. I don't always see that message regarding the payload offset being too large, but I've added logging into the Libfabric code and when it does happen I see the following:
I've attached logs below for an example of where the connection fails without the payload offset being logged as too large: |
Beta Was this translation helpful? Give feedback.
-
Away from this topic I'm seeing a crash when I run ping pong on Windows, using an RDM endpoint type with the UDP provider. The crash is in the first call to fi_cq_read, recreatable every time. Shall I raise a separate discussion for this issue? |
Beta Was this translation helpful? Give feedback.
-
I think there may be 2 or 3 somewhat independent problems being exposed.
I don't know how this is occurring. This might be difficult to debug.
I know how this might occur. There's even a comment in the code about one possibility. I can update the print at this location, which may give help confirm whether the potential problem is occurring or not.
|
Beta Was this translation helpful? Give feedback.
-
From host_2-2 log:
Does the app use both FI_EP_MSG and FI_EP_RDM endpoint types? The logs suggest that is the case. The address shown above is also inserted into the AV:
I'm guessing that message is some sort of OOB address exchange. If connections between MSG and RDM endpoints are being mingled, that could explain the problems being reported. How are you selecting and obtaining the address of the RDM endpoints? |
Beta Was this translation helpful? Give feedback.
-
I have an issue with Libfabric v1.18.0 using an FI_EP_RDM endpoint type with the tcp provider.
The issue I see is that when I run up two instances of my application, both sides create an RDM endpoint using the tcp provider and add the corresponding address to the address vector.
Both applications then start sending and receiving data on these endpoints, but sometimes I see an issue where there is a race condition as Libfabric must be attempting to establish a tcp connection from both hosts at the same time, with fi_send being called in a loop in both applications.
Enabling Libfabric debug logging I see the following repeated, and a connection is never actually established between hosts:
Host 1:
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_CONNREQ
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_process_connreq():422 connreq for 0000024171D3AAC0
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_process_connreq():463 simultaneous, reject peer
Host 2:
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_cm_msg():112 Connection refused from remote
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_req_done():196 Failed to receive connect response
11libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_SHUTDOWN
Note that the call to fi_send returns -FI_EAGAIN and remains in that state.
Note also that this is a race condition, sometimes the connections establish without issue and data is transferred between hosts fine.
I'm just wondering if anyone else has experienced this, I have some ideas on how to resolve but any advice on how best to fix this would be appreciated.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions