Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/rxm: add FI_PEER support to rxm #10510

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

aingerson
Copy link
Contributor

Add rxm support for FI_PEER (srx, cq, and cntr) and FI_AV_USER_ID to be used with the link provider

@aingerson
Copy link
Contributor Author

(will move first 3 commits into separate PR, leaving here for testing)

Comment on lines 144 to 176
ret = ofi_cq_write_error_trunc(rx_buf->ep->util_ep.rx_cq,
ret = ofi_peer_cq_write_error_trunc(rx_buf->ep->util_ep.rx_cq,
rx_buf->recv_entry->context,
rx_buf->recv_entry->comp_flags |
rx_buf->pkt.hdr.flags,
Copy link
Contributor

@j-xiong j-xiong Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alignment is off after the change.

rxm_cq_write(rxm_ep->util_ep.tx_cq, app_context,
comp_flags, 0, NULL, 0, 0);
(void) ofi_peer_cq_write(rxm_ep->util_ep.tx_cq, app_context,
comp_flags, 0, NULL, 0, 0, FI_ADDR_NOTAVAIL);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alignment.


if (ofi_peer_cq_write_error(rxm_ep->util_ep.rx_cq, &err_entry))
FI_WARN(&rxm_prov, FI_LOG_CQ,
"Unable to ofi_peer_cq_Wwrite_error\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo "Wwrite"

Comment on lines 845 to 848
rxm_cq_write_rx_error(def_tx_entry->rxm_ep,
ofi_op_msg,
def_tx_entry->rndv_ack.rx_buf->
recv_entry->context, (int) ret);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try break the line after ( so that all the arguments are aligned. That's the preferred way if lines become too long with the regular style. Same for a few more places following this.

Comment on lines 77 to 78
ret = ofi_peer_cq_write_error_peek(rxm_ep->util_ep.rx_cq, tag,
context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the alignment.

Comment on lines 49 to 51
return util_srx_generic_recv(&rxm_ep->srx->ep_fid, msg->msg_iov,
msg->desc, msg->iov_count, msg->addr, msg->context,
flags | rxm_ep->util_ep.rx_msg_flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment.

Comment on lines 66 to 67
return util_srx_generic_recv(&rxm_ep->srx->ep_fid, &iov, &desc, 1,
src_addr, context, rxm_ep->util_ep.rx_op_flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment.

Comment on lines 77 to 78
return util_srx_generic_recv(&rxm_ep->srx->ep_fid, iov, desc, count,
src_addr, context, rxm_ep->util_ep.rx_op_flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment.

Comment on lines 75 to 77
return util_srx_generic_trecv(&rxm_ep->srx->ep_fid, &iov, &desc, 1,
src_addr, context, tag, ignore,
rxm_ep->util_ep.rx_op_flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment off by 1 space.


if (attr->op_flags & FI_PEER) {
rxm_domain->srx = ((struct fi_peer_srx_context *)
(context))->srx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if srx has already been set?

@aingerson
Copy link
Contributor Author

@j-xiong Almost all of the alignment you pointed out are places where I did it on purpose to maintain the 80 character limit. Do you want me to change all of those to adding a new line after the open bracket? It looks weird to me and I don't think we do it most places in libfabric but I can change them all if you like.

@j-xiong
Copy link
Contributor

j-xiong commented Nov 21, 2024 via email

@aingerson
Copy link
Contributor Author

Update -
Fixed alignment and formatting issues
Rewrote bind path
Check for already existing srx on create and bind paths

Copy link
Contributor

@j-xiong j-xiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@j-xiong
Copy link
Contributor

j-xiong commented Nov 22, 2024

@shijin-aws Is the AWS CI failure related?

@j-xiong
Copy link
Contributor

j-xiong commented Nov 23, 2024

bot:aws:retest

@j-xiong
Copy link
Contributor

j-xiong commented Nov 23, 2024

2 mpichtestsuite test cases failed with Intel CI over verbs_rxm:

not ok 417 - ./pt2pt/icprobe 4 # time=180.3842689991
APPLICATION TIMED OUT, TIMEOUT = 180s

not ok 418 - ./pt2pt/icprobe 4 # time=180.858914136887
APPLICATION TIMED OUT, TIMEOUT = 180s

The rxm SAR segment type enum was defined inside another struct.
While techincally ok, this made it difficult for editors to find
the type and reported compiler errors. This cleans it up to make
it more readible and easier for editors to find the type

Signed-off-by: Alexia Ingerson <[email protected]>
Add application side support for FI_AV_USER_ID which requires
saving the fi_addr input as the internal fi_addr (for both the
peer API srx use case and for reporting unique source address
information).
When supporting the capability for the application, remove it
form the core provider information as it is only required on the
top layer

Signed-off-by: Alexia Ingerson <[email protected]>
Support using the peer APIs by default using the util peer helper
functions. Instead of going through the rxm-specific functions to write
to CQs and counters, use the ofi_peer_cq/cntr APIs which use the
owner ops. In the default case where rxm is not being used as a peer
these will go to the regular ofi_cq_write functions.

Signed-off-by: Alexia Ingerson <[email protected]>
Remove rxm implementation of receive queues and leverage the util
srx implementation which supports the peer srx API. This allows rxm
to use the peer API calls to match receives.

To do this, move the rxm protocol information from the receive entry
into the rx_buf and allocate it dynamically as needed to track protocol
information. This allows rxm to use the default peer_rx_entry instead of
its own custom receive entry.

With this last piece of the peer API implemented, rxm can also now
advertise full support of the FI_PEER capability. Just like the
FI_AV_USER_ID capability, rxm removes the bit from the core provider
info as it is only a requirement from the application side and not
from the message provider

Signed-off-by: Alexia Ingerson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants