Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make util_ns fields: run and is_initialized atomics #10570

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dsciebu
Copy link

@dsciebu dsciebu commented Nov 21, 2024

Values of util_ns fields can be read/write in parallel from multiple threads. Making them atomics secures these kind of access.

Signed-off-by: Dariusz Sciebura [email protected]

@dsciebu dsciebu changed the title Make util_ns fileds: run and is_initialized atomics Make util_ns fields: run and is_initialized atomics Nov 21, 2024
return;

ofi_atomic_initialize32(&ns->ref, 0);
ns->listen_sock = INVALID_SOCKET;
if (!ns->hostname)
ns->hostname = OFI_NS_DEFAULT_HOSTNAME;
ns->is_initialized = 1;
ofi_atomic_set32(&ns->is_initialized, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name service is basically a simple way to support running fabtests. It's not something we expect to have enabled as a general purpose feature.

If it were, the changes here are not sufficient to support multiple threads. Locks would be needed. It's ultimately the responsibility of the caller to have appropriate serialization to the name service calls.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip about the name service. I found that it is by default enabled for verbs provider, which, in case of a multithreaded application, leads to data races triggered by e.g. parallel fi_close and util_ns_accept_handler execution. Don't you think, that by as you said: "not expecting this option to be enabled as a general purpose feature" you should default this option to OFF?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what sort of endpoints the app is using? It looks like the name service is related to dgram ep's, and only started if a verbs domain is opened for dgram ep's. Note that this can occur through the rxd provider for rdm ep's.

@j-xiong - I see that the name server is enabled by default, but disabled if specific, unrelated environment variables are set (OMPI_COMM_WORLD_RANK or PMI_RANK). Should the default be to disable it, and use an environment variable to enable instead?

I'm unsure how frequently the verbs dgram provider is used in practice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shefty The name server is needed for verbs dgram to resolve "node" into dst_addr in fi_getinfo(). By default this is a feature an application may use. MPIs (OpenMPI, MPICH, Intel MPI) don't use this feature so the name server can be safely disabled.

Copy link
Author

@dsciebu dsciebu Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My experiments that revealed the issue were made on a Mellanox RDMA card. Using that hardware and setting up the hints the way that rdm fabtests do, I ended up in getting an RXD protocol 'based' endpoint.
When checking my libfabric based app's test, I noticed that Thread Sanitizer reports data race in libfabric and ended up here. Turning off the 'name service' with the proper env variable solves the problem completely.
So, IMO it should be:
a) turned off by default if not necessary or
b) made thread safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsciebu - I agree. Can you open an issue to track this? The name service should be made thread safe, which I'm guessing won't be that difficult. Likely just needs a lock around most of the functions you identified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants