Can parallel netcdf (based on hdf5) allow some mpi tasks (in the same communicator) not do any actual IO operations #407
Replies: 5 comments
-
In collective mode the program will hang until all participating MPI processes contribute. But if this code works with HDF5-1.10.6 and not with HDF5-1.14.0, that is a problem. Can you try with HDF5-1.14.1? |
Beta Was this translation helpful? Give feedback.
-
@edwardhartnett Thank you! |
Beta Was this translation helpful? Give feedback.
-
This is expected behavior, in a collective operation ALL processors must participate. IIRC it should be possible for processors which do not have data to write to still call the netCDF put function, but with a data size of 0. This will cause them to participate in the collective write, even if that processor has no data to write. |
Beta Was this translation helpful? Give feedback.
-
Best performance is usually collective writes. |
Beta Was this translation helpful? Give feedback.
-
@edhartnett Thanks. now I have a chance to have some test codes to verify/clarify the behavior from different version of libs. I may give some udpates/or ask you questions off this QA and, finally give an update here. |
Beta Was this translation helpful? Give feedback.
-
Hi, netcdf developers and community
First, thanks a lot for your help on my question raised below.
The problem is about the parallel netcdf based on hdf5 (namely, for netcdf 4).
I feel a little confused when our program becomes hanging with new netcdf/hdf lib being used ( maybe, the key is : hdf5/1.14.0 being used to replace previous hdf5/1.10.6 .
The issuse is NOAA-EMC/GSI#447
My colleagues find the problem is because , for example, all mpi tasks has the nf90_open line as
^^
nf90_open(filenamein,nf90_nowrite,gfile_loc,comm=mpi_comm_world,info=MPI_INFO_NULL)
V
But, in some mpi tasks, there are never any netcdf io operations to be done for some setup by design .
Those mpi tasks within the mpi_comm_world and running the nf90_open command would become hanging if there are no any actual netcdf io operation ("enter " any netcdf IO functions).
When this problem run smoothly using older libs, would you please confirm or clarify what the standard of parallel netcdf is on this issue? Is this related to choice of independent or collective mode for NF90_VAR_PAR_ACCESS?
Your help is appreciated.
Beta Was this translation helpful? Give feedback.
All reactions