-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dump walkers periodically for post-processing #4940
base: develop
Are you sure you want to change the base?
Conversation
append block index after walkers
in addition to walkers
Hi Paul. Thanks for adding/restoring this ghost feature. As we have discussed this functionality is useful for a variety of uses, and clearly the functionality could be evolved further. For now, two questions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did an initial pass. Didn't fully understand the change at the moment. Requested more documentation.
@@ -52,7 +52,7 @@ class HDFWalkerOutput | |||
/** dump configurations | |||
* @param w walkers | |||
*/ | |||
bool dump(const WalkerConfigurations& w, int block); | |||
bool dump(const WalkerConfigurations& w, int block, const bool identify_block=false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document what identify_block=true do.
std::vector<std::string> names = {"block", hdf::num_walkers, partition_name, dataset_name, weights_name}; | ||
for (auto aname : names) | ||
if (hout.is_dataset(aname)) hout.unlink(aname); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you mentioned the data set name used for recording instead of checkpointing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset name for recording is walkers{block_idx}
, whereas the name for checkpoint is simply walkers
.
@@ -62,7 +62,7 @@ class HDFWalkerOutput | |||
std::array<BufferType, 2> RemoteData; | |||
std::array<std::vector<QMCTraits::FullPrecRealType>, 2> RemoteDataW; | |||
int block; | |||
void write_configuration(const WalkerConfigurations& W, hdf_archive& hout, int block); | |||
void write_configuration(const WalkerConfigurations& W, hdf_archive& hout, int block, const bool identify_block); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document what identify_block=true do.
The asan failure was found caused by CI OS image. It has been fixed. |
I also noticed that the added change only affect legacy driving. For my curiosity, what is missing for you to adopt the batched driver? |
With regards to batched I want to remove WalkerConfiguration as an object and heavily refactor eliminate the "Walker-->Particle" objects. The Walker object and some of its data possessions which are stuffed into particle set is stuffed full of values that should be contiguous vectors in the crowds or even at rank scope but are instead here in an AOS with respect to walkers and mapped into a byte buffer. As it stands all this "dumping walkers" stuff is predicated on that. Pointer arithmetic is done etc. The less dependency there is is on this the better. I think we would be better off writing an "estimator" that wrote the per walker values for data out. This exposes legacy walker implementation details that we don't want to carry forward either, to the extent that some still exist that is because refactoring ParticleSet has been blocked for several years now. If you want to dump walkers it should be a crowd or population level operation so it can be synchronized properly and done efficiently. If you want to use parallel hdf5 the appropriate level would be at MCPopulation. Ideally it would be done from a separate io thread while real work is being done by the main crowd threads. |
@prckent apologies for the delayed response. To answer your questions:
I did not test the energies and weights. The same
No, I did not add this feature to the batched drivers. The current implementation breaks down in an MPI run. I think this is due to problems with parallel hdf5 file writing. @ye-luo @PDoakORNL I would love some help on getting this feature to work. |
Hi Paul - Is #5019 working for you? If so, we can close this PR. |
@prckent please keep this PR alive. it is needed for different uses like reply. |
@prckent yes, #5019 does what I need. @ye-luo would it be easier to add a "replay" flag to the new WalkerLog class? |
WalkerLog has very different design logics and use cases. For example, it writes one file per MPI rank instead of using scalable I/O. Data are not organized in steps and thus requires extra work to recover a true reply. |
Proposed changes
Repurpose unused input "record_configs" to set frequency of dumping walkers for post-processing.
Example input of a simple harmonic oscillator tb38_msho.zip
A normal checkpoint file
{prefix}.config.h5
is overwritten at every checkpoint and containsThe
walkers
dataset contains the snapshot at checkpoint.With "record_configs" > 0, each addition to the
config.h5
is preserved. Instead of onewalkers
dataset, we need to identify the block at which thewalkers
are dumpedThe number
32
in the namewalkers32
identify that these walkers were dump at block32
.What type(s) of changes does this code introduce?
Does this introduce a breaking change?
What systems has this change been tested on?
Intel workstation
Checklist
Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.