-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Driver-portable walker logging #5019
Conversation
//add trace information | ||
bool allow_walker_logs = das.walkerlogs_tag == "yes" || | ||
(das.walkerlogs_tag == "none" && (das.new_run_type == QMCRunType::VMC || das.new_run_type == QMCRunType::DMC || das.new_run_type == QMCRunType::VMC_BATCH || das.new_run_type == QMCRunType::DMC_BATCH)); | ||
new_driver->requestWalkerLogs(allow_walker_logs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complicated control logic. Prefer not to have another switch. I would like to have the following.
if "walkerlog" appears globally, all the drivers honor it. if not, each driver honor the "walkerlog" within the driver input section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreeing with Ye here - we use similar logic for maxcpusecs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The control logic protects drivers which either don't support walker logging, or shouldn't.
Traces were never enabled for optimizers, for instance. On another note, I really don't want to write code to support walker logging in all of the legacy drivers. CSVMC anyone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further comments welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current walker logging support is limited to VMC/DMC and its functionality and implementation can be completely different for a non-VMC/DMC driver. So its input is more suited for a driver input section rather than globally. If only some VMC/DMC drivers need to turn on logging, the input should be in the driver input section. Right now, we provide convenience to have the input globally and I believe it is better not to another switch to turn it off locally. To me, that is a bad design logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the other protection, setting allow_walker_logs
correctly to false
for unsupported drivers is what makes the WalkerLogManager
/WalkerLogCollector
operations a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then default value of allow_walker_logs
is already false
to all the unsupported drivers. Why do we even need to set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much details. I would like to discuss the user-facing input logic first and then we can address the implementation issue, maybe at a later PR.
Are you OK with the following input logic
- if the global
<walkerlog/>
exists, turn on logging in all the supported drivers. - when the global doesn't exist, legacy drivers do nothing. batched drivers will look for driver scope
<walkerlog/>
and act on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will all be simpler once drivers actually parse their own input. Right now driver input parsing is handled in QMCDriverFactory. I suggest we focus on that problem and this one will sort itself out.
I have merged in develop and fixed cmake issues. Please pull before making further changes. |
I've addressed Peter's comment from #5035 requesting simplification of WalkerLogState. |
I've also tried Peter's suggestion to use |
Comments addressed. Ready to proceed. |
You need h5d_append and we don't have a high-level API in hdf_archive right now. added issue #5038. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still quite a few fixes are needed but it is not necessary to hold the PR.
Test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to hold some sort of line to model good simulation object life cycle control and input handling for the batched code.
@@ -268,7 +268,6 @@ void OperatorBase::deleteTraceQuantities() | |||
have_required_traces_ = false; | |||
request_.reset(); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be in the change set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding an unecessary file to the PR file list for white space change has never been allowed. I have had to fix such changes many times. Finding these sort of typos takes effort by reviewers and they stack up over time.
@@ -75,6 +75,9 @@ class QMCMain : public MPIObjectBase, public QMCAppBase | |||
///xml mcwalkerset read-in elements | |||
std::vector<xmlNodePtr> walker_set_in_; | |||
|
|||
///walkerlogs xml | |||
xmlNodePtr walker_logs_xml_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do something like
std::optional<WalkerLogInput> walker_log_input_;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't like storing xml, parsing the xml in-place and store the result seems cleaner. We can make some improvements with a later PR.
@@ -304,6 +309,12 @@ bool VMCBatched::run() | |||
IndexType num_blocks = qmcdriver_input_.get_max_blocks(); | |||
//start the main estimator | |||
estimator_manager_->startDriverRun(); | |||
//start walker log manager | |||
wlog_manager_ = std::make_unique<WalkerLogManager>(walker_logs_input, allow_walker_logs, get_root_name(), myComm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please construct this in the QMCDriverNew constructor as EstimatorManagerNew is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be constructed in the constructor. I have moved this line to process() in #5039 at the moment. When the global input doesn't exist, we may look it up locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Process is also an API element that only exists for legacy compatibility, I'd be very suprised if construction actually needed to delayed until then, but lets discuss later.
/// whether to allow walker logs | ||
bool allow_walker_logs; | ||
/// walker logs input | ||
WalkerLogInput walker_logs_input; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag plus optional input object optional unique_pointer_ too much optional stuff. You can either check the presence of the object or the input, the allow is unecessary.
@@ -241,6 +249,10 @@ class QMCDriverNew : public QMCDriverInterface, public MPIObjectBase | |||
void putTraces(xmlNodePtr txml) override {} | |||
void requestTraces(bool allow_traces) override {} | |||
|
|||
void putWalkerLogs(xmlNodePtr wlxml) override; | |||
|
|||
void requestWalkerLogs(bool allow_walker_logs_) override { allow_walker_logs = allow_walker_logs_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem to be API elements required by legacy, they shouldn't do anything for batched and run counter to clean construction. This should be documented here and and note to remove with the legacy drivers added.
} | ||
|
||
void Crowd::stopBlock() { estimator_manager_crowd_.stopBlock(); } | ||
|
||
void Crowd::collectStepWalkerLog(int current_step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logWalkers(int current_step)
Seems to say everything necessary.
@@ -429,6 +432,14 @@ bool DMCBatched::run() | |||
IndexType num_blocks = qmcdriver_input_.get_max_blocks(); | |||
|
|||
estimator_manager_->startDriverRun(); | |||
|
|||
//start walker log manager | |||
wlog_manager_ = std::make_unique<WalkerLogManager>(walker_logs_input, allow_walker_logs, get_root_name(), myComm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of the log manager being a run scope object is promising but then it should not be a member variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We may get a chance to make it a run scope object after restructuring certain bits. Will give it a try once #5039 got merged.
Hi guys. I had thought this was going in given Ye's approval. If there really is a short list of "must have" items still remaining for this PR after all these passes, please make a short, final list below and I will address them. I need to get past this point and get back to other work, some of which depends on this code being in place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t feel there are blocking issues although many improvements are still needed.
Reverting the whitespace OperatorBase.cpp change to reduce the number files is the only must have. |
Test this please |
The requested renaming has been completed.
Proposed changes
This PR introduces a simplified version of the legacy TraceManager that focuses solely on writing per-walker data to HDF5 throughout the run (no streaming/listener functionality). The new implementation, WalkerLogManager, is compatible with and implemented in both the legacy and batched drivers.
This implementation extends the original to include the ability to write full data for the highest, lowest, and median energy walkers on each rank to disk at every step. The intention of writing this abbreviated data is to enable inexpensive access to data from problematic walkers (e.g. population explosions) in production runs.
Capabilities
Below is an example generated from a short DMC run of diamond on 6 MPI ranks showing the energies of all walkers (blue), highest energy walkers (green), lowest energy walkers (red), median energy walkers (black), vs the mean energy from dmc.dat (magenta line):
Design
Tests
What type(s) of changes does this code introduce?
Does this introduce a breaking change?
What systems has this change been tested on?
Laptop
Checklist
Path out of WIP
Tasks for review round 3
Elements for reviewers to review