Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Skipping over monitors at the end of job. #185

Open
nwinner opened this issue Nov 16, 2021 · 0 comments
Open

[Bug] Skipping over monitors at the end of job. #185

nwinner opened this issue Nov 16, 2021 · 0 comments

Comments

@nwinner
Copy link
Contributor

nwinner commented Nov 16, 2021

System

  • Custodian version: https://github.com/nwinner/custodian branch cp2k. Issue does not relate to this branch, but I am technically using handlers that exist in my fork, rather than already exist in the main repo.
  • Python version: 3.7

Summary

Consider a handler that checks for convergence. It checks to see if convergence is not reached while the job is running, and so it is a monitor. It can happen, however, that the program finishes because max steps reached and exists before custodian cycles to do another check of the monitors. It will proceed to do the final check

It appears that the way that custodian is written, if any error was found and stored in the bool variable has_error, then instead of running a final check with all handlers, only ones with is_monitor=False will be checked. Because of this, my convergence handler was not included in the final check, and so it was bypassed and the job was marked as completed because a different handler addressed the problem that caused has_error.

Maybe I'm not understanding the logic of the code, but this does not seem to be the way it should function. Clarifications would be appreciated if this is intended. I'm not sure about a proposed solution because there could be collisions where running every handler at the end causes a monitor to be run twice.

Example code

This code snippet is taken from custodian.py:448

            # While the job is running, we use the handlers that are
            # monitors to monitor the job.
            if isinstance(p, subprocess.Popen):
                if self.monitors:
                    n = 0
                    while True:
                        n += 1
                        time.sleep(self.polling_time_step)
                        if p.poll() is not None:
                            break
                        terminate = self.terminate_func or p.terminate
                        if n % self.monitor_freq == 0:
                            has_error = self._do_check(self.monitors,
                                                       terminate)
                        if terminate is not None and terminate != p.terminate:
                            time.sleep(self.polling_time_step)
                else:
                    p.wait()
                    if self.terminate_func is not None and \
                            self.terminate_func != p.terminate:
                        self.terminate_func()
                        time.sleep(self.polling_time_step)

                zero_return_code = p.returncode == 0

            logger.info("{}.run has completed. "
                        "Checking remaining handlers".format(job.name))
            # Check for errors again, since in some cases non-monitor
            # handlers fix the problems detected by monitors
            # if an error has been found, not all handlers need to run
            if has_error:
                self._do_check([h for h in self.handlers
                                if not h.is_monitor])
            else:
                has_error = self._do_check(self.handlers)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant