[Bug] Skipping over monitors at the end of job. #185

nwinner · 2021-11-16T19:57:55Z

System

Custodian version: https://github.com/nwinner/custodian branch cp2k. Issue does not relate to this branch, but I am technically using handlers that exist in my fork, rather than already exist in the main repo.
Python version: 3.7

Summary

Consider a handler that checks for convergence. It checks to see if convergence is not reached while the job is running, and so it is a monitor. It can happen, however, that the program finishes because max steps reached and exists before custodian cycles to do another check of the monitors. It will proceed to do the final check

It appears that the way that custodian is written, if any error was found and stored in the bool variable has_error, then instead of running a final check with all handlers, only ones with is_monitor=False will be checked. Because of this, my convergence handler was not included in the final check, and so it was bypassed and the job was marked as completed because a different handler addressed the problem that caused has_error.

Maybe I'm not understanding the logic of the code, but this does not seem to be the way it should function. Clarifications would be appreciated if this is intended. I'm not sure about a proposed solution because there could be collisions where running every handler at the end causes a monitor to be run twice.

Example code

This code snippet is taken from custodian.py:448

            # While the job is running, we use the handlers that are
            # monitors to monitor the job.
            if isinstance(p, subprocess.Popen):
                if self.monitors:
                    n = 0
                    while True:
                        n += 1
                        time.sleep(self.polling_time_step)
                        if p.poll() is not None:
                            break
                        terminate = self.terminate_func or p.terminate
                        if n % self.monitor_freq == 0:
                            has_error = self._do_check(self.monitors,
                                                       terminate)
                        if terminate is not None and terminate != p.terminate:
                            time.sleep(self.polling_time_step)
                else:
                    p.wait()
                    if self.terminate_func is not None and \
                            self.terminate_func != p.terminate:
                        self.terminate_func()
                        time.sleep(self.polling_time_step)

                zero_return_code = p.returncode == 0

            logger.info("{}.run has completed. "
                        "Checking remaining handlers".format(job.name))
            # Check for errors again, since in some cases non-monitor
            # handlers fix the problems detected by monitors
            # if an error has been found, not all handlers need to run
            if has_error:
                self._do_check([h for h in self.handlers
                                if not h.is_monitor])
            else:
                has_error = self._do_check(self.handlers)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Skipping over monitors at the end of job. #185

[Bug] Skipping over monitors at the end of job. #185

nwinner commented Nov 16, 2021

[Bug] Skipping over monitors at the end of job. #185

[Bug] Skipping over monitors at the end of job. #185

Comments

nwinner commented Nov 16, 2021

System

Summary

Example code