Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insufficient information stored to recover parent-child relationship between jobs from their JobStore output docs? #374

Open
mkhorton opened this issue Jul 20, 2023 · 7 comments

Comments

@mkhorton
Copy link
Member

Please advise if I've mis-interpreted the code/docs.

Assume:

  • Storing Job outputs via JobStore.
  • Not using a workflow manager, i.e. using jobflow directly.

For a given document in the JobStore, I see uuid, I also see hosts (which can be used to see that a given Job belongs to the same Flow), however, that I can see, there is no way to see the dependency relationship between two or more Job output documents, is this correct?

If correct, is this intended usage? What would be a minimal way to retain this information, without adding a dependency on a specific workflow manager?

@utf
Copy link
Member

utf commented Jul 27, 2023

Hi @mkhorton, this is something I've spoken to @gpetretto and @davidwaroquiers about. I believe the only other information you need to resolve the job dependencies are the OutputReferences in the job inputs. These are available through the job.input_references property.

The simplest way to enable this would be:

  1. At the beginning of the job.run function, copy the output of job.input_references. The reason why we have to copy them at the beginning is that the job.resolve_args function resolves the references in place. So at the end of the function the original input references are not available.
  2. Add a new field field "input_references" to the data stored at the end of job.run`. E.g., here:
    "uuid": self.uuid,

You should then be able to construct the entire flow (including nested flows) and the dependencies between jobs. The only information that will be missing is the names of the Flows (the names of the jobs are fine). The reason is that we don't store flows in the database directly.

@mkhorton
Copy link
Member Author

mkhorton commented Aug 3, 2023

Thanks for the reply @utf, good to know I wasn't missing anything obvious.

I'll see if I can make a PR to add this, unless @gpetretto or @davidwaroquiers are already working on it? If it'd be welcome, I'd quite like to add a pydantic.BaseModel to describe the JobStore document format too.

@utf
Copy link
Member

utf commented Aug 14, 2023

A PR would be very welcome. And yes, agreed that we should have a document model for the job store document.

@davidwaroquiers
Copy link
Contributor

It would indeed be very useful to be able to "reconstruct" the Flow(s) after they have run (or while they are running) in order to visualize them. We've indeed already discussed about this but haven't started working on this. This issue also falls within a set of other features that would be nice to have and are somewhat interconnected. I would maybe like to raise the idea to have a meeting with the most active developers/contributors in order to list out and somehow plan for the short/mid-term developments. @utf What do you think ?

@mcgalcode
Copy link
Contributor

@mkhorton did you end up starting work on this? I offered to make some contributions to jobflow and would love to tackle this, and am planning to start working on it now. Happy to hold off/coordinate though if you have any concerns or WIP.

@mkhorton
Copy link
Member Author

By all means Max, go ahead! I do not have a WIP. Let me know if you have any problems however (perhaps open a PR early so anyone interested can comment?)

@mcgalcode
Copy link
Contributor

Sounds good Matt! Early PR is a good idea for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants