Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MATRIX as grid provider #161

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Add MATRIX as grid provider #161

wants to merge 11 commits into from

Conversation

cschwan
Copy link
Contributor

@cschwan cschwan commented Nov 25, 2022

This branch currently adds a ./run_matrix.sh BASH script because I can iterate very quickly with it, but the final product will be integrated into the existing Python program, of course.

@cschwan cschwan self-assigned this Nov 25, 2022
@codecov-commenter
Copy link

codecov-commenter commented Nov 25, 2022

Codecov Report

Base: 31.44% // Head: 31.44% // No change to project coverage 👍

Coverage data is based on head (b03fbd7) compared to base (9c1f06f).
Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #161   +/-   ##
=======================================
  Coverage   31.44%   31.44%           
=======================================
  Files          20       20           
  Lines         954      954           
=======================================
  Hits          300      300           
  Misses        654      654           
Flag Coverage Δ
unittests 31.44% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

- MATRIX seems to be sensitive to the string after `run_` for the name
  of the run directory. To be safe, choose the next integer
- do not copy runtime.dat - this doesn't seem to be needed
- do not manually copy the files but instead rely on MATRIX's `tar_run`
  and `delete_run` features
The `log` doesn't exist after a process has been generated for the first
time
@felixhekhorn felixhekhorn mentioned this pull request Dec 15, 2022
@alecandido
Copy link
Member

@cschwan @felixhekhorn this PR has eventually to be split:

  • the run_matrix.sh script has to be converted into an external module for PineFarm
  • the pinecard has to be simply lifted top-level (since pinecards folder does not exist any longer)

However, I would recommend to keep as it is for the time being, and go for the final layout only when Matrix will be approximately ready to be used.

@felixhekhorn
Copy link
Contributor

maybe @comane and/or @t7phy and/or @andreab1997 can take over?

@alecandido
Copy link
Member

maybe @comane and/or @t7phy and/or @andreab1997 can take over?

It would actually be appreciated. I could provide help, but consider also the tutorial by @scarlehoff

https://nnpdf.github.io/pineline/tutorials/vrap

(and consider to contribute yourself to Pineline documentation)

@felixhekhorn
Copy link
Contributor

@comane and @t7phy let's discuss better here - as already said even better if you can turn this into a pinefarm PR (should we just copy everything into an issue there @cschwan ?)

what you said in NNPDF/pinefarm#52 sounds very worrisome - the whole purpose of the pineline is to ensure reproducibility and you tell me you start tweaking at random again ... note that this does does not refer to the complexity of a single runcard

We really want to be reproducible and to track metadata (the whole mess about the pineline) and I would say that is a hard requirement for NNPDF4.1 (or whatever comes next). And I am quite confident that the pineline structure is flexible enough to allow for whatever Matrix business you need (of which I have no idea).

Maybe we can discuss further in a meeting? (if possible including @cschwan ? )

@t7phy
Copy link
Member

t7phy commented Jan 19, 2024

@comane and @t7phy let's discuss better here - as already said even better if you can turn this into a pinefarm PR (should we just copy everything into an issue there @cschwan ?)

what you said in NNPDF/pinefarm#52 sounds very worrisome - the whole purpose of the pineline is to ensure reproducibility and you tell me you start tweaking at random again ... note that this does does not refer to the complexity of a single runcard

We really want to be reproducible and to track metadata (the whole mess about the pineline) and I would say that is a hard requirement for NNPDF4.1 (or whatever comes next). And I am quite confident that the pineline structure is flexible enough to allow for whatever Matrix business you need (of which I have no idea).

Maybe we can discuss further in a meeting? (if possible including @cschwan ? )

@felixhekhorn the tweaks I mentioned do not prevent reproducibility whatsoever, there are 4 files that contain all the info to reproduce a run and we can easily store them in pinecard repo, my point was w.r.t. running matrix through pinefarm.
one example is that for a ttbar run at NNLO with an precision of 0.05%, I require about ~800-900 jobs in the main run that can each range from 1 day to 2.5 days. There are 3 stages to a matrix run, warm up, pre run and main run. After pre run, matrix provides you with the jobs and time required and based on the cluster and queue rules, one may want to limit job times, o number of jobs to achieve a desired balance.
another example is that if I have to generate a differential distribution grid for both atlas and cms results (with different binning), the automated approach would be to have a single run per distribution however that is computationally wasteful, I define multiple distributions with different binning but then during the run, they need to be named in a 'non-elegant' way to have them as separate distributions which affects its labelling, and I fix them after grids are generated.
These are just 2 examples out of many considerations, none has to do with reproducibility.
also besides that, there is now another bug in matrix currently, which we are able to get around manually, but an automated code might ofcourse give errors.

@felixhekhorn
Copy link
Contributor

Neither of your two points I consider a valid argument against integrating matrix

let me address the last point first: this is already long planned NNPDF/pinefarm#11 - though we never found the time to work on that; @cschwan already sketched a possible solution there; at the moment there isn't a 1-on-1 mapping of pinecards to grids so far (because one might need to merge something), so we can easily have an n-on-m relation ...

In fact your first point reinforces my point of view, I think: most importantly if you already have some instruction (in whatever form and whatever debug state) please commit them! and then we can iterate from there - as for the problem you describe, I think we can easily find a adjustment within pinefarm: recall that pinefarm is an interface and I'd say there is no need to mirror the arbitrary complex program from below to above, but we can just let them do there job - BUT we (=pinefarm) want to control the framework, which specifically concerns 1) the input and 2) the output and that will be the job of the pinefarm interface

the interface has to:

  1. read the one and only theory runcard that is defined by NNPDF and parse that to MATRIX; read some observables definition which are given from the outside (and which in the best case can be understood without deeper knowledge of MATRIX)
  2. compare in the end the native MATRIX output to the grid output; write all necessary metadata (including the initial instructions) consumed in the generation and expected from the outside

what you do in the middle can be negotiated (including whatever hack for MATRIX bugs), but the framework has to stand - please read the pineline paper where we argue in detail for that!

and it is about reproducibility, because to me this sounds much like you're becoming the new @enocera - the moment you would leave the collaboration, NNPDF is lost because we have no idea on how you obtained the results (may they be right or wrong)

@t7phy
Copy link
Member

t7phy commented Jan 22, 2024

@felixhekhorn you are right, none of the points are an argument against integrating matrix because I never made the point that we shouldn't integrate matrix, did I? 😄
I just pointed out that dealing with matrix is very finicky and we will need to have a dedicated discussion before working on this endeavor.

P.s. I don't plan to leave the collaboration for a long time ;)

@ramonpeter ramonpeter self-assigned this Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants