Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize approach for postprocessing data: use builders? #466

Open
mattmcdermott opened this issue Aug 9, 2023 · 4 comments
Open

Standardize approach for postprocessing data: use builders? #466

mattmcdermott opened this issue Aug 9, 2023 · 4 comments
Labels
discussion API design discussions

Comments

@mattmcdermott
Copy link
Member

mattmcdermott commented Aug 9, 2023

A point of confusion when writing a new Flow seems to be how to process the output data, particularly when it involves outputs from multiple jobs. Should we put postprocessing code in a job? Or should we use builders? Currently, there is only one builder in atomate2 (this is the ElasticBuilder):

class ElasticBuilder(Builder):

IMO we should almost always recommend using builders, which is already the standard for MP (and more robust). I took this strategy in the magnetic ordering workflow (WIP #432), where I implemented a builder under common that can be easily adapted to results from any DFT code. This may be a good design pattern (where possible).

Perhaps we should open this up for discussion?

@computron @utf

@utf
Copy link
Member

utf commented Aug 14, 2023

Overview

I think we should support both. Often you want to be able to use the outputs of flow in a subsequent calculation. E.g., the AMSET flow calculates the elastic constant as part of it. To avoid having to run the builders midway during a workflow, the elastic document should be calculated during the flow itself and stored in the database. Afterwards, if you want to recalculate the elastic constants using different settings, or say you calculated some more displacements, then you can run the builder to generate the elastic collection separately.

In my opinion, the cleanest way to do this is to share as much code as possible between the builder and the calculation of properties during the workflow. The design pattern I adopted was to put class methods on the property task documents that construct the document from some input data. The process then works like this:

  1. The workflow knows exactly which documents are needed to construct the property document as it has the full workflow graph. It can then just call the document class method to construct the property.
  2. The builder's job is to decide which calculations are needed to construct the property. This should be all of the logic of the builder, just deciding which calculations to use, extracting the relevant fields from them, and then passing them to the document class method.

Example

As an example, the ElasticTaskDocument has a method from_stresses.

def from_stresses(
cls,
structure: Structure,
stresses: List[Stress],
deformations: List[Deformation],
uuids: List[str],
job_dirs: List[str],
fitting_method: str = SETTINGS.ELASTIC_FITTING_METHOD,
order: Optional[int] = None,
equilibrium_stress: Optional[Matrix3D] = None,
symprec: float = SETTINGS.SYMPREC,
allow_elastically_unstable_structs: bool = True,
):

In the elastic workflow, this is called using the outputs of jobs in the workflow. This construction happens in a job:

return ElasticDocument.from_stresses(
structure,
stresses,
deformations,
uuids,
job_dirs,
fitting_method=fitting_method,
order=order,
equilibrium_stress=equilibrium_stress,
symprec=symprec,
allow_elastically_unstable_structs=allow_elastically_unstable_structs,
)

In the elastic builder, the builder collects the relevant documents and then calls the exact same method to construct the document:

return ElasticDocument.from_stresses(
structure,
stresses,
deformations,
uuids,
job_dirs,
fitting_method=fitting_method,
symprec=symprec,
)

I think this way we get the best of both worlds. Once you have written the task document and the class method, it is very little effort to call this method in the workflow. Creating the builder is more effort but none of it is duplicated, as you can simply reuse the same code for constructing the document. My suggestion would be to implement this approach in the magnetic ordering workflow. I also recommended this be implemented in the ferroelectric workflow too. E.g., see this comment: https://github.com/materialsproject/atomate2/pull/196/files#r1143386198

Documentation

If people agree with this approach, I suggest we add a page in the developer documentation about this design decision. Hopefully, once we have a few examples, this approach will be clearer to people.

Obviously I would appreciate input from other developers as to whether there are downsides or issues with this approach.

@mattmcdermott
Copy link
Member Author

mattmcdermott commented Aug 14, 2023

Excellent write-up! @utf

I agree -- if the schema construction approach is used, this shouldn't be too much extra work to implement for a developer.

My original hesitation with writing a postprocessing job was how to avoid getting stuck in workflows where failure of at least one job is very common (for example, a very unfavorable transformed structure cannot converge during relaxation). As long as we also have the builder, this provides some extra flexibility :)

@computron computron added the discussion API design discussions label Aug 21, 2023
@computron
Copy link
Member

We discussed this briefly in the atomate2 meeting on 8/25/23 but weren't able to finish the discussion. A few notes to follow up on here:

  1. @janosh points out that the example schema for elasticity is reproduced in almost the same way in the emmett repo. Is there a reason to not consolidate these?
  2. I brought up some concerns regarding data size and duplication (i.e. data consistency) if these builder documents are stored both within the Jobflow infrastructure and as part of an analysis collection. For 10 runs this is really not a problem to duplicate, for 10,000 it could be.
  3. I wanted some more details about what job_dirs was and whether we had a better way to track directories when runs occur across multiple machines. Note that the emmett version of this scheme does not have job_dirs.

@janosh
Copy link
Member

janosh commented Aug 25, 2023

Just to clarify, I didn't compare in detail but the ElasticityDoc in emmet-core

https://github.com/materialsproject/emmet/blob/4f2c35976d05369f3947014dbfef6baf98e1500f/emmet-core/emmet/core/elasticity.py#L156-L353

looks similar to ElasticDocument in atomate2. So if there is potential for consolidation, we should pursue it in the same spirit as #486

class ElasticDocument(BaseModel):
"""Document containing elastic tensor information and related properties."""
structure: Structure = Field(
None, description="The structure for which the elastic data is calculated."
)
elastic_tensor: ElasticTensorDocument = Field(
None, description="Fitted elastic tensor."
)
eq_stress: Matrix3D = Field(
None, description="The equilibrium stress of the structure."
)
derived_properties: DerivedProperties = Field(
None, description="Properties derived from the elastic tensor."
)
formula_pretty: str = Field(
None,
description="Cleaned representation of the formula",
)
fitting_data: FittingData = Field(
None, description="Data used to fit the elastic tensor."
)
fitting_method: str = Field(
None, description="Method used to fit the elastic tensor."
)
order: int = Field(
None, description="Order of the expansion of the elastic tensor."
)
@classmethod
def from_stresses(
cls,
structure: Structure,
stresses: List[Stress],
deformations: List[Deformation],
uuids: List[str],
job_dirs: List[str],
fitting_method: str = SETTINGS.ELASTIC_FITTING_METHOD,
order: Optional[int] = None,
equilibrium_stress: Optional[Matrix3D] = None,
symprec: float = SETTINGS.SYMPREC,
allow_elastically_unstable_structs: bool = True,
):
"""
Create an elastic document from strains and stresses.
Parameters
----------
structure : .Structure
The structure for which strains and stresses were calculated.
stresses : list of Stress
A list of corresponding stresses.
deformations : list of Deformation
A list of corresponding deformations.
uuids: list of str
A list of uuids, one for each deformation calculation.
job_dirs : list of str
A list of job directories, one for each deformation calculation.
fitting_method : str
The method used to fit the elastic tensor. See pymatgen for more details on
the methods themselves. The options are:
- "finite_difference" (note this is required if fitting a 3rd order tensor)
- "independent"
- "pseudoinverse"
order : int or None
Order of the tensor expansion to be fitted. Can be either 2 or 3.
equilibrium_stress : list of list of float
The stress on the equilibrium (relaxed) structure.
symprec : float
Symmetry precision for deriving symmetry equivalent deformations. If
``symprec=None``, then no symmetry operations will be applied.
allow_elastically_unstable_structs : bool
Whether to allow the ElasticDocument to still complete in the event that
the structure is elastically unstable.
"""
strains = [d.green_lagrange_strain for d in deformations]
if symprec is not None:
strains, stresses, uuids, job_dirs = _expand_strains(
structure, strains, stresses, uuids, job_dirs, symprec
)
deformations = [s.get_deformation_matrix() for s in strains]
# -0.1 to convert units from kBar to GPa and stress direction
stresses = [-0.1 * s for s in stresses]
eq_stress = None
if equilibrium_stress:
eq_stress = -0.1 * Stress(equilibrium_stress)
pk_stresses = [s.piola_kirchoff_2(d) for s, d in zip(stresses, deformations)]
if order is None:
order = 2 if len(stresses) < 70 else 3 # TODO: Figure this out better
if order > 2 or fitting_method == "finite_difference":
# force finite diff if order > 2
result = ElasticTensorExpansion.from_diff_fit(
strains, pk_stresses, eq_stress=eq_stress, order=order
)
if order == 2:
result = ElasticTensor(result[0])
elif fitting_method == "pseudoinverse":
result = ElasticTensor.from_pseudoinverse(strains, pk_stresses)
elif fitting_method == "independent":
result = ElasticTensor.from_independent_strains(
strains, pk_stresses, eq_stress=eq_stress
)
else:
raise ValueError(f"Unsupported elastic fitting method {fitting_method}")
ieee = result.convert_to_ieee(structure)
property_tensor = ieee if order == 2 else ElasticTensor(ieee[0])
ignore_errors = bool(allow_elastically_unstable_structs)
property_dict = property_tensor.get_structure_property_dict(
structure, ignore_errors=ignore_errors
)
derived_properties = DerivedProperties(**property_dict)
eq_stress = eq_stress.tolist() if eq_stress is not None else eq_stress
return cls(
structure=structure,
eq_stress=eq_stress,
derived_properties=derived_properties,
formula_pretty=structure.composition.reduced_formula,
fitting_method=fitting_method,
order=order,
elastic_tensor=ElasticTensorDocument(
raw=result.voigt.tolist(), ieee_format=ieee.voigt.tolist()
),
fitting_data=FittingData(
cauchy_stresses=[s.tolist() for s in stresses],
strains=[s.tolist() for s in strains],
pk_stresses=[p.tolist() for p in pk_stresses],
deformations=[d.tolist() for d in deformations],
uuids=uuids,
job_dirs=job_dirs,
),
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion API design discussions
Projects
None yet
Development

No branches or pull requests

4 participants