Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General suggestions for code cleanup/improvements #48

Open
halx opened this issue Dec 13, 2015 · 33 comments
Open

General suggestions for code cleanup/improvements #48

halx opened this issue Dec 13, 2015 · 33 comments
Labels

Comments

@halx
Copy link
Collaborator

halx commented Dec 13, 2015

There is currently a lot of cruft growing in the code which is basically down to design choices.

Here some suggestions:
The parsers should be designed in a "polymorphic" fashion such that the main code does not need to know anything about the individual parser codes. This implies that a clear interface and a specification must be designed to define what data is passed in and what is passed out and in what form. It is the responsibility of the parser code to signal back what it can and cannot do. E.g., as it stands now, dhdlt should be None if MD code can't do TI or u_klt should be None if it *BAR analysis can't be done. Energies should be converted to a "canonical" form or it should be signalled back to the main code what the units are. This will relieve the main code maintainer to do code-specific clean-up operations.

Optparse should be replaced with argparse.

The globals need to go as this is really bad programming style. The problem with this is that all these variables are accessible throughout all of the module which mean that they can be modified anywhere. This can create hard to track down bugs (all you need is a second person messing around there...).

@halx
Copy link
Collaborator Author

halx commented Dec 18, 2015

One practical thought on this. I think the code should have an understanding of what the various analysis methods "mean". Not sure if there's something in place but e,g,

all_methods = set(['TI','TI-CUBIC','DEXP','IEXP','GINS','GDEL','BAR','UBAR','RBAR','MBAR'])
ti_methods = set(['TI','TI-CUBIC'])
bar_methods = set(['DEXP','IEXP','GINS','GDEL','BAR','UBAR','RBAR','MBAR'])

other methods if applicable

selected_method = set(P.methods) # or via argparse

if not selected_method & bar_methods: # intersection
do_bar = None

if u_klt is None:
selected_method = selected_method - bar_methods

etc.

@davidlmobley
Copy link
Member

Yes, I agree with all of this, and most especially the issue relating to global variables. Probably we should break up this issue into several separate issues which make specific proposals for code changes that can be handled relatively independently, i.e.:

  • replace optparse with argparse (see also error with -c -m ti+ti-cubic #51)
  • decide canonical form for energies
  • discuss and determine interface for parsers (specify what data goes in and what comes out so that parsers can be independent/polymorphic) and what units are
  • rewrite parsers to conform to interface specifications and use canonical units
  • eliminate global variables
  • make code know what class different methods are in, and use this to (???) (simplify plotting decisions, ...?)

@davidlmobley davidlmobley changed the title Code cleanup General suggestions for code cleanup/improvements Dec 18, 2015
@halx
Copy link
Collaborator Author

halx commented Jan 4, 2016

Ok, that's certainly a plan but the practical problem is that this affetcs
one single file which in turn means conflicts in parallel development. For
the near future this will be just as it is but in the long run it may make
sense to split alchemical_analysis.py in smaller but logical chunks. In
this context I would also suggest to develop the tool into a proper library
with a well-defined API. alchemical_analysis.py will then just be the
command line frontend. Doing this will also mean that writing to stdout
has to be modified e.g. through a logger that explicitly needs to be
switched on for the command line tool but would not output anything by
default. There may be other possibilities.

On 18 December 2015 at 19:09, David L. Mobley [email protected]
wrote:

Yes, I agree with all of this, and most especially the issue relating to
global variables. Probably we should break up this issue into several
separate issues which make specific proposals for code changes that can be
handled relatively independently, i.e.:

  • replace optparse with argparse
  • decide canonical form for energies
  • discuss and determine interface for parsers (specify what data goes
    in and what comes out so that parsers can be independent/polymorphic) and
    what units are
  • rewrite parsers to conform to interface specifications and use
    canonical units
  • eliminate global variables
  • make code know what class different methods are in, and use this to
    (???) (simplify plotting decisions, ...?)


Reply to this email directly or view it on GitHub
#48 (comment)
.

@halx
Copy link
Collaborator Author

halx commented Jan 4, 2016

Ok, I have created a branch 'refactor'. Done so far.

  • dynamic loading of parsers: need to discuss the interface
  • replaced optparse with argparse
  • eliminated globals in main()

@davidlmobley
Copy link
Member

OK. It may take me a couple of days to get to review this one. What needs
to be discussed regarding parsers?

Thanks.

On Mon, Jan 4, 2016 at 4:00 AM, Hannes Loeffler [email protected]
wrote:

Ok, I have created a branch 'refactor'. Done so far.

  • dynamic loading of parsers: need to discuss the interface
  • replaced optparse with argparse
  • eliminated globals in main()


Reply to this email directly or view it on GitHub
#48 (comment)
.

David Mobley
[email protected]
949-385-2436

@halx
Copy link
Collaborator Author

halx commented Jan 4, 2016

The interface i.e. inputs and outputs. At the moment P is passed in which is, frankly, a rather "heavy" data structure. It also seems to be (ab?)used to also contain additional data items beyond what arparse puts in there. The current code heavily relies on this data structure to be present globally so it is in essence another global. I would tend to only pass around data that is really needed. Further cleanup of the code and separation into smaller units will make this obvious.

On the output side the code is mostly where it should be. But we will need a mechanism to report back the parsers "capabilities". Maybe a separate data structure that is compared to other data times. e.g. P.methods to determine what analysis can and should be done.

Overall, I would think this is not high priority as it will become clearer during refactoring what needs to be done. The new code will need heavy testing before a merge and release. I tend to only test the AMBER side because I am most familiar with it but I will have a look through the example date as well to see if I haven't broken the code in an obvious way.

BTW, what is the need to take extras steps for the Desmond parser? Looking into uncorrelate() it seems there is effectively only one line of code different. Can't this be done by the parser itself?

@halx
Copy link
Collaborator Author

halx commented Jan 7, 2016

I think I have an idea how units should be handled. Essentially, the parsers fill up dhdlt and u_klt and report back what units they are in. The parsers should do no further conversion but will need to also report the temperature to compute beta. All conversion are then done in the main code as required and requested by the user

@davidlmobley
Copy link
Member

That sounds reasonable. But I should investigate a bit on HOW the units
should be reported back. For example, there is a units module used in
openmm and some other places which actually can make objects CARRY units in
a pretty convenient way. But I need to investigate whether having these
units attached to return data would slow things down or not.

Otherwise I agree with you though.

On Thu, Jan 7, 2016 at 8:01 AM, Hannes Loeffler [email protected]
wrote:

I think I have an idea how units should be handled. Essentially, the
parsers fill up dhdlt and u_klt and report back what units they are in. The
parsers should do no further conversion but will need to also report
the temperature to compute beta. All conversion are then done in the main
code as required and requested by the user


Reply to this email directly or view it on GitHub
#48 (comment)
.

David Mobley
[email protected]
949-385-2436

@halx
Copy link
Collaborator Author

halx commented Jan 8, 2016

Following the duscussion on openmm/openmm#680
it doesn't look to me that the current maintainer is much interested in
making this module available in a sensible way (i.e. standalone). Pint has
been suggested in the thread. I'll take a look at that one and see if it
is interesting enough.

On 8 January 2016 at 01:29, David L. Mobley [email protected]
wrote:

That sounds reasonable. But I should investigate a bit on HOW the units
should be reported back. For example, there is a units module used in
openmm and some other places which actually can make objects CARRY units in
a pretty convenient way. But I need to investigate whether having these
units attached to return data would slow things down or not.

Otherwise I agree with you though.

On Thu, Jan 7, 2016 at 8:01 AM, Hannes Loeffler [email protected]
wrote:

I think I have an idea how units should be handled. Essentially, the
parsers fill up dhdlt and u_klt and report back what units they are in.
The
parsers should do no further conversion but will need to also report
the temperature to compute beta. All conversion are then done in the main
code as required and requested by the user


Reply to this email directly or view it on GitHub
<
#48 (comment)

.

David Mobley
[email protected]
949-385-2436


Reply to this email directly or view it on GitHub
#48 (comment)
.

@halx
Copy link
Collaborator Author

halx commented Jan 8, 2016

I had a quick look into pint and it looks workable. Possible downsides:
this approach means replacing all 'raw' values with unit objects. This
could mean more memory and also being slower. By how much I don't know but
memory was a concern of Pavel's.

On 8 January 2016 at 10:15, Hannes Loeffler [email protected]
wrote:

Following the duscussion on
openmm/openmm#680 it doesn't look to me
that the current maintainer is much interested in making this module
available in a sensible way (i.e. standalone). Pint has been suggested in
the thread. I'll take a look at that one and see if it is interesting
enough.

On 8 January 2016 at 01:29, David L. Mobley [email protected]
wrote:

That sounds reasonable. But I should investigate a bit on HOW the units
should be reported back. For example, there is a units module used in
openmm and some other places which actually can make objects CARRY units
in
a pretty convenient way. But I need to investigate whether having these
units attached to return data would slow things down or not.

Otherwise I agree with you though.

On Thu, Jan 7, 2016 at 8:01 AM, Hannes Loeffler <[email protected]

wrote:

I think I have an idea how units should be handled. Essentially, the
parsers fill up dhdlt and u_klt and report back what units they are in.
The
parsers should do no further conversion but will need to also report
the temperature to compute beta. All conversion are then done in the
main
code as required and requested by the user


Reply to this email directly or view it on GitHub
<
#48 (comment)

.

David Mobley
[email protected]
949-385-2436


Reply to this email directly or view it on GitHub
#48 (comment)
.

@davidlmobley
Copy link
Member

OK, thanks. I did put a follow-up question in that thread. I've used
simtk.unit relatively recently but it sounds like that's a bad idea (I
don't want to make this require simtk), so looking into pint probably makes
sense.

On Fri, Jan 8, 2016 at 2:15 AM, Hannes Loeffler [email protected]
wrote:

Following the duscussion on
openmm/openmm#680
it doesn't look to me that the current maintainer is much interested in
making this module available in a sensible way (i.e. standalone). Pint has
been suggested in the thread. I'll take a look at that one and see if it
is interesting enough.

On 8 January 2016 at 01:29, David L. Mobley [email protected]
wrote:

That sounds reasonable. But I should investigate a bit on HOW the units
should be reported back. For example, there is a units module used in
openmm and some other places which actually can make objects CARRY units
in
a pretty convenient way. But I need to investigate whether having these
units attached to return data would slow things down or not.

Otherwise I agree with you though.

On Thu, Jan 7, 2016 at 8:01 AM, Hannes Loeffler <
[email protected]>
wrote:

I think I have an idea how units should be handled. Essentially, the
parsers fill up dhdlt and u_klt and report back what units they are in.
The
parsers should do no further conversion but will need to also report
the temperature to compute beta. All conversion are then done in the
main
code as required and requested by the user


Reply to this email directly or view it on GitHub
<

#48 (comment)

.

David Mobley
[email protected]
949-385-2436


Reply to this email directly or view it on GitHub
<
#48 (comment)

.


Reply to this email directly or view it on GitHub
#48 (comment)
.

David Mobley
[email protected]
949-385-2436

@halx
Copy link
Collaborator Author

halx commented Jan 13, 2016

I have addressed most of these except for th unit problem. The harder part will now be to separate the current functions into more useful chunks (probably by means of analysis method and separte plotting from non-plotting functionality) and a library.

I still not convinced that the tool needs a unit managment via a specialised module. Do you expect future functionality that is more than energy conversion at one single point in the code?

@davidlmobley
Copy link
Member

I have addressed most of these except for th unit problem. The harder part will now be to separate the current functions into more useful chunks (probably by means of analysis method and separte plotting from non-plotting functionality) and a library.

Great. Splitting in the way you describe also makes sense, I think.

I still not convinced that the tool needs a unit managment via a specialised module. Do you expect future functionality that is more than energy conversion at one single point in the code?

I am not totally convinced that it does either, though I think in general I'll probably try to be including these from scratch when I write new code that will be dealing with multiple sets of units - simply because having everything carry units makes it SO MUCH easier to make sure you have appropriate units when/where needed.

@halx
Copy link
Collaborator Author

halx commented Jan 13, 2016

My main worry was that this wouldn't work well with the numpy arrays that
we use now. A numpy array can only contain certain data types (a "raw"
number) and wrapping an array would cost some time and effort. Also, that
would have multiplied memory storage even if it was possible.

I should have taken a closer look though and read
http://pint.readthedocs.org/en/0.6/numpy.html . So you can actually wrap a
numpy array together with a unit and do all the stuff you can do with with
scalars! Ok, maybe you cannot do any transformation of an array you can do
with numpy but simple unit conversion is possible. So after all that
doesn't look so bad.

On 13 January 2016 at 16:59, David L. Mobley [email protected]
wrote:

I have addressed most of these except for th unit problem. The harder part
will now be to separate the current functions into more useful chunks
(probably by means of analysis method and separte plotting from
non-plotting functionality) and a library.

Great. Splitting in the way you describe also makes sense, I think.

I still not convinced that the tool needs a unit managment via a
specialised module. Do you expect future functionality that is more than
energy conversion at one single point in the code?

I am not totally convinced that it does either, though I think in general
I'll probably try to be including these from scratch when I write new code
that will be dealing with multiple sets of units - simply because having
everything carry units makes it SO MUCH easier to make sure you have
appropriate units when/where needed.


Reply to this email directly or view it on GitHub
#48 (comment)
.

@davidlmobley
Copy link
Member

Ah, that's a good point. Pint does indeed look promising.

On Wed, Jan 13, 2016 at 11:32 AM, Hannes Loeffler [email protected]
wrote:

My main worry was that this wouldn't work well with the numpy arrays that
we use now. A numpy array can only contain certain data types (a "raw"
number) and wrapping an array would cost some time and effort. Also, that
would have multiplied memory storage even if it was possible.

I should have taken a closer look though and read
http://pint.readthedocs.org/en/0.6/numpy.html . So you can actually wrap a
numpy array together with a unit and do all the stuff you can do with with
scalars! Ok, maybe you cannot do any transformation of an array you can do
with numpy but simple unit conversion is possible. So after all that
doesn't look so bad.

On 13 January 2016 at 16:59, David L. Mobley [email protected]
wrote:

I have addressed most of these except for th unit problem. The harder
part
will now be to separate the current functions into more useful chunks
(probably by means of analysis method and separte plotting from
non-plotting functionality) and a library.

Great. Splitting in the way you describe also makes sense, I think.

I still not convinced that the tool needs a unit managment via a
specialised module. Do you expect future functionality that is more than
energy conversion at one single point in the code?

I am not totally convinced that it does either, though I think in general
I'll probably try to be including these from scratch when I write new
code
that will be dealing with multiple sets of units - simply because having
everything carry units makes it SO MUCH easier to make sure you have
appropriate units when/where needed.


Reply to this email directly or view it on GitHub
<
#48 (comment)

.


Reply to this email directly or view it on GitHub
#48 (comment)
.

David Mobley
[email protected]
949-385-2436

@dotsdl
Copy link

dotsdl commented Jun 15, 2016

FWIW, seconding all the the points raised in this discussion. I think it's a huge boon if tools written in Python can actually be used directly from a Python session, but as currently written the core component of this module is only really usable as a command-line script. What's more, the code is extremely hard to follow, with e.g. functions that operate on names defined outside of their scope (alchemical_analysis.uncorrelate is one example).

Given this, I'm currently ripping out the things I need from it manually to get into a form I can actually use. I'll push the result to GitHub and perhaps that can help to move this issue along. I think this library addresses an important void, but could use some heavy refactoring.

@nathanmlim
Copy link
Collaborator

Hi @dotsdl,
Before you go and refactor the code, there is currently a PR (#68) where @halx has done lots of work in refactoring the code. Nobody has had the chance to review it yet and thus hasn't been merged yet

@davidlmobley
Copy link
Member

@dotsdl - I'm very on board with major changes; basically there's a ton that needs to be done along those lines and unfortunately I haven't had anyone in the group who is actively doing organization/improvements/maintenance right now (though @Limn1 will probably be getting involved). @halx has done a lot that we didn't get to review/discuss yet. Perhaps a call would be warranted?

Nobody has had the chance to review it yet and thus hasn't been merged yet

Elaborating - I've had a bit of a look but we haven't really tested much yet, and it raises larger architecture issues/refactoring issues that I haven't been able to think through.

Maybe we should do a separate issue to think through the best ways to do this. Or we could schedule a call. Interested in being involved? @halx , are you still thinking about this and interested in being involved?

My group doesn't have enough real "code design" experience, I think, yet, so I need to involve outside expertise. The first version of this that we "released" was way too much done just by one student who didn't have the needed expertise.

@mrshirts has also mentioned that he's interested in being more involved going forward.

@davidlmobley
Copy link
Member

@dotsdl - do please give us what you end up with regardless, as even if
you're not involved in the larger "where do we go from here" discussion, at
least seeing what you do/having your summary of it will help us figure out
how to do this best.

Thanks.

On Wed, Jun 15, 2016 at 1:57 PM, Nathan Lim [email protected]
wrote:

Hi @dotsdl https://github.com/dotsdl,
Before you go and refactor the code, there is currently a PR (#68
#68) where @halx
https://github.com/halx has done lots of work in refactoring the code.
Nobody has had the chance to review it yet and thus hasn't been merged yet


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#48 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGzUYb2X-5sFETJbX73C9yq49Ped0PlFks5qMGc7gaJpZM4G0Vg5
.

David Mobley
[email protected]
949-385-2436

@mrshirts
Copy link
Contributor

Hi, all- I'm definitely happy to help review code, especially at the physical correctness and functionality level. I'm not as experienced in larger architecture levels.

I still not convinced that the tool needs a unit managment via a specialised module. Do you expect future functionality that is more than energy conversion at one single point in the code?

I don't think a general unit manager will be needed. There is a very low probability of ever needing anything other than energy units of either kJ/mol, kcal/mol, or kT. Something specialized for this purpose should be fine.

@dotsdl
Copy link

dotsdl commented Jun 15, 2016

@davidlmobley I'm happy to be involved in this process. I think converging on a common codebase for doing low-level things like parsing output files from different MD engines and extracting the correct quantities for FEP/TI is important for progress in the larger field, since these should be entirely solvable problems. For my needs at the moment I only need to extract gromacs outputs, so I'll be focused on that.

I do have more of a software engineering focus, so I'm happy to help in terms of library architecture. As I said, I'll push the library that results from my work to a repo on GitHub and we'll see if a path forward emerges from it. We'll also then have something concrete to iterate on. That work will definitely require review from @davidlmobley and @mrshirts for physical correctness, since I'm less well-versed in caveats/pitfalls of alchemical free energy calculations.

@halx
Copy link
Collaborator Author

halx commented Jun 16, 2016

Yes, I have started with some light-weight changes in the code and I would think we should use this as a basis for discussion.

@davidlmobley
Copy link
Member

@dotsdl , @halx - I've had this on hold for a while because of a super crazy summer (including a new baby, a visitor on sabbatical, etc.) and current grant proposal deadlines. But I'll be able to deal a bit with it again after the first week in October. Could we talk around that time to develop a better plan for the future, perhaps also with @mrshirts ?

@dotsdl
Copy link

dotsdl commented Sep 20, 2016

@davidlmobley absolutely. I'm in the midst of writing my dissertation (defending Nov. 4), so of limited use on development fronts. But been working on a library called tentatively alchemlyb that contains the machinery I used to do data munging in Python from GROMACS output and do MBAR and TI with what appear to be best-practices.

I pushed this library to GitHub today, but the API is far from stable, as it's mostly things I built over the course of the past couple of months. Starting November I and two others in our lab will be employed to get this library in a form that is easy for everyone to use, complete with tests, docs, etc., as we have a need for a central library of solid implementations for the different alchemistry projects we are doing ourselves. We will definitely be interested in both you and @mrshirts input to ensure our implementations are scientifically and statistically sound.

We hope to keep the library simple and straightforward enough that it can be used both interactively and as part of larger postprocessing pipelines. The goal is simplicity and clarity. Should we set up a time to discuss general needs more directly?

@davidlmobley
Copy link
Member

@dotsdl - this actually sounds really interesting. I'd been thinking to some extent that what this really needs is to be re-framed into a library format rather than a stand-alone tool which attempts to do everything and keeps having to get extended to fit with different codes. There might then still be a front end which would interact with different tools...

Maybe move this to e-mail so we can schedule, looping in @mrshirts and @halx ?

@dotsdl
Copy link

dotsdl commented Sep 20, 2016

@davidlmobley happy to! You can reach me at [email protected]. Feel free to shoot me a message and we'll get rolling.

@halx
Copy link
Collaborator Author

halx commented Sep 20, 2016

Yes, let's got with email to arrange a discussion in October.

@halx
Copy link
Collaborator Author

halx commented Dec 14, 2016

Looks like I haven't paid much attention to this recently... Where are things standing at the moment?

@davidlmobley
Copy link
Member

@dotsdl - any updates? Should we talk again? I'm still much more enthusiastic about the library approach you were talking about working on, but I haven't seen much action there, and obviously we'll need to do something fairly soon.

@dotsdl
Copy link

dotsdl commented Dec 14, 2016

@davidlmobley I've been working on a proof-of-concept set of components and structure for alchemlyb this past week, and I'm hoping to be able to put both the API proposal that it demonstrates as well as a gist showing how it works in practice during the coming week. Speaking to @orbeckst, we also wanted to draft a white paper that clearly states the problem, aims, and our solution once consensus on the proposal is reached on the issue tracker.

I'll post the issue today and follow it up with the detailed proposal + gist in the coming days. Apologies for the delay.

@mrshirts
Copy link
Contributor

mrshirts commented Dec 14, 2016 via email

@halx
Copy link
Collaborator Author

halx commented Dec 15, 2016 via email

@davidlmobley
Copy link
Member

Sounds good, @dotsdl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants