Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop using odgi's Python bindings in calyx_depth.py #116

Merged
merged 38 commits into from
Nov 17, 2023
Merged

Conversation

anshumanmohan
Copy link
Contributor

@anshumanmohan anshumanmohan commented Jul 10, 2023

Instead of taking the input graph and parsing as an odgi graph, I instead parse it as a GFA. Then, instead of using odgi to grab simple statistics such as the number of nodes, paths, and the most steps in a given path, I use a method in preprocess.

See this PR's sibling issues:

These changes are with an eye to #107.

Copy link
Collaborator

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome; this is great!! Assuming it works, this will be a very useful simplification, for the reasons discussed in #107.

pollen_py/pollen/depth/parse_data.py Outdated Show resolved Hide resolved
pollen_py/pollen/depth/parse_data.py Outdated Show resolved Hide resolved
pollen_py/pollen/depth/parse_data.py Outdated Show resolved Hide resolved
pollen_py/pollen/depth/parse_data.py Outdated Show resolved Hide resolved
pollen_py/pollen/depth/parse_data.py Outdated Show resolved Hide resolved
pollen_py/pollen/depth/parse_data.py Outdated Show resolved Hide resolved
pollen_py/pollen/depth/parse_data.py Outdated Show resolved Hide resolved
@@ -274,19 +285,19 @@ def config_parser(parser):

def run(args):
if args.from_verilog or args.from_interp:
with open(filename, "r") as fp:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@susan-garry I'm confused as to how this filename thing ever worked. Does args.filename do the trick, or is there a deeper issue?

@anshumanmohan
Copy link
Contributor Author

anshumanmohan commented Jul 14, 2023

@susan-garry I need to get on a synchronous call with you to figure out the symlink situation here. I've been trying for a little bit to pass exine a gfa instead of an og, but I'm new enough to the setup that I'm really quite turned around. Does tomorrow work?

If you'd like to tinker on your own, check out this branch and run make test from pollen/. The recipe has been minimized; I just need to get the exine code to "see" the graph. Right now I get FileNotFoundError: [Errno 2] No such file or directory: '../ex2.gfa' although the file is where I think it should be.

@sampsyo
Copy link
Collaborator

sampsyo commented Jul 18, 2023

Hmm; I get a somewhat different set of errors when trying this out. To narrow in on a specific test, I ran:

$ turnt -vp test/depth/subset-paths/ex1.txt
$ odgi depth -i ../basic/ex1.og -d -s ex1.paths
#node.id	depth	depth.uniq
1	1	1
2	2	1
$ exine depth -a -r ../basic/ex1.gfa -s ex1.paths
Traceback (most recent call last):
[...]
  File "/Users/asampson/Library/Python/3.11/lib/python/site-packages/mygfa/mygfa.py", line 292, in parse
    assert False, f"unknown line marker {fields[0]}"
AssertionError: unknown line marker .

That is, the odgi environment works fine, and we crash with a somewhat mysterious error. That problem comes from here:

graph = mygfa.Graph.parse(filename)

Because Graph.parse expects an open file, not a filename. So that should be something like Graph.parse(open(filename)).

@anshumanmohan
Copy link
Contributor Author

Ah I had that right in one spot but wrong in another. Thanks! Fixed!

Now we have a different set of issues, having to do with all the other uses of odgi-ey stuff that are in the file but are not used by calyx_depth. For a taste:

  1. graph.get_step_count(node_h)
  2. graph.get_path(step_h)
  3. graph.for_each_step_on_handle(node_h, parse_step)
  4. graph.for_each_handle(parse_node)
  5. graph.get_path_name(path_h)

I would love a confirmation on this, but as I understand it, these are used to create a .data representation of the given graph, but are not used by calyx_depth.py to generate the Calyx accelerator itself. Only get_dimensions is used in the generation of the accelerator. Is that right?

I'm unsure about how to proceed.

Truly removing odgi from parse_data would mean supporting all these getters and itererators in mygfa. Is that worth our time? Should we instead:

  • Let parse_data use odgi.
  • Just lift the one non-odgi-pure-mygfa method get_dimensions somewhere up above
    ?

The result is that

  • calyx_depth can use get_dimensions and be free of odgi.
  • parse_data can continue to use odgi.

@sampsyo
Copy link
Collaborator

sampsyo commented Jul 19, 2023

I would love a confirmation on this, but as I understand it, these are used to create a .data representation of the given graph, but are not used by calyx_depth.py to generate the Calyx accelerator itself. Only get_dimensions is used in the generation of the accelerator. Is that right?

This looks approximately correct to me, in the sense that the bulk of this happens in parse_data.py and not calyx_depth.py. (It also happens in python_depth.py. However, I don't think this is used anywhere currently—has it perhaps been subsumed by slow-odgi, and can we delete it now?)

In my ideal imagination of how this works, there are two parts of this Calyx-accelerator implementation:

  • A data converter, from GFA to JSON. This needs to look at the graph, obviously.
  • A hardware generator, which produces Calyx code. This should not look at the input graph. It may need some basic parameters like the number of steps per path or whatever, but these should be summarized in a small list of numbers.

Maybe @susan-garry could confirm whether this is the goal, or if something major prevents this idea from being possible?

Truly removing odgi from parse_data would mean supporting all these getters and itererators in mygfa. Is that worth our time? Should we instead: […] The result is that

  • calyx_depth can use get_dimensions and be free of odgi.
  • parse_data can continue to use odgi.

For the scope of this PR, let's just do this! That is, let's make the Calyx-generation thing odgi-free. The data-converter thing will remain odgi-dependent. Then we can tackle that in a future PR. Does this make sense, @anshumanmohan?

One silly question I have about this, however: I had it in my head that your new pollen_data_gen tool can produce the same JSON output as this existing parse_data module, which I believe is wired up to exine depth. Is that true? If so, we can just delete parse_data altogether as obsolete (again, in a different PR).

@anshumanmohan
Copy link
Contributor Author

Great, all makes sense!

Yes, pollen_data_gen's depth-json-generation matches parse_data's depth-json-generation. At least for depth, we can do away with parse_data. I think we've kept parse_data around for two reasons, both flimsy:

  • just plain inertia
  • perceived speed (but this is vague: we have not benchmarked the two json-generators properly)

@sampsyo
Copy link
Collaborator

sampsyo commented Jul 20, 2023

Cool. IMO let's not worry too much about speed—both tools are in the "slow" category because they are primarily in Python (while parse_data uses native code via odgi's Python bindings, it does so in a way where the inner loops are still Python). If we want something fast, we'll need a different approach. So with that in mind, I am even more in favor of deleting the old code instead of trying to port it to mygfa.

@anshumanmohan
Copy link
Contributor Author

anshumanmohan commented Jul 20, 2023

I have reverted all my changes to parse_data and simply made a copy of the method get_dimensions from parse_data into calyx_depth. This copy of get_dimensions uses mygfa, not odgi. calyx_depth no longer calls parse_data at all.

This PR is now, in theory, dead simple.

What remains is a spot of exine engineering:

  • We can pass either a GFA or an OG file to our odgi oracle
  • We need to pass a GFA to calyx_depth
  • We need to pass an OG to parse_data

I have chatted with @susan-garry and we have a plan: she will take on the exine engineering that we need. The user will pass exine a GFA, and exine will pass that GFA to those who need it. For the one person who needs a OG (i.e., parse_data), exine will create a temporary OG file and pass them that.

@anshumanmohan anshumanmohan mentioned this pull request Jul 20, 2023
@anshumanmohan anshumanmohan changed the title Stop using odgi's Python bindings in parse_data.py Stop using odgi's Python bindings in calyx_depth.py Jul 20, 2023
@anshumanmohan
Copy link
Contributor Author

@susan-garry while this PR no longer closes #107 (it goes most of the way, but we also need #147 and #148), it does fully close #137, no? If so, please feel free to link it using the Development panel on the right and then merge!

@anshumanmohan anshumanmohan merged commit 3ebf978 into main Nov 17, 2023
1 check passed
@anshumanmohan anshumanmohan deleted the no-odgi-python branch November 17, 2023 13:49
@susan-garry susan-garry restored the no-odgi-python branch February 28, 2024 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pollen_data_gen: support for subset_paths depth: use pollen_data_gen in place of parse_data
3 participants