Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a draft for fasta_ltrretriever_lai #4984

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

GallVp
Copy link
Member

@GallVp GallVp commented Feb 25, 2024

PR checklist

Closes #XXX

  • Added a draft for fasta_ltrretriever_lai
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@GallVp GallVp added the WIP Work in progress label Feb 25, 2024
@GallVp GallVp requested a review from a team as a code owner February 25, 2024 22:32
@GallVp GallVp requested review from louperelo and removed request for a team February 25, 2024 22:32
@GallVp GallVp marked this pull request as draft February 25, 2024 22:41
mahesh-panchal
mahesh-panchal previously approved these changes Feb 26, 2024
Copy link
Member

@mahesh-panchal mahesh-panchal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

One detail I'm not positive about, but absolutely makes sense to me is if

 ch_ltrretriever_inputs.map { meta, fasta, ltr -> [ meta, fasta ] },
 ch_ltrretriever_inputs.map { meta, fasta, ltr -> ltr },

is the same as

ch_ltrretriever_inputs.multiMap { meta, fasta, ltr ->
    ch_fasta: [meta, fasta]
    ch_ltr: ltr

We normally use the second version, but I don't see why the first version shouldn't be equivalent. Channels are still queues even if they're asynchronous. I guess I'm just wondering if using two+ map operations could ever make the inputs out of sync.

@GallVp
Copy link
Member Author

GallVp commented Feb 26, 2024

We normally use the second version, but I don't see why the first version shouldn't be equivalent. Channels are still queues even if they're asynchronous. I guess I'm just wondering if using two+ map operations could ever make the inputs out of sync.

Thank you @mahesh-panchal

The two map operations should not desynchronise the channel as it is a FIFO queue. In my experience, desynchronisation is a problem when multiple channels are involved. But here we are explicitly joining (i.e. synchronising) channels into a single channel:

ch_ltrretriever_inputs          = ch_short_ids_fasta.join(ch_ltr_candidates)

If ch_short_ids_fasta and ch_ltr_candidates are out of sync, their messages are synchronised by a unique key using the join operator.

@mahesh-panchal
Copy link
Member

Sorry, perhaps let's move the conversation here where we can see the overview.
So I'm confused by the suggestion. Is the suggestion to make CUSTOM_SHORTENFASTAIDS a native (exec:) process, and CUSTOM_RESTOREGFFIDS local functions instead, or the other way around? And the other process emits the files from the function through the process?

@GallVp
Copy link
Member Author

GallVp commented May 15, 2024

The suggestion is to replace CUSTOM_SHORTENFASTAIDS and CUSTOM_RESTOREGFFIDS with local groovy functions inside the FASTA_LTRRETRIEVER_LAI workflow. Emit the files produced by the local functions as outputs from FASTA_LTRRETRIEVER_LAI.

This delegates the problem of publishing the files produced by the local functions to the pipeline developer. They can decide to crate a local patch of the sub-workflow and add the output directory to collectFile operator or use a module like cat/cat outside the sub-workflow to publish the files.

@mahesh-panchal
Copy link
Member

I would say go ahead an implement it. It's easier to see what you mean when there's code. That's the wonderful thing about using git. We can always go back and checkout a previous state if the idea doesn't work out.

@GallVp
Copy link
Member Author

GallVp commented May 29, 2024

Hi @mahesh-panchal

I have hit another road block. I am not sure if we can attach meta to a file after splitting it with splitFasta operator.

I am trying to replace CUSTOM_SHORTENFASTAIDS (#5001) with following groovy code:

workflow {
    Channel.of(file(params.test_data['sarscov2']['genome']['genome_fasta'], checkIfExists: true))
    | splitFasta( record: [ id: true, sequence: true ] )
    | mix(
        Channel.of(file(params.test_data['actinidia_chinensis']['genome']['genome_1_fasta_gz'], checkIfExists: true))
        | splitFasta( record: [ id: true, sequence: true ] )
    )
    // | operator( [ id: 'test' ] ) // How to insert meta
    | map { record, meta ->
        if ( record.id.size() <= 13 && record.id ==~ /\w+/ ) return [ record, meta ]

        def id_digest   = record.id.md5()[0..<10]
        record.id       = "Ctg$id_digest"

        [ record, meta ]
    }
    | collectFile( sort: { it.size() } ) {
        [ 'test.fa', it.id + '\n' + it.sequence ]
    }
    | view
}
// Execute on macOS: (export PROFILE='docker' && nextflow run playground.nf -c tests/config/nf-test.config)

@mahesh-panchal
Copy link
Member

You can use the splitFasta method, rather than the channel operator.

    Channel.of( file( params.test_data['sarscov2']['genome']['genome_fasta'], checkIfExists: true ) )
        | map { file -> [ [ id:file.baseName ], file ] }
        | flatMap { meta, file -> file.splitFasta( record: [ id: true, sequence: true ] ).collect{ [ meta, it ] } }

Or something like that.

@GallVp GallVp marked this pull request as ready for review June 5, 2024 02:28
@GallVp
Copy link
Member Author

GallVp commented Jun 5, 2024

Hi @mahesh-panchal

Thank you for the hint. I have used it to create a groovy version of the module.

I have also changed the name of the sub-workflow to make the underlying operations explicit: FASTA_LTRHARVEST_LTRFINDER_LTRRETRIEVER_LAI

This has been a very complicated sub-workflow. It has given me a lot of trouble with pesky little bugs even with the custom modules which I tested independently. Now we have put everything into a single workflow and the chances that there is an edge case bug somewhere in there are quite high.

@GallVp GallVp added Ready for Review and removed WIP Work in progress labels Jun 5, 2024
Copy link
Member

@mahesh-panchal mahesh-panchal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to get back to this later on. 6th and 7th are holidays here.


take:
ch_fasta // channel: [ val(meta), fasta ]
ch_monoploid_seqs // channel: [ val(meta2), txt ]; Optional: Set to [] if not needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than use [] for nothing, I think it would better to say use Channel.empty(). This might simplify some logic downstream

Copy link
Member

@mahesh-panchal mahesh-panchal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I still more time to review this.

ch_versions = Channel.empty()

// Prapre input channels
ch_monoploid_seqs_plain = ( ch_monoploid_seqs ?: Channel.empty() )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming Channel.empty() as input rather than [] would mean this ternary statement isn't needed.

}
| collectFile
| map { tsv ->
[ tsv.baseName.replace('.short.ids', ''), tsv ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does .short.ids get removed here rather than naming the file without the .short.ids in the first place?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants