How to use `rcOutDir` option to process multiple samples and avoid out of memory errors? #344

bernardo-heberle · 2022-06-22T20:29:47Z

bernardo-heberle
Jun 22, 2022

Hello,

I am running 9 cDNA samples sequenced on the PromethION with Bambu. Each ".bam" file is ~30GB long after filtering the reads, so the total amount of data being processed by Bambu is ~270GB. I noticed that the memory requirements are getting pretty high, as the job will fail with 500GB of RAM memory, but will complete if I increase the RAM to 1000GB. Here is the command I am running:

se_novel <- bambu(reads = bam, annotations = bambuAnnotations, rcOutDir = "./bambu_processed_files/", genome = fa_file, lowMemory=TRUE, ncore=8, opt.discovery = list(min.sampleNumber = 5, min.readCount = 5))

The bam variable is a vector with the paths for the 9 ".bam" files.

I was wondering if I am using the rcOutDir option correctly?

It is my understanding that this option is supposed to help with runs utilizing multiple samples, but I am not sure if there is an intermediary step that I am missing. I ask this because in the long run I intend to use bambu to process several dozens, if not hundreds, of cDNA samples generated on the PromethION. However, with the increase in RAM requirements, memory will probably become a limiting factor for processing a larger number of samples.

Any help with this and/or tips on how to avoid out of memory errors with a large number of large samples will be much appreciated!

cying111 · 2022-06-23T02:08:34Z

cying111
Jun 23, 2022
Maintainer

Hi @bernardo-heberle, usually the processing memory hikes when preprocessing the bam files, so what we would recommend for very large samples, is to 1) process them individually with discovery mode only discovery = TRUE, quantification = FALSE and save the read class file to rcOutDir as what you have done, and then after all files are processed, 2) supply all read class files' paths as a character vector to the rcFile argument rcFile = path_to_all_read_class_files and then re-run bambu with both discovery and quantification mode on. One tip is that when you process the large sample bam file with only discovery mode on, you can set a specific value to yieldSize, which controls the read in of bam files, say 1,000,000 yieldSize = 1000000, so that the memory usage is a bit controlled. Another tip, is that for both steps, you might want to use a limited number of cpus, when you fail with a certain number of cpus, you can always lower the cpu number and try again.

3 replies

bernardo-heberle Feb 1, 2023
Author

Hi @bernardo-heberle, usually the processing memory hikes when preprocessing the bam files, so what we would recommend for very large samples, is to 1) process them individually with discovery mode only discovery = TRUE, quantification = FALSE and save the read class file to rcOutDir as what you have done, and then after all files are processed, 2) supply all read class files' paths as a character vector to the rcFile argument rcFile = path_to_all_read_class_files and then re-run bambu with both discovery and quantification mode on. One tip is that when you process the large sample bam file with only discovery mode on, you can set a specific value to yieldSize, which controls the read in of bam files, say 1,000,000 yieldSize = 1000000, so that the memory usage is a bit controlled. Another tip, is that for both steps, you might want to use a limited number of cpus, when you fail with a certain number of cpus, you can always lower the cpu number and try again.

I have one more question about this @cying111. Currently, when running step "1)" I am setting NDR=1 and then on step "2)" when combining all the files I am using the Bambu auto-generated NDR threshold. Is this a reasonable approach or would you suggest already using the Bambu auto-generated NDR threshold on step "1)" when processing files individually?

andredsim Feb 2, 2023
Maintainer

Hi,

The rcOutDir files are output before the NDR prefiltering step so the NDR=1 you are setting will not impact them (and consequently will not impact step 2). It would only impact the output of the annotations produced from step 1). Therefore your approach is reasonable and the output from step 2 should be as expected.

If you plan to run this more in future and you do not need the annotation output from each individual file from step 1, you can set discovery = FALSE in step 1 (and no NDR is needed as it won't be used). This means that only rcOutDir files will be generated and will save you a bit of time.

Kind Regards,
Andre SIm

bernardo-heberle Feb 2, 2023
Author

Great, thank you Andre!

bernardo-heberle · 2022-06-26T03:08:07Z

bernardo-heberle
Jun 26, 2022
Author

Thank you for the quick and effective response @cying111! The solutions you suggested worked well and I am able to run all of the samples with no issues now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use `rcOutDir` option to process multiple samples and avoid out of memory errors? #344

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to use rcOutDir option to process multiple samples and avoid out of memory errors? #344

bernardo-heberle Jun 22, 2022

Replies: 2 comments · 3 replies

cying111 Jun 23, 2022 Maintainer

bernardo-heberle Feb 1, 2023 Author

andredsim Feb 2, 2023 Maintainer

bernardo-heberle Feb 2, 2023 Author

bernardo-heberle Jun 26, 2022 Author

How to use `rcOutDir` option to process multiple samples and avoid out of memory errors? #344

bernardo-heberle
Jun 22, 2022

Replies: 2 comments 3 replies

cying111
Jun 23, 2022
Maintainer

bernardo-heberle Feb 1, 2023
Author

andredsim Feb 2, 2023
Maintainer

bernardo-heberle Feb 2, 2023
Author

bernardo-heberle
Jun 26, 2022
Author